> On Jan. 21, 2016, 10:49 p.m., Ryan Blue wrote: > > connector/connector-hdfs/src/main/java/org/apache/sqoop/connector/hdfs/HdfsExtractor.java, > > line 252 > > <https://reviews.apache.org/r/42327/diff/8/?file=1204487#file1204487line252> > > > > Why not use the existing Hadoop class? That's what we did for Kite: > > https://github.com/kite-sdk/kite/blob/master/kite-data/kite-data-core/src/main/java/org/kitesdk/data/spi/filesystem/InputFormatReader.java#L69 > > > > Then you don't need to include your own anonymous class. There's a bit > > of overhead to get the right constructor but I think that's a better > > strategy.
i agree that the end result it prettier but im not sure the added complexity is worth it in this case. i would like to get a committer's view on this one > On Jan. 21, 2016, 10:49 p.m., Ryan Blue wrote: > > connector/connector-hdfs/pom.xml, line 88 > > <https://reviews.apache.org/r/42327/diff/8/?file=1204486#file1204486line88> > > > > Does the exclusion from dependencyManagement apply here? I don't think > > you need both. forgot to remove that. thanks! > On Jan. 21, 2016, 10:49 p.m., Ryan Blue wrote: > > connector/connector-hdfs/src/main/java/org/apache/sqoop/connector/hdfs/HdfsExtractor.java, > > line 506 > > <https://reviews.apache.org/r/42327/diff/8/?file=1204487#file1204487line506> > > > > I think a better and cheaper check is to read the first 4 bytes (and/or > > the last 4 bytes) and make sure they are "PAR1". The first 4 and the last 4 > > for all valid parquet files contain that magic pattern. we "could" run into an issue where we run into a non-parquet file that starts/ends with that magic pattern right? isn't this approach more complete and will prevent false-positives? > On Jan. 21, 2016, 10:49 p.m., Ryan Blue wrote: > > connector/connector-hdfs/src/main/java/org/apache/sqoop/connector/hdfs/hdfsWriter/HdfsParquetWriter.java, > > line 53 > > <https://reviews.apache.org/r/42327/diff/8/?file=1204492#file1204492line53> > > > > Does this make the default UNCOMPRESSED? I think I'd rather see SNAPPY > > by default. for our other formats we default to not compressing. i think it may be strange for one format to use compression while the other ones do not. do you feel strongly about this? > On Jan. 21, 2016, 10:49 p.m., Ryan Blue wrote: > > connector/connector-hdfs/src/main/java/org/apache/sqoop/connector/hdfs/hdfsWriter/HdfsParquetWriter.java, > > line 58 > > <https://reviews.apache.org/r/42327/diff/8/?file=1204492#file1204492line58> > > > > Why use the OutputFormat and RecordWriter here instead of an > > AvroParquetWriter? That is more straight-forward. good idea. thanks > On Jan. 21, 2016, 10:49 p.m., Ryan Blue wrote: > > connector/connector-hdfs/src/test/java/org/apache/sqoop/connector/hdfs/TestLoader.java, > > line 165 > > <https://reviews.apache.org/r/42327/diff/8/?file=1204495#file1204495line165> > > > > Should this be assertTrue instead of assert? thanks > On Jan. 21, 2016, 10:49 p.m., Ryan Blue wrote: > > test/src/test/java/org/apache/sqoop/integration/connector/hdfs/ParquetTest.java, > > line 96 > > <https://reviews.apache.org/r/42327/diff/8/?file=1204500#file1204500line96> > > > > Why is this a multiset? it is true that setLines does not necessarily need to be a multiset. I made it one to match the assertMapReduceOutput method from HdfsAsserts which inspired this chunk of code. Even though we do not have any duplicates in our output here, i think the multiset reflects the constraints we are testing for (unordered, may contain duplicates) so i kept it. > On Jan. 21, 2016, 10:49 p.m., Ryan Blue wrote: > > connector/connector-hdfs/src/test/java/org/apache/sqoop/connector/hdfs/TestLoader.java, > > line 197 > > <https://reviews.apache.org/r/42327/diff/8/?file=1204495#file1204495line197> > > > > Why is this value changed? reverted > On Jan. 21, 2016, 10:49 p.m., Ryan Blue wrote: > > connector/connector-hdfs/src/test/java/org/apache/sqoop/connector/hdfs/TestLoader.java, > > line 188 > > <https://reviews.apache.org/r/42327/diff/8/?file=1204495#file1204495line188> > > > > Why is default compression not supported? looking at the parquet's CompressionCodecName enum: ``` public enum CompressionCodecName { UNCOMPRESSED(null, CompressionCodec.UNCOMPRESSED, ""), SNAPPY("parquet.hadoop.codec.SnappyCodec", CompressionCodec.SNAPPY, ".snappy"), GZIP("org.apache.hadoop.io.compress.GzipCodec", CompressionCodec.GZIP, ".gz"), LZO("com.hadoop.compression.lzo.LzoCodec", CompressionCodec.LZO, ".lzo"); ``` ToCompression.Default maps to org.apache.hadoop.io.compress.DefaultCodec which I believe is zlib, which is not supported here. > On Jan. 21, 2016, 10:49 p.m., Ryan Blue wrote: > > pom.xml, line 720 > > <https://reviews.apache.org/r/42327/diff/8/?file=1204498#file1204498line720> > > > > Why exclude Avro? The dependency version should be managed by maven to > > match the version depended on by Sqoop directly. this is an issue that i believe to be related to our connector class loader. not excluding avro causes the following exception: ``` java.lang.LinkageError: loader constraint violation: when resolving method "org.apache.sqoop.connector.idf.AVROIntermediateDataFormat.getAvroSchema()Lorg/apache/avro/Schema;" the class loader (instance of org/apache/sqoop/classloader/ConnectorClassLoader) of the current class, org/apache/sqoop/connector/hdfs/hdfsWriter/HdfsParquetWriter, and the class loader (instance of sun/misc/Launcher$AppClassLoader) for the method's defining class, org/apache/sqoop/connector/idf/AVROIntermediateDataFormat, have different Class objects for the type org/apache/avro/Schema used in the signature at org.apache.sqoop.connector.hdfs.hdfsWriter.HdfsParquetWriter.initialize(HdfsParquetWriter.java:60) at org.apache.sqoop.connector.hdfs.HdfsLoader$1.run(HdfsLoader.java:95) at org.apache.sqoop.connector.hdfs.HdfsLoader$1.run(HdfsLoader.java:61) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628) at org.apache.sqoop.connector.hdfs.HdfsLoader.load(HdfsLoader.java:61) at org.apache.sqoop.connector.hdfs.HdfsLoader.load(HdfsLoader.java:46) at org.apache.sqoop.job.mr.SqoopOutputFormatLoadExecutor$ConsumerThread$1.call(SqoopOutputFormatLoadExecutor.java:276) at org.apache.sqoop.job.mr.SqoopOutputFormatLoadExecutor$ConsumerThread$1.call(SqoopOutputFormatLoadExecutor.java:257) at org.apache.sqoop.utils.ClassUtils.executeWithClassLoader(ClassUtils.java:281) at org.apache.sqoop.job.mr.SqoopOutputFormatLoadExecutor$ConsumerThread.run(SqoopOutputFormatLoadExecutor.java:256) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745)oader (instance of sun/misc/Launcher$AppClassLoader) for the method's defining class, org/apache/sqoop/connector/idf/AVROIntermediateDataFormat, have different Class objects for the type org/apache/avro/Schema used in the signature ``` this is occurring within individual mappers. - Abraham ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/42327/#review115691 ----------------------------------------------------------- On Jan. 21, 2016, 7:06 p.m., Abraham Fine wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/42327/ > ----------------------------------------------------------- > > (Updated Jan. 21, 2016, 7:06 p.m.) > > > Review request for Sqoop. > > > Bugs: SQOOP-2788 > https://issues.apache.org/jira/browse/SQOOP-2788 > > > Repository: sqoop-sqoop2 > > > Description > ------- > > read and write parquet in hdfsconnector > > > Diffs > ----- > > connector/connector-hdfs/pom.xml 599631418ca63cc43d645c1ee1e7a73dc824b313 > > connector/connector-hdfs/src/main/java/org/apache/sqoop/connector/hdfs/HdfsExtractor.java > a813c479a07d68e14ed49936f642e762e5b37437 > > connector/connector-hdfs/src/main/java/org/apache/sqoop/connector/hdfs/HdfsLoader.java > 774221aaf5c8cdb8d26ca108fae239598b42229b > > connector/connector-hdfs/src/main/java/org/apache/sqoop/connector/hdfs/HdfsPartition.java > 644de60581faf90ceb2fcef8d3e0544067791fcc > > connector/connector-hdfs/src/main/java/org/apache/sqoop/connector/hdfs/configuration/ToFormat.java > 27d121f529ecb4d5bd79e2b8c74ab8f7cc15fb10 > > connector/connector-hdfs/src/main/java/org/apache/sqoop/connector/hdfs/hdfsWriter/GenericHdfsWriter.java > 2ccccc4a94a582c8b47ccdefa523d1fd1632e627 > > connector/connector-hdfs/src/main/java/org/apache/sqoop/connector/hdfs/hdfsWriter/HdfsParquetWriter.java > PRE-CREATION > > connector/connector-hdfs/src/main/java/org/apache/sqoop/connector/hdfs/hdfsWriter/HdfsSequenceWriter.java > 75c2e7ef192d7d9628e622cc3c5ef176e33a73d0 > > connector/connector-hdfs/src/main/java/org/apache/sqoop/connector/hdfs/hdfsWriter/HdfsTextWriter.java > 78cf9732fdb89689b04d43e4af70ca5a43732dbf > > connector/connector-hdfs/src/test/java/org/apache/sqoop/connector/hdfs/TestLoader.java > 11fcef2a38209c79928f582cf8aa03e889247f22 > > connector/connector-sdk/src/main/java/org/apache/sqoop/connector/common/SqoopAvroUtils.java > 985149cbb0d28b55a19d17076d996364d7f2ae90 > > connector/connector-sdk/src/main/java/org/apache/sqoop/connector/idf/AVROIntermediateDataFormat.java > d78fa8b72ecfe62eeec240e01597e7f2a7e4dd76 > pom.xml cb8a973abc96af1de905cebd80d30177cbaf1cb4 > test/pom.xml 644a9c7dbc746d4a3268532bdcf0babd4faaafba > > test/src/test/java/org/apache/sqoop/integration/connector/hdfs/ParquetTest.java > PRE-CREATION > > Diff: https://reviews.apache.org/r/42327/diff/ > > > Testing > ------- > > integration tests pass > > > Thanks, > > Abraham Fine > >
