Re: Review Request 42327: SQOOP-2788: Parquet support for HdfsConnector

Abraham Fine Thu, 21 Jan 2016 17:05:30 -0800


> On Jan. 21, 2016, 10:49 p.m., Ryan Blue wrote:
> > connector/connector-hdfs/src/main/java/org/apache/sqoop/connector/hdfs/HdfsExtractor.java,
> >  line 252
> > <https://reviews.apache.org/r/42327/diff/8/?file=1204487#file1204487line252>
> >
> >     Why not use the existing Hadoop class? That's what we did for Kite: 
> > https://github.com/kite-sdk/kite/blob/master/kite-data/kite-data-core/src/main/java/org/kitesdk/data/spi/filesystem/InputFormatReader.java#L69
> >     
> >     Then you don't need to include your own anonymous class. There's a bit 
> > of overhead to get the right constructor but I think that's a better 
> > strategy.


i agree that the end result it prettier but im not sure the added complexity is 
worth it in this case. 

i would like to get a committer's view on this one


> On Jan. 21, 2016, 10:49 p.m., Ryan Blue wrote:
> > connector/connector-hdfs/pom.xml, line 88
> > <https://reviews.apache.org/r/42327/diff/8/?file=1204486#file1204486line88>
> >
> >     Does the exclusion from dependencyManagement apply here? I don't think 
> > you need both.

forgot to remove that. thanks!


> On Jan. 21, 2016, 10:49 p.m., Ryan Blue wrote:
> > connector/connector-hdfs/src/main/java/org/apache/sqoop/connector/hdfs/HdfsExtractor.java,
> >  line 506
> > <https://reviews.apache.org/r/42327/diff/8/?file=1204487#file1204487line506>
> >
> >     I think a better and cheaper check is to read the first 4 bytes (and/or 
> > the last 4 bytes) and make sure they are "PAR1". The first 4 and the last 4 
> > for all valid parquet files contain that magic pattern.

we "could" run into an issue where we run into a non-parquet file that 
starts/ends with that magic pattern right? isn't this approach more complete 
and will prevent false-positives?


> On Jan. 21, 2016, 10:49 p.m., Ryan Blue wrote:
> > connector/connector-hdfs/src/main/java/org/apache/sqoop/connector/hdfs/hdfsWriter/HdfsParquetWriter.java,
> >  line 53
> > <https://reviews.apache.org/r/42327/diff/8/?file=1204492#file1204492line53>
> >
> >     Does this make the default UNCOMPRESSED? I think I'd rather see SNAPPY 
> > by default.

for our other formats we default to not compressing. i think it may be strange 
for one format to use compression while the other ones do not. do you feel 
strongly about this?


> On Jan. 21, 2016, 10:49 p.m., Ryan Blue wrote:
> > connector/connector-hdfs/src/main/java/org/apache/sqoop/connector/hdfs/hdfsWriter/HdfsParquetWriter.java,
> >  line 58
> > <https://reviews.apache.org/r/42327/diff/8/?file=1204492#file1204492line58>
> >
> >     Why use the OutputFormat and RecordWriter here instead of an 
> > AvroParquetWriter? That is more straight-forward.

good idea. thanks


> On Jan. 21, 2016, 10:49 p.m., Ryan Blue wrote:
> > connector/connector-hdfs/src/test/java/org/apache/sqoop/connector/hdfs/TestLoader.java,
> >  line 165
> > <https://reviews.apache.org/r/42327/diff/8/?file=1204495#file1204495line165>
> >
> >     Should this be assertTrue instead of assert?

thanks


> On Jan. 21, 2016, 10:49 p.m., Ryan Blue wrote:
> > test/src/test/java/org/apache/sqoop/integration/connector/hdfs/ParquetTest.java,
> >  line 96
> > <https://reviews.apache.org/r/42327/diff/8/?file=1204500#file1204500line96>
> >
> >     Why is this a multiset?

it is true that setLines does not necessarily need to be a multiset. I made it 
one to match the assertMapReduceOutput method from HdfsAsserts which inspired 
this chunk of code. Even though we do not have any duplicates in our output 
here, i think the multiset reflects the constraints we are testing for 
(unordered, may contain duplicates) so i kept it.


> On Jan. 21, 2016, 10:49 p.m., Ryan Blue wrote:
> > connector/connector-hdfs/src/test/java/org/apache/sqoop/connector/hdfs/TestLoader.java,
> >  line 197
> > <https://reviews.apache.org/r/42327/diff/8/?file=1204495#file1204495line197>
> >
> >     Why is this value changed?

reverted


> On Jan. 21, 2016, 10:49 p.m., Ryan Blue wrote:
> > connector/connector-hdfs/src/test/java/org/apache/sqoop/connector/hdfs/TestLoader.java,
> >  line 188
> > <https://reviews.apache.org/r/42327/diff/8/?file=1204495#file1204495line188>
> >
> >     Why is default compression not supported?

looking at the parquet's CompressionCodecName enum: 

```
public enum CompressionCodecName {
  UNCOMPRESSED(null, CompressionCodec.UNCOMPRESSED, ""),
  SNAPPY("parquet.hadoop.codec.SnappyCodec", CompressionCodec.SNAPPY, 
".snappy"),
  GZIP("org.apache.hadoop.io.compress.GzipCodec", CompressionCodec.GZIP, ".gz"),
  LZO("com.hadoop.compression.lzo.LzoCodec", CompressionCodec.LZO, ".lzo");
```


ToCompression.Default maps to org.apache.hadoop.io.compress.DefaultCodec which 
I believe is zlib, which is not supported here.


> On Jan. 21, 2016, 10:49 p.m., Ryan Blue wrote:
> > pom.xml, line 720
> > <https://reviews.apache.org/r/42327/diff/8/?file=1204498#file1204498line720>
> >
> >     Why exclude Avro? The dependency version should be managed by maven to 
> > match the version depended on by Sqoop directly.

this is an issue that i believe to be related to our connector class loader.

not excluding avro causes the following exception:

```
java.lang.LinkageError: loader constraint violation: when resolving method 
"org.apache.sqoop.connector.idf.AVROIntermediateDataFormat.getAvroSchema()Lorg/apache/avro/Schema;"
 the class loader (instance of 
org/apache/sqoop/classloader/ConnectorClassLoader) of the current class, 
org/apache/sqoop/connector/hdfs/hdfsWriter/HdfsParquetWriter, and the class 
loader (instance of sun/misc/Launcher$AppClassLoader) for the method's defining 
class, org/apache/sqoop/connector/idf/AVROIntermediateDataFormat, have 
different Class objects for the type org/apache/avro/Schema used in the 
signature
        at 
org.apache.sqoop.connector.hdfs.hdfsWriter.HdfsParquetWriter.initialize(HdfsParquetWriter.java:60)
        at org.apache.sqoop.connector.hdfs.HdfsLoader$1.run(HdfsLoader.java:95)
        at org.apache.sqoop.connector.hdfs.HdfsLoader$1.run(HdfsLoader.java:61)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
        at org.apache.sqoop.connector.hdfs.HdfsLoader.load(HdfsLoader.java:61)
        at org.apache.sqoop.connector.hdfs.HdfsLoader.load(HdfsLoader.java:46)
        at 
org.apache.sqoop.job.mr.SqoopOutputFormatLoadExecutor$ConsumerThread$1.call(SqoopOutputFormatLoadExecutor.java:276)
        at 
org.apache.sqoop.job.mr.SqoopOutputFormatLoadExecutor$ConsumerThread$1.call(SqoopOutputFormatLoadExecutor.java:257)
        at 
org.apache.sqoop.utils.ClassUtils.executeWithClassLoader(ClassUtils.java:281)
        at 
org.apache.sqoop.job.mr.SqoopOutputFormatLoadExecutor$ConsumerThread.run(SqoopOutputFormatLoadExecutor.java:256)
        at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)oader (instance of 
sun/misc/Launcher$AppClassLoader) for the method's defining class, 
org/apache/sqoop/connector/idf/AVROIntermediateDataFormat, have different Class 
objects for the type org/apache/avro/Schema used in the signature
```

this is occurring within individual mappers.


- Abraham


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/42327/#review115691
-----------------------------------------------------------


On Jan. 21, 2016, 7:06 p.m., Abraham Fine wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/42327/
> -----------------------------------------------------------
> 
> (Updated Jan. 21, 2016, 7:06 p.m.)
> 
> 
> Review request for Sqoop.
> 
> 
> Bugs: SQOOP-2788
>     https://issues.apache.org/jira/browse/SQOOP-2788
> 
> 
> Repository: sqoop-sqoop2
> 
> 
> Description
> -------
> 
> read and write parquet in hdfsconnector
> 
> 
> Diffs
> -----
> 
>   connector/connector-hdfs/pom.xml 599631418ca63cc43d645c1ee1e7a73dc824b313 
>   
> connector/connector-hdfs/src/main/java/org/apache/sqoop/connector/hdfs/HdfsExtractor.java
>  a813c479a07d68e14ed49936f642e762e5b37437 
>   
> connector/connector-hdfs/src/main/java/org/apache/sqoop/connector/hdfs/HdfsLoader.java
>  774221aaf5c8cdb8d26ca108fae239598b42229b 
>   
> connector/connector-hdfs/src/main/java/org/apache/sqoop/connector/hdfs/HdfsPartition.java
>  644de60581faf90ceb2fcef8d3e0544067791fcc 
>   
> connector/connector-hdfs/src/main/java/org/apache/sqoop/connector/hdfs/configuration/ToFormat.java
>  27d121f529ecb4d5bd79e2b8c74ab8f7cc15fb10 
>   
> connector/connector-hdfs/src/main/java/org/apache/sqoop/connector/hdfs/hdfsWriter/GenericHdfsWriter.java
>  2ccccc4a94a582c8b47ccdefa523d1fd1632e627 
>   
> connector/connector-hdfs/src/main/java/org/apache/sqoop/connector/hdfs/hdfsWriter/HdfsParquetWriter.java
>  PRE-CREATION 
>   
> connector/connector-hdfs/src/main/java/org/apache/sqoop/connector/hdfs/hdfsWriter/HdfsSequenceWriter.java
>  75c2e7ef192d7d9628e622cc3c5ef176e33a73d0 
>   
> connector/connector-hdfs/src/main/java/org/apache/sqoop/connector/hdfs/hdfsWriter/HdfsTextWriter.java
>  78cf9732fdb89689b04d43e4af70ca5a43732dbf 
>   
> connector/connector-hdfs/src/test/java/org/apache/sqoop/connector/hdfs/TestLoader.java
>  11fcef2a38209c79928f582cf8aa03e889247f22 
>   
> connector/connector-sdk/src/main/java/org/apache/sqoop/connector/common/SqoopAvroUtils.java
>  985149cbb0d28b55a19d17076d996364d7f2ae90 
>   
> connector/connector-sdk/src/main/java/org/apache/sqoop/connector/idf/AVROIntermediateDataFormat.java
>  d78fa8b72ecfe62eeec240e01597e7f2a7e4dd76 
>   pom.xml cb8a973abc96af1de905cebd80d30177cbaf1cb4 
>   test/pom.xml 644a9c7dbc746d4a3268532bdcf0babd4faaafba 
>   
> test/src/test/java/org/apache/sqoop/integration/connector/hdfs/ParquetTest.java
>  PRE-CREATION 
> 
> Diff: https://reviews.apache.org/r/42327/diff/
> 
> 
> Testing
> -------
> 
> integration tests pass
> 
> 
> Thanks,
> 
> Abraham Fine
> 
>

Re: Review Request 42327: SQOOP-2788: Parquet support for HdfsConnector

Reply via email to