[ 
https://issues.apache.org/jira/browse/DRILL-7291?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16950869#comment-16950869
 ] 

Arina Ielchiieva commented on DRILL-7291:
-----------------------------------------

This reader is mainly used for Parquet complex types (switch is one 
automatically) and considered to be experimental. I am not aware of the of the 
negative impact, may be it will be slower etc, we did not test this. So it 
won't be set by default in future releases.

Regarding bug you have reported, as I have mentioned before, I could not 
reproduce the issue. Maybe reproduce depends on Java version, system version 
etc. But unit there is a reproduce, there is not feasibility of the fix. 

> parquet with compression gzip doesn't work well
> -----------------------------------------------
>
>                 Key: DRILL-7291
>                 URL: https://issues.apache.org/jira/browse/DRILL-7291
>             Project: Apache Drill
>          Issue Type: Bug
>          Components: Documentation, Storage - Parquet
>    Affects Versions: 1.15.0, 1.16.0
>            Reporter: benj
>            Priority: Major
>         Attachments: 0_0_0.parquet, short_no_binary_quote.csvh, 
> sqlline_error.log
>
>
> Create a parquet with compression=gzip produce bad result.
> Example:
>  * input: file_pqt (compression=none)
> {code:java}
> ALTER SESSION SET `store.format`='parquet';
> ALTER SESSION SET `store.parquet.compression` = 'snappy';
> CREATE TABLE ....`file_snappy_pqt` 
>  AS(SELECT * FROM ....`file_pqt`);
> ALTER SESSION SET `store.parquet.compression` = 'gzip';
> CREATE TABLE ....`file_gzip_pqt` 
>  AS(SELECT * FROM ....`file_pqt`);{code}
> Then compare the content of the different parquet files:
> {code:java}
> ALTER SESSION SET `store.parquet.use_new_reader` = true;
> SELECT COUNT(*) FROM ....`file_pqt`;        => 15728036
> SELECT COUNT(*) FROM ....`file_snappy_pqt`; => 15728036
> SELECT COUNT(*) FROM ....`file_gzip_pqt`;   => 15728036
> => OK
> SELECT COUNT(*) FROM ....`file_pqt` WHERE `Code` = '';        => 0
> SELECT COUNT(*) FROM ....`file_snappy_pqt` WHERE `Code` = ''; => 0
> SELECT COUNT(*) FROM ....`file_gzip_pqt` WHERE `Code` = '';   => 14744966
> => NOK
> SELECT COUNT(*) FROM ....`file_pqt` WHERE `Code2` = '';        => 0
> SELECT COUNT(*) FROM ....`file_snappy_pqt` WHERE `Code2` = ''; => 0
> SELECT COUNT(*) FROM ....`file_gzip_pqt` WHERE `Code2` = '';   => 14744921
> => NOK{code}
> _(There is no NULL value in these files.)_
>  _(With exec.storage.enable_v3_text_reader=true it gives same results)_
> So If the parquet file contains the right number of rows, the values in the 
> different columns are not identical.
> Some "random" values of the _gzip parquet_ are reduce to empty string
> I think the problem is from the reader and not the writer because:
> {code:java}
> SELECT COUNT(*) FROM ....`file_pqt` WHERE `CRC32` = 'B33D600C';      => 2
> SELECT COUNT(*) FROM ....`file_gzip_pqt` WHERE `CRC32` = 'B33D600C'; => 0
> {code}
> but
> {code:java}
> hadoop jar parquet-tools-1.10.0.jar cat file_gzip_pqt/1_0_0.parquet | grep -c 
> "B33D600C"
> 2019-06-12 11:45:23,738 INFO hadoop.InternalParquetRecordReader: RecordReader 
> initialized will read a total of 3597092 records.
> 2019-06-12 11:45:23,739 INFO hadoop.InternalParquetRecordReader: at row 0. 
> reading next block
> 2019-06-12 11:45:23,804 INFO zlib.ZlibFactory: Successfully loaded & 
> initialized native-zlib library
> 2019-06-12 11:45:23,805 INFO compress.CodecPool: Got brand-new decompressor 
> [.gz]
> 2019-06-12 11:45:23,815 INFO hadoop.InternalParquetRecordReader: block read 
> in memory in 76 ms. row count = 3597092
> 2
> {code}
>  So the values are well present in the _Apache Parquet_ file but can't be 
> exploited via _Apache Drill_.
> In attachment an extract (the original file is 2.2 Go) which produce the same 
> behaviour.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to