benj created DRILL-7291:
---------------------------

             Summary: parquet with compression gzip doesn't work well
                 Key: DRILL-7291
                 URL: https://issues.apache.org/jira/browse/DRILL-7291
             Project: Apache Drill
          Issue Type: Bug
          Components: Storage - Parquet
    Affects Versions: 1.16.0, 1.15.0
            Reporter: benj
         Attachments: 0_0_0.parquet

Create a parquet with compression=gzip produce bad result.

Example:
 * input: file_pqt (compression=none)
{code:java}
ALTER SESSION SET `store.format`='parquet';
ALTER SESSION SET `store.parquet.compression` = 'snappy';
CREATE TABLE ....`file_snappy_pqt` 
 AS(SELECT * FROM ....`file_pqt`);
ALTER SESSION SET `store.parquet.compression` = 'gzip';
CREATE TABLE ....`file_gzip_pqt` 
 AS(SELECT * FROM ....`file_pqt`);{code}
Then compare the content of the different parquet files:
{code:java}
ALTER SESSION SET `store.parquet.use_new_reader` = true;
SELECT COUNT(*) FROM ....`file_pqt`;        => 15728036
SELECT COUNT(*) FROM ....`file_snappy_pqt`; => 15728036
SELECT COUNT(*) FROM ....`file_gzip_pqt`;   => 15728036
=> OK
SELECT COUNT(*) FROM ....`file_pqt` WHERE `Code` = '';        => 0
SELECT COUNT(*) FROM ....`file_snappy_pqt` WHERE `Code` = ''; => 0
SELECT COUNT(*) FROM ....`file_gzip_pqt` WHERE `Code` = '';   => 14744966
=> NOK
SELECT COUNT(*) FROM ....`file_pqt` WHERE `Code2` = '';        => 0
SELECT COUNT(*) FROM ....`file_snappy_pqt` WHERE `Code2` = ''; => 0
SELECT COUNT(*) FROM ....`file_gzip_pqt` WHERE `Code2` = '';   => 14744921
=> NOK{code}
_(There is no NULL value in these files.)_
 _(With exec.storage.enable_v3_text_reader=true it gives same results)_

So If the parquet file contains the right number of rows, the values in the 
different columns are not identical.

Some "random" values of the _gzip parquet_ are reduce to empty string

I think the problem is from the reader and not the writer because:
{code:java}
SELECT COUNT(*) FROM ....`file_pqt` WHERE `CRC32` = 'B33D600C';      => 2
SELECT COUNT(*) FROM ....`file_gzip_pqt` WHERE `CRC32` = 'B33D600C'; => 0
{code}
but
{code:java}
hadoop jar parquet-tools-1.10.0.jar cat file_gzip_pqt/1_0_0.parquet | grep -c 
"B33D600C"
2019-06-12 11:45:23,738 INFO hadoop.InternalParquetRecordReader: RecordReader 
initialized will read a total of 3597092 records.
2019-06-12 11:45:23,739 INFO hadoop.InternalParquetRecordReader: at row 0. 
reading next block
2019-06-12 11:45:23,804 INFO zlib.ZlibFactory: Successfully loaded & 
initialized native-zlib library
2019-06-12 11:45:23,805 INFO compress.CodecPool: Got brand-new decompressor 
[.gz]
2019-06-12 11:45:23,815 INFO hadoop.InternalParquetRecordReader: block read in 
memory in 76 ms. row count = 3597092
2
{code}
 So the values are well present in the _Apache Parquet_ file but can't be 
exploited via _Apache Drill_.

In attachment an extract (the original file is 2.2 Go) which produce the same 
behaviour.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to