[
https://issues.apache.org/jira/browse/DRILL-7291?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16949569#comment-16949569
]
benj commented on DRILL-7291:
-----------------------------
Please find in attchment the error log [^sqlline_error.log]
I was intrigued by the "sc: REQUIRED BINARY O:UTF8 R:0 D:0" in
relation with the error "Error: INTERNAL_ERROR ERROR: null"
So I have tried the request below that solve the problem here.
{code:sql}
CREATE TABLE dfs.tmp.`t2` AS SELECT sha1, md5, crc32, fn, fs, pc, osc
, COALESCE(sc,'') /* USE COALESCE to avoid NULL VALUE although csv empty value
are protected by quote */
FROM dfs.tmp.`short_no_binary_quote.csvh`;
SELECT * FROM dfs.tmp.`t2`;
+------------------------------------------+----------------------------------+----------+--------------------------------+-------+-------+-----+--------+
| sha1 | md5 |
crc32 | fn | fs | pc | osc | EXPR$7 |
+------------------------------------------+----------------------------------+----------+--------------------------------+-------+-------+-----+--------+
| 0000000F8527DCCAB6642252BBCFA1B8072D33EE | 68CE322D8A896B6E4E7E3F18339EC85C |
E39149E4 | Blended_Coolers_Vanilla_NL.png | 30439 | 19042 | 362 | |
| 00000091728653B7D55DF30BFAFE86C52F2F4A59 | 81AE5D302A0E6D33182CB69ED791181C |
5594C3B0 | ic_menu_notifications.png | 366 | 21386 | 362 | |
| 0000065F1900120613745CC5E25A57C84624DC2B | AEB7C147EF7B7CEE91807B500A378BA4 |
24400952 | points_program_fragment.xml | 1684 | 21842 | 362 | |
+------------------------------------------+----------------------------------+----------+--------------------------------+-------+-------+-----+--------+
3 rows selected (0.111 seconds)
{code}
But there is a something wrong because:
{code:sql}
CREATE TABLE dfs.tmp.`t2` AS SELECT sha1, md5, crc32, fn, fs, pc, osc
, COALESCE(sc,'FLAGFLAGFLAG') /* USE COALESCE to avoid NULL VALUE although csv
empty value are protected by quote - NOTE that FLAGFLAGFLAG will not appear in
EXPR$7*/
FROM dfs.tmp.`short_no_binary_quote.csvh`;
SELECT * FROM dfs.tmp.`t2` /* FLAGFLAGFLAG will not appear in EXPR$7 */ ;
+------------------------------------------+----------------------------------+----------+--------------------------------+-------+-------+-----+--------+
| sha1 | md5 |
crc32 | fn | fs | pc | osc | EXPR$7 |
+------------------------------------------+----------------------------------+----------+--------------------------------+-------+-------+-----+--------+
| 0000000F8527DCCAB6642252BBCFA1B8072D33EE | 68CE322D8A896B6E4E7E3F18339EC85C |
E39149E4 | Blended_Coolers_Vanilla_NL.png | 30439 | 19042 | 362 | |
| 00000091728653B7D55DF30BFAFE86C52F2F4A59 | 81AE5D302A0E6D33182CB69ED791181C |
5594C3B0 | ic_menu_notifications.png | 366 | 21386 | 362 | |
| 0000065F1900120613745CC5E25A57C84624DC2B | AEB7C147EF7B7CEE91807B500A378BA4 |
24400952 | points_program_fragment.xml | 1684 | 21842 | 362 | |
+------------------------------------------+----------------------------------+----------+--------------------------------+-------+-------+-----+--------+
3 rows selected (0.156 seconds)
{code}
> parquet with compression gzip doesn't work well
> -----------------------------------------------
>
> Key: DRILL-7291
> URL: https://issues.apache.org/jira/browse/DRILL-7291
> Project: Apache Drill
> Issue Type: Bug
> Components: Storage - Parquet
> Affects Versions: 1.15.0, 1.16.0
> Reporter: benj
> Priority: Major
> Attachments: 0_0_0.parquet, short_no_binary_quote.csvh,
> sqlline_error.log
>
>
> Create a parquet with compression=gzip produce bad result.
> Example:
> * input: file_pqt (compression=none)
> {code:java}
> ALTER SESSION SET `store.format`='parquet';
> ALTER SESSION SET `store.parquet.compression` = 'snappy';
> CREATE TABLE ....`file_snappy_pqt`
> AS(SELECT * FROM ....`file_pqt`);
> ALTER SESSION SET `store.parquet.compression` = 'gzip';
> CREATE TABLE ....`file_gzip_pqt`
> AS(SELECT * FROM ....`file_pqt`);{code}
> Then compare the content of the different parquet files:
> {code:java}
> ALTER SESSION SET `store.parquet.use_new_reader` = true;
> SELECT COUNT(*) FROM ....`file_pqt`; => 15728036
> SELECT COUNT(*) FROM ....`file_snappy_pqt`; => 15728036
> SELECT COUNT(*) FROM ....`file_gzip_pqt`; => 15728036
> => OK
> SELECT COUNT(*) FROM ....`file_pqt` WHERE `Code` = ''; => 0
> SELECT COUNT(*) FROM ....`file_snappy_pqt` WHERE `Code` = ''; => 0
> SELECT COUNT(*) FROM ....`file_gzip_pqt` WHERE `Code` = ''; => 14744966
> => NOK
> SELECT COUNT(*) FROM ....`file_pqt` WHERE `Code2` = ''; => 0
> SELECT COUNT(*) FROM ....`file_snappy_pqt` WHERE `Code2` = ''; => 0
> SELECT COUNT(*) FROM ....`file_gzip_pqt` WHERE `Code2` = ''; => 14744921
> => NOK{code}
> _(There is no NULL value in these files.)_
> _(With exec.storage.enable_v3_text_reader=true it gives same results)_
> So If the parquet file contains the right number of rows, the values in the
> different columns are not identical.
> Some "random" values of the _gzip parquet_ are reduce to empty string
> I think the problem is from the reader and not the writer because:
> {code:java}
> SELECT COUNT(*) FROM ....`file_pqt` WHERE `CRC32` = 'B33D600C'; => 2
> SELECT COUNT(*) FROM ....`file_gzip_pqt` WHERE `CRC32` = 'B33D600C'; => 0
> {code}
> but
> {code:java}
> hadoop jar parquet-tools-1.10.0.jar cat file_gzip_pqt/1_0_0.parquet | grep -c
> "B33D600C"
> 2019-06-12 11:45:23,738 INFO hadoop.InternalParquetRecordReader: RecordReader
> initialized will read a total of 3597092 records.
> 2019-06-12 11:45:23,739 INFO hadoop.InternalParquetRecordReader: at row 0.
> reading next block
> 2019-06-12 11:45:23,804 INFO zlib.ZlibFactory: Successfully loaded &
> initialized native-zlib library
> 2019-06-12 11:45:23,805 INFO compress.CodecPool: Got brand-new decompressor
> [.gz]
> 2019-06-12 11:45:23,815 INFO hadoop.InternalParquetRecordReader: block read
> in memory in 76 ms. row count = 3597092
> 2
> {code}
> So the values are well present in the _Apache Parquet_ file but can't be
> exploited via _Apache Drill_.
> In attachment an extract (the original file is 2.2 Go) which produce the same
> behaviour.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)