Ramana Inukonda Nagaraj created DRILL-2267:
----------------------------------------------
Summary: Parquet writer with dictionary encoding results in
corrupted varchar columns
Key: DRILL-2267
URL: https://issues.apache.org/jira/browse/DRILL-2267
Project: Apache Drill
Issue Type: Bug
Components: Storage - Parquet
Reporter: Ramana Inukonda Nagaraj
Assignee: Steven Phillips
Using CTAS created a parquet file through drill having the varchar datatype.
Created parquet file looks like this through parquet-tools
VARCHAR_col: OPTIONAL BINARY O:UTF8 R:0 D:1
VAR16CHAR_col: OPTIONAL BINARY O:UTF8 R:0 D:1
VARCHAR_col: BINARY SNAPPY DO:0 FPO:894307 SZ:16344/231716/14.18
VC:378624 ENC:RLE,PLAIN_DICTIONARY,BIT_PACKED
VAR16CHAR_col: BINARY SNAPPY DO:0 FPO:910651 SZ:25830/381493/14.77
VC:378624 ENC:RLE,PLAIN_DICTIONARY,BIT_PACKED
On querying the file several records show up having corrupted data for these
fields.
| VAR16CHAR_col |
+---------------+
| ������������ |
| |
| �������� |
| ����� |
| �� |
| |
| |
| �� |
| ������������ |
| |
| �������� |
| ����� |
| �� |
| |
| |
| �� |
| ������������ |
| |
| �������� |
| ����� |
| �� |
| |
| |
| �� |
| ������������ |
| |
| �������� |
| ����� |
| �� |
| |
| |
| �� |
| ������������ |
| |
| �������� |
| ����� |
| �� |
If dictionary encoding is turned off the resultant file can be read without
these issues.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)