[
https://issues.apache.org/jira/browse/DRILL-7864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17313058#comment-17313058
]
Antoine Pitrou commented on DRILL-7864:
---------------------------------------
Note that Pandas and Parquet C++ give the following result on this example:
{code:python}
>>> df = pq.read_table("../output.parquet").to_pandas()
>>> df[(df['operating_point'] == 214) & (df['statistic'] == 'mean')]
Unnamed: 0 operating_point time statistic InjectionTimingCursor
I_Injection_IA InjectionRate ADC02_IA
640000 640000 214 0 mean 0.0
1.140145 0.079771 -9.997519
640001 640001 214 10 mean 0.0
2.079204 0.177609 -9.997519
640002 640002 214 20 mean 0.0
4.028618 0.272391 -9.997519
640003 640003 214 30 mean 0.0
6.042390 0.325176 -9.997519
640004 640004 214 40 mean 0.0
7.825881 0.327692 -9.997519
... ... ... ... ... ...
... ... ...
640995 640995 214 9950 mean 0.0
-0.036111 -0.032860 -9.997519
640996 640996 214 9960 mean 0.0
-0.034652 -0.013963 -9.997519
640997 640997 214 9970 mean 0.0
-0.034540 0.002608 -9.997519
640998 640998 214 9980 mean 0.0
-0.034211 0.013444 -9.997519
640999 640999 214 9990 mean 0.0
-0.033422 0.016216 -9.997519
[1000 rows x 8 columns]
{code}
... which seems the same as {{parquet_dotnet.csv}}.
Also note that only "InjectionRate" seems different with Drill, not
"I_Injection_IA". But the two columns use the same types and encodings:
{code}
Column 0: Unnamed: 0 (INT64)
Column 1: operating_point (INT64)
Column 2: time (INT64)
Column 3: statistic (BYTE_ARRAY/UTF8)
Column 4: InjectionTimingCursor (DOUBLE)
Column 5: I_Injection_IA (FLOAT)
Column 6: InjectionRate (FLOAT)
Column 7: ADC02_IA (DOUBLE)
--- Row Group: 0 ---
--- Total Bytes: 9439795 ---
--- Total Compressed Bytes: 9439795 ---
--- Rows: 708000 ---
[...]
Column 5
Values: 708000, Null Values: 0, Distinct Values: 0
Max: 21.7209, Min: -0.388101
Compression: SNAPPY, Encodings: PLAIN_DICTIONARY PLAIN RLE PLAIN
Uncompressed Size: 3448473, Compressed Size: 3448636
Column 6
Values: 708000, Null Values: 0, Distinct Values: 0
Max: 82.7128, Min: -2.17565
Compression: SNAPPY, Encodings: PLAIN_DICTIONARY PLAIN RLE PLAIN
Uncompressed Size: 2789654, Compressed Size: 2789801
[...]
{code}
> Parquet file could not be read correctly
> ----------------------------------------
>
> Key: DRILL-7864
> URL: https://issues.apache.org/jira/browse/DRILL-7864
> Project: Apache Drill
> Issue Type: Bug
> Components: Storage - Parquet
> Affects Versions: 1.18.0
> Reporter: Matthias Rosenthaler
> Priority: Major
> Attachments: drill_query.csv, output.parquet, parquet-dotnet.csv
>
>
> The following parquet file which is generated by ParquetSharp (which is using
> the underlying apache arrow c++ lib) is not readable by drill. The values of
> the columns are displaced. If I write the affected float32 columns
> "InjectionRate" and "I_injection_IA" as float64, everything is fine.
> Update: It seems that the bug is *caused by dictionary encoding*. If I turn
> this feature of, drill is able to read it. So please take a look into reading
> dictionary encoded columns in drill to solve the bug.
> Also created a ticket for the arrow project, but they redirect me to the
> drill project. https://issues.apache.org/jira/browse/ARROW-11629
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)