[jira] [Commented] (DRILL-7864) Parquet file could not be read correctly

Antoine Pitrou (Jira) Thu, 01 Apr 2021 02:58:05 -0700


    [ 
https://issues.apache.org/jira/browse/DRILL-7864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17313058#comment-17313058
 ]


Antoine Pitrou commented on DRILL-7864:
---------------------------------------

Note that Pandas and Parquet C++ give the following result on this example:
{code:python}
>>> df = pq.read_table("../output.parquet").to_pandas()
>>> df[(df['operating_point'] == 214) & (df['statistic'] == 'mean')]
        Unnamed: 0  operating_point  time statistic  InjectionTimingCursor  
I_Injection_IA  InjectionRate  ADC02_IA
640000      640000              214     0      mean                    0.0      
  1.140145       0.079771 -9.997519
640001      640001              214    10      mean                    0.0      
  2.079204       0.177609 -9.997519
640002      640002              214    20      mean                    0.0      
  4.028618       0.272391 -9.997519
640003      640003              214    30      mean                    0.0      
  6.042390       0.325176 -9.997519
640004      640004              214    40      mean                    0.0      
  7.825881       0.327692 -9.997519
...            ...              ...   ...       ...                    ...      
       ...            ...       ...
640995      640995              214  9950      mean                    0.0      
 -0.036111      -0.032860 -9.997519
640996      640996              214  9960      mean                    0.0      
 -0.034652      -0.013963 -9.997519
640997      640997              214  9970      mean                    0.0      
 -0.034540       0.002608 -9.997519
640998      640998              214  9980      mean                    0.0      
 -0.034211       0.013444 -9.997519
640999      640999              214  9990      mean                    0.0      
 -0.033422       0.016216 -9.997519

[1000 rows x 8 columns]
{code}

... which seems the same as {{parquet_dotnet.csv}}.

Also note that only "InjectionRate" seems different with Drill, not 
"I_Injection_IA". But the two columns use the same types and encodings:
{code}
Column 0: Unnamed: 0 (INT64)
Column 1: operating_point (INT64)
Column 2: time (INT64)
Column 3: statistic (BYTE_ARRAY/UTF8)
Column 4: InjectionTimingCursor (DOUBLE)
Column 5: I_Injection_IA (FLOAT)
Column 6: InjectionRate (FLOAT)
Column 7: ADC02_IA (DOUBLE)
--- Row Group: 0 ---
--- Total Bytes: 9439795 ---
--- Total Compressed Bytes: 9439795 ---
--- Rows: 708000 ---
[...]
Column 5
  Values: 708000, Null Values: 0, Distinct Values: 0
  Max: 21.7209, Min: -0.388101
  Compression: SNAPPY, Encodings: PLAIN_DICTIONARY PLAIN RLE PLAIN
  Uncompressed Size: 3448473, Compressed Size: 3448636
Column 6
  Values: 708000, Null Values: 0, Distinct Values: 0
  Max: 82.7128, Min: -2.17565
  Compression: SNAPPY, Encodings: PLAIN_DICTIONARY PLAIN RLE PLAIN
  Uncompressed Size: 2789654, Compressed Size: 2789801
[...]
{code}


> Parquet file could not be read correctly
> ----------------------------------------
>
>                 Key: DRILL-7864
>                 URL: https://issues.apache.org/jira/browse/DRILL-7864
>             Project: Apache Drill
>          Issue Type: Bug
>          Components: Storage - Parquet
>    Affects Versions: 1.18.0
>            Reporter: Matthias Rosenthaler
>            Priority: Major
>         Attachments: drill_query.csv, output.parquet, parquet-dotnet.csv
>
>
> The following parquet file which is generated by ParquetSharp (which is using 
> the underlying apache arrow c++ lib) is not readable by drill. The values of 
> the columns are displaced. If I write the affected float32 columns 
> "InjectionRate" and "I_injection_IA" as float64, everything is fine.
> Update: It seems that the bug is *caused by dictionary encoding*. If I turn 
> this feature of, drill is able to read it. So please take a look into reading 
> dictionary encoded columns in drill to solve the bug.
> Also created a ticket for the arrow project, but they redirect me to the 
> drill project. https://issues.apache.org/jira/browse/ARROW-11629
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (DRILL-7864) Parquet file could not be read correctly

Reply via email to