[jira] [Commented] (ARROW-11629) [C++] Writing float32 values with "Dictionary Encoding" makes parquet files not readable for some tools

Micah Kornfield (Jira) Wed, 31 Mar 2021 23:13:07 -0700


    [ 
https://issues.apache.org/jira/browse/ARROW-11629?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17312891#comment-17312891
 ]


Micah Kornfield commented on ARROW-11629:
-----------------------------------------

[~matthros] could you give some more details on the mismatch you were seeing in 
Java?  Also is the way you are generating the parquet file similar to the 
python code I posted above that round trips?  

 
I hacked [together a gist 
|https://gist.github.com/emkornfield/efd3a4c3c1012dc19cf9769198e3bffe] to read 
the  parquet file generated with the code above and wrote two columns back out 
to an Arrow  file (the first column and "I_Injection_IA").  Reading the file 
back in Python and comparing to the columns read from the CSV shows that the 
data round trips correctly.


Note "a" in this gist corresponds to the first column.  To use the avro 
bindings all columns need a name.  

P.S. I was originally hoping to use the AvroToArrow Java functionality but that 
appears broken, which is why I only sampled two columns. I'll have to file some 
issues against it.

> [C++] Writing float32 values with "Dictionary Encoding" makes parquet files 
> not readable for some tools
> -------------------------------------------------------------------------------------------------------
>
>                 Key: ARROW-11629
>                 URL: https://issues.apache.org/jira/browse/ARROW-11629
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: C++, Python
>    Affects Versions: 3.0.0
>            Reporter: Matthias Rosenthaler
>            Priority: Major
>         Attachments: foo.parquet, image-2021-02-15-15-49-41-908.png, 
> output.csv, output.parquet
>
>
> If I try to read the attached csv file with pyarrow, changing the float64 
> columns to float32 and export it to parquet, the parquet file gets corrupted. 
> It is not readable for apache drill or Parquet.Net any longer.
>  
> Update: Bug in "*Dictionary Encoding*" feature. If I switch it off for 
> float32 columns, everything works as expected.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-11629) [C++] Writing float32 values with "Dictionary Encoding" makes parquet files not readable for some tools

Reply via email to