[
https://issues.apache.org/jira/browse/ARROW-11629?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17312891#comment-17312891
]
Micah Kornfield commented on ARROW-11629:
-----------------------------------------
[~matthros] could you give some more details on the mismatch you were seeing in
Java? Also is the way you are generating the parquet file similar to the
python code I posted above that round trips?
I hacked [together a gist
|https://gist.github.com/emkornfield/efd3a4c3c1012dc19cf9769198e3bffe] to read
the parquet file generated with the code above and wrote two columns back out
to an Arrow file (the first column and "I_Injection_IA"). Reading the file
back in Python and comparing to the columns read from the CSV shows that the
data round trips correctly.
Note "a" in this gist corresponds to the first column. To use the avro
bindings all columns need a name.
P.S. I was originally hoping to use the AvroToArrow Java functionality but that
appears broken, which is why I only sampled two columns. I'll have to file some
issues against it.
> [C++] Writing float32 values with "Dictionary Encoding" makes parquet files
> not readable for some tools
> -------------------------------------------------------------------------------------------------------
>
> Key: ARROW-11629
> URL: https://issues.apache.org/jira/browse/ARROW-11629
> Project: Apache Arrow
> Issue Type: Bug
> Components: C++, Python
> Affects Versions: 3.0.0
> Reporter: Matthias Rosenthaler
> Priority: Major
> Attachments: foo.parquet, image-2021-02-15-15-49-41-908.png,
> output.csv, output.parquet
>
>
> If I try to read the attached csv file with pyarrow, changing the float64
> columns to float32 and export it to parquet, the parquet file gets corrupted.
> It is not readable for apache drill or Parquet.Net any longer.
>
> Update: Bug in "*Dictionary Encoding*" feature. If I switch it off for
> float32 columns, everything works as expected.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)