[ https://issues.apache.org/jira/browse/ARROW-5562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Benjamin Kietzman resolved ARROW-5562. -------------------------------------- Resolution: Fixed Issue resolved by pull request 5375 [https://github.com/apache/arrow/pull/5375] > [C++][Parquet] parquet writer does not handle negative zero correctly > --------------------------------------------------------------------- > > Key: ARROW-5562 > URL: https://issues.apache.org/jira/browse/ARROW-5562 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Python > Affects Versions: 0.13.0 > Reporter: Bob Briody > Assignee: Wes McKinney > Priority: Major > Labels: parquet, pull-request-available > Fix For: 0.15.0 > > Time Spent: 0.5h > Remaining Estimate: 0h > > > I have the following csv file: (Note that {{col_a}} contains a negative zero > value.) > {code:java} > col_a,col_b > 0.0,0.0 > -0.0,0.0{code} > ...and process it via: > {code:java} > from pyarrow import csv, parquet > in_csv = 'in.csv' > table = csv.read_csv(in_csv) > parquet.write_to_dataset(table, root_path='./'){code} > > The output parquet file is then loaded into S3 and queried via AWS Athena > (i.e. PrestoDB / Hive). > Any query that touches {{col_a}} fails with the following error: > {code:java} > HIVE_CANNOT_OPEN_SPLIT: Error opening Hive split {{REDACTED}} (offset=0, > length=593): low must be less than or equal to high{code} > > As a sanity check, I transformed the csv file to parquet using an AWS Glue > Spark Job and I was able to query the output parquet file successfully. > As such, it appears as though the pyarrow writer is producing an invalid > parquet file when a column contains at least one instance of 0.0, at least > one instance of -0.0, and no other values. > -- This message was sent by Atlassian Jira (v8.3.2#803003)