[jira] [Commented] (ARROW-11629) [C++] Writing float32 values makes parquet files not readable for some tools

Tanguy Fautre (Jira) Wed, 17 Feb 2021 08:06:07 -0800


    [ 
https://issues.apache.org/jira/browse/ARROW-11629?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17285936#comment-17285936
 ]


Tanguy Fautre commented on ARROW-11629:
---------------------------------------

ParquetSharp is just a wrapper around Arrow/Parquet C++. API will be similar 
(in this particular case, you need to use the {{WriterPropertiesBuilder}} to 
disable dictionary encoding on the float column). Here is an example from 
ParquetSharp unit tests:

{code:c#}
var columns = new Column[]
{
    new Column<int>("int"),
    new Column<double>("double"),
    new Column<string>("string"),
    new Column<bool>("bool")
};

using var builder = new WriterPropertiesBuilder();
using var writerProperties = 
builder.Compression(Compression.Snappy).DisableDictionary("double").Build();
using var fileWriter = new ParquetFileWriter(output, columns, writerProperties);
{code}

I suspect that either Arrow creates somehow an incorrect Parquet file when a 
large dictionary is created, or (more likely) that Drill and Parquet .NET have 
a bug when reading large dictionaries.

> [C++] Writing float32 values makes parquet files not readable for some tools
> ----------------------------------------------------------------------------
>
>                 Key: ARROW-11629
>                 URL: https://issues.apache.org/jira/browse/ARROW-11629
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: C++, Python
>    Affects Versions: 3.0.0
>            Reporter: Matthias Rosenthaler
>            Priority: Major
>         Attachments: foo.parquet, image-2021-02-15-15-49-41-908.png, 
> output.csv, output.parquet
>
>
> If I try to read the attached csv file with pyarrow, changing the float64 
> columns to float32 and export it to parquet, the parquet file gets corrupted. 
> It is not readable for apache drill or Parquet.Net any longer.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-11629) [C++] Writing float32 values makes parquet files not readable for some tools

Reply via email to