[jira] [Commented] (ARROW-11057) [Python] Data inconsistency with read and write

Weston Pace (Jira) Mon, 28 Dec 2020 19:43:36 -0800


    [ 
https://issues.apache.org/jira/browse/ARROW-11057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17255791#comment-17255791
 ]


Weston Pace commented on ARROW-11057:
-------------------------------------

Actually, I take that back.  The two are similar but different.  PARQUET-1798 
is referring to setting the parquet "field_id".  This is not actually 
happening.  Instead, when the file is read back, the field_ids are 
automatically generated and ARROW then exposes this as "PARQUET:field_id" in 
the field metadata in the pyarrow Table.  This field metadata is then getting 
written back out when the file is saved.

So, unlike PARQUET-1798, the field_id field in parquet's SchemaElement is not 
being set.

Instead, the "PARQUET:field_id" key is present in the Arrow schema that gets 
thrift encoded and attached to the parquet file's file-wide metadata as 
ARROW:schema.

The only differences between the two files are completely contained in the 
ARROW:schema value.

> [Python] Data inconsistency with read and write
> -----------------------------------------------
>
>                 Key: ARROW-11057
>                 URL: https://issues.apache.org/jira/browse/ARROW-11057
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 2.0.0
>            Reporter: David Quijano
>            Priority: Major
>
> I have been reading and writing some tables to parquet and I found some 
> inconsistencies.
> {code:java}
> # create a table with some data
> a = pa.Table.from_pydict({'x': [1]*100,'y': [2]*100,'z': [3]*100,})
> # write it to file
> pq.write_table(a, 'test.parquet')
> # read the same file
> b = pq.write_table('test.parquet')
> # a == b is True, that's good
> # write table b to file
> pq.write_table(b, 'test2.parquet')
> # test is different from test2{code}
> Basically it is:
>  * Create table in memory
>  * Write it to file
>  * Read it again
>  * Write it to a different file
> The files are not the same. The second one contains extra information.
> The differences are consistent across different compressions (I tried snappy 
> and zstd).
> Also, reading the second file and and writing it again, produces the same 
> file.
> Is this a bug or an expected behavior?
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-11057) [Python] Data inconsistency with read and write

Reply via email to