[ 
https://issues.apache.org/jira/browse/ARROW-9009?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche updated ARROW-9009:
-----------------------------------------
    Description: 
When reading a parquet file (which was written by Arrow) with the datasets API, 
it preserves the "ARROW:schema" field in the metadata:

{code:python}
import pyarrow as pa
import pyarrow.parquet as pq
import pyarrow.dataset as ds

table = pa.table({'a': [1, 2, 3]})
pq.write_table(table, "test.parquet")

dataset = ds.dataset("test.parquet", format="parquet")
{code}

{code}
In [7]: dataset.schema                                                          
                                                                                
                              
Out[7]: 
a: int64
  -- field metadata --
  PARQUET:field_id: '1'
-- schema metadata --
ARROW:schema: '/////3gAAAAQAAAAAAAKAAwABgAFAAgACgAAAAABAwAMAAAACAAIAAAABA' + 114

In [8]: dataset.to_table().schema                                               
                                                                                
                              
Out[8]: 
a: int64
  -- field metadata --
  PARQUET:field_id: '1'
-- schema metadata --
ARROW:schema: '/////3gAAAAQAAAAAAAKAAwABgAFAAgACgAAAAABAwAMAAAACAAIAAAABA' + 114
{code}

while when reading with the `parquet` module reader, we do not preserve this 
metadata:

{code}
In [9]: pq.read_table("test.parquet").schema                                    
                                                                                
                              
Out[9]: 
a: int64
  -- field metadata --
  PARQUET:field_id: '1'
{code}

Since the "ARROW:schema" information is used to properly reconstruct the Arrow 
schema from the ParquetSchema, it is no longer needed once you already have the 
arrow schema, so it's probably not needed to keep it as metadata in the arrow 
schema.

  was:
When reading a parquet file (which was written by Arrow) with the datasets API, 
it preserves the "ARROW:schema" field in the metadata:

{code:python}
import pyarrow as pa
import pyarrow.parquet as pq
import pyarrow.dataset as ds

table = pa.table({'a': [1, 2, 3]})
pq.write_table(table, "test.parquet")

dataset = ds.dataset("test.parquet", format="parquet")
{code}
In [7]: dataset.schema                                                          
                                                                                
                              
Out[7]: 
a: int64
  -- field metadata --
  PARQUET:field_id: '1'
-- schema metadata --
ARROW:schema: '/////3gAAAAQAAAAAAAKAAwABgAFAAgACgAAAAABAwAMAAAACAAIAAAABA' + 114

In [8]: dataset.to_table().schema                                               
                                                                                
                              
Out[8]: 
a: int64
  -- field metadata --
  PARQUET:field_id: '1'
-- schema metadata --
ARROW:schema: '/////3gAAAAQAAAAAAAKAAwABgAFAAgACgAAAAABAwAMAAAACAAIAAAABA' + 114
{code}

while when reading with the `parquet` module reader, we do not preserve this 
metadata:

{code}
In [9]: pq.read_table("test.parquet").schema                                    
                                                                                
                              
Out[9]: 
a: int64
  -- field metadata --
  PARQUET:field_id: '1'
{code}

Since the "ARROW:schema" information is used to properly reconstruct the Arrow 
schema from the ParquetSchema, it is no longer needed once you already have the 
arrow schema, so it's probably not needed to keep it as metadata in the arrow 
schema.


> [C++][Dataset] ARROW:schema should be removed from schema's metadata when 
> reading Parquet files
> -----------------------------------------------------------------------------------------------
>
>                 Key: ARROW-9009
>                 URL: https://issues.apache.org/jira/browse/ARROW-9009
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: C++
>            Reporter: Joris Van den Bossche
>            Priority: Major
>              Labels: dataset
>
> When reading a parquet file (which was written by Arrow) with the datasets 
> API, it preserves the "ARROW:schema" field in the metadata:
> {code:python}
> import pyarrow as pa
> import pyarrow.parquet as pq
> import pyarrow.dataset as ds
> table = pa.table({'a': [1, 2, 3]})
> pq.write_table(table, "test.parquet")
> dataset = ds.dataset("test.parquet", format="parquet")
> {code}
> {code}
> In [7]: dataset.schema                                                        
>                                                                               
>                                   
> Out[7]: 
> a: int64
>   -- field metadata --
>   PARQUET:field_id: '1'
> -- schema metadata --
> ARROW:schema: '/////3gAAAAQAAAAAAAKAAwABgAFAAgACgAAAAABAwAMAAAACAAIAAAABA' + 
> 114
> In [8]: dataset.to_table().schema                                             
>                                                                               
>                                   
> Out[8]: 
> a: int64
>   -- field metadata --
>   PARQUET:field_id: '1'
> -- schema metadata --
> ARROW:schema: '/////3gAAAAQAAAAAAAKAAwABgAFAAgACgAAAAABAwAMAAAACAAIAAAABA' + 
> 114
> {code}
> while when reading with the `parquet` module reader, we do not preserve this 
> metadata:
> {code}
> In [9]: pq.read_table("test.parquet").schema                                  
>                                                                               
>                                   
> Out[9]: 
> a: int64
>   -- field metadata --
>   PARQUET:field_id: '1'
> {code}
> Since the "ARROW:schema" information is used to properly reconstruct the 
> Arrow schema from the ParquetSchema, it is no longer needed once you already 
> have the arrow schema, so it's probably not needed to keep it as metadata in 
> the arrow schema.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to