[jira] [Comment Edited] (ARROW-8980) [Python] Metadata grows exponentially when using schema from disk

Joris Van den Bossche (Jira) Tue, 02 Jun 2020 02:30:07 -0700


    [ 
https://issues.apache.org/jira/browse/ARROW-8980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17123549#comment-17123549
 ]


Joris Van den Bossche edited comment on ARROW-8980 at 6/2/20, 9:29 AM:
-----------------------------------------------------------------------

[~kevinglasson] thanks for the report!

A bit modified example to visualize the issue:

{code:python}
import pyarrow.parquet as pq

fname = "test_metadata_size.parquet" 
df = pd.DataFrame({"A": [0] * 100000})
df.to_parquet(fname)

# first read
file1 = pq.ParquetFile("test_metadata_size.parquet")                            
                                                                                
                      
table1 = file1.read()                                                           
                                                                                
                       
schema1 = file1.schema.to_arrow_schema()                                        
                                                                                
                       

# writing
writer = pq.ParquetWriter(fname, schema=schema1)                                
                                                                                
                      
writer.write_table(pa.Table.from_pandas(df))                                    
                                                                                
                     
writer.close()                                                                  
                                                                                
                     

# second read
file2 = pq.ParquetFile(fname)                                                   
                                                                                
                     
table2 = file2.read()                                                           
                                                                                
                     
schema2 = file2.schema.to_arrow_schema() 
{code}

and then looking at the different schemas:

{code}
>>> schema1                                                                     
>>>                                                                             
>>>                               
A: int64
  -- field metadata --
  PARQUET:field_id: '1'
-- schema metadata --
pandas: '{"index_columns": [{"kind": "range", "name": null, "start": 0, "' + 408
ARROW:schema: '/////4gCAAAQAAAAAAAKAA4ABgAFAAgACgAAAAABAwAQAAAAAAAKAAwAAA' + 818

>>> table1.schema                                                               
>>>                                                                             
>>>                               
A: int64
  -- field metadata --
  PARQUET:field_id: '1'
-- schema metadata --
pandas: '{"index_columns": [{"kind": "range", "name": null, "start": 0, "' + 408

>>> schema2  
A: int64
  -- field metadata --
  PARQUET:field_id: '1'
-- schema metadata --
pandas: '{"index_columns": [{"kind": "range", "name": null, "start": 0, "' + 408
ARROW:schema: '/////4gCAAAQAAAAAAAKAA4ABgAFAAgACgAAAAABAwAQAAAAAAAKAAwAAA' + 818
ARROW:schema: '/////2AGAAAQAAAAAAAKAA4ABgAFAAgACgAAAAABAwAQAAAAAAAKAAwAAA' + 
2130

>>> table2.schema
A: int64
  -- field metadata --
  PARQUET:field_id: '1'
-- schema metadata --
pandas: '{"index_columns": [{"kind": "range", "name": null, "start": 0, "' + 408
ARROW:schema: '/////2AGAAAQAAAAAAAKAA4ABgAFAAgACgAAAAABAwAQAAAAAAAKAAwAAA' + 
2130
{code}

So indeed, as you said, it's the ARROW:schema size that is accumulating.

Some observations:

- In actual {{Table.schema}}, the ARROW:schema field is removed from the 
metadata (after reading). Sidenote: so if you would use this instead of 
{{file.schema.to_arrow_schema()}} that could be a temporary workaround for you.
- When converting the ParquetSchema to a pyarrow Schema, we don't remove the 
"ARROW:schema" key, which we probably should do? (since that information is 
only used to propertly reconstruct the arrow schema, so once you have this 
arrow schema, we can drop the metadata for this. Similarly as we do when 
reading the actual file)
- When writing with a schema that already has a "ARROW:schema"  metadata field, 
another field (with a duplicated key) gets added. I suppose this might be 
expected since the metadata doesn't check for duplicate keys right now. But it 
would also help in this case if the field would be over-written.


was (Author: jorisvandenbossche):
[~kevinglasson] thanks for the report!

A bit modified example to visualize the issue:

{code:python}
import pyarrow.parquet as pq

fname = "test_metadata_size.parquet" 
df = pd.DataFrame({"A": [0] * 100000})
df.to_parquet(fname)

# first read
file1 = pq.ParquetFile("test_metadata_size.parquet")                            
                                                                                
                      
table1 = file1.read()                                                           
                                                                                
                       
schema1 = file1.schema.to_arrow_schema()                                        
                                                                                
                       

# writing
writer = pq.ParquetWriter(fname, schema=schema1)                                
                                                                                
                      
writer.write_table(pa.Table.from_pandas(df))                                    
                                                                                
                     
writer.close()                                                                  
                                                                                
                     

# second read
file2 = pq.ParquetFile(fname)                                                   
                                                                                
                     
table2 = file2.read()                                                           
                                                                                
                     
schema2 = file2.schema.to_arrow_schema() 
{code}

and then looking at the different schemas:

{code}
>>> schema1                                                                     
>>>                                                                             
>>>                               
A: int64
  -- field metadata --
  PARQUET:field_id: '1'
-- schema metadata --
pandas: '{"index_columns": [{"kind": "range", "name": null, "start": 0, "' + 408
ARROW:schema: '/////4gCAAAQAAAAAAAKAA4ABgAFAAgACgAAAAABAwAQAAAAAAAKAAwAAA' + 818

>>> table1.schema                                                               
>>>                                                                             
>>>                               
A: int64
  -- field metadata --
  PARQUET:field_id: '1'
-- schema metadata --
pandas: '{"index_columns": [{"kind": "range", "name": null, "start": 0, "' + 408

>>> schema2  
A: int64
  -- field metadata --
  PARQUET:field_id: '1'
-- schema metadata --
pandas: '{"index_columns": [{"kind": "range", "name": null, "start": 0, "' + 408
ARROW:schema: '/////4gCAAAQAAAAAAAKAA4ABgAFAAgACgAAAAABAwAQAAAAAAAKAAwAAA' + 818
ARROW:schema: '/////2AGAAAQAAAAAAAKAA4ABgAFAAgACgAAAAABAwAQAAAAAAAKAAwAAA' + 
2130

>>> table2.schema
A: int64
  -- field metadata --
  PARQUET:field_id: '1'
-- schema metadata --
pandas: '{"index_columns": [{"kind": "range", "name": null, "start": 0, "' + 408
ARROW:schema: '/////2AGAAAQAAAAAAAKAA4ABgAFAAgACgAAAAABAwAQAAAAAAAKAAwAAA' + 
2130
{code}

So indeed, as you said, it's the ARROW:schema size that is accumulating.

Some observations:

- In actual {{Table.schema}}, the ARROW:schema field is removed from the 
metadata (after reading). Sidenote: so if you would use this instead of 
{{file.schema.to_arrow_schema()}} that could be a temporary workaround for you.
- When converting the ParquetSchema to a pyarrow Schema, we don't remove the 
"ARROW:schema" key, which we maybe should do? (since that information is only 
used to propertly reconstruct the arrow schema, so once you have this arrow 
schema, we can drop the metadata for this)
- When writing with a schema that already has a "ARROW:schema"  metadata field, 
another field (with a duplicated key) gets added. I suppose this might be 
expected since the metadata doesn't check for duplicate keys right now.

> [Python] Metadata grows exponentially when using schema from disk
> -----------------------------------------------------------------
>
>                 Key: ARROW-8980
>                 URL: https://issues.apache.org/jira/browse/ARROW-8980
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 0.16.0
>         Environment: python: 3.7.3 | packaged by conda-forge | (default, Dec 
> 6 2019, 08:36:57)
> [Clang 9.0.0 (tags/RELEASE_900/final)]
> pa version: 0.16.0
> pd version: 0.25.2
>            Reporter: Kevin Glasson
>            Priority: Major
>              Labels: metadata, parquet, pyarrow, python, schema
>         Attachments: growing_metadata.py, test.pq
>
>
> When overwriting parquet files we first read the schema that is already on 
> disk this is mainly to deal with some type harmonizing between pyarrow and 
> pandas (that I wont go into).
> Regardless here is a simple example (below) with no weirdness. If I 
> continously re-write the same file by first fetching the schema from disk, 
> creating a writer with that schema and then writing same dataframe the file 
> size keeps growing even though the amount of rows has not changed.
> Note: My solution was to remove `b'ARROW:schema'` data from the 
> `schema.metadata.` this seems to stop the file size growing. So I wonder if 
> the writer keeps appending to it or something? TBH I'm not entirely sure but 
> I have a hunch that the ARROW:schema is just the metadata serialised or 
> something.
> I should also note that once the metadata gets to big this leads to a buffer 
> overflow in another part of the code 'thrift' which was referenced here: 
> https://issues.apache.org/jira/browse/PARQUET-1345
> {code:java}
> import pyarrow as pa
> import pyarrow.parquet as pq
> import pyarrow as pa
> import pandas as pd
> import pathlib
> import sys
> def main():
>     print(f"python: {sys.version}")
>     print(f"pa version: {pa.__version__}")
>     print(f"pd version: {pd.__version__}")    fname = "test.pq"
>     path = pathlib.Path(fname)    df = pd.DataFrame({"A": [0] * 100000})
>     df.to_parquet(fname)    print(f"Wrote test frame to {fname}")
>     print(f"Size of {fname}: {path.stat().st_size}")    for _ in range(5):
>         file = pq.ParquetFile(fname)
>         tmp_df = file.read().to_pandas()
>         print(f"Number of rows on disk: {tmp_df.shape}")
>         print("Reading schema from disk")
>         schema = file.schema.to_arrow_schema()
>         print("Creating new writer")
>         writer = pq.ParquetWriter(fname, schema=schema)
>         print("Re-writing the dataframe")
>         writer.write_table(pa.Table.from_pandas(df))
>         writer.close()
>         print(f"Size of {fname}: {path.stat().st_size}")
> if __name__ == "__main__":
>     main()
> {code}
> {code:java}
> (sdm) ➜ ~ python growing_metadata.py
> python: 3.7.3 | packaged by conda-forge | (default, Dec 6 2019, 08:36:57)
> [Clang 9.0.0 (tags/RELEASE_900/final)]
> pa version: 0.16.0
> pd version: 0.25.2
> Wrote test frame to test.pq
> Size of test.pq: 1643
> Number of rows on disk: (100000, 1)
> Reading schema from disk
> Creating new writer
> Re-writing the dataframe
> Size of test.pq: 3637
> Number of rows on disk: (100000, 1)
> Reading schema from disk
> Creating new writer
> Re-writing the dataframe
> Size of test.pq: 8327
> Number of rows on disk: (100000, 1)
> Reading schema from disk
> Creating new writer
> Re-writing the dataframe
> Size of test.pq: 19301
> Number of rows on disk: (100000, 1)
> Reading schema from disk
> Creating new writer
> Re-writing the dataframe
> Size of test.pq: 44944
> Number of rows on disk: (100000, 1)
> Reading schema from disk
> Creating new writer
> Re-writing the dataframe
> Size of test.pq: 104815{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Comment Edited] (ARROW-8980) [Python] Metadata grows exponentially when using schema from disk

Reply via email to