[jira] [Commented] (ARROW-3728) [Python] Merging Parquet Files - Pandas Meta in Schema Mismatch

David Lee (JIRA) Thu, 29 Nov 2018 09:33:27 -0800


    [ 
https://issues.apache.org/jira/browse/ARROW-3728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16703544#comment-16703544
 ]


David Lee commented on ARROW-3728:
----------------------------------

I'm finding the same problem as well.. This is similar to: 
https://jira.apache.org/jira/browse/ARROW-3065
I think the underlying panda schema has changed between pyarrow releases so I 
can't merge old files with new files.

On the topic of merging parquet files.. This is something I do to try to create 
128 meg parquet files to match the HDFS blocksize configured in Hadoop.

It is not possible to predetermine the size of a parquet file when you mix in 
dictionary encoding + snappy compression, but you can work around it be merging 
smaller parquet files together as row groups.

Save two million rows of data per parquet file. This ends up creating multiple 
parquet files around 10 megs each after encoding and compression.
Figure out which files should be merged by adding their file sizes together 
until it the sum comes in just under 128 megs which is between 95% and 100% of 
128 * 1024 * 1024 bytes.
Read each parquet file in as a arrow table and write the arrow table to a new 
file as a row group. This is both fast and memory efficient since you only need 
to put two million rows of data in memory at a time.
On a separate topic I should probably open up a new issue / enhancement request.

A. Would it be possible to read a row group out of parquet file, modify it as a 
panda and then write it back to the original parquet file?

B. Would it be possible to add a boolean hidden status column to every parquet 
file? A status of True would mean the row is valid. A status of False would 
mean the row is deleted. Dremio uses an internal flag in Arrow data sets when 
doing SQL Union operations. It is more efficient to flag a record as deleted 
instead of trying to delete it out of a columnar memory format. If we could 
introduce something for columnar parquet you could in theory update parquet 
files by flagging the old record as deleted and reinserting the replacement 
record at the end of the existing file without having to shuffle / re-write the 
entire file.

 

> [Python] Merging Parquet Files - Pandas Meta in Schema Mismatch
> ---------------------------------------------------------------
>
>                 Key: ARROW-3728
>                 URL: https://issues.apache.org/jira/browse/ARROW-3728
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 0.10.0, 0.11.0, 0.11.1
>         Environment: Python 3.6.3
> OSX 10.14
>            Reporter: Micah Williamson
>            Assignee: Krisztian Szucs
>            Priority: Major
>              Labels: parquet, pull-request-available
>             Fix For: 0.12.0
>
>          Time Spent: 1h
>  Remaining Estimate: 0h
>
> From: 
> https://stackoverflow.com/questions/53214288/merging-parquet-files-pandas-meta-in-schema-mismatch
>  
> I am trying to merge multiple parquet files into one. Their schemas are 
> identical field-wise but my {{ParquetWriter}} is complaining that they are 
> not. After some investigation I found that the pandas meta in the schemas are 
> different, causing this error.
>  
> Sample-
> {code:python}
> import pyarrow.parquet as pq
> pq_tables=[]
> for file_ in files:
>     pq_table = pq.read_table(f'{MESS_DIR}/{file_}')
>     pq_tables.append(pq_table)
>     if writer is None:
>         writer = pq.ParquetWriter(COMPRESSED_FILE, schema=pq_table.schema, 
> use_deprecated_int96_timestamps=True)
>     writer.write_table(table=pq_table)
> {code}
> The error-
> {code}
> Traceback (most recent call last):
>   File "{PATH_TO}/main.py", line 68, in lambda_handler
>     writer.write_table(table=pq_table)
>   File 
> "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pyarrow/parquet.py",
>  line 335, in write_table
>     raise ValueError(msg)
> ValueError: Table schema does not match schema used to create file:
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (ARROW-3728) [Python] Merging Parquet Files - Pandas Meta in Schema Mismatch

Reply via email to