[
https://issues.apache.org/jira/browse/ARROW-3728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16703544#comment-16703544
]
David Lee commented on ARROW-3728:
----------------------------------
I'm finding the same problem as well.. This is similar to:
https://jira.apache.org/jira/browse/ARROW-3065
I think the underlying panda schema has changed between pyarrow releases so I
can't merge old files with new files.
On the topic of merging parquet files.. This is something I do to try to create
128 meg parquet files to match the HDFS blocksize configured in Hadoop.
It is not possible to predetermine the size of a parquet file when you mix in
dictionary encoding + snappy compression, but you can work around it be merging
smaller parquet files together as row groups.
Save two million rows of data per parquet file. This ends up creating multiple
parquet files around 10 megs each after encoding and compression.
Figure out which files should be merged by adding their file sizes together
until it the sum comes in just under 128 megs which is between 95% and 100% of
128 * 1024 * 1024 bytes.
Read each parquet file in as a arrow table and write the arrow table to a new
file as a row group. This is both fast and memory efficient since you only need
to put two million rows of data in memory at a time.
On a separate topic I should probably open up a new issue / enhancement request.
A. Would it be possible to read a row group out of parquet file, modify it as a
panda and then write it back to the original parquet file?
B. Would it be possible to add a boolean hidden status column to every parquet
file? A status of True would mean the row is valid. A status of False would
mean the row is deleted. Dremio uses an internal flag in Arrow data sets when
doing SQL Union operations. It is more efficient to flag a record as deleted
instead of trying to delete it out of a columnar memory format. If we could
introduce something for columnar parquet you could in theory update parquet
files by flagging the old record as deleted and reinserting the replacement
record at the end of the existing file without having to shuffle / re-write the
entire file.
> [Python] Merging Parquet Files - Pandas Meta in Schema Mismatch
> ---------------------------------------------------------------
>
> Key: ARROW-3728
> URL: https://issues.apache.org/jira/browse/ARROW-3728
> Project: Apache Arrow
> Issue Type: Bug
> Components: Python
> Affects Versions: 0.10.0, 0.11.0, 0.11.1
> Environment: Python 3.6.3
> OSX 10.14
> Reporter: Micah Williamson
> Assignee: Krisztian Szucs
> Priority: Major
> Labels: parquet, pull-request-available
> Fix For: 0.12.0
>
> Time Spent: 1h
> Remaining Estimate: 0h
>
> From:
> https://stackoverflow.com/questions/53214288/merging-parquet-files-pandas-meta-in-schema-mismatch
>
> I am trying to merge multiple parquet files into one. Their schemas are
> identical field-wise but my {{ParquetWriter}} is complaining that they are
> not. After some investigation I found that the pandas meta in the schemas are
> different, causing this error.
>
> Sample-
> {code:python}
> import pyarrow.parquet as pq
> pq_tables=[]
> for file_ in files:
> pq_table = pq.read_table(f'{MESS_DIR}/{file_}')
> pq_tables.append(pq_table)
> if writer is None:
> writer = pq.ParquetWriter(COMPRESSED_FILE, schema=pq_table.schema,
> use_deprecated_int96_timestamps=True)
> writer.write_table(table=pq_table)
> {code}
> The error-
> {code}
> Traceback (most recent call last):
> File "{PATH_TO}/main.py", line 68, in lambda_handler
> writer.write_table(table=pq_table)
> File
> "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pyarrow/parquet.py",
> line 335, in write_table
> raise ValueError(msg)
> ValueError: Table schema does not match schema used to create file:
> {code}
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)