[jira] [Commented] (ARROW-3728) [Python] Merging Parquet Files - Pandas Meta in Schema Mismatch

Milenko Markovic (Jira) Wed, 01 Jul 2020 04:41:01 -0700


    [ 
https://issues.apache.org/jira/browse/ARROW-3728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17149349#comment-17149349
 ]


Milenko Markovic commented on ARROW-3728:
-----------------------------------------

I have similar issues. I am trying to merge Parquet files with my code



def combine_parquet_files(input_folder, target_path):
 try:
 files = []
 for file_name in os.listdir(input_folder):
  files.append(pq.read_table(os.path.join(input_folder, file_name)))
   with 
pq.ParquetWriter(target_path,files[0].schema,compression='snappy',use_dictionary=
 False,data_page_size= 524288, write_statistics=True) as writer:
 for f in files:
 writer.write_table(f)
 except Exception as e:
 print(e)

It does not work for small files,20-30k. Why? I tried to change data_page_size 
but still I go the same output. What does data_page_size actually do?

> [Python] Merging Parquet Files - Pandas Meta in Schema Mismatch
> ---------------------------------------------------------------
>
>                 Key: ARROW-3728
>                 URL: https://issues.apache.org/jira/browse/ARROW-3728
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 0.10.0, 0.11.0, 0.11.1
>         Environment: Python 3.6.3
> OSX 10.14
>            Reporter: Micah Williamson
>            Assignee: Krisztian Szucs
>            Priority: Major
>              Labels: parquet, pull-request-available
>             Fix For: 0.12.0
>
>          Time Spent: 1h
>  Remaining Estimate: 0h
>
> From: 
> https://stackoverflow.com/questions/53214288/merging-parquet-files-pandas-meta-in-schema-mismatch
>  
> I am trying to merge multiple parquet files into one. Their schemas are 
> identical field-wise but my {{ParquetWriter}} is complaining that they are 
> not. After some investigation I found that the pandas meta in the schemas are 
> different, causing this error.
>  
> Sample-
> {code:python}
> import pyarrow.parquet as pq
> pq_tables=[]
> for file_ in files:
>     pq_table = pq.read_table(f'{MESS_DIR}/{file_}')
>     pq_tables.append(pq_table)
>     if writer is None:
>         writer = pq.ParquetWriter(COMPRESSED_FILE, schema=pq_table.schema, 
> use_deprecated_int96_timestamps=True)
>     writer.write_table(table=pq_table)
> {code}
> The error-
> {code}
> Traceback (most recent call last):
>   File "{PATH_TO}/main.py", line 68, in lambda_handler
>     writer.write_table(table=pq_table)
>   File 
> "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pyarrow/parquet.py",
>  line 335, in write_table
>     raise ValueError(msg)
> ValueError: Table schema does not match schema used to create file:
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-3728) [Python] Merging Parquet Files - Pandas Meta in Schema Mismatch

Reply via email to