ronyarmon opened a new issue, #14094:
URL: https://github.com/apache/arrow/issues/14094
Hello, I'm trying to merge a large quantity of Parquet files into a single
Parquet file which works fine from Python shell:
`>>> import pandas as pd
>>> import pyarrow.parquet as pq
>>> chunks_path = './chunks'
>>> pq.write_table(pq.ParquetDataset(chunks_path).read(),
'merge_test.parquet', row_group_size=100000)
>>> import sys
>>>
ubuntu@ip-172-31-15-123:~/milestones_chains/results/experiment1$ ls *.parquet
merge_test.parquet
ubuntu@ip-172-31-15-123:~/milestones_chains/results/experiment1$ du -sh
merge_test.parquet
3.7G merge_test.parquet`
The files to merge are stored in the 'chunks' directory and the merge
produces a 3.7G file.
And yet I fail to merge the files when running the same command from a
Python script:
`# Preprocessing stages producing the parqeut files to merge
# Results file
print('combine results')
pq.write_table(pq.ParquetDataset(chunks_path).read(), 'results.parquet',
row_group_size=100000)
`
I'm getting:
`combine results
killed`
What may be the reason for the failure to merge the files when run from a
script and how can it be fixed?
I'm using Ubuntu on ec2 instance('c5.12xlarge'), 48 vCPU cores, 96Mb RAM
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]