ronyarmon opened a new issue, #14094:
URL: https://github.com/apache/arrow/issues/14094

   Hello, I'm trying to merge a large quantity of Parquet files into a single 
Parquet file which works fine from Python shell:
   `>>> import pandas as pd
   >>> import pyarrow.parquet as pq
   >>> chunks_path = './chunks'
   >>> pq.write_table(pq.ParquetDataset(chunks_path).read(), 
'merge_test.parquet', row_group_size=100000)
   >>> import sys
   >>> 
   ubuntu@ip-172-31-15-123:~/milestones_chains/results/experiment1$ ls *.parquet
   merge_test.parquet
   ubuntu@ip-172-31-15-123:~/milestones_chains/results/experiment1$ du -sh 
merge_test.parquet 
   3.7G    merge_test.parquet`
   The files to merge are stored in the 'chunks' directory and the merge 
produces a 3.7G file. 
   And yet I fail to merge the files when running the same command from a 
Python script:
   `# Preprocessing stages producing the parqeut files to merge
   # Results file
   print('combine results')
   pq.write_table(pq.ParquetDataset(chunks_path).read(), 'results.parquet', 
row_group_size=100000)
   `
   I'm getting: 
   `combine results
   killed`
   
   What may be the reason for the failure to merge the files when run from a 
script and how can it be fixed? 
   I'm using Ubuntu on ec2 instance('c5.12xlarge'), 48 vCPU cores, 96Mb RAM


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to