zackmedi commented on issue #38427:
URL: https://github.com/apache/arrow/issues/38427#issuecomment-1826820610

   @mapleFU Here's a script for reproducing the issue!
   
   Something I just noticed while creating this script is that it seems to 
depend on my `ulimit -n` setting on my terminal. As you can see below the 
`max_open_files = 20`, however the script will hang if I set `ulimit -n 20` or 
`ulimit -n 25`. I didn't test every individual value, but at `ulimit -n 30` the 
script starts working again. So a ulimit value somewhere between 5 and 10 files 
over the `max_open_files` arg fixes the issue in this case. I'm not sure how it 
scales in the case where max_open_files is at 30,000 like when I originally 
encountered it though.
   
   ```
   import random
   import shutil
   import string
   from pathlib import Path
   import pyarrow
   import pyarrow.dataset as ds
   
   def gen_batches(my_schema):
       batch_dict = create_empty_dict()
       for i in range(1000):
           batch_dict['part'].append(i)
   
           chars = ''.join(random.choice(string.ascii_letters) for _ in 
range(3))
           batch_dict['chars'].append(chars)
   
           if len(batch_dict['part']) >= 3:
               yield pyarrow.RecordBatch.from_pydict(batch_dict, my_schema)
               batch_dict = create_empty_dict()
   
   def create_empty_dict():
       return {
           'part': [],
           'chars': []
       }
   
   if __name__ == '__main__':
       shutil.rmtree('./tmp/', ignore_errors=True)
       Path('./tmp/').mkdir(exist_ok=True)
   
       my_schema = pyarrow.schema([
           ('part', pyarrow.uint32()),
           ('chars', pyarrow.string())
       ])
   
       writer_args = {
           'schema': my_schema,
           'format': 'parquet',
           'partitioning': [
               'part'
           ],
           'partitioning_flavor': 'hive',
           'max_partitions': 1_000,
           'max_open_files': 20,
           'existing_data_behavior': 'delete_matching'
       }
   
       ds.write_dataset(
           gen_batches(my_schema),
           base_dir='./tmp/',
           **writer_args
       )
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to