zackmedi commented on issue #38427:
URL: https://github.com/apache/arrow/issues/38427#issuecomment-1826820610
@mapleFU Here's a script for reproducing the issue!
Something I just noticed while creating this script is that it seems to
depend on my `ulimit -n` setting on my terminal. As you can see below the
`max_open_files = 20`, however the script will hang if I set `ulimit -n 20` or
`ulimit -n 25`. I didn't test every individual value, but at `ulimit -n 30` the
script starts working again. So a ulimit value somewhere between 5 and 10 files
over the `max_open_files` arg fixes the issue in this case. I'm not sure how it
scales in the case where max_open_files is at 30,000 like when I originally
encountered it though.
```
import random
import shutil
import string
from pathlib import Path
import pyarrow
import pyarrow.dataset as ds
def gen_batches(my_schema):
batch_dict = create_empty_dict()
for i in range(1000):
batch_dict['part'].append(i)
chars = ''.join(random.choice(string.ascii_letters) for _ in
range(3))
batch_dict['chars'].append(chars)
if len(batch_dict['part']) >= 3:
yield pyarrow.RecordBatch.from_pydict(batch_dict, my_schema)
batch_dict = create_empty_dict()
def create_empty_dict():
return {
'part': [],
'chars': []
}
if __name__ == '__main__':
shutil.rmtree('./tmp/', ignore_errors=True)
Path('./tmp/').mkdir(exist_ok=True)
my_schema = pyarrow.schema([
('part', pyarrow.uint32()),
('chars', pyarrow.string())
])
writer_args = {
'schema': my_schema,
'format': 'parquet',
'partitioning': [
'part'
],
'partitioning_flavor': 'hive',
'max_partitions': 1_000,
'max_open_files': 20,
'existing_data_behavior': 'delete_matching'
}
ds.write_dataset(
gen_batches(my_schema),
base_dir='./tmp/',
**writer_args
)
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]