[ 
https://issues.apache.org/jira/browse/ARROW-11857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17316181#comment-17316181
 ] 

Antoine Pitrou commented on ARROW-11857:
----------------------------------------

Thank you very much. I think I understand what's happening: you're walking a 
very wide directory (lots of subdirectories) and the AWS SDK launches a new 
thread for each subdirectory.

In any case, the implementation for this functionality has been reworked 
recently and it should be fixed. Can you try one of the nightly builds and 
report the results? See 
https://arrow.apache.org/docs/python/install.html#installing-nightly-packages

> [Python] Resource temporarily unavailable when using the new Dataset API with 
> Pandas
> ------------------------------------------------------------------------------------
>
>                 Key: ARROW-11857
>                 URL: https://issues.apache.org/jira/browse/ARROW-11857
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 3.0.0
>         Environment: OS: Debian GNU/Linux 10 (buster) x86_64 
> Kernel: 4.19.0-14-amd64 
> CPU: Intel i7-6700K (8) @ 4.200GHz 
> Memory: 32122MiB
> Python: v3.7.3
>            Reporter: Anton Friberg
>            Assignee: Weston Pace
>            Priority: Critical
>             Fix For: 4.0.0
>
>         Attachments: gdb.txt.gz
>
>
> When using the new Dataset API under v3.0.0 it instantly crashes with
> {code:java}
>  terminate called after throwing an instance of 'std::system_error'
>  what(): Resource temporarily unavailable{code}
> This does not happen in an earlier version. The error message leads me to 
> believe that the issue is not on the Python side but might be in the C++ 
> libraries.
> As background, I am using the new Dataset API by calling the following
> {code:java}
> s3_fs = fs.S3FileSystem(<minio credentials>)
> dataset = pq.ParquetDataset(
>         f"{bucket}/{base_path}",
>         filesystem=s3_fs,
>         partitioning="hive",
>         use_legacy_dataset=False,
>         filters=filters
> )
> dataframe = dataset.read_pandas(columns=columns).to_pandas(){code}
> The dataset itself contains 10,000s of files around 100 MB in size and is 
> created using incremental bulk processing from pandas and pyarrow v1.0.1. 
> With the filters I am limiting the amount of files that are fetch to around 
> 20.
> I am suspecting an issue with a limit in the total amount of threads that are 
> spawning but I have been unable to resolve it by calling
> {code:java}
> pyarrow.set_cpu_count(1) {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to