[jira] [Commented] (ARROW-11857) [Python] Resource temporarily unavailable when using the new Dataset API with Pandas

Anton Friberg (Jira) Wed, 07 Apr 2021 01:50:04 -0700


    [ 
https://issues.apache.org/jira/browse/ARROW-11857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17316128#comment-17316128
 ]


Anton Friberg commented on ARROW-11857:
---------------------------------------

[~apitrou] I have never done it before but I am willing to try. Please let me 
know if I did anything wrong.

Environment
{code:java}
$ which python
/home/antonfr/.virtualenvs/pyarrow-stacktrace/bin/python
$ python --version
Python 3.7.3
$ pip freeze               
appdirs==1.4.4
dateutils==0.6.12
numpy==1.20.2
packaging==20.9
pkg-resources==0.4.0
pyarrow==3.0.0
pyparsing==2.4.7
python-dateutil==2.8.1
pytz==2021.1
six==1.15.0{code}
To run gdb I installed _python3-gdb_ and _python3-dev_ in addition to gdb but 
did not manage to run python3-gdb and install pyarrow in that virtualenv.

Then I ran the following (my script is called _anja_range_download.py_)
{code:java}
$ gdb /home/antonfr/.virtualenvs/pyarrow-stacktrace/bin/python
GNU gdb (Debian 8.2.1-2+b3) 8.2.1
Copyright (C) 2018 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.For help, type "help".
--Type <RET> for more, q to quit, c to continue without paging--
Type "apropos word" to search for commands related to "word"...
Reading symbols from 
/home/antonfr/.virtualenvs/pyarrow-stacktrace/bin/python...Reading symbols from 
/usr/lib/debug/.build-id/99/21c75e6930d3e9d9fa8c942aca9dc4500bb65f.debug...done.
done.
(gdb) set logging on
Copying output to gdb.txt.
(gdb) run anja_range_download.py
Starting program: /home/antonfr/.virtualenvs/pyarrow-stacktrace/bin/python 
anja_range_download.py
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[New Thread 0x7ffff47ff700 (LWP 27230)]

... Multiple thousands of threads started and exited ...

[New Thread 0x7fec62358700 (LWP 5825)]
terminate called after throwing an instance of 'std::system_error'
  what():  Resource temporarily unavailable

... Hangs for a while ...

Thread 920 "python" received signal SIGABRT, Aborted.
[Switching to Thread 0x7ffd34cf9700 (LWP 28235)]
__GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
50      ../sysdeps/unix/sysv/linux/raise.c: No such file or directory.

... Tested py-bt unsuccessfully ...

(gdb) py-bt
Undefined command: "py-bt".  Try "help".

... Running normal bt ...
(gdb) bt
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
#1  0x00007ffff79c8535 in __GI_abort () at abort.c:79
#2  0x00007ffff51ff983 in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6
#3  0x00007ffff52058c6 in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6
#4  0x00007ffff5205901 in std::terminate() () from 
/lib/x86_64-linux-gnu/libstdc++.so.6
#5  0x00007ffff5205b34 in __cxa_throw () from 
/lib/x86_64-linux-gnu/libstdc++.so.6
#6  0x00007ffff5742f4c in std::__throw_system_error(int) () from 
/home/antonfr/.virtualenvs/pyarrow-stacktrace/lib/python3.7/site-packages/pyarrow/libarrow.so.300
#7  0x00007ffff64f65f9 in 
std::thread::_M_start_thread(std::unique_ptr<std::thread::_State, 
std::default_delete<std::thread::_State> >, void (*)()) ()
   from 
/home/antonfr/.virtualenvs/pyarrow-stacktrace/lib/python3.7/site-packages/pyarrow/libarrow.so.300
#8  0x00007ffff60b23df in 
Aws::Utils::Threading::DefaultExecutor::SubmitToThread(std::function<void 
()>&&) ()
   from 
/home/antonfr/.virtualenvs/pyarrow-stacktrace/lib/python3.7/site-packages/pyarrow/libarrow.so.300
#9  0x00007ffff60427c5 in 
Aws::S3::S3Client::ListObjectsV2Async(Aws::S3::Model::ListObjectsV2Request 
const&, std::function<void (Aws::S3::S3Client const*, 
Aws::S3::Model::ListObjectsV2Request const&, 
Aws::Utils::Outcome<Aws::S3::Model::ListObjectsV2Result, Aws::S3::S3Error> 
const&, std::shared_ptr<Aws::Client::AsyncCallerContext const> const&)> const&, 
std::shared_ptr<Aws::Client::AsyncCallerContext const> const&) const () from 
/home/antonfr/.virtualenvs/pyarrow-stacktrace/lib/python3.7/site-packages/pyarrow/libarrow.so.300
#10 0x00007ffff5d9e19b in arrow::fs::(anonymous 
namespace)::TreeWalker::WalkChild(std::string, int) ()
   from 
/home/antonfr/.virtualenvs/pyarrow-stacktrace/lib/python3.7/site-packages/pyarrow/libarrow.so.300
#11 0x00007ffff5d9efc0 in arrow::fs::(anonymous 
namespace)::TreeWalker::ListObjectsV2Handler::HandleResult(Aws::S3::Model::ListObjectsV2Result
 const&) ()
   from 
/home/antonfr/.virtualenvs/pyarrow-stacktrace/lib/python3.7/site-packages/pyarrow/libarrow.so.300
#12 0x00007ffff5d9f18b in std::_Function_handler<void (Aws::S3::S3Client 
const*, Aws::S3::Model::ListObjectsV2Request const&, 
Aws::Utils::Outcome<Aws::S3::Model::ListObjectsV2Result, Aws::S3::S3Error> 
const&, std::shared_ptr<Aws::Client::AsyncCallerContext const> const&), 
arrow::fs::(anonymous 
namespace)::TreeWalker::ListObjectsV2Handler>::_M_invoke(std::_Any_data const&, 
Aws::S3::S3Client const*&&, Aws::S3::Model::ListObjectsV2Request const&, 
Aws::Utils::Outcome<Aws::S3::Model::ListObjectsV2Result, Aws::S3::S3Error> 
const&, std::shared_ptr<Aws::Client::AsyncCallerContext const> const&) () from 
/home/antonfr/.virtualenvs/pyarrow-stacktrace/lib/python3.7/site-packages/pyarrow/libarrow.so.300
#13 0x00007ffff5fa9972 in 
Aws::S3::S3Client::ListObjectsV2AsyncHelper(Aws::S3::Model::ListObjectsV2Request
 const&, std::function<void (Aws::S3::S3Client const*, 
Aws::S3::Model::ListObjectsV2Request const&, 
Aws::Utils::Outcome<Aws::S3::Model::ListObjectsV2Result, Aws::S3::S3Error> 
const&, std::shared_ptr<Aws::Client::AsyncCallerContext const> const&)> const&, 
std::shared_ptr<Aws::Client::AsyncCallerContext const> const&) const () from 
/home/antonfr/.virtualenvs/pyarrow-stacktrace/lib/python3.7/site-packages/pyarrow/libarrow.so.300
#14 0x00007ffff60e2f07 in 
std::thread::_State_impl<std::thread::_Invoker<std::tuple<Aws::Utils::Threading::DefaultExecutor::SubmitToThread(std::function<void
 ()>&&)::{lambda()#1}> > >::_M_run() () from 
/home/antonfr/.virtualenvs/pyarrow-stacktrace/lib/python3.7/site-packages/pyarrow/libarrow.so.300
#15 0x00007ffff64f6580 in execute_native_thread_routine () from 
/home/antonfr/.virtualenvs/pyarrow-stacktrace/lib/python3.7/site-packages/pyarrow/libarrow.so.300
#16 0x00007ffff7f58fa3 in start_thread (arg=<optimized out>) at 
pthread_create.c:486
#17 0x00007ffff7a9f4cf in clone () at 
../sysdeps/unix/sysv/linux/x86_64/clone.S:95
{code}
Is this the necessary information or do you need anything else?

> [Python] Resource temporarily unavailable when using the new Dataset API with 
> Pandas
> ------------------------------------------------------------------------------------
>
>                 Key: ARROW-11857
>                 URL: https://issues.apache.org/jira/browse/ARROW-11857
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 3.0.0
>         Environment: OS: Debian GNU/Linux 10 (buster) x86_64 
> Kernel: 4.19.0-14-amd64 
> CPU: Intel i7-6700K (8) @ 4.200GHz 
> Memory: 32122MiB
> Python: v3.7.3
>            Reporter: Anton Friberg
>            Assignee: Weston Pace
>            Priority: Critical
>             Fix For: 4.0.0
>
>
> When using the new Dataset API under v3.0.0 it instantly crashes with
> {code:java}
>  terminate called after throwing an instance of 'std::system_error'
>  what(): Resource temporarily unavailable{code}
> This does not happen in an earlier version. The error message leads me to 
> believe that the issue is not on the Python side but might be in the C++ 
> libraries.
> As background, I am using the new Dataset API by calling the following
> {code:java}
> s3_fs = fs.S3FileSystem(<minio credentials>)
> dataset = pq.ParquetDataset(
>         f"{bucket}/{base_path}",
>         filesystem=s3_fs,
>         partitioning="hive",
>         use_legacy_dataset=False,
>         filters=filters
> )
> dataframe = dataset.read_pandas(columns=columns).to_pandas(){code}
> The dataset itself contains 10,000s of files around 100 MB in size and is 
> created using incremental bulk processing from pandas and pyarrow v1.0.1. 
> With the filters I am limiting the amount of files that are fetch to around 
> 20.
> I am suspecting an issue with a limit in the total amount of threads that are 
> spawning but I have been unable to resolve it by calling
> {code:java}
> pyarrow.set_cpu_count(1) {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-11857) [Python] Resource temporarily unavailable when using the new Dataset API with Pandas

Reply via email to