[
https://issues.apache.org/jira/browse/ARROW-18228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Vadym Dytyniak updated ARROW-18228:
-----------------------------------
Description:
We use Dask to parallelise read/write operations and pyarrow to write dataset
from worker nodes.
After pyarrow released version 10.0.0, our data flows automatically switched to
the latest version and some of them started to fail with the following error:
{code:java}
File "/usr/local/lib/python3.10/dist-packages/org/store/storage.py", line 768,
in _write_partition
ds.write_dataset(
File "/usr/local/lib/python3.10/dist-packages/pyarrow/dataset.py", line 988,
in write_dataset
_filesystemdataset_write(
File "pyarrow/_dataset.pyx", line 2859, in
pyarrow._dataset._filesystemdataset_write
check_status(CFileSystemDataset.Write(c_options, c_scanner))
File "pyarrow/error.pxi", line 115, in pyarrow.lib.check_status
raise IOError(message)
OSError: When creating key 'equities.us.level2.by_security/' in bucket
'org-prod': AWS Error SLOW_DOWN during PutObject operation: Please reduce your
request rate. {code}
In total flow failed many times: most failed with the error above, but one
failed with:
{code:java}
File "/usr/local/lib/python3.10/dist-packages/chronos/store/storage.py", line
857, in _load_partition
table = ds.dataset(
File "/usr/local/lib/python3.10/dist-packages/pyarrow/dataset.py", line 752,
in dataset
return _filesystem_dataset(source, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/pyarrow/dataset.py", line 444,
in _filesystem_dataset
fs, paths_or_selector = _ensure_single_source(source, filesystem)
File "/usr/local/lib/python3.10/dist-packages/pyarrow/dataset.py", line 411,
in _ensure_single_source
file_info = filesystem.get_file_info(path)
File "pyarrow/_fs.pyx", line 564, in pyarrow._fs.FileSystem.get_file_info
info = GetResultValue(self.fs.GetFileInfo(path))
File "pyarrow/error.pxi", line 144, in
pyarrow.lib.pyarrow_internal_check_status
return check_status(status)
File "pyarrow/error.pxi", line 115, in pyarrow.lib.check_status
raise IOError(message)
OSError: When getting information for key
'ns/date=2022-10-31/channel=4/feed=A/9f41f928eedc431ca695a7ffe5fc60c2-0.parquet'
in bucket 'org-poc': AWS Error NETWORK_CONNECTION during HeadObject operation:
curlCode: 28, Timeout was reached {code}
Do you have any idea what was changed for dataset write between 9.0.0 and
10.0.0 to help us to fix the issue?
was:
We use Dask to parallelise read/write operations and pyarrow to write dataset
from worker nodes.
After pyarrow released version 10.0.0, our data flows automatically switched to
the latest version and some of them started to fail with the following error:
{code:java}
File "/usr/local/lib/python3.10/dist-packages/org/store/storage.py", line 768,
in _write_partition
ds.write_dataset(
File "/usr/local/lib/python3.10/dist-packages/pyarrow/dataset.py", line 988,
in write_dataset
_filesystemdataset_write(
File "pyarrow/_dataset.pyx", line 2859, in
pyarrow._dataset._filesystemdataset_write
check_status(CFileSystemDataset.Write(c_options, c_scanner))
File "pyarrow/error.pxi", line 115, in pyarrow.lib.check_status
raise IOError(message)
OSError: When creating key 'equities.us.level2.by_security/' in bucket
'org-prod': AWS Error SLOW_DOWN during PutObject operation: Please reduce your
request rate. {code}
Do you have any idea what was changed for dataset write between 9.0.0 and
10.0.0 to help us to fix the issue?
In total flow failed many times: most failed with the error above, but one
failed with:
{code:java}
File "/usr/local/lib/python3.10/dist-packages/chronos/store/storage.py", line
857, in _load_partition
table = ds.dataset(
File "/usr/local/lib/python3.10/dist-packages/pyarrow/dataset.py", line 752,
in dataset
return _filesystem_dataset(source, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/pyarrow/dataset.py", line 444,
in _filesystem_dataset
fs, paths_or_selector = _ensure_single_source(source, filesystem)
File "/usr/local/lib/python3.10/dist-packages/pyarrow/dataset.py", line 411,
in _ensure_single_source
file_info = filesystem.get_file_info(path)
File "pyarrow/_fs.pyx", line 564, in pyarrow._fs.FileSystem.get_file_info
info = GetResultValue(self.fs.GetFileInfo(path))
File "pyarrow/error.pxi", line 144, in
pyarrow.lib.pyarrow_internal_check_status
return check_status(status)
File "pyarrow/error.pxi", line 115, in pyarrow.lib.check_status
raise IOError(message)
OSError: When getting information for key
'ns/date=2022-10-31/channel=4/feed=A/9f41f928eedc431ca695a7ffe5fc60c2-0.parquet'
in bucket 'org-poc': AWS Error NETWORK_CONNECTION during HeadObject operation:
curlCode: 28, Timeout was reached {code}
> AWS Error SLOW_DOWN during PutObject operation
> ----------------------------------------------
>
> Key: ARROW-18228
> URL: https://issues.apache.org/jira/browse/ARROW-18228
> Project: Apache Arrow
> Issue Type: Bug
> Affects Versions: 10.0.0
> Reporter: Vadym Dytyniak
> Priority: Major
>
> We use Dask to parallelise read/write operations and pyarrow to write dataset
> from worker nodes.
> After pyarrow released version 10.0.0, our data flows automatically switched
> to the latest version and some of them started to fail with the following
> error:
> {code:java}
> File "/usr/local/lib/python3.10/dist-packages/org/store/storage.py", line
> 768, in _write_partition
> ds.write_dataset(
> File "/usr/local/lib/python3.10/dist-packages/pyarrow/dataset.py", line
> 988, in write_dataset
> _filesystemdataset_write(
> File "pyarrow/_dataset.pyx", line 2859, in
> pyarrow._dataset._filesystemdataset_write
> check_status(CFileSystemDataset.Write(c_options, c_scanner))
> File "pyarrow/error.pxi", line 115, in pyarrow.lib.check_status
> raise IOError(message)
> OSError: When creating key 'equities.us.level2.by_security/' in bucket
> 'org-prod': AWS Error SLOW_DOWN during PutObject operation: Please reduce
> your request rate. {code}
> In total flow failed many times: most failed with the error above, but one
> failed with:
> {code:java}
> File "/usr/local/lib/python3.10/dist-packages/chronos/store/storage.py", line
> 857, in _load_partition
> table = ds.dataset(
> File "/usr/local/lib/python3.10/dist-packages/pyarrow/dataset.py", line
> 752, in dataset
> return _filesystem_dataset(source, **kwargs)
> File "/usr/local/lib/python3.10/dist-packages/pyarrow/dataset.py", line
> 444, in _filesystem_dataset
> fs, paths_or_selector = _ensure_single_source(source, filesystem)
> File "/usr/local/lib/python3.10/dist-packages/pyarrow/dataset.py", line
> 411, in _ensure_single_source
> file_info = filesystem.get_file_info(path)
> File "pyarrow/_fs.pyx", line 564, in pyarrow._fs.FileSystem.get_file_info
> info = GetResultValue(self.fs.GetFileInfo(path))
> File "pyarrow/error.pxi", line 144, in
> pyarrow.lib.pyarrow_internal_check_status
> return check_status(status)
> File "pyarrow/error.pxi", line 115, in pyarrow.lib.check_status
> raise IOError(message)
> OSError: When getting information for key
> 'ns/date=2022-10-31/channel=4/feed=A/9f41f928eedc431ca695a7ffe5fc60c2-0.parquet'
> in bucket 'org-poc': AWS Error NETWORK_CONNECTION during HeadObject
> operation: curlCode: 28, Timeout was reached {code}
>
> Do you have any idea what was changed for dataset write between 9.0.0 and
> 10.0.0 to help us to fix the issue?
--
This message was sent by Atlassian Jira
(v8.20.10#820010)