[ 
https://issues.apache.org/jira/browse/ARROW-18228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vadym Dytyniak updated ARROW-18228:
-----------------------------------
    Description: 
We use Dask to parallelise read/write operations and pyarrow to write dataset 
from worker nodes.

After pyarrow released version 10.0.0, our data flows automatically switched to 
the latest version and some of them started to fail with the following error:
{code:java}
File "/usr/local/lib/python3.10/dist-packages/org/store/storage.py", line 768, 
in _write_partition
    ds.write_dataset(
  File "/usr/local/lib/python3.10/dist-packages/pyarrow/dataset.py", line 988, 
in write_dataset
    _filesystemdataset_write(
  File "pyarrow/_dataset.pyx", line 2859, in 
pyarrow._dataset._filesystemdataset_write
    check_status(CFileSystemDataset.Write(c_options, c_scanner))
  File "pyarrow/error.pxi", line 115, in pyarrow.lib.check_status
    raise IOError(message)
OSError: When creating key 'equities.us.level2.by_security/' in bucket 
'org-prod': AWS Error SLOW_DOWN during PutObject operation: Please reduce your 
request rate. {code}
Do you have any idea what was changed for dataset write between 9.0.0 and 
10.0.0 to help us to fix the issue?

In total flow failed many times: most failed with the error above, but one 
failed with:
{code:java}
File "/usr/local/lib/python3.10/dist-packages/chronos/store/storage.py", line 
857, in _load_partition
    table = ds.dataset(
  File "/usr/local/lib/python3.10/dist-packages/pyarrow/dataset.py", line 752, 
in dataset
    return _filesystem_dataset(source, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/pyarrow/dataset.py", line 444, 
in _filesystem_dataset
    fs, paths_or_selector = _ensure_single_source(source, filesystem)
  File "/usr/local/lib/python3.10/dist-packages/pyarrow/dataset.py", line 411, 
in _ensure_single_source
    file_info = filesystem.get_file_info(path)
  File "pyarrow/_fs.pyx", line 564, in pyarrow._fs.FileSystem.get_file_info
    info = GetResultValue(self.fs.GetFileInfo(path))
  File "pyarrow/error.pxi", line 144, in 
pyarrow.lib.pyarrow_internal_check_status
    return check_status(status)
  File "pyarrow/error.pxi", line 115, in pyarrow.lib.check_status
    raise IOError(message)
OSError: When getting information for key 
'ns/date=2022-10-31/channel=4/feed=A/9f41f928eedc431ca695a7ffe5fc60c2-0.parquet'
 in bucket 'org-poc': AWS Error NETWORK_CONNECTION during HeadObject operation: 
curlCode: 28, Timeout was reached {code}

  was:
We use Dask to parallelise read/write operations and pyarrow to write dataset 
from worker nodes.

After pyarrow released version 10.0.0, our data flows automatically switched to 
the latest version and some of them started to fail with the following error:
{code:java}
File "/usr/local/lib/python3.10/dist-packages/org/store/storage.py", line 768, 
in _write_partition
    ds.write_dataset(
  File "/usr/local/lib/python3.10/dist-packages/pyarrow/dataset.py", line 988, 
in write_dataset
    _filesystemdataset_write(
  File "pyarrow/_dataset.pyx", line 2859, in 
pyarrow._dataset._filesystemdataset_write
    check_status(CFileSystemDataset.Write(c_options, c_scanner))
  File "pyarrow/error.pxi", line 115, in pyarrow.lib.check_status
    raise IOError(message)
OSError: When creating key 'equities.us.level2.by_security/' in bucket 
'org-prod': AWS Error SLOW_DOWN during PutObject operation: Please reduce your 
request rate. {code}
Do you have any idea what was changed for dataset write between 9.0.0 and 
10.0.0 to help us to fix the issue?


> AWS Error SLOW_DOWN during PutObject operation
> ----------------------------------------------
>
>                 Key: ARROW-18228
>                 URL: https://issues.apache.org/jira/browse/ARROW-18228
>             Project: Apache Arrow
>          Issue Type: Bug
>    Affects Versions: 10.0.0
>            Reporter: Vadym Dytyniak
>            Priority: Major
>
> We use Dask to parallelise read/write operations and pyarrow to write dataset 
> from worker nodes.
> After pyarrow released version 10.0.0, our data flows automatically switched 
> to the latest version and some of them started to fail with the following 
> error:
> {code:java}
> File "/usr/local/lib/python3.10/dist-packages/org/store/storage.py", line 
> 768, in _write_partition
>     ds.write_dataset(
>   File "/usr/local/lib/python3.10/dist-packages/pyarrow/dataset.py", line 
> 988, in write_dataset
>     _filesystemdataset_write(
>   File "pyarrow/_dataset.pyx", line 2859, in 
> pyarrow._dataset._filesystemdataset_write
>     check_status(CFileSystemDataset.Write(c_options, c_scanner))
>   File "pyarrow/error.pxi", line 115, in pyarrow.lib.check_status
>     raise IOError(message)
> OSError: When creating key 'equities.us.level2.by_security/' in bucket 
> 'org-prod': AWS Error SLOW_DOWN during PutObject operation: Please reduce 
> your request rate. {code}
> Do you have any idea what was changed for dataset write between 9.0.0 and 
> 10.0.0 to help us to fix the issue?
> In total flow failed many times: most failed with the error above, but one 
> failed with:
> {code:java}
> File "/usr/local/lib/python3.10/dist-packages/chronos/store/storage.py", line 
> 857, in _load_partition
>     table = ds.dataset(
>   File "/usr/local/lib/python3.10/dist-packages/pyarrow/dataset.py", line 
> 752, in dataset
>     return _filesystem_dataset(source, **kwargs)
>   File "/usr/local/lib/python3.10/dist-packages/pyarrow/dataset.py", line 
> 444, in _filesystem_dataset
>     fs, paths_or_selector = _ensure_single_source(source, filesystem)
>   File "/usr/local/lib/python3.10/dist-packages/pyarrow/dataset.py", line 
> 411, in _ensure_single_source
>     file_info = filesystem.get_file_info(path)
>   File "pyarrow/_fs.pyx", line 564, in pyarrow._fs.FileSystem.get_file_info
>     info = GetResultValue(self.fs.GetFileInfo(path))
>   File "pyarrow/error.pxi", line 144, in 
> pyarrow.lib.pyarrow_internal_check_status
>     return check_status(status)
>   File "pyarrow/error.pxi", line 115, in pyarrow.lib.check_status
>     raise IOError(message)
> OSError: When getting information for key 
> 'ns/date=2022-10-31/channel=4/feed=A/9f41f928eedc431ca695a7ffe5fc60c2-0.parquet'
>  in bucket 'org-poc': AWS Error NETWORK_CONNECTION during HeadObject 
> operation: curlCode: 28, Timeout was reached {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to