credentials

via GitHub Tue, 26 Sep 2023 12:44:51 -0700


maubarsom opened a new issue, #37888:
URL: https://github.com/apache/arrow/issues/37888


   ### Describe the bug, including details regarding any error messages, 
version, and platform.
   
   Bug seen in pyarrow version 12.0.0 on macOS Ventura 13.6, Apple M1 Pro.
   
   # Description
   
   The error was detected in `pandas` originally, but traced to `pyarrow`, as 
described in the screenshot. Basically, if I try to read an existing file from 
`S3` when my credentials are stored in the ~/.aws/credentials and config 
directory, pyarrow returns the error .
   
   ```
   OSError: When getting information for key 'XXX/YYY.parquet' in bucket 
'ZZZZZ': AWS Error ACCESS_DENIED during HeadObject operation: No response body.
   ```
   
   **Expected result**: The file is succesfully read
   
   **Note:** This error DOES NOT occur if the credentials are set as 
environment variables (instead of being read from ~/.aws/credentials). If they 
are set as env variables, pyarrow succesfully reads the parquet file.
   
   **Note 2:** As shown in the screenshot, I managed to circunvent the issue in 
**pandas** by passing the `storage_options={"anon":False}` explicitly. However, 
trying a similar approach in `pyarrow`, by setting explicitly 
`filesystem=S3Filesystem(anonymous=False)` did not succeed, and resulted in the 
same error.
   
   # Screenshot
   
   
![pyarrow_bug_report](https://github.com/apache/arrow/assets/5690589/af2d00de-ce4e-4214-9e89-b9fa11e10043)
   
   
   The traceback:
   
   ```
   File 
~/.mambaforge/envs/datasci/lib/python3.11/site-packages/pyarrow/parquet/core.py:2939,
 in read_table(source, columns, use_threads, metadata, schema, 
use_pandas_metadata, read_dictionary, memory_map, buffer_size, partitioning, 
filesystem, filters, use_legacy_dataset, ignore_prefixes, pre_buffer, 
coerce_int96_timestamp_unit, decryption_properties, thrift_string_size_limit, 
thrift_container_size_limit)
      2932     raise ValueError( 
      2933         "The 'metadata' keyword is no longer supported with the new "
      2934         "datasets-based implementation. Specify "
      2935         "'use_legacy_dataset=True' to temporarily recover the old "
      2936         "behaviour."
      2937     )
      2938 try:
   -> 2939     dataset = _ParquetDatasetV2(
      2940         source,
      2941         schema=schema,
      2942         filesystem=filesystem,
      2943         partitioning=partitioning,
      2944         memory_map=memory_map,
      2945         read_dictionary=read_dictionary,
      2946         buffer_size=buffer_size,
      2947         filters=filters,
      2948         ignore_prefixes=ignore_prefixes,
      2949         pre_buffer=pre_buffer,
      2950         coerce_int96_timestamp_unit=coerce_int96_timestamp_unit,
      2951         thrift_string_size_limit=thrift_string_size_limit,
      2952         thrift_container_size_limit=thrift_container_size_limit,
      2953     )
      2954 except ImportError:
      2955     # fall back on ParquetFile for simple cases when pyarrow.dataset
      2956     # module is not available
      2957     if filters is not None:
   
   File 
~/.mambaforge/envs/datasci/lib/python3.11/site-packages/pyarrow/parquet/core.py:2465,
 in _ParquetDatasetV2.__init__(self, path_or_paths, filesystem, filters, 
partitioning, read_dictionary, buffer_size, memory_map, ignore_prefixes, 
pre_buffer, coerce_int96_timestamp_unit, schema, decryption_properties, 
thrift_string_size_limit, thrift_container_size_limit, **kwargs)
      2463     except ValueError:
      2464         filesystem = LocalFileSystem(use_mmap=memory_map)
   -> 2465 finfo = filesystem.get_file_info(path_or_paths)
      2466 if finfo.is_file:
      2467     single_file = path_or_paths
   
   File 
~/.mambaforge/envs/datasci/lib/python3.11/site-packages/pyarrow/_fs.pyx:571, in 
pyarrow._fs.FileSystem.get_file_info()
   
   File 
~/.mambaforge/envs/datasci/lib/python3.11/site-packages/pyarrow/error.pxi:144, 
in pyarrow.lib.pyarrow_internal_check_status()
   
   File 
~/.mambaforge/envs/datasci/lib/python3.11/site-packages/pyarrow/error.pxi:115, 
in pyarrow.lib.check_status()
   ```
   
   ### Component(s)
   
   Parquet, Python


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] maubarsom opened a new issue, #37888: pyarrow fails to read from AWS S3 when credentials come from ~/.aws/credentials

Reply via email to