[GitHub] [arrow] Fokko opened a new issue, #33972: [Python] Remove redundant S3 call

via GitHub Wed, 01 Feb 2023 05:58:05 -0800


Fokko opened a new issue, #33972:
URL: https://github.com/apache/arrow/issues/33972


   ### Describe the enhancement requested
   
   Hey all,
   
   First of all thanks everyone for working on PyArrow! Really loving it so 
far. I'm currently working on PyIceberg that will load an Iceberg table in 
PyArrow. For those unfamiliar with Apache Iceberg. This is a table format that 
focusses on having huge tables (petabyte size). PyIceberg makes you life easier 
by taking care of statistics to boost performance, and all the schema 
maintenance. For example, if you change the partitioning of an Iceberg table, 
you don't have to directly rewrite all the files, you can do this in an 
incremental way.
   
   Now I'm running into some performance issues, and I noticed that PyArrow is 
doing more queries than required to S3. I went down the rabbit hole, and was 
able to narrow it down to:
   
   ```python
   import pyarrow.dataset as ds
   from pyarrow.fs import S3FileSystem
   ONE_MEGABYTE = 1024 * 1024
   
   client_kwargs = {
       "endpoint_override": "http://localhost:9000";,
       "access_key": "admin",
       "secret_key": "password",
   }
   parquet_format = ds.ParquetFileFormat(
       use_buffered_stream=True,
       pre_buffer=True,
       buffer_size=8 * ONE_MEGABYTE
   )
   fs = S3FileSystem(**client_kwargs)
   with 
fs.open_input_file("warehouse/wh/nyc/taxis/data/tpep_pickup_datetime_day=2022-04-30/00003-4-89e0ad58-fb77-4512-8679-6f26d8d6ef28-00033.parquet")
 as fout:
       # First get the fragment
       fragment = parquet_format.make_fragment(fout, None)
       print(f"Schema: {fragment.physical_schema}")
       arrow_table = ds.Scanner.from_fragment(
           fragment=fragment
       ).to_table()
   ```
   
   I need the schema first, because it can be that a column got renamed, but 
the the file hasn't been rewritten against the latest schema. The same goes for 
filtering, if you change a column name, and the file still has the old name in 
there, then you would like to leverage the predicate pushdown of PyArrow to not 
load the data in memory at all.
   
   When looking into the minio logs I can see that it does four requests.
   
   1. A head to check if the file exists
   2. The last 64kb from the Parquet file to get the schema
   3. Another last 64kb from the parquet file to get the schema
   4. A nice beefy 1978578kb request to fetch the data
   
   Looking at the tests, we shouldn't fetch the footer twice:
   
   ```python
   # with default discovery, no metadata loaded
   with assert_opens([fragment.path]):
       fragment.ensure_complete_metadata()
   assert fragment.row_groups == [0, 1]
   
   # second time -> use cached / no file IO
   with assert_opens([]):
       fragment.ensure_complete_metadata()
   ```
   
   Any thoughts or advice? I went through the code a bit already, but my cpp is 
a bit rusty
   
   ### Component(s)
   
   Python


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] Fokko opened a new issue, #33972: [Python] Remove redundant S3 call

Reply via email to