Fokko commented on issue #6475:
URL: https://github.com/apache/iceberg/issues/6475#issuecomment-1365733458

   It looks like PyArrow is still doing more requests than s3fs.
   
   I've created a local table of taxis in PyArrow:
   ```sql
   %%sql
   
   CREATE DATABASE nyc;
   
   %%sql
   
   CREATE TABLE nyc.taxis (
       VendorID              bigint,
       tpep_pickup_datetime  timestamp,
       tpep_dropoff_datetime timestamp,
       passenger_count       double,
       trip_distance         double,
       RatecodeID            double,
       store_and_fwd_flag    string,
       PULocationID          bigint,
       DOLocationID          bigint,
       payment_type          bigint,
       fare_amount           double,
       extra                 double,
       mta_tax               double,
       tip_amount            double,
       tolls_amount          double,
       improvement_surcharge double,
       total_amount          double,
       congestion_surcharge  double,
       airport_fee           double
   )
   USING iceberg
   PARTITIONED BY (days(tpep_pickup_datetime))
   ```
   ```python
   %%python
   
   # Loop over it to avoid OOM, otherwise *.parquet would also work (and be 
more efficient)
   for filename in [
       "yellow_tripdata_2022-04.parquet",
       "yellow_tripdata_2022-03.parquet",
       "yellow_tripdata_2022-02.parquet",
       "yellow_tripdata_2022-01.parquet",
       "yellow_tripdata_2021-12.parquet",
       "yellow_tripdata_2021-11.parquet",
       "yellow_tripdata_2021-10.parquet",
       "yellow_tripdata_2021-09.parquet",
       "yellow_tripdata_2021-08.parquet"
   ]:
       df = spark.read.parquet(f"/home/iceberg/data/{filename}")
       df.write.mode("append").saveAsTable("nyc.taxis")
   ```
   
   Looking at the minio requests when running `pyiceberg --catalog local files 
nyc.taxis`:
   
   ```
   Snapshots: local.nyc.taxis
   └── Snapshot 6682082212753545990, schema 0: 
s3a://warehouse/wh/nyc/taxis/metadata/snap-6682082212753545990-1-9f900c62-59ca-4ea1-ae75-e373d113b036.avro
       ├── Manifest: 
s3a://warehouse/wh/nyc/taxis/metadata/9f900c62-59ca-4ea1-ae75-e373d113b036-m0.avro
       │   ...
       ├── Manifest: 
s3a://warehouse/wh/nyc/taxis/metadata/f4d30f09-ebf1-4339-ae9f-652739d4e0bc-m0.avro
       │   ...
       ├── Manifest: 
s3a://warehouse/wh/nyc/taxis/metadata/3fb110ad-0b08-40d7-8e02-ba86192cfc8d-m0.avro
       │   ...
       ├── Manifest: 
s3a://warehouse/wh/nyc/taxis/metadata/6300efd8-fa3a-40ea-bc77-8c128e04d83d-m0.avro
       │   ...
       ├── Manifest: 
s3a://warehouse/wh/nyc/taxis/metadata/27187379-514d-42f4-aada-ade2b239b41a-m0.avro
       │   ...
       ├── Manifest: 
s3a://warehouse/wh/nyc/taxis/metadata/29b8cfd9-38a8-48ac-aabd-ede8b4118d3d-m0.avro
       │   ...
       ├── Manifest: 
s3a://warehouse/wh/nyc/taxis/metadata/37f92724-8248-43c2-b8c7-cd40e2da21cf-m0.avro
       │   ...
       ├── Manifest: 
s3a://warehouse/wh/nyc/taxis/metadata/02096f9b-1056-43ed-85af-871a47ad87f8-m0.avro
       │   ...
       └── Manifest: 
s3a://warehouse/wh/nyc/taxis/metadata/5a44ffb0-64ac-4b4d-9896-a53c52d6c30b-m0.avro
           ...
   
   ```
   
   ### PyArrow
   
   ```
   2022-12-27T08:45:32.822 [206 Partial Content] s3.GetObject 
minio:9000/warehouse/wh/nyc/taxis/metadata/00009-3dd7ff4a-a81f-4780-bee1-de9335490c20.metadata.json
 172.18.0.3        1.142ms      ↑ 169 B ↓ 14 KiB
   2022-12-27T08:45:32.913 [200 OK] s3.HeadObject 
127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/snap-6682082212753545990-1-9f900c62-59ca-4ea1-ae75-e373d113b036.avro
 172.18.0.1        867µs       ↑ 153 B ↓ 412 B
   2022-12-27T08:45:32.925 [206 Partial Content] s3.GetObject 
127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/snap-6682082212753545990-1-9f900c62-59ca-4ea1-ae75-e373d113b036.avro
 172.18.0.1        1.626ms      ↑ 159 B ↓ 4.6 KiB
   2022-12-27T08:45:32.973 [200 OK] s3.HeadObject 
127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/9f900c62-59ca-4ea1-ae75-e373d113b036-m0.avro
 172.18.0.1        1.216ms      ↑ 153 B ↓ 413 B
   2022-12-27T08:45:32.989 [206 Partial Content] s3.GetObject 
127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/9f900c62-59ca-4ea1-ae75-e373d113b036-m0.avro
 172.18.0.1        3.719ms      ↑ 159 B ↓ 8.5 KiB
   2022-12-27T08:45:33.020 [206 Partial Content] s3.GetObject 
127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/9f900c62-59ca-4ea1-ae75-e373d113b036-m0.avro
 172.18.0.1        3.904ms      ↑ 159 B ↓ 8.5 KiB
   2022-12-27T08:45:33.042 [206 Partial Content] s3.GetObject 
127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/9f900c62-59ca-4ea1-ae75-e373d113b036-m0.avro
 172.18.0.1        1.903ms      ↑ 159 B ↓ 1.7 KiB
   2022-12-27T08:45:33.104 [200 OK] s3.HeadObject 
127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/f4d30f09-ebf1-4339-ae9f-652739d4e0bc-m0.avro
 172.18.0.1        1.232ms      ↑ 153 B ↓ 413 B
   2022-12-27T08:45:33.113 [206 Partial Content] s3.GetObject 
127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/f4d30f09-ebf1-4339-ae9f-652739d4e0bc-m0.avro
 172.18.0.1        683µs       ↑ 159 B ↓ 8.5 KiB
   2022-12-27T08:45:33.120 [206 Partial Content] s3.GetObject 
127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/f4d30f09-ebf1-4339-ae9f-652739d4e0bc-m0.avro
 172.18.0.1        975µs       ↑ 159 B ↓ 7.0 KiB
   2022-12-27T08:45:33.141 [200 OK] s3.HeadObject 
127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/3fb110ad-0b08-40d7-8e02-ba86192cfc8d-m0.avro
 172.18.0.1        383µs       ↑ 153 B ↓ 413 B
   2022-12-27T08:45:33.144 [206 Partial Content] s3.GetObject 
127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/3fb110ad-0b08-40d7-8e02-ba86192cfc8d-m0.avro
 172.18.0.1        774µs       ↑ 159 B ↓ 8.5 KiB
   2022-12-27T08:45:33.148 [206 Partial Content] s3.GetObject 
127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/3fb110ad-0b08-40d7-8e02-ba86192cfc8d-m0.avro
 172.18.0.1        833µs       ↑ 159 B ↓ 7.4 KiB
   2022-12-27T08:45:33.170 [200 OK] s3.HeadObject 
127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/6300efd8-fa3a-40ea-bc77-8c128e04d83d-m0.avro
 172.18.0.1        432µs       ↑ 153 B ↓ 413 B
   2022-12-27T08:45:33.173 [206 Partial Content] s3.GetObject 
127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/6300efd8-fa3a-40ea-bc77-8c128e04d83d-m0.avro
 172.18.0.1        1.208ms      ↑ 159 B ↓ 8.5 KiB
   2022-12-27T08:45:33.178 [206 Partial Content] s3.GetObject 
127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/6300efd8-fa3a-40ea-bc77-8c128e04d83d-m0.avro
 172.18.0.1        814µs       ↑ 159 B ↓ 8.2 KiB
   2022-12-27T08:45:33.202 [200 OK] s3.HeadObject 
127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/27187379-514d-42f4-aada-ade2b239b41a-m0.avro
 172.18.0.1        427µs       ↑ 153 B ↓ 413 B
   2022-12-27T08:45:33.205 [206 Partial Content] s3.GetObject 
127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/27187379-514d-42f4-aada-ade2b239b41a-m0.avro
 172.18.0.1        671µs       ↑ 159 B ↓ 8.5 KiB
   2022-12-27T08:45:33.209 [206 Partial Content] s3.GetObject 
127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/27187379-514d-42f4-aada-ade2b239b41a-m0.avro
 172.18.0.1        502µs       ↑ 159 B ↓ 7.9 KiB
   2022-12-27T08:45:33.233 [200 OK] s3.HeadObject 
127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/29b8cfd9-38a8-48ac-aabd-ede8b4118d3d-m0.avro
 172.18.0.1        616µs       ↑ 153 B ↓ 413 B
   2022-12-27T08:45:33.236 [206 Partial Content] s3.GetObject 
127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/29b8cfd9-38a8-48ac-aabd-ede8b4118d3d-m0.avro
 172.18.0.1        955µs       ↑ 159 B ↓ 8.5 KiB
   2022-12-27T08:45:33.240 [206 Partial Content] s3.GetObject 
127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/29b8cfd9-38a8-48ac-aabd-ede8b4118d3d-m0.avro
 172.18.0.1        934µs       ↑ 159 B ↓ 7.4 KiB
   2022-12-27T08:45:33.262 [200 OK] s3.HeadObject 
127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/37f92724-8248-43c2-b8c7-cd40e2da21cf-m0.avro
 172.18.0.1        308µs       ↑ 153 B ↓ 413 B
   2022-12-27T08:45:33.265 [206 Partial Content] s3.GetObject 
127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/37f92724-8248-43c2-b8c7-cd40e2da21cf-m0.avro
 172.18.0.1        641µs       ↑ 159 B ↓ 8.5 KiB
   2022-12-27T08:45:33.269 [206 Partial Content] s3.GetObject 
127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/37f92724-8248-43c2-b8c7-cd40e2da21cf-m0.avro
 172.18.0.1        831µs       ↑ 159 B ↓ 7.6 KiB
   2022-12-27T08:45:33.295 [200 OK] s3.HeadObject 
127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/02096f9b-1056-43ed-85af-871a47ad87f8-m0.avro
 172.18.0.1        625µs       ↑ 153 B ↓ 413 B
   2022-12-27T08:45:33.298 [206 Partial Content] s3.GetObject 
127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/02096f9b-1056-43ed-85af-871a47ad87f8-m0.avro
 172.18.0.1        828µs       ↑ 159 B ↓ 8.5 KiB
   2022-12-27T08:45:33.302 [206 Partial Content] s3.GetObject 
127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/02096f9b-1056-43ed-85af-871a47ad87f8-m0.avro
 172.18.0.1        897µs       ↑ 159 B ↓ 7.8 KiB
   2022-12-27T08:45:33.324 [200 OK] s3.HeadObject 
127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/5a44ffb0-64ac-4b4d-9896-a53c52d6c30b-m0.avro
 172.18.0.1        474µs       ↑ 153 B ↓ 413 B
   2022-12-27T08:45:33.326 [206 Partial Content] s3.GetObject 
127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/5a44ffb0-64ac-4b4d-9896-a53c52d6c30b-m0.avro
 172.18.0.1        644µs       ↑ 159 B ↓ 8.5 KiB
   2022-12-27T08:45:33.330 [206 Partial Content] s3.GetObject 
127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/5a44ffb0-64ac-4b4d-9896-a53c52d6c30b-m0.avro
 172.18.0.1        904µs       ↑ 159 B ↓ 7.1 KiB
   ```
   
   ### S3FS
   
   ```
   2022-12-27T09:05:13.127 [206 Partial Content] s3.GetObject 
minio:9000/warehouse/wh/nyc/taxis/metadata/00009-3dd7ff4a-a81f-4780-bee1-de9335490c20.metadata.json
 172.18.0.3        1.167ms      ↑ 169 B ↓ 14 KiB
   2022-12-27T09:05:13.245 [200 OK] s3.HeadObject 
127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/snap-6682082212753545990-1-9f900c62-59ca-4ea1-ae75-e373d113b036.avro
 172.18.0.1        4.86ms       ↑ 138 B ↓ 412 B
   2022-12-27T09:05:13.258 [206 Partial Content] s3.GetObject 
127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/snap-6682082212753545990-1-9f900c62-59ca-4ea1-ae75-e373d113b036.avro
 172.18.0.1        860µs       ↑ 153 B ↓ 4.6 KiB
   2022-12-27T09:05:13.265 [200 OK] s3.HeadObject 
127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/9f900c62-59ca-4ea1-ae75-e373d113b036-m0.avro
 172.18.0.1        282µs       ↑ 138 B ↓ 413 B
   2022-12-27T09:05:13.268 [206 Partial Content] s3.GetObject 
127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/9f900c62-59ca-4ea1-ae75-e373d113b036-m0.avro
 172.18.0.1        943µs       ↑ 153 B ↓ 18 KiB
   2022-12-27T09:05:13.297 [200 OK] s3.HeadObject 
127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/f4d30f09-ebf1-4339-ae9f-652739d4e0bc-m0.avro
 172.18.0.1        472µs       ↑ 138 B ↓ 413 B
   2022-12-27T09:05:13.300 [206 Partial Content] s3.GetObject 
127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/f4d30f09-ebf1-4339-ae9f-652739d4e0bc-m0.avro
 172.18.0.1        907µs       ↑ 153 B ↓ 15 KiB
   2022-12-27T09:05:13.320 [200 OK] s3.HeadObject 
127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/3fb110ad-0b08-40d7-8e02-ba86192cfc8d-m0.avro
 172.18.0.1        349µs       ↑ 138 B ↓ 413 B
   2022-12-27T09:05:13.323 [206 Partial Content] s3.GetObject 
127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/3fb110ad-0b08-40d7-8e02-ba86192cfc8d-m0.avro
 172.18.0.1        779µs       ↑ 153 B ↓ 15 KiB
   2022-12-27T09:05:13.348 [200 OK] s3.HeadObject 
127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/6300efd8-fa3a-40ea-bc77-8c128e04d83d-m0.avro
 172.18.0.1        346µs       ↑ 138 B ↓ 413 B
   2022-12-27T09:05:13.351 [206 Partial Content] s3.GetObject 
127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/6300efd8-fa3a-40ea-bc77-8c128e04d83d-m0.avro
 172.18.0.1        681µs       ↑ 153 B ↓ 16 KiB
   2022-12-27T09:05:13.374 [200 OK] s3.HeadObject 
127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/27187379-514d-42f4-aada-ade2b239b41a-m0.avro
 172.18.0.1        291µs       ↑ 138 B ↓ 413 B
   2022-12-27T09:05:13.377 [206 Partial Content] s3.GetObject 
127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/27187379-514d-42f4-aada-ade2b239b41a-m0.avro
 172.18.0.1        771µs       ↑ 153 B ↓ 16 KiB
   2022-12-27T09:05:13.399 [200 OK] s3.HeadObject 
127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/29b8cfd9-38a8-48ac-aabd-ede8b4118d3d-m0.avro
 172.18.0.1        375µs       ↑ 138 B ↓ 413 B
   2022-12-27T09:05:13.403 [206 Partial Content] s3.GetObject 
127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/29b8cfd9-38a8-48ac-aabd-ede8b4118d3d-m0.avro
 172.18.0.1        789µs       ↑ 153 B ↓ 15 KiB
   2022-12-27T09:05:13.431 [200 OK] s3.HeadObject 
127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/37f92724-8248-43c2-b8c7-cd40e2da21cf-m0.avro
 172.18.0.1        373µs       ↑ 138 B ↓ 413 B
   2022-12-27T09:05:13.434 [206 Partial Content] s3.GetObject 
127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/37f92724-8248-43c2-b8c7-cd40e2da21cf-m0.avro
 172.18.0.1        650µs       ↑ 153 B ↓ 16 KiB
   2022-12-27T09:05:13.457 [200 OK] s3.HeadObject 
127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/02096f9b-1056-43ed-85af-871a47ad87f8-m0.avro
 172.18.0.1        308µs       ↑ 138 B ↓ 413 B
   2022-12-27T09:05:13.460 [206 Partial Content] s3.GetObject 
127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/02096f9b-1056-43ed-85af-871a47ad87f8-m0.avro
 172.18.0.1        753µs       ↑ 153 B ↓ 16 KiB
   2022-12-27T09:05:13.482 [200 OK] s3.HeadObject 
127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/5a44ffb0-64ac-4b4d-9896-a53c52d6c30b-m0.avro
 172.18.0.1        279µs       ↑ 138 B ↓ 413 B
   2022-12-27T09:05:13.488 [206 Partial Content] s3.GetObject 
127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/5a44ffb0-64ac-4b4d-9896-a53c52d6c30b-m0.avro
 172.18.0.1        885µs       ↑ 153 B ↓ 15 KiB
   ```
   
   We can observe that PyArrow does two calls call to the Avro file, and s3fs 
just one.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to