marsupialtail opened a new pull request, #13799:
URL: https://github.com/apache/arrow/pull/13799
This exposes the Fragment Readahead and Batch Readahead flags in the C++
Scanner to the user in Python.
This can be used to finetune RAM usage and IO utilization during downloading
large files from S3 or other network sources. I believe the default settings
are overly conservative for small RAM settings and I observe less than 20% IO
utilization on some instances on AWS.
The Python API is exposed only to methods where these flags make sense.
Scanning from a RecordBatchIterator won't need those these flags nor will those
flags make sense. Only the latter flag makes sense for making a scanner from a
fragment.
To test this, set up an i3.2xlarge instance on AWS:
```
import pyarrow
import pyarrow.dataset as ds
import pyarrow.csv as csv
import time
pyarrow.set_cpu_count(8)
pyarrow.set_io_thread_count(16)
lineitem_scheme =
["l_orderkey","l_partkey","l_suppkey","l_linenumber","l_quantity","l_extendedprice",
"l_discount","l_tax","l_returnflag","l_linestatus","l_shipdate","l_commitdate","l_receiptdate","l_shipinstruct",
"l_shipmode","l_comment", "null"]
csv_format =
ds.CsvFileFormat(read_options=csv.ReadOptions(column_names=lineitem_scheme,
block_size= 32 * 1024 * 1024), parse_options=csv.ParseOptions(delimiter="|"))
dataset = ds.dataset("s3://TPC",format=csv_format)
s = dataset.to_batches(batch_size=1000000000)
while count < 100:
z = next(s)
```
For our purposes let's just make the TPC dataset consist of hundreds of
Parquet files each with one row group. (something that Spark would generate).
This script would get somewhere around 1Gbps. If you now do
```
s = dataset.to_batches(batch_size=1000000000, fragment_readahead=16)
```
You can get to 2.5Gbps which is the advertised steady rate cap for this
instance type.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]