[jira] [Updated] (ARROW-9458) [Python] Dataset Scanner is single-threaded only

Joris Van den Bossche (Jira) Tue, 14 Jul 2020 07:18:19 -0700


     [ 
https://issues.apache.org/jira/browse/ARROW-9458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Joris Van den Bossche updated ARROW-9458:
-----------------------------------------
    Summary: [Python] Dataset Scanner is single-threaded only  (was: [Python] 
Dataset singlethreaded only)

> [Python] Dataset Scanner is single-threaded only
> ------------------------------------------------
>
>                 Key: ARROW-9458
>                 URL: https://issues.apache.org/jira/browse/ARROW-9458
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>            Reporter: Maarten Breddels
>            Assignee: Maarten Breddels
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 1.0.0
>
>         Attachments: image-2020-07-14-14-31-29-943.png, 
> image-2020-07-14-14-38-16-767.png
>
>          Time Spent: 20m
>  Remaining Estimate: 0h
>
> I'm not sure this is a misunderstanding, or a compilation issue (flags?) or 
> an issue in the C++ layer.
> I have 1000 parquet files with a total of 1 billion rows (1 million rows each 
> file, ~20 columns). I wanted to see if I could go through all rows 1 of 2 
> columns efficiently (vaex use case).
>  
> {code:java}
> import pyarrow.parquet
> import pyarrow as pa
> import pyarrow.dataset as ds
> import glob
> ds = pa.dataset.dataset(glob.glob('/data/taxi_parquet/data_*.parquet'))
> scanned = 0
> for scan_task in ds.scan(batch_size=1_000_000, columns=['passenger_count'], 
> use_threads=True):
>     for record_batch in scan_task.execute():
>         scanned += record_batch.num_rows
> scanned
> {code}
> This only seems to use 1 cpu.
> Using a threadpool from Python:
> {code:java}
> # %%timeit
> import concurrent.futures
> pool = concurrent.futures.ThreadPoolExecutor()
> ds = pa.dataset.dataset(glob.glob('/data/taxi_parquet/data_*.parquet'))
> def process(scan_task):
>     scan_count = 0
>     for record_batch in scan_task.execute():
>         scan_count += len(record_batch)
>     return scan_count
> sum(pool.map(process, ds.scan(batch_size=1_000_000, 
> columns=['passenger_count'], use_threads=False)))
> {code}
> Gives me a similar performance, again, only 100% cpu usage (=1 core/cpu).
> py-spy (profiler for Python) shows no GIL, so this might be something at the 
> C++ layer.
> Am I 'holding it wrong' or could this be a bug? Note that IO speed is not a 
> problem on this system (it actually all comes from OS cache, no disk read 
> observed)
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-9458) [Python] Dataset Scanner is single-threaded only

Reply via email to