[ 
https://issues.apache.org/jira/browse/ARROW-17913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17612361#comment-17612361
 ] 

Håkon Magne Holmen commented on ARROW-17913:
--------------------------------------------

Maybe use something akin to the old implementation when memory_map=True, since 
the I/O is expected to be low latency and zero-copy?

> feather.read_table 150x slower when reading columns in newer versions
> ---------------------------------------------------------------------
>
>                 Key: ARROW-17913
>                 URL: https://issues.apache.org/jira/browse/ARROW-17913
>             Project: Apache Arrow
>          Issue Type: Bug
>    Affects Versions: 7.0.0, 8.0.0, 9.0.0
>         Environment: python 3.9, ubuntu 20.04
>            Reporter: Håkon Magne Holmen
>            Priority: Major
>              Labels: feather, performance
>
> h3. Description
> Performance when reading columns using {{feather.read_table}} on Arrow 
> 7.0.0-9.0.0 is drastically slower than it was in 6.0.0.
> Profiling the code below shows that the bottleneck is somewhere in the 
> {{read_names}} function of {{pyarrow._feather.FeatherReader}}.
> h5. Example
> Setup code:
> {code}
> import pandas as pd
> from pyarrow import feather
> rows, cols = (1_000_000, 10)
> data = {f'c{c}': range(rows) for c in range(cols)}
> df = pd.DataFrame(data=data)
> feather.write_feather(df, 'test.feather', compression="uncompressed"){code} 
> Benchmarks Arrow 9.0.0:
> {code}
> %timeit feather.read_table('test.feather', memory_map=True)
> %timeit feather.read_table('test.feather', columns=list(df.columns), 
> memory_map=True)
> > 178 µs ± 1.23 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
> 33.8 ms ± 964 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
> {code}
> Benchmarks Arrow 6.0.0:
> {code}
> %timeit feather.read_table('test.feather', memory_map=True)
> %timeit feather.read_table('test.feather', columns=list(df.columns), 
> memory_map=True)
> > 173 µs ± 2.12 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
> 224 µs ± 12.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to