[ https://issues.apache.org/jira/browse/ARROW-17913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17612361#comment-17612361 ]
Håkon Magne Holmen commented on ARROW-17913: -------------------------------------------- Maybe use something akin to the old implementation when memory_map=True, since the I/O is expected to be low latency and zero-copy? > feather.read_table 150x slower when reading columns in newer versions > --------------------------------------------------------------------- > > Key: ARROW-17913 > URL: https://issues.apache.org/jira/browse/ARROW-17913 > Project: Apache Arrow > Issue Type: Bug > Affects Versions: 7.0.0, 8.0.0, 9.0.0 > Environment: python 3.9, ubuntu 20.04 > Reporter: Håkon Magne Holmen > Priority: Major > Labels: feather, performance > > h3. Description > Performance when reading columns using {{feather.read_table}} on Arrow > 7.0.0-9.0.0 is drastically slower than it was in 6.0.0. > Profiling the code below shows that the bottleneck is somewhere in the > {{read_names}} function of {{pyarrow._feather.FeatherReader}}. > h5. Example > Setup code: > {code} > import pandas as pd > from pyarrow import feather > rows, cols = (1_000_000, 10) > data = {f'c{c}': range(rows) for c in range(cols)} > df = pd.DataFrame(data=data) > feather.write_feather(df, 'test.feather', compression="uncompressed"){code} > Benchmarks Arrow 9.0.0: > {code} > %timeit feather.read_table('test.feather', memory_map=True) > %timeit feather.read_table('test.feather', columns=list(df.columns), > memory_map=True) > > 178 µs ± 1.23 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each) > 33.8 ms ± 964 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) > {code} > Benchmarks Arrow 6.0.0: > {code} > %timeit feather.read_table('test.feather', memory_map=True) > %timeit feather.read_table('test.feather', columns=list(df.columns), > memory_map=True) > > 173 µs ± 2.12 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each) > 224 µs ± 12.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)