[jira] [Commented] (ARROW-11469) [Python] Performance degradation parquet reading of wide dataframes

David Li (Jira) Tue, 11 May 2021 08:07:05 -0700


    [ 
https://issues.apache.org/jira/browse/ARROW-11469?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17342631#comment-17342631
 ]


David Li commented on ARROW-11469:
----------------------------------

I took a quick stab and looked with a profiler.

TL;DR we're doing two things quadratically (copying shared_ptrs and looking up 
fields). This is the bulk of the runtime - not any actual computation!

I tested a Parquet file with 10000 columns, 10000 rows, using the read options 
{{use_threads=True, memory_map=True}}.

PyArrow 0.17, conda-forge: consistently ~550ms
master 5.0.0: consistently ~5000-5500ms

Profiler: [https://share.firefox.dev/3bffpI3]

The time spent falls in 3 buckets:
 # ScanTask::SafeExecute takes ~500ms. This is actually reading the data - 
everything seems fine here.
 # SetProjection takes ~1000ms. 

SetProjection takes each column and looks it up in the schema with 
FieldRef::GetOne. The lookup guards against duplicate field names, hence, it's 
iterating every column on each lookup - a quadratic operation. This is ~500ms 
of the runtime. Then, when we bind the projection, the same thing happens again 
- this is the other half. Most of the leaf time is spent in 
std::operator==<char>, i.e. in comparing column names.
 # ProjectSingleBatch takes ~3300ms.

Most of the time ends up in GetDatumField, which is yet again a lookup. Of the 
time herein, FieldRef::FindOneOrNone takes 730ms - this is the same problem as 
before (quadratic lookup). The method is sufficiently generic that we can't 
necessarily assume the field indices are the same (or can we?).

The bulk of the remaining time - 2530ms - is in FieldPathGetDatumImpl which 
boils down to FieldPath::Get. This boils down to shared_ptr operations in 
SimpleRecordBatch::column_data (which is just a getter returning a 
vector<shared_ptr<ArrayData>>) - likely, since we're doing this once per field 
per record batch, we're creating a copy of 10000 shared_ptrs on every iteration!



I tried two optimizations - if we make a simple lookup table in SetProjection, 
we can shave off those 500ms. (This will be harder if/when we support nested 
field refs.) And if we add a SimpleRecordBatch::column_data_ref which returns 
an const ArrayDataVector&, then we can shave about 2500ms off the runtime. 
Profile after making these changes: [https://share.firefox.dev/3tHsT5U]

The rest of the time is in Bind/FindAll. We could probably eliminate or lessen 
this. We could add index-based field refs; this would make things much less 
flexible though. Or we could add careful use of lookup tables, e.g. lazily 
construct one on Schema. Since presumably all batches from a fragment share the 
same underlying schema, this would help a lot. (However, once we get beyond 
simple projections, preserving that property will be difficult.)

The change to column_data is a very easy win and shouldn't pessimize any 
existing path (the implementation already operates on a vector& - so it was 
making a pointless copy), so I'll submit that as a PR while we decide what to 
do about the rest.

> [Python] Performance degradation parquet reading of wide dataframes
> -------------------------------------------------------------------
>
>                 Key: ARROW-11469
>                 URL: https://issues.apache.org/jira/browse/ARROW-11469
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 1.0.0, 1.0.1, 2.0.0, 3.0.0
>            Reporter: Axel G
>            Priority: Minor
>         Attachments: image-2021-05-03-14-31-41-260.png, 
> image-2021-05-03-14-39-59-485.png, image-2021-05-03-14-40-09-520.png, 
> profile_wide300.svg
>
>
> I noticed a relatively big performance degradation in version 1.0.0+ when 
> trying to load wide dataframes.
> For example you should be able to reproduce by doing:
> {code:java}
> import numpy as np
> import pandas as pd
> import pyarrow as pa
> import pyarrow.parquet as pq
> df = pd.DataFrame(np.random.rand(100, 10000))
> table = pa.Table.from_pandas(df)
> pq.write_table(table, "temp.parquet")
> %timeit pd.read_parquet("temp.parquet"){code}
> In version 0.17.0, this takes about 300-400 ms and for anything above and 
> including 1.0.0, this suddenly takes around 2 seconds.
>  
> Thanks for looking into this.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-11469) [Python] Performance degradation parquet reading of wide dataframes

Reply via email to