karlovnv opened a new issue, #7698:
URL: https://github.com/apache/arrow-datafusion/issues/7698
### Describe the bug
I'am testing DataFusion for using it as a core of antifraud system which has
several thousand columns and billions of rows.
I'm excited about the flexibility and possibilities this technology provides.
The problems we faced with:
1) Optimization of the logical plan works slow because of copying the whole
schema in some rules.
We workarounded it with prepared queries (we cache parametrized logical plan)
2) Creating physical plan consume up to 35% on CPU, that is more than it's
execution (we use several hundreds of aggregation functions and DF shows
pretty good execution time)
Some investigation on that showed, that there a lot of copying (take a look
at flamegraph)
```
29 % datafusion_physical_expr::planner::create_physical_expr
28.5 % --> datafusion_common::dfschema::DFSchema::index_of_column
28.5 % -- -->
datafusion_common::dfschema::DFSchema::index_of_column_by_name
7.4 % -- -- --> __memcmp_sse4_1
14.6 % -- -- -->
datafusion_common::table_reference::TableReference::resolved_eq
6.8 % -- -- -- --> __memcmp_sse4_1
```

I think that is because of copying of the stings which handled by CoW
pointer.
Another place to improve is to use hash map instead of list here:
https://github.com/apache/arrow-datafusion/blob/22d03c127e7c5e56cf97ae33eb4446d5b7022eaa/datafusion/common/src/dfschema.rs#L211
Now algorithm has O(N^2) complexity (N in iterating all the columns in
`datafusion_common::dfschema::DFSchema::index_of_column_by_name`
and N in `datafusion_common::table_reference::TableReference::resolved_eq`).
Some ideas to resolve:
- Use hashmap in DFSchema instead of list (decrease complexity of resolving
column index by it's name)
- Implement parametrization of Physical plan and prepared physical plans (in
order to enable caching it the same as prepared logical plan)
- Consider eliminating of string copying (may be by replacing CoW to Arc or
shared ref)
Thank you for developing a such great tool!
### To Reproduce
It's hard to extract some code from the project, but I will try to build
simple repro
### Expected behavior
Creation of physical plan spent much less time in CPU than it's execution
### Additional context
_No response_
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]