[GitHub] [arrow-datafusion] alamb opened a new issue #1690: DiskManager and TempFiles getting created several times per query

GitBox Thu, 27 Jan 2022 14:42:31 -0800


alamb opened a new issue #1690:
URL: https://github.com/apache/arrow-datafusion/issues/1690

**Describe the bug**
If you run a query in DataFusion against parquet files, it will create
several unnecessary temporary files.

IOx also hits the same thing (with the same root cause):
https://github.com/influxdata/influxdb_iox/issues/3507#issuecomment-1023679575

There are several places which (non obviously) create a DiskManager instance
today -- the one that hits the parquet usecase above is (in the creation of the
pruning predicate that requires an `ExecutionContext`):
https://github.com/apache/arrow-datafusion/blob/master/datafusion/src/physical_optimizer/pruning.rs#L132

This has two problems:
1. it is unneeded overhead (the disk manager is not used),
2. the overhead is larger than it needs to be (it creates a tempfile)

I propose a two pronged solution (will propose two PRs):
1. Create temp files on demand in the DiskManger (so we are at least not
doing IO unless needed)
2. Remove unnecessary creation of ExecutionContext

I think the second will be a slightly larger project as it gets passed to
`create_physical_expr`

Though I think the main sources of problem are related to
`create_physical_expr` and that only uses the context to look up vars, if
necessary.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-datafusion] alamb opened a new issue #1690: DiskManager and TempFiles getting created several times per query

Reply via email to