[GitHub] [arrow-datafusion] metesynnada opened a new issue, #4692: Support for executing infinite files

GitBox Wed, 21 Dec 2022 07:22:13 -0800


metesynnada opened a new issue, #4692:
URL: https://github.com/apache/arrow-datafusion/issues/4692


   **Is your feature request related to a problem or challenge? Please describe 
what you are trying to do.**
   As discussed many times previously, we want Datafusion to support streaming 
use cases. We already have a tentative roadmap (here)[link here]. One of the 
first steps to achieve this goal is to have the ability to run queries on 
infinite files such as FIFOs.
   
   A query like
   
   ```rust
   SELECT * FROM test WHERE test_col_1 >10;
   ```
   
   can be executed on a FIFO file without any problem. However, a query like
   
   ```rust
   SELECT test_col_1, MIN(test_col_2) FROM test GROUP BY test_col_1;
   ```
   
   cannot be executed on an infinite data source since the `AggregateExec` 
materializes all data to calculate its result. 
   
   Ideally, we would like the planner, by means of some rule mechanism, to 
decide whether a query is executable with given data sources (which may have a 
mix of infinite and finite sources). For example, we would like the planner to 
detect that we are attempting to run a pipeline-breaking query on an infinite 
source. Furthermore, we would like to “save” our query and transform it into a 
runnable query whenever possible.
   
   To be more concrete, let’s discuss the query
   
   ```rust
   SELECT t2.c1 FROM infinite_data_source as t1 JOIN finite_data_source as t2 
ON t1.c1 = t2.c1
   ```
   
   This query produces a `HashJoinExec` with an infinite left side and a finite 
right side. The `HashJoinExec` cannot execute this join in the current 
implementation. However, it could execute if we swapped the sides. Thus, we 
need a mechanism to decide how we manage these issues.
   
   **Describe the solution you'd like**
   
   We will discuss the changes in multiple categories:
   
   - Data source changes: We need CSV, JSON, and AVRO formats.
   - Execution plan changes: A basic API addition to handle/accommodate 
infinite streams at the plan level when writing custom operators.
   - Physical optimization changes: A new rule checks whether a given query can 
handle its infinite inputs and makes necessary transformations to make it 
executable when possible (e.g. reorder join inputs to transform breaking 
queries into runnable queries).
   
   **Describe alternatives you've considered**
   NA
   
   **Additional context**
   NA


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-datafusion] metesynnada opened a new issue, #4692: Support for executing infinite files

Reply via email to