metesynnada opened a new issue, #4692: URL: https://github.com/apache/arrow-datafusion/issues/4692
**Is your feature request related to a problem or challenge? Please describe what you are trying to do.** As discussed many times previously, we want Datafusion to support streaming use cases. We already have a tentative roadmap (here)[link here]. One of the first steps to achieve this goal is to have the ability to run queries on infinite files such as FIFOs. A query like ```rust SELECT * FROM test WHERE test_col_1 >10; ``` can be executed on a FIFO file without any problem. However, a query like ```rust SELECT test_col_1, MIN(test_col_2) FROM test GROUP BY test_col_1; ``` cannot be executed on an infinite data source since the `AggregateExec` materializes all data to calculate its result. Ideally, we would like the planner, by means of some rule mechanism, to decide whether a query is executable with given data sources (which may have a mix of infinite and finite sources). For example, we would like the planner to detect that we are attempting to run a pipeline-breaking query on an infinite source. Furthermore, we would like to “save” our query and transform it into a runnable query whenever possible. To be more concrete, let’s discuss the query ```rust SELECT t2.c1 FROM infinite_data_source as t1 JOIN finite_data_source as t2 ON t1.c1 = t2.c1 ``` This query produces a `HashJoinExec` with an infinite left side and a finite right side. The `HashJoinExec` cannot execute this join in the current implementation. However, it could execute if we swapped the sides. Thus, we need a mechanism to decide how we manage these issues. **Describe the solution you'd like** We will discuss the changes in multiple categories: - Data source changes: We need CSV, JSON, and AVRO formats. - Execution plan changes: A basic API addition to handle/accommodate infinite streams at the plan level when writing custom operators. - Physical optimization changes: A new rule checks whether a given query can handle its infinite inputs and makes necessary transformations to make it executable when possible (e.g. reorder join inputs to transform breaking queries into runnable queries). **Describe alternatives you've considered** NA **Additional context** NA -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
