[jira] [Commented] (ARROW-10368) [Rust][DataFusion] Make InMemoryScan work on iterators of RecordBatch

Remi Dettai (Jira) Thu, 22 Oct 2020 10:29:11 -0700


    [ 
https://issues.apache.org/jira/browse/ARROW-10368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17219210#comment-17219210
 ]


Remi Dettai commented on ARROW-10368:
-------------------------------------

If I summarize all of the above, these are the paths I could take to implement 
my S3 parquet datasource:
- modify InMemoryScan to use iterators of RecordBatch. This iterator can be 
provided by a s3 reader or any other datasource. *No way to have projection 
pushdown here.*
- use ExtensionPlanner and LogicalPlan::Extension to implement a new logical 
and execution plan. *No way to have projection pushdown here.*
- extend the current ParquetExec with a specific code path if the filename 
starts with "s3://". I am not sure how this could be done without bringing in 
the dependency to s3 right in the middle of datafusion, which would 
definitively not scale.
- replace all the LogicalPlan::XxxScan by a single LogicalPlan::SourceScan 
(equivalent to the LogicalPlan::CustomScan above) that dynamically dispatches 
to any source implementation.

Last solution seems to be the best, but I'm curious to have your opinions !

> [Rust][DataFusion] Make InMemoryScan work on iterators of RecordBatch
> ---------------------------------------------------------------------
>
>                 Key: ARROW-10368
>                 URL: https://issues.apache.org/jira/browse/ARROW-10368
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: Rust, Rust - DataFusion
>            Reporter: Remi Dettai
>            Priority: Major
>
> Currently, InMemoryScan takes a Vec<Vec<RecordBatch>> as data.
> - the outer Vec separates the partitions
> - the inner Vec contains all the RecordBatch for one partition
> The inner Vec is then converted into an iterator when the LogicalPlan is 
> turned into a PhysicalPlan.
> I suggest that InMemoryScan should take Vec<Iter<RecordBatch>>.  This would 
> make it possible to plug custom Scan implementations into datafusion without 
> the need to read them entirely into memory. It would still work pretty 
> seamlessly with Vec<Vec<RecordBatch>> that would just need a to be converted 
> with data.map(|x| x.iter()) first.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-10368) [Rust][DataFusion] Make InMemoryScan work on iterators of RecordBatch

Reply via email to