[ 
https://issues.apache.org/jira/browse/ARROW-10368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17219426#comment-17219426
 ] 

Jorge Leitão commented on ARROW-10368:
--------------------------------------

[~rdettai], great proposal and comments.

IMO the last item, for the same reasons that you concluded :)

a scan is very fundamental and has a different semantics than the extension, 
which was designed to be a generic compute node.
 # check what is the common pattern of each Scan node (both logical and 
physical)
 # check what differences it has vs e.g. s3
 # abstract the pattern out to a new generic logical node
 # migrate all scanners to it
 # implement s3 on top of the new pattern

Note that the code now expects a stream, not an iterator, of RecordBatch, which 
may help at reading s3 sources.

As it stands, this is a large task, so, let us know if you any help. One idea 
is to have a PR only with the new interface (and only {{todo!}} on the 
implementation) for the new node, so that we can all go through, before 
committing the bulk of the work.

> [Rust][DataFusion] Make InMemoryScan work on iterators of RecordBatch
> ---------------------------------------------------------------------
>
>                 Key: ARROW-10368
>                 URL: https://issues.apache.org/jira/browse/ARROW-10368
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: Rust, Rust - DataFusion
>            Reporter: Remi Dettai
>            Priority: Major
>
> Currently, InMemoryScan takes a Vec<Vec<RecordBatch>> as data.
> - the outer Vec separates the partitions
> - the inner Vec contains all the RecordBatch for one partition
> The inner Vec is then converted into an iterator when the LogicalPlan is 
> turned into a PhysicalPlan.
> I suggest that InMemoryScan should take Vec<Iter<RecordBatch>>.  This would 
> make it possible to plug custom Scan implementations into datafusion without 
> the need to read them entirely into memory. It would still work pretty 
> seamlessly with Vec<Vec<RecordBatch>> that would just need a to be converted 
> with data.map(|x| x.iter()) first.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to