[jira] [Updated] (ARROW-10368) [Rust][DataFusion] Refactor scan nodes to allow extensions

Remi Dettai (Jira) Fri, 23 Oct 2020 01:47:30 -0700


     [ 
https://issues.apache.org/jira/browse/ARROW-10368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Remi Dettai updated ARROW-10368:
--------------------------------
    Description: 
The first intention was to refactor InMemoryScan to use an iterator.

{quote}Currently, InMemoryScan takes a Vec<Vec<RecordBatch>> as data.
- the outer Vec separates the partitions
- the inner Vec contains all the RecordBatch for one partition
The inner Vec is then converted into an iterator when the LogicalPlan is turned 
into a PhysicalPlan.

I suggest that InMemoryScan should take Vec<Iter<RecordBatch>>.  This would 
make it possible to plug custom Scan implementations into datafusion without 
the need to read them entirely into memory. It would still work pretty 
seamlessly with Vec<Vec<RecordBatch>> that would just need a to be converted 
with data.map(|x| x.iter()) first.{quote}

After further inspection (see discussion below), it seems more appropriate to 
completely refactor the way scan operation works. The idea is to replace all 
specific XxxScan nodes with a generic SourceScan node:




  was:
Currently, InMemoryScan takes a Vec<Vec<RecordBatch>> as data.
- the outer Vec separates the partitions
- the inner Vec contains all the RecordBatch for one partition
The inner Vec is then converted into an iterator when the LogicalPlan is turned 
into a PhysicalPlan.

I suggest that InMemoryScan should take Vec<Iter<RecordBatch>>.  This would 
make it possible to plug custom Scan implementations into datafusion without 
the need to read them entirely into memory. It would still work pretty 
seamlessly with Vec<Vec<RecordBatch>> that would just need a to be converted 
with data.map(|x| x.iter()) first.




> [Rust][DataFusion] Refactor scan nodes to allow extensions
> ----------------------------------------------------------
>
>                 Key: ARROW-10368
>                 URL: https://issues.apache.org/jira/browse/ARROW-10368
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: Rust, Rust - DataFusion
>            Reporter: Remi Dettai
>            Priority: Major
>
> The first intention was to refactor InMemoryScan to use an iterator.
> {quote}Currently, InMemoryScan takes a Vec<Vec<RecordBatch>> as data.
> - the outer Vec separates the partitions
> - the inner Vec contains all the RecordBatch for one partition
> The inner Vec is then converted into an iterator when the LogicalPlan is 
> turned into a PhysicalPlan.
> I suggest that InMemoryScan should take Vec<Iter<RecordBatch>>.  This would 
> make it possible to plug custom Scan implementations into datafusion without 
> the need to read them entirely into memory. It would still work pretty 
> seamlessly with Vec<Vec<RecordBatch>> that would just need a to be converted 
> with data.map(|x| x.iter()) first.{quote}
> After further inspection (see discussion below), it seems more appropriate to 
> completely refactor the way scan operation works. The idea is to replace all 
> specific XxxScan nodes with a generic SourceScan node:



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-10368) [Rust][DataFusion] Refactor scan nodes to allow extensions

Reply via email to