[jira] [Commented] (ARROW-10368) [Rust][DataFusion] Make InMemoryScan work on iterators of RecordBatch

Remi Dettai (Jira) Thu, 22 Oct 2020 09:22:13 -0700


    [ 
https://issues.apache.org/jira/browse/ARROW-10368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17219142#comment-17219142
 ]


Remi Dettai commented on ARROW-10368:
-------------------------------------

bq.  I wonder if we could not go one step further and try to add a new logical 
plan that makes it possible to add custom sources

This is definitively outside the scope of the original ticket, but here is what 
I came up with for the custom source logical plan:

{code:java}
/// Produces rows from a custom user implementation
LogicalPlan::CustomScan {
    /// A shared reference to the custom implementation
    scanner: Arc<dyn CustomScanner>,
},
{code}

with:

{code:java}
#[async_trait]
/// A user implemented scanner that can be used by datafusion
pub trait CustomScanner: Send + Sync + fmt::Debug {
  /// reference to the schema of the data as it will be read by this scanner
  fn projected_schema(&self) -> &SchemaRef;
  /// string display of this scanner
  fn format(&self) -> &str;
  /// apply projection on this scanner
  fn project(
    &self,
    required_columns: &HashSet<String>,
    has_projection: bool,
  ) -> Result<Arc<dyn CustomScanner>>;
  /// get scanner partitioning
  fn output_partitioning(&self) -> Partitioning;
  /// get iterator for a given partition
  async fn execute(&self, partition: usize) -> Result<Box<dyn RecordBatchReader 
+ Send>>;
}
{code}

I am now wondering if we shouldn't make this the common interface for all 
LogicalPlan::XxxScan rather than having them in the enum.



> [Rust][DataFusion] Make InMemoryScan work on iterators of RecordBatch
> ---------------------------------------------------------------------
>
>                 Key: ARROW-10368
>                 URL: https://issues.apache.org/jira/browse/ARROW-10368
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: Rust, Rust - DataFusion
>            Reporter: Remi Dettai
>            Priority: Major
>
> Currently, InMemoryScan takes a Vec<Vec<RecordBatch>> as data.
> - the outer Vec separates the partitions
> - the inner Vec contains all the RecordBatch for one partition
> The inner Vec is then converted into an iterator when the LogicalPlan is 
> turned into a PhysicalPlan.
> I suggest that InMemoryScan should take Vec<Iter<RecordBatch>>.  This would 
> make it possible to plug custom Scan implementations into datafusion without 
> the need to read them entirely into memory. It would still work pretty 
> seamlessly with Vec<Vec<RecordBatch>> that would just need a to be converted 
> with data.map(|x| x.iter()) first.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-10368) [Rust][DataFusion] Make InMemoryScan work on iterators of RecordBatch

Reply via email to