pgwhalen commented on code in PR #65:
URL: https://github.com/apache/datafusion-java/pull/65#discussion_r3266868144


##########
core/src/main/java/org/apache/datafusion/SessionContext.java:
##########
@@ -392,6 +392,39 @@ public void registerUdf(ScalarUdf udf) {
     registerScalarUdf(nativeHandle, name, signatureBytes, volatility.code(), 
impl);
   }
 
+  /**
+   * Register a Java-implemented data source as a table. SQL queries that 
reference {@code name}
+   * call back into {@code source} to fetch batches.
+   *
+   * <p>{@link DataSource#schema()} is called once here, on the calling 
thread, and cached on the
+   * native side. {@link 
DataSource#scan(org.apache.arrow.memory.BufferAllocator)} is called once
+   * per query that touches the table, on a Tokio worker thread; it must 
return a fresh, independent
+   * {@link org.apache.arrow.vector.ipc.ArrowReader} on every call, with its 
buffers allocated from
+   * the {@link org.apache.arrow.memory.BufferAllocator} the framework 
supplies.
+   *
+   * @throws IllegalArgumentException if {@code name} or {@code source} is 
{@code null}.
+   * @throws IllegalStateException if {@code source.schema()} returns {@code 
null}, or this context
+   *     is closed.
+   * @throws RuntimeException if native registration fails.
+   */
+  public void registerDataSource(String name, DataSource source) {

Review Comment:
   Since this is basically a simplified API on top of the 
`SessionContext::register_table` rust function, what if we called the java 
function that instead (`registerTable`), and made the interface it accepts 
`TableProvider`?
   
   I get that this PR is basically barebones support for custom table 
registration in java, and that `data_source.rs` is handling a lot so the java 
user gets a simple `scan()` callback.  I think only providing that for now 
makes sense as a first step (and will always be useful for simple cases), but 
I'd like to make sure this can evolve towards all the flexibility of the 
`TableProvider` trait that interacts with `ExecutionPlan` and ultimately an 
`ArrowReader`.  The 
[LiteralGuaranteeTest](https://github.com/pgwhalen/datafusion-java/blob/1c20733aa8b008315af6092a912b58ef3df6a482/datafusion-ffi-java/src/test/java/org/apache/arrow/datafusion/LiteralGuaranteeTest.java#L178-L267)
 from my bindings demonstrates what this could look like and what it enables 
(filter pushdown).
   
   To keep things minimal for PR, maybe we could just
   - rename `registerDataSource` to `registerTable`
   - rename the `DataSource` interface to `TableProvider`
   - provide a simple implementation of `TableProvider` that just holds what 
the current `DataSource` does - not sure about a name for that, but maybe like 
`SimpleTableProvider` or `FullScanTableProvider` or something
   
   Then we can make `TableProvider` more featured over time.  Totally open to 
other ideas too.
   
   Part of my motivation in renaming is that in the back of my head I'm 
thinking about eventual support for the separate 
[DataSource](https://github.com/apache/datafusion/blob/c8b784a01f5d0bcbe0dac806730fb61afc0be8ef/datafusion/datasource/src/source.rs#L126),
 so don't want to clash on naming.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to