[
https://issues.apache.org/jira/browse/ARROW-7808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17041100#comment-17041100
]
Ben Kietzman commented on ARROW-7808:
-------------------------------------
The dataset API is not stable; a full 1:1 mapping will be *more* work to
maintain. For example https://issues.apache.org/jira/browse/ARROW-7886 would
remove Source and SourceFactory altogether, which would necessitate refactoring
both the JNI binding and the Java which uses it. I recommend exposing only
classes which are directly useful for a minimal use case, then exposing classes
as they become necessary in follow ups.
[~fsaintjacques]'s recommendation on the mailing list would be an excellent
starting point. Alternatively, I recommend following the initial R binding
work:
https://github.com/romainfrancois/arrow/blob/9dfba2ea8949a0a0a17393976a97d3a34dc63d39/r/R/dataset.R
This minimally exposes Source, Dataset, Scanner, and the corresponding
factories. Scans result in a materialized Table (so ScanTasks, Fragments, etc
may remain hidden) and take full advantage of predicate/projection push down.
> [Java][Dataset] Implement Datasets Java API
> --------------------------------------------
>
> Key: ARROW-7808
> URL: https://issues.apache.org/jira/browse/ARROW-7808
> Project: Apache Arrow
> Issue Type: Improvement
> Components: C++ - Dataset, Java
> Reporter: Hongze Zhang
> Priority: Major
> Labels: dataset
>
> Porting following C++ Datasets APIs to Java:
> * DataSource
> * DataSourceDiscovery
> * DataFragment
> * Dataset
> * Scanner
> * ScanTask
> * ScanOptions
--
This message was sent by Atlassian Jira
(v8.3.4#803005)