[
https://issues.apache.org/jira/browse/ARROW-7808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17074387#comment-17074387
]
Hongze Zhang commented on ARROW-7808:
-------------------------------------
Thanks guys for your suggestions! And so sorry for such a late reply. I've been
busy on some other stuffs and now I am continuing to work on this. Actually in
my organization we have been maintaining an runnable implementation[1] for
several months, it may not be completely ready for making an upstream PR but
still be showing my main designs.
I see your suggestion with the preference of high-level approach, with which
actually I agree. In my current implementation, there might be some classes
that look like something "lower level" in Java, such as DataFragment[2], or
ScanTask[3], but further developers don't ever have to make implementations for
specific source formats - we have NativeDataFragment[4] or NativeScanTask[5] to
cover all cases. The same design is applied to DataSource[6][7] so we only have
to bridge c++ DataSourceDiscovery implementations in further development. Here
is an example[8] from us to add an arrow::dataset::SingleFileDataSource and use
it from Java.
And I know in the newest upstream code C++ API has been reworked a lot (the
remove of DataSource, rename of DataSourceDiscovery and so on). So there should
be some extra work to me to make things match during rebasing.
Sorry again for the delay, and any thoughts please let me know. Thanks.
[1] [https://github.com/zhztheplayer/arrow-1/commits/ARROW-7808]
[2]
[https://github.com/zhztheplayer/arrow-1/blob/b2e98bfaf24565a6b23ecd36b9c3c2c264be51f8/java/dataset/src/main/java/org/apache/arrow/dataset/fragment/DataFragment.java]
[3]
[https://github.com/zhztheplayer/arrow-1/blob/b2e98bfaf24565a6b23ecd36b9c3c2c264be51f8/java/dataset/src/main/java/org/apache/arrow/dataset/scanner/ScanTask.java]
[4]
[https://github.com/zhztheplayer/arrow-1/blob/b2e98bfaf24565a6b23ecd36b9c3c2c264be51f8/java/dataset/src/main/java/org/apache/arrow/dataset/jni/NativeDataFragment.java]
[5]
[https://github.com/zhztheplayer/arrow-1/blob/b2e98bfaf24565a6b23ecd36b9c3c2c264be51f8/java/dataset/src/main/java/org/apache/arrow/dataset/jni/NativeScanTask.java]
[6]
[https://github.com/zhztheplayer/arrow-1/commit/b2e98bfaf24565a6b23ecd36b9c3c2c264be51f8#diff-deea6cb88ea63d76f71b7b4cfd173206]
[7]
[https://github.com/zhztheplayer/arrow-1/blob/b2e98bfaf24565a6b23ecd36b9c3c2c264be51f8/java/dataset/src/main/java/org/apache/arrow/dataset/jni/NativeDataSource.java]
[8]
[https://github.com/zhztheplayer/arrow-1/commit/7cb13b96e81fd153c4ad9c68aff00f032abb5110]
> [Java][Dataset] Implement Datasets Java API
> --------------------------------------------
>
> Key: ARROW-7808
> URL: https://issues.apache.org/jira/browse/ARROW-7808
> Project: Apache Arrow
> Issue Type: Improvement
> Components: C++ - Dataset, Java
> Reporter: Hongze Zhang
> Priority: Major
> Labels: dataset
>
> Porting following C++ Datasets APIs to Java:
> * DataSource
> * DataSourceDiscovery
> * DataFragment
> * Dataset
> * Scanner
> * ScanTask
> * ScanOptions
--
This message was sent by Atlassian Jira
(v8.3.4#803005)