hi Hongze, The Datasets functionality is indeed extremely useful, and it may make sense to have it available in many languages eventually. With Java, I would raise the issue that things are comparatively weaker there when it comes to actually reading the files themselves. Whereas we have reasonably fast Arrow-based interfaces to CSV, JSON, ORC, and Parquet in C++ the same is not true in Java. Not a deal breaker but worth taking into consideration.
I wonder aloud whether it might be worth investing in a JNI-based interface to the C++ libraries as one potential approach to save on development time. - Wes On Tue, Nov 26, 2019 at 5:54 AM Hongze Zhang <notify...@126.com> wrote: > > Hi all, > > > Recently the datasets API has been improved a lot and I found some of the new > features are very useful to my own work. For example to me a important one is > the fix of ARROW-6952[1]. And as I currently work on Java/Scala projects like > Spark, I am now investigating a way to call some of the datasets APIs in Java > so that I could gain performance improvement from native dataset > filters/projectors. Meantime I am also interested in the ability of scanning > different data sources provided by dataset API. > > > Regarding using datasets in Java, my initial idea is to port (by writing > Java-version implementations) some of the high-level concepts in Java such as > DataSourceDiscovery/DataSet/Scanner/FileFormat, then create and call lower > level record batch iterators via JNI. This way we seem to retain performance > advantages from c++ dataset code. > > > Is anyone interested in this topic also? Or is this something already on the > development plan? Any feedback or thoughts would be much appreciated. > > > Best, > Hongze > > > [1] https://issues.apache.org/jira/browse/ARROW-6952