hi Hongze,

The Datasets functionality is indeed extremely useful, and it may make
sense to have it available in many languages eventually. With Java, I
would raise the issue that things are comparatively weaker there when
it comes to actually reading the files themselves. Whereas we have
reasonably fast Arrow-based interfaces to CSV, JSON, ORC, and Parquet
in C++ the same is not true in Java. Not a deal breaker but worth
taking into consideration.

I wonder aloud whether it might be worth investing in a JNI-based
interface to the C++ libraries as one potential approach to save on
development time.

- Wes



On Tue, Nov 26, 2019 at 5:54 AM Hongze Zhang <notify...@126.com> wrote:
>
> Hi all,
>
>
> Recently the datasets API has been improved a lot and I found some of the new 
> features are very useful to my own work. For example to me a important one is 
> the fix of ARROW-6952[1]. And as I currently work on Java/Scala projects like 
> Spark, I am now investigating a way to call some of the datasets APIs in Java 
> so that I could gain performance improvement from native dataset 
> filters/projectors. Meantime I am also interested in the ability of scanning 
> different data sources provided by dataset API.
>
>
> Regarding using datasets in Java, my initial idea is to port (by writing 
> Java-version implementations) some of the high-level concepts in Java such as 
> DataSourceDiscovery/DataSet/Scanner/FileFormat, then create and call lower 
> level record batch iterators via JNI. This way we seem to retain performance 
> advantages from c++ dataset code.
>
>
> Is anyone interested in this topic also? Or is this something already on the 
> development plan? Any feedback or thoughts would be much appreciated.
>
>
> Best,
> Hongze
>
>
> [1] https://issues.apache.org/jira/browse/ARROW-6952

Reply via email to