To set up bridges between Java and C++, the C data interface specification may help: https://github.com/apache/arrow/pull/5442
There's an implementation for C++ here, and it also includes a Python-R bridge able to share Arrow data between two different runtimes (i.e. PyArrow and R-Arrow were compiled potentially using different toolchains, with different ABIs): https://github.com/apache/arrow/pull/5608 Regards Antoine. Le 27/11/2019 à 11:16, Hongze Zhang a écrit : > Hi Micah, > > > Regarding our use cases, we'd use the API on Parquet files with some pushed > filters and projectors, and we'd extend the C++ Datasets code to provide > necessary support for our own data formats. > > >> If JNI is seen as too cumbersome, another possible avenue to pursue is >> writing a gRPC wrapper around the DataSet metadata capabilities. One could >> then create a facade on top of that for Java. For data reads, I can see >> either building a Flight server or directly use the JNI readers. > > > Thanks for your suggestion but I'm not entirely getting it. Does this mean to > start some individual gRPC/Flight server process to deal with the > metadata/data exchange problem between Java and C++ Datasets? If yes, then in > some cases, doesn't it easily introduce bigger problems about life cycle and > resource management of the processes? Please correct me if I misunderstood > your point. > > > And IMHO I don't strongly hate the possible inconsistencies and bugs bought > by a Java porting of something like the Datasets framework. Inconsistencies > are usually in a way inevitable between two different languages' > implementations of the same component, but there is supposed to be a > trade-off based on whether the implementations arre worth to be provided. I > didn't have chance to fully investigate the requirements of Datasets-Java > from other projects so I'm not 100% sure but the functionality such as source > discovery, predicate pushdown, multi-format support could be powerful for > many scenarios. Anyway I'm totally with you that the work amount could be > huge and bugs might be brought. So my goal it to start from a small piece of > the APIs to minimize the initial work. What do you think? > > > Thanks, > Hongze > > > > At 2019-11-27 16:00:35, "Micah Kornfield" <emkornfi...@gmail.com> wrote: >> Hi Hongze, >> I have a strong preference for not porting non-trivial logic from one >> language to another, especially if the main goal is performance. I think >> this will replicate bugs and cause confusion if inconsistencies occur. It >> is also a non-trivial amount of work to develop, review, setup CI, etc. >> >> If JNI is seen as too cumbersome, another possible avenue to pursue is >> writing a gRPC wrapper around the DataSet metadata capabilities. One could >> then create a facade on top of that for Java. For data reads, I can see >> either building a Flight server or directly use the JNI readers. >> >> In either case this is a non-trivial amount of work, so I at least, >> would appreciate a short write-up (1-2 pages) explicitly stating >> goals/use-cases for the library and a high level design (component overview >> and relationships between components and how it will co-exist with existing >> Java code). If I understand correctly, one goal is to use this as a basis >> for a new Spark DataSet API with better performance than the vectorized >> spark parquet reader? Are there others? >> >> Wes, what are your thoughts on this? >> >> Thanks, >> Micah >> >> >> On Tue, Nov 26, 2019 at 10:51 PM Hongze Zhang <notify...@126.com> wrote: >> >>> Hi Wes and Micah, >>> >>> >>> Thanks for your kindly reply. >>> >>> >>> Micah: We don't use Spark (vectorized) parquet reader because it is a pure >>> Java implementation. Performance could be worse than doing the similar work >>> natively. Another reason is we may need to >>> integrate some other specific data sources with Arrow datasets, for >>> limiting the workload, we would like to maintain a common read pipeline for >>> both this one and other wildly used data sources like Parquet and Csv. >>> >>> >>> Wes: Yes, Datasets framework along with Parquet/CSV/... reader >>> implementations are totally native, So a JNI bridge will be needed then we >>> don't actually read files in Java. >>> >>> >>> My another concern is how many C++ datasets components should be bridged >>> via JNI. For example, >>> bridge the ScanTask only? Or bridge more components including Scanner, >>> Table, even the DataSource >>> discovery system? Or just bridge the C++ arrow Parquet, Orc readers (as >>> Micah said, orc-jni is >>> already there) and reimplement everything needed by datasets in Java? This >>> might be not that easy to >>> decide but currently based on my limited perspective I would prefer to get >>> started from the ScanTask >>> layer as a result we could leverage some valuable work finished in C++ >>> datasets and don't have to >>> maintain too much tedious JNI code. The real IO process still take place >>> inside C++ readers when we >>> do scan operation. >>> >>> >>> So Wes, Micah, is this similar to your consideration? >>> >>> >>> Thanks, >>> Hongze >>> >>> At 2019-11-27 12:39:52, "Micah Kornfield" <emkornfi...@gmail.com> wrote: >>>> Hi Hongze, >>>> To add to Wes's point, there are already some efforts to do JNI for ORC >>>> (which needs to be integrated with CI) and some open PRs for Parquet in >>> the >>>> project. However, given that you are using Spark I would expect there is >>>> already dataset functionality that is equivalent to the dataset API to do >>>> rowgroup/partition level filtering. Can you elaborate on what problems >>> you >>>> are seeing with those and what additional use cases you have? >>>> >>>> Thanks, >>>> Micah >>>> >>>> >>>> On Tue, Nov 26, 2019 at 1:10 PM Wes McKinney <wesmck...@gmail.com> wrote: >>>> >>>>> hi Hongze, >>>>> >>>>> The Datasets functionality is indeed extremely useful, and it may make >>>>> sense to have it available in many languages eventually. With Java, I >>>>> would raise the issue that things are comparatively weaker there when >>>>> it comes to actually reading the files themselves. Whereas we have >>>>> reasonably fast Arrow-based interfaces to CSV, JSON, ORC, and Parquet >>>>> in C++ the same is not true in Java. Not a deal breaker but worth >>>>> taking into consideration. >>>>> >>>>> I wonder aloud whether it might be worth investing in a JNI-based >>>>> interface to the C++ libraries as one potential approach to save on >>>>> development time. >>>>> >>>>> - Wes >>>>> >>>>> >>>>> >>>>> On Tue, Nov 26, 2019 at 5:54 AM Hongze Zhang <notify...@126.com> wrote: >>>>>> >>>>>> Hi all, >>>>>> >>>>>> >>>>>> Recently the datasets API has been improved a lot and I found some of >>>>> the new features are very useful to my own work. For example to me a >>>>> important one is the fix of ARROW-6952[1]. And as I currently work on >>>>> Java/Scala projects like Spark, I am now investigating a way to call >>> some >>>>> of the datasets APIs in Java so that I could gain performance >>> improvement >>>>> from native dataset filters/projectors. Meantime I am also interested in >>>>> the ability of scanning different data sources provided by dataset API. >>>>>> >>>>>> >>>>>> Regarding using datasets in Java, my initial idea is to port (by >>> writing >>>>> Java-version implementations) some of the high-level concepts in Java >>> such >>>>> as DataSourceDiscovery/DataSet/Scanner/FileFormat, then create and call >>>>> lower level record batch iterators via JNI. This way we seem to retain >>>>> performance advantages from c++ dataset code. >>>>>> >>>>>> >>>>>> Is anyone interested in this topic also? Or is this something already >>> on >>>>> the development plan? Any feedback or thoughts would be much >>> appreciated. >>>>>> >>>>>> >>>>>> Best, >>>>>> Hongze >>>>>> >>>>>> >>>>>> [1] https://issues.apache.org/jira/browse/ARROW-6952 >>>>> >>>