Re: Datasets and Java

Antoine Pitrou Wed, 27 Nov 2019 02:20:27 -0800


To set up bridges between Java and C++, the C data interface
specification may help:
https://github.com/apache/arrow/pull/5442


There's an implementation for C++ here, and it also includes a Python-R
bridge able to share Arrow data between two different runtimes (i.e.
PyArrow and R-Arrow were compiled potentially using different
toolchains, with different ABIs):
https://github.com/apache/arrow/pull/5608

Regards

Antoine.



Le 27/11/2019 à 11:16, Hongze Zhang a écrit :
> Hi Micah,
> 
> 
> Regarding our use cases, we'd use the API on Parquet files with some pushed 
> filters and projectors, and we'd extend the C++ Datasets code to provide 
> necessary support for our own data formats.
> 
> 
>> If JNI is seen as too cumbersome, another possible avenue to pursue is
>> writing a gRPC wrapper around the DataSet metadata capabilities.  One could
>> then create a facade on top of that for Java.  For data reads, I can see
>> either building a Flight server or directly use the JNI readers.
> 
> 
> Thanks for your suggestion but I'm not entirely getting it. Does this mean to 
> start some individual gRPC/Flight server process to deal with the 
> metadata/data exchange problem between Java and C++ Datasets? If yes, then in 
> some cases, doesn't it easily introduce bigger problems about life cycle and 
> resource management of the processes? Please correct me if I misunderstood 
> your point.
> 
> 
> And IMHO I don't strongly hate the possible inconsistencies and bugs bought 
> by a Java porting of something like the Datasets framework. Inconsistencies 
> are usually in a way inevitable between two different languages' 
> implementations of the same component, but there is supposed to be a 
> trade-off based on whether the implementations arre worth to be provided. I 
> didn't have chance to fully investigate the requirements of Datasets-Java 
> from other projects so I'm not 100% sure but the functionality such as source 
> discovery, predicate pushdown, multi-format support could be powerful for 
> many scenarios. Anyway I'm totally with you that the work amount could be 
> huge and bugs might be brought. So my goal it to start from a small piece of 
> the APIs to minimize the initial work. What do you think?
> 
> 
> Thanks,
> Hongze
> 
> 
> 
> At 2019-11-27 16:00:35, "Micah Kornfield" <emkornfi...@gmail.com> wrote:
>> Hi Hongze,
>> I have a strong preference for not porting non-trivial logic from one
>> language to another, especially if the main goal is performance.  I think
>> this will replicate bugs and cause confusion if inconsistencies occur.  It
>> is also a non-trivial amount of work to develop, review, setup CI, etc.
>>
>> If JNI is seen as too cumbersome, another possible avenue to pursue is
>> writing a gRPC wrapper around the DataSet metadata capabilities.  One could
>> then create a facade on top of that for Java.  For data reads, I can see
>> either building a Flight server or directly use the JNI readers.
>>
>> In either case this is a non-trivial amount of work, so I at least,
>> would appreciate a short write-up (1-2 pages) explicitly stating
>> goals/use-cases for the library and a high level design (component overview
>> and relationships between components and how it will co-exist with existing
>> Java code).  If I understand correctly, one goal is to use this as a basis
>> for a new Spark DataSet API with better performance than the vectorized
>> spark parquet reader?  Are there others?
>>
>> Wes, what are your thoughts on this?
>>
>> Thanks,
>> Micah
>>
>>
>> On Tue, Nov 26, 2019 at 10:51 PM Hongze Zhang <notify...@126.com> wrote:
>>
>>> Hi Wes and Micah,
>>>
>>>
>>> Thanks for your kindly reply.
>>>
>>>
>>> Micah: We don't use Spark (vectorized) parquet reader because it is a pure
>>> Java implementation. Performance could be worse than doing the similar work
>>> natively. Another reason is we may need to
>>> integrate some other specific data sources with Arrow datasets, for
>>> limiting the workload, we would like to maintain a common read pipeline for
>>> both this one and other wildly used data sources like Parquet and Csv.
>>>
>>>
>>> Wes: Yes, Datasets framework along with Parquet/CSV/... reader
>>> implementations are totally native, So a JNI bridge will be needed then we
>>> don't actually read files in Java.
>>>
>>>
>>> My another concern is how many C++ datasets components should be bridged
>>> via JNI. For example,
>>> bridge the ScanTask only? Or bridge more components including Scanner,
>>> Table, even the DataSource
>>> discovery system? Or just bridge the C++ arrow Parquet, Orc readers (as
>>> Micah said, orc-jni is
>>> already there) and reimplement everything needed by datasets in Java? This
>>> might be not that easy to
>>> decide but currently based on my limited perspective I would prefer to get
>>> started from the ScanTask
>>> layer as a result we could leverage some valuable work finished in C++
>>> datasets and don't have to
>>> maintain too much tedious JNI code. The real IO process still take place
>>> inside C++ readers when we
>>> do scan operation.
>>>
>>>
>>> So Wes, Micah, is this similar to your consideration?
>>>
>>>
>>> Thanks,
>>> Hongze
>>>
>>> At 2019-11-27 12:39:52, "Micah Kornfield" <emkornfi...@gmail.com> wrote:
>>>> Hi Hongze,
>>>> To add to Wes's point, there are already some efforts to do JNI for ORC
>>>> (which needs to be integrated with CI) and some open PRs for Parquet in
>>> the
>>>> project.  However, given that you are using Spark I would expect there is
>>>> already dataset functionality that is equivalent to the dataset API to do
>>>> rowgroup/partition level filtering.  Can you elaborate on what problems
>>> you
>>>> are seeing with those and what additional use cases you have?
>>>>
>>>> Thanks,
>>>> Micah
>>>>
>>>>
>>>> On Tue, Nov 26, 2019 at 1:10 PM Wes McKinney <wesmck...@gmail.com> wrote:
>>>>
>>>>> hi Hongze,
>>>>>
>>>>> The Datasets functionality is indeed extremely useful, and it may make
>>>>> sense to have it available in many languages eventually. With Java, I
>>>>> would raise the issue that things are comparatively weaker there when
>>>>> it comes to actually reading the files themselves. Whereas we have
>>>>> reasonably fast Arrow-based interfaces to CSV, JSON, ORC, and Parquet
>>>>> in C++ the same is not true in Java. Not a deal breaker but worth
>>>>> taking into consideration.
>>>>>
>>>>> I wonder aloud whether it might be worth investing in a JNI-based
>>>>> interface to the C++ libraries as one potential approach to save on
>>>>> development time.
>>>>>
>>>>> - Wes
>>>>>
>>>>>
>>>>>
>>>>> On Tue, Nov 26, 2019 at 5:54 AM Hongze Zhang <notify...@126.com> wrote:
>>>>>>
>>>>>> Hi all,
>>>>>>
>>>>>>
>>>>>> Recently the datasets API has been improved a lot and I found some of
>>>>> the new features are very useful to my own work. For example to me a
>>>>> important one is the fix of ARROW-6952[1]. And as I currently work on
>>>>> Java/Scala projects like Spark, I am now investigating a way to call
>>> some
>>>>> of the datasets APIs in Java so that I could gain performance
>>> improvement
>>>>> from native dataset filters/projectors. Meantime I am also interested in
>>>>> the ability of scanning different data sources provided by dataset API.
>>>>>>
>>>>>>
>>>>>> Regarding using datasets in Java, my initial idea is to port (by
>>> writing
>>>>> Java-version implementations) some of the high-level concepts in Java
>>> such
>>>>> as DataSourceDiscovery/DataSet/Scanner/FileFormat, then create and call
>>>>> lower level record batch iterators via JNI. This way we seem to retain
>>>>> performance advantages from c++ dataset code.
>>>>>>
>>>>>>
>>>>>> Is anyone interested in this topic also? Or is this something already
>>> on
>>>>> the development plan? Any feedback or thoughts would be much
>>> appreciated.
>>>>>>
>>>>>>
>>>>>> Best,
>>>>>> Hongze
>>>>>>
>>>>>>
>>>>>> [1] https://issues.apache.org/jira/browse/ARROW-6952
>>>>>
>>>

Re: Datasets and Java

Reply via email to