Re: Datasets and Java

Hongze Zhang Wed, 27 Nov 2019 02:17:43 -0800

Hi Micah,


Regarding our use cases, we'd use the API on Parquet files with some pushed 
filters and projectors, and we'd extend the C++ Datasets code to provide 
necessary support for our own data formats.


> If JNI is seen as too cumbersome, another possible avenue to pursue is
> writing a gRPC wrapper around the DataSet metadata capabilities.  One could
> then create a facade on top of that for Java.  For data reads, I can see
> either building a Flight server or directly use the JNI readers.


Thanks for your suggestion but I'm not entirely getting it. Does this mean to 
start some individual gRPC/Flight server process to deal with the metadata/data 
exchange problem between Java and C++ Datasets? If yes, then in some cases, 
doesn't it easily introduce bigger problems about life cycle and resource 
management of the processes? Please correct me if I misunderstood your point.


And IMHO I don't strongly hate the possible inconsistencies and bugs bought by 
a Java porting of something like the Datasets framework. Inconsistencies are 
usually in a way inevitable between two different languages' implementations of 
the same component, but there is supposed to be a trade-off based on whether 
the implementations arre worth to be provided. I didn't have chance to fully 
investigate the requirements of Datasets-Java from other projects so I'm not 
100% sure but the functionality such as source discovery, predicate pushdown, 
multi-format support could be powerful for many scenarios. Anyway I'm totally 
with you that the work amount could be huge and bugs might be brought. So my 
goal it to start from a small piece of the APIs to minimize the initial work. 
What do you think?


Thanks,
Hongze



At 2019-11-27 16:00:35, "Micah Kornfield" <[email protected]> wrote:
>Hi Hongze,
>I have a strong preference for not porting non-trivial logic from one
>language to another, especially if the main goal is performance.  I think
>this will replicate bugs and cause confusion if inconsistencies occur.  It
>is also a non-trivial amount of work to develop, review, setup CI, etc.
>
>If JNI is seen as too cumbersome, another possible avenue to pursue is
>writing a gRPC wrapper around the DataSet metadata capabilities.  One could
>then create a facade on top of that for Java.  For data reads, I can see
>either building a Flight server or directly use the JNI readers.
>
>In either case this is a non-trivial amount of work, so I at least,
>would appreciate a short write-up (1-2 pages) explicitly stating
>goals/use-cases for the library and a high level design (component overview
>and relationships between components and how it will co-exist with existing
>Java code).  If I understand correctly, one goal is to use this as a basis
>for a new Spark DataSet API with better performance than the vectorized
>spark parquet reader?  Are there others?
>
>Wes, what are your thoughts on this?
>
>Thanks,
>Micah
>
>
>On Tue, Nov 26, 2019 at 10:51 PM Hongze Zhang <[email protected]> wrote:
>
>> Hi Wes and Micah,
>>
>>
>> Thanks for your kindly reply.
>>
>>
>> Micah: We don't use Spark (vectorized) parquet reader because it is a pure
>> Java implementation. Performance could be worse than doing the similar work
>> natively. Another reason is we may need to
>> integrate some other specific data sources with Arrow datasets, for
>> limiting the workload, we would like to maintain a common read pipeline for
>> both this one and other wildly used data sources like Parquet and Csv.
>>
>>
>> Wes: Yes, Datasets framework along with Parquet/CSV/... reader
>> implementations are totally native, So a JNI bridge will be needed then we
>> don't actually read files in Java.
>>
>>
>> My another concern is how many C++ datasets components should be bridged
>> via JNI. For example,
>> bridge the ScanTask only? Or bridge more components including Scanner,
>> Table, even the DataSource
>> discovery system? Or just bridge the C++ arrow Parquet, Orc readers (as
>> Micah said, orc-jni is
>> already there) and reimplement everything needed by datasets in Java? This
>> might be not that easy to
>> decide but currently based on my limited perspective I would prefer to get
>> started from the ScanTask
>> layer as a result we could leverage some valuable work finished in C++
>> datasets and don't have to
>> maintain too much tedious JNI code. The real IO process still take place
>> inside C++ readers when we
>> do scan operation.
>>
>>
>> So Wes, Micah, is this similar to your consideration?
>>
>>
>> Thanks,
>> Hongze
>>
>> At 2019-11-27 12:39:52, "Micah Kornfield" <[email protected]> wrote:
>> >Hi Hongze,
>> >To add to Wes's point, there are already some efforts to do JNI for ORC
>> >(which needs to be integrated with CI) and some open PRs for Parquet in
>> the
>> >project.  However, given that you are using Spark I would expect there is
>> >already dataset functionality that is equivalent to the dataset API to do
>> >rowgroup/partition level filtering.  Can you elaborate on what problems
>> you
>> >are seeing with those and what additional use cases you have?
>> >
>> >Thanks,
>> >Micah
>> >
>> >
>> >On Tue, Nov 26, 2019 at 1:10 PM Wes McKinney <[email protected]> wrote:
>> >
>> >> hi Hongze,
>> >>
>> >> The Datasets functionality is indeed extremely useful, and it may make
>> >> sense to have it available in many languages eventually. With Java, I
>> >> would raise the issue that things are comparatively weaker there when
>> >> it comes to actually reading the files themselves. Whereas we have
>> >> reasonably fast Arrow-based interfaces to CSV, JSON, ORC, and Parquet
>> >> in C++ the same is not true in Java. Not a deal breaker but worth
>> >> taking into consideration.
>> >>
>> >> I wonder aloud whether it might be worth investing in a JNI-based
>> >> interface to the C++ libraries as one potential approach to save on
>> >> development time.
>> >>
>> >> - Wes
>> >>
>> >>
>> >>
>> >> On Tue, Nov 26, 2019 at 5:54 AM Hongze Zhang <[email protected]> wrote:
>> >> >
>> >> > Hi all,
>> >> >
>> >> >
>> >> > Recently the datasets API has been improved a lot and I found some of
>> >> the new features are very useful to my own work. For example to me a
>> >> important one is the fix of ARROW-6952[1]. And as I currently work on
>> >> Java/Scala projects like Spark, I am now investigating a way to call
>> some
>> >> of the datasets APIs in Java so that I could gain performance
>> improvement
>> >> from native dataset filters/projectors. Meantime I am also interested in
>> >> the ability of scanning different data sources provided by dataset API.
>> >> >
>> >> >
>> >> > Regarding using datasets in Java, my initial idea is to port (by
>> writing
>> >> Java-version implementations) some of the high-level concepts in Java
>> such
>> >> as DataSourceDiscovery/DataSet/Scanner/FileFormat, then create and call
>> >> lower level record batch iterators via JNI. This way we seem to retain
>> >> performance advantages from c++ dataset code.
>> >> >
>> >> >
>> >> > Is anyone interested in this topic also? Or is this something already
>> on
>> >> the development plan? Any feedback or thoughts would be much
>> appreciated.
>> >> >
>> >> >
>> >> > Best,
>> >> > Hongze
>> >> >
>> >> >
>> >> > [1] https://issues.apache.org/jira/browse/ARROW-6952
>> >>
>>

Re: Datasets and Java

Reply via email to