[
https://issues.apache.org/jira/browse/ARROW-11901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17445041#comment-17445041
]
Micah Kornfield commented on ARROW-11901:
-----------------------------------------
{quote}It's not about eliminating anything, it's about developing the existing
Java API, such as this very specific use case for compression codecs.
[[email protected]] was able to wrap LZ4 using JavaCPP, all by
himself! it's a lot easier to do than code everything manually with JNI:
[https://github.com/bytedeco/javacpp-presets/pull/1094]
{quote}
I think there is some miscommunication, on what I thought were 2 separate
issues. How to implement an efficient LZ4 decoder and whether to base the Java
API as a wrapper on the C++ API. The second would essentially would need a
heavy rewrite of the Java API as it is fundamentally different than the design
of the C++ API. I think there could be some interest from consumers of Arrow
in an API that more accurately mimics the C++ version, but again that is a
different thread. It could be for some of the more complex bindings (DataSets)
JavaCPP might be a better choice then hand-coded JNI.
{quote}[~emkornfield], since the C++ builds of Arrow already include LZ4, it is
indeed pretty trivial to expose a few JNI methods to access it.
{quote}
I was not referring to binding to the C++ implementation here but directly to
the LZ4 library. It looks like JavaCPP makes this efficient from a developer
perspective. But the
[API|https://github.com/bytedeco/javacpp-presets/pull/1094/files#diff-3d9af736e997982d68098d986670f05ff40ae0cc62773a1dd0eb418e55990317R38]
isn't quite what I imagined, it looks like it goes through ByteBuffer, when
all we really need is something like [ZSTD
API|https://github.com/luben/zstd-jni/blob/master/src/main/java/com/github/luben/zstd/Zstd.java#L454].
For such a minimal API I'm ambivalent on taking on a new dependency here.
{quote}If you have some ideas as to why most engineers are OK using Cython in
the case of Python, but not the equivalent in the case of Java, I would be very
much interested in hearing your opinions.
{quote}
I'm not an expert but a few thoughts:
# Cython is more then just a C++ wrapper. It speeds up python even if you
never want to write native code by effectively allowing one to write C code as
python. In Java, at least in theory, the JIT can do some heavy lifting here.
# The Python GIL is a pain point that Java doesn't have and Cython + Native
code can effectively work around it.
# There has always been a tight relationship between Python and Native code
where as JNI is much more esoteric, and can cause unexpected deployment issues
(e.g. correctly pointing the JVM to .so files, correctly integrating with the
JVM's memory capacity features, etc).
# Cython was also a pretty easy way to get compatibility between python 2.x
and python 3.x
Sometimes there is watershed moment, more mature projects can be reluctant to
try new technologies unless they are proven elsewhere and they solve a
significant pain-point.
{quote}We could do the same for Arrow!
{quote}
The dev@ mailing list is the place to discuss this. I tried searching and
couldn't find any previous discussions on the topic there.
> [Java] Investigate potential performance improvement of compression codec
> -------------------------------------------------------------------------
>
> Key: ARROW-11901
> URL: https://issues.apache.org/jira/browse/ARROW-11901
> Project: Apache Arrow
> Issue Type: Improvement
> Components: Java
> Reporter: Liya Fan
> Assignee: Benjamin Wilhelm
> Priority: Major
>
> In response to the discussion in
> https://github.com/apache/arrow/pull/8949/files#r588046787
> There are some performance penalties in the implementation of the compression
> codecs (e.g. data copying between heap/off-heap data). We need to revise the
> code to improve the performance.
> We should also provide some benchmarks to validate that the performance
> actually improves.
--
This message was sent by Atlassian Jira
(v8.20.1#820001)