[jira] [Commented] (ARROW-11901) [Java] Investigate potential performance improvement of compression codec

Micah Kornfield (Jira) Wed, 17 Nov 2021 01:36:38 -0800


    [ 
https://issues.apache.org/jira/browse/ARROW-11901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17445041#comment-17445041
 ]


Micah Kornfield commented on ARROW-11901:
-----------------------------------------

{quote}It's not about eliminating anything, it's about developing the existing 
Java API, such as this very specific use case for compression codecs. 
[[email protected]] was able to wrap LZ4 using JavaCPP, all by 
himself! it's a lot easier to do than code everything manually with JNI:
[https://github.com/bytedeco/javacpp-presets/pull/1094]
{quote}
I think there is some miscommunication, on what I thought were 2 separate 
issues.  How to implement an efficient LZ4 decoder and whether to base the Java 
API as a wrapper on the C++ API.  The second would essentially would need a 
heavy rewrite of the Java API as it is fundamentally different than the design 
of the C++ API.  I think there could be some interest from consumers of Arrow 
in an API that more accurately mimics the C++ version, but again that is a 
different thread.  It could be for some of the more complex bindings (DataSets) 
JavaCPP might be a better choice then hand-coded JNI.

 
{quote}[~emkornfield], since the C++ builds of Arrow already include LZ4, it is 
indeed pretty trivial to expose a few JNI methods to access it. 
{quote}
I was not referring to binding to the C++ implementation here but directly to 
the LZ4 library.  It looks like JavaCPP makes this efficient from a developer 
perspective.  But the 
[API|https://github.com/bytedeco/javacpp-presets/pull/1094/files#diff-3d9af736e997982d68098d986670f05ff40ae0cc62773a1dd0eb418e55990317R38]
 isn't quite what I imagined, it looks like it goes through ByteBuffer, when 
all we really need is something like [ZSTD 
API|https://github.com/luben/zstd-jni/blob/master/src/main/java/com/github/luben/zstd/Zstd.java#L454].
  For such a minimal API I'm ambivalent on taking on a new dependency here.

 
{quote}If you have some ideas as to why most engineers are OK using Cython in 
the case of Python, but not the equivalent in the case of Java, I would be very 
much interested in hearing your opinions.
{quote}
I'm not an expert but a few thoughts:
 # Cython is more then just a C++ wrapper.  It speeds up python even if you 
never want to write native code by effectively allowing one to write C code as 
python.  In Java, at least in theory, the JIT can do some heavy lifting here.
 # The Python GIL is a pain point that Java doesn't have and Cython + Native 
code can effectively work around it.
 # There has always been a tight relationship between Python and Native code 
where as JNI is much more esoteric, and can cause unexpected deployment issues 
(e.g. correctly pointing the JVM to .so files, correctly integrating with the 
JVM's memory capacity features, etc). 
 # Cython was also a pretty easy way to get compatibility between python 2.x 
and python 3.x

Sometimes there is watershed moment, more mature projects can be reluctant to 
try new technologies unless they are proven elsewhere and they solve a 
significant pain-point.  
{quote}We could do the same for Arrow!
{quote}
The dev@ mailing list is the place to discuss this.  I tried searching and 
couldn't find any previous discussions on the topic there.

> [Java] Investigate potential performance improvement of compression codec
> -------------------------------------------------------------------------
>
>                 Key: ARROW-11901
>                 URL: https://issues.apache.org/jira/browse/ARROW-11901
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: Java
>            Reporter: Liya Fan
>            Assignee: Benjamin Wilhelm
>            Priority: Major
>
> In response to the discussion in 
> https://github.com/apache/arrow/pull/8949/files#r588046787
> There are some performance penalties in the implementation of the compression 
> codecs (e.g. data copying between heap/off-heap data). We need to revise the 
> code to improve the performance. 
> We should also provide some benchmarks to validate that the performance 
> actually improves. 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (ARROW-11901) [Java] Investigate potential performance improvement of compression codec

Reply via email to