Le 04/05/2022 à 17:21, Alessandro Molina a écrit :
The proposal seems reasonable to me, we should do our best at providing
users the same experience on the various systems whenever possible.

As long as we don't receive complaints about the package size, I think we
can live with it. If it becomes a problem for our users, we can always make
per-system binaries in the future.

Hmm, I think it wouldn't hurt to be proactive wrt. package sizes. Negative feedback doesn't always get propagated to us, and instead we may lose users due to the bad first impression.

Regards

Antoine.




PS: I think you forgot to enable comments on the google docs, that's
something you usually want to allow as it eases providing feedback.

On Tue, May 3, 2022 at 4:19 PM Larry White <ljw1...@gmail.com> wrote:

Hi all,

Please see

https://docs.google.com/document/d/1y25kRrXlORnUD9p7wTMOWjC6wONEyI9rU-Pv4q1udZ8/edit?usp=sharing
for a copy of this email with proper formatting.

thanks.

On Mon, May 2, 2022 at 4:23 PM Larry White <ljw1...@gmail.com> wrote:

Hi all,


I would like to request your feedback on incorporating Windows binaries
in
those Maven packages that have native Arrow dependencies, while drawing
your attention to the likely impact on jar size.


Five of the 23 arrow packages on Maven Central have native dependencies.
Four of those five have bundled native libraries included in the maven
package jar itself. (The exception is the plasma package.) For the
others,
both .so (Linux shared-object) and .dylib (OSX dynamic library) files are
provided in the same jar. Windows native libraries are not included.


The packages in question are:

    -

    arrow-dataset


    -

    arrow-orc
    -

    arrow-c
    -

    Arrow-gandiva


For developers using Arrow on OSX or Linux, the experience using the
arrow-dataset jar with its bundled native library is the same as using a
pure Java library. Including Windows binaries in the jars would expand
the
community of developers who could use Arrow features like datasets
“out-of
the box.”


Moreover, it is not trivial for devs on Windows to create their own
solution. To the best of my knowledge, pre-compiled JNI DLLs are not
available for download, and there are no build scripts or instructions,
as there are for Linux and Mac users (see

https://arrow.apache.org/docs/dev/developers/java/building.html#building-arrow-jni-modules
).
Effort

To produce the JNI DLLs, the main effort will be to create new
Windows-focused build scripts similar to: *arrow
<https://github.com/apache/arrow>/ci
<https://github.com/apache/arrow/tree/master/ci>/scripts
<https://github.com/apache/arrow/tree/master/ci/scripts
/java_jni_macos_build.sh,
a*nd incorporate them into the larger build process.


Creating these build files is a prerequisite for the suggested packaging
changes, but is also desirable in its own right, even if the proposed
packaging change is not implemented.
File size concern

The downside of including Windows binaries is that these files are large.
In the 7.0.0 release, the two native library files included in the
dataset
jar total 78 MB on disk, which is roughly 100% of the total size of the
jar. See table below for more details.

module

.dylib (size in MB)

.so (size in MB)

Combined

dataset

34.6

43.7

78.3

ORC

29.3

37.9

67.2

Gandiva

77.4

87.1

164.5

c-data

<1.0

<1.0

<`1.0

Total

141.3

167.7



It’s estimated that DLLs would be slightly larger than the dylib files,
so
that the proposed change would increase the size of the dataset jar from
78.3 MB to about 114 MB.

For reference, here are the native Arrow libraries (.so) in a PyArrow
x86-64 wheel:

Dataset

2.3

Flight

13.0

Python

2.1

Python-flight

0.1

Plasma

0.2

Parquet

4.3

Arrow

49.0

Total

71.0

Note that this isn't an apples-to-apples comparison: the PyArrow
libraries
do not include Gandiva, while the Java libraries do not include Flight,
Plasma, Parque, or (presumably) some amount of the code in the Arrow
file.

As more C++ functionality is used by Java code the number of modules with
native dependencies may rise, and the size of the individual libraries
may
increase.

For the sake of simplicity, it is preferable to produce a single Jar for
each module that contains binaries for the three platforms: Windows, OSX,
and Linux. If file size is a significant concern, there are several
options:



    -

    Stripping some symbols (`strip -x`) on the Linux dataset JNI library
    brings it down from 43 to 34 MB, at the cost of debug information. It
may
    be worth considering this option for release builds.
    -

    It may be possible to combine modules to reduce the amount of
    duplicated code for projects that need more than one module with
native
    dependencies.
    -

    OS-specific Maven packages could be built


Thank you for your feedback,

larry



Reply via email to