I updated the link and opened Jira tickets for the work

On Wed, May 4, 2022 at 11:21 AM Alessandro Molina <
alessan...@ursacomputing.com> wrote:

> The proposal seems reasonable to me, we should do our best at providing
> users the same experience on the various systems whenever possible.
>
> As long as we don't receive complaints about the package size, I think we
> can live with it. If it becomes a problem for our users, we can always make
> per-system binaries in the future.
>
> PS: I think you forgot to enable comments on the google docs, that's
> something you usually want to allow as it eases providing feedback.
>
> On Tue, May 3, 2022 at 4:19 PM Larry White <ljw1...@gmail.com> wrote:
>
> > Hi all,
> >
> > Please see
> >
> >
> https://docs.google.com/document/d/1y25kRrXlORnUD9p7wTMOWjC6wONEyI9rU-Pv4q1udZ8/edit?usp=sharing
> > for a copy of this email with proper formatting.
> >
> > thanks.
> >
> > On Mon, May 2, 2022 at 4:23 PM Larry White <ljw1...@gmail.com> wrote:
> >
> > > Hi all,
> > >
> > >
> > > I would like to request your feedback on incorporating Windows binaries
> > in
> > > those Maven packages that have native Arrow dependencies, while drawing
> > > your attention to the likely impact on jar size.
> > >
> > >
> > > Five of the 23 arrow packages on Maven Central have native
> dependencies.
> > > Four of those five have bundled native libraries included in the maven
> > > package jar itself. (The exception is the plasma package.) For the
> > others,
> > > both .so (Linux shared-object) and .dylib (OSX dynamic library) files
> are
> > > provided in the same jar. Windows native libraries are not included.
> > >
> > >
> > > The packages in question are:
> > >
> > >    -
> > >
> > >    arrow-dataset
> > >
> > >
> > >    -
> > >
> > >    arrow-orc
> > >    -
> > >
> > >    arrow-c
> > >    -
> > >
> > >    Arrow-gandiva
> > >
> > >
> > > For developers using Arrow on OSX or Linux, the experience using the
> > > arrow-dataset jar with its bundled native library is the same as using
> a
> > > pure Java library. Including Windows binaries in the jars would expand
> > the
> > > community of developers who could use Arrow features like datasets
> > “out-of
> > > the box.”
> > >
> > >
> > > Moreover, it is not trivial for devs on Windows to create their own
> > > solution. To the best of my knowledge, pre-compiled JNI DLLs are not
> > > available for download, and there are no build scripts or instructions,
> > > as there are for Linux and Mac users (see
> > >
> >
> https://arrow.apache.org/docs/dev/developers/java/building.html#building-arrow-jni-modules
> > > ).
> > > Effort
> > >
> > > To produce the JNI DLLs, the main effort will be to create new
> > > Windows-focused build scripts similar to: *arrow
> > > <https://github.com/apache/arrow>/ci
> > > <https://github.com/apache/arrow/tree/master/ci>/scripts
> > > <https://github.com/apache/arrow/tree/master/ci/scripts
> > >/java_jni_macos_build.sh,
> > > a*nd incorporate them into the larger build process.
> > >
> > >
> > > Creating these build files is a prerequisite for the suggested
> packaging
> > > changes, but is also desirable in its own right, even if the proposed
> > > packaging change is not implemented.
> > > File size concern
> > >
> > > The downside of including Windows binaries is that these files are
> large.
> > > In the 7.0.0 release, the two native library files included in the
> > dataset
> > > jar total 78 MB on disk, which is roughly 100% of the total size of the
> > > jar. See table below for more details.
> > >
> > > module
> > >
> > > .dylib (size in MB)
> > >
> > > .so (size in MB)
> > >
> > > Combined
> > >
> > > dataset
> > >
> > > 34.6
> > >
> > > 43.7
> > >
> > > 78.3
> > >
> > > ORC
> > >
> > > 29.3
> > >
> > > 37.9
> > >
> > > 67.2
> > >
> > > Gandiva
> > >
> > > 77.4
> > >
> > > 87.1
> > >
> > > 164.5
> > >
> > > c-data
> > >
> > > <1.0
> > >
> > > <1.0
> > >
> > > <`1.0
> > >
> > > Total
> > >
> > > 141.3
> > >
> > > 167.7
> > >
> > >
> > >
> > > It’s estimated that DLLs would be slightly larger than the dylib files,
> > so
> > > that the proposed change would increase the size of the dataset jar
> from
> > > 78.3 MB to about 114 MB.
> > >
> > > For reference, here are the native Arrow libraries (.so) in a PyArrow
> > > x86-64 wheel:
> > >
> > > Dataset
> > >
> > > 2.3
> > >
> > > Flight
> > >
> > > 13.0
> > >
> > > Python
> > >
> > > 2.1
> > >
> > > Python-flight
> > >
> > > 0.1
> > >
> > > Plasma
> > >
> > > 0.2
> > >
> > > Parquet
> > >
> > > 4.3
> > >
> > > Arrow
> > >
> > > 49.0
> > >
> > > Total
> > >
> > > 71.0
> > >
> > > Note that this isn't an apples-to-apples comparison: the PyArrow
> > libraries
> > > do not include Gandiva, while the Java libraries do not include Flight,
> > > Plasma, Parque, or (presumably) some amount of the code in the Arrow
> > file.
> > >
> > > As more C++ functionality is used by Java code the number of modules
> with
> > > native dependencies may rise, and the size of the individual libraries
> > may
> > > increase.
> > >
> > > For the sake of simplicity, it is preferable to produce a single Jar
> for
> > > each module that contains binaries for the three platforms: Windows,
> OSX,
> > > and Linux. If file size is a significant concern, there are several
> > options:
> > >
> > >
> > >
> > >    -
> > >
> > >    Stripping some symbols (`strip -x`) on the Linux dataset JNI library
> > >    brings it down from 43 to 34 MB, at the cost of debug information.
> It
> > may
> > >    be worth considering this option for release builds.
> > >    -
> > >
> > >    It may be possible to combine modules to reduce the amount of
> > >    duplicated code for projects that need more than one module with
> > native
> > >    dependencies.
> > >    -
> > >
> > >    OS-specific Maven packages could be built
> > >
> > >
> > > Thank you for your feedback,
> > >
> > > larry
> > >
> >
>

Reply via email to