I updated the link and opened Jira tickets for the work On Wed, May 4, 2022 at 11:21 AM Alessandro Molina < alessan...@ursacomputing.com> wrote:
> The proposal seems reasonable to me, we should do our best at providing > users the same experience on the various systems whenever possible. > > As long as we don't receive complaints about the package size, I think we > can live with it. If it becomes a problem for our users, we can always make > per-system binaries in the future. > > PS: I think you forgot to enable comments on the google docs, that's > something you usually want to allow as it eases providing feedback. > > On Tue, May 3, 2022 at 4:19 PM Larry White <ljw1...@gmail.com> wrote: > > > Hi all, > > > > Please see > > > > > https://docs.google.com/document/d/1y25kRrXlORnUD9p7wTMOWjC6wONEyI9rU-Pv4q1udZ8/edit?usp=sharing > > for a copy of this email with proper formatting. > > > > thanks. > > > > On Mon, May 2, 2022 at 4:23 PM Larry White <ljw1...@gmail.com> wrote: > > > > > Hi all, > > > > > > > > > I would like to request your feedback on incorporating Windows binaries > > in > > > those Maven packages that have native Arrow dependencies, while drawing > > > your attention to the likely impact on jar size. > > > > > > > > > Five of the 23 arrow packages on Maven Central have native > dependencies. > > > Four of those five have bundled native libraries included in the maven > > > package jar itself. (The exception is the plasma package.) For the > > others, > > > both .so (Linux shared-object) and .dylib (OSX dynamic library) files > are > > > provided in the same jar. Windows native libraries are not included. > > > > > > > > > The packages in question are: > > > > > > - > > > > > > arrow-dataset > > > > > > > > > - > > > > > > arrow-orc > > > - > > > > > > arrow-c > > > - > > > > > > Arrow-gandiva > > > > > > > > > For developers using Arrow on OSX or Linux, the experience using the > > > arrow-dataset jar with its bundled native library is the same as using > a > > > pure Java library. Including Windows binaries in the jars would expand > > the > > > community of developers who could use Arrow features like datasets > > “out-of > > > the box.” > > > > > > > > > Moreover, it is not trivial for devs on Windows to create their own > > > solution. To the best of my knowledge, pre-compiled JNI DLLs are not > > > available for download, and there are no build scripts or instructions, > > > as there are for Linux and Mac users (see > > > > > > https://arrow.apache.org/docs/dev/developers/java/building.html#building-arrow-jni-modules > > > ). > > > Effort > > > > > > To produce the JNI DLLs, the main effort will be to create new > > > Windows-focused build scripts similar to: *arrow > > > <https://github.com/apache/arrow>/ci > > > <https://github.com/apache/arrow/tree/master/ci>/scripts > > > <https://github.com/apache/arrow/tree/master/ci/scripts > > >/java_jni_macos_build.sh, > > > a*nd incorporate them into the larger build process. > > > > > > > > > Creating these build files is a prerequisite for the suggested > packaging > > > changes, but is also desirable in its own right, even if the proposed > > > packaging change is not implemented. > > > File size concern > > > > > > The downside of including Windows binaries is that these files are > large. > > > In the 7.0.0 release, the two native library files included in the > > dataset > > > jar total 78 MB on disk, which is roughly 100% of the total size of the > > > jar. See table below for more details. > > > > > > module > > > > > > .dylib (size in MB) > > > > > > .so (size in MB) > > > > > > Combined > > > > > > dataset > > > > > > 34.6 > > > > > > 43.7 > > > > > > 78.3 > > > > > > ORC > > > > > > 29.3 > > > > > > 37.9 > > > > > > 67.2 > > > > > > Gandiva > > > > > > 77.4 > > > > > > 87.1 > > > > > > 164.5 > > > > > > c-data > > > > > > <1.0 > > > > > > <1.0 > > > > > > <`1.0 > > > > > > Total > > > > > > 141.3 > > > > > > 167.7 > > > > > > > > > > > > It’s estimated that DLLs would be slightly larger than the dylib files, > > so > > > that the proposed change would increase the size of the dataset jar > from > > > 78.3 MB to about 114 MB. > > > > > > For reference, here are the native Arrow libraries (.so) in a PyArrow > > > x86-64 wheel: > > > > > > Dataset > > > > > > 2.3 > > > > > > Flight > > > > > > 13.0 > > > > > > Python > > > > > > 2.1 > > > > > > Python-flight > > > > > > 0.1 > > > > > > Plasma > > > > > > 0.2 > > > > > > Parquet > > > > > > 4.3 > > > > > > Arrow > > > > > > 49.0 > > > > > > Total > > > > > > 71.0 > > > > > > Note that this isn't an apples-to-apples comparison: the PyArrow > > libraries > > > do not include Gandiva, while the Java libraries do not include Flight, > > > Plasma, Parque, or (presumably) some amount of the code in the Arrow > > file. > > > > > > As more C++ functionality is used by Java code the number of modules > with > > > native dependencies may rise, and the size of the individual libraries > > may > > > increase. > > > > > > For the sake of simplicity, it is preferable to produce a single Jar > for > > > each module that contains binaries for the three platforms: Windows, > OSX, > > > and Linux. If file size is a significant concern, there are several > > options: > > > > > > > > > > > > - > > > > > > Stripping some symbols (`strip -x`) on the Linux dataset JNI library > > > brings it down from 43 to 34 MB, at the cost of debug information. > It > > may > > > be worth considering this option for release builds. > > > - > > > > > > It may be possible to combine modules to reduce the amount of > > > duplicated code for projects that need more than one module with > > native > > > dependencies. > > > - > > > > > > OS-specific Maven packages could be built > > > > > > > > > Thank you for your feedback, > > > > > > larry > > > > > >