Re: [DISCUSS] Big Endian support in Arrow (was: Re: [Java] Supporting Big Endian)

Jacques Nadeau Sun, 30 Aug 2020 22:12:37 -0700

I didn't realize that Ishizaki isn't just proposing a BE platform support,
he is proposing a new BE version of the format. In this situation computers
speaking Arrow potentially have to convert from one version to the other
version. For example two machines communicating with Arrow flight now have
to do order swaps if they are speaking different endianess. This makes me
feel even less positive about the proposal.


In terms of my main concern: it was around demand versus cost. I don't
believe that has been addressed. There have been ways discussed to try to
minimize cost but that cost is non-zero whereas the demand feels very close
to zero. As far as I can tell, there was one ask for big-endian support in
August of 2016 (Sanjay) and none since [1]. Even in these threads I don't
hear a resounding demand for big endian. There are many features that are
far less invasive/expensive to support that have much more demand.

This feels like something that could easily rot and/or just become a load
on the core maintainers of the project. I know I don't want to have to go
and debug on a remote BE system if some tests starts failing for that
platform...

[1] Only a skim of archives, there may have been others:
https://markmail.org/thread/7gbf5est6mcsnc6l




On Sun, Aug 30, 2020 at 8:39 PM Micah Kornfield <[email protected]>
wrote:

> Looking over the outstanding PRs while the code isn't necessarily pretty, I
> don't think they are too invasive.
>
> Also it seems that Kazuaki Ishizaki is willing to add benchmarks where
> necessary to verify the lack of performance regressions.  (Please correct
> me if I misunderstood).
>
> Jacques and Liya Fan does this address your concerns?  Are there further
> details that you would like to discuss?  Are you still opposed to support
> in Java?
>
> Do maintainers of other implementations have concerns (in particular Go
> seems to be the other language in progress)?
>
> Thanks,
> Micah
>
> On Wed, Aug 26, 2020 at 6:57 AM Kazuaki Ishizaki <[email protected]>
> wrote:
>
> > Hi,
> > I waited for comments regarding Java Big-Endian (BE) support during my
> > one-week vacation. Thank you for good suggestions and comments.
> > I already responded to some questions in another mail. This mail
> addresses
> > the remaining questions: Use cases, holistic strategy for BE support, and
> > testing plans
> >
> > 1. Use cases
> > The use case of Arrow Java is in Apache Spark, which was already
> published
> > in Arrow Blog [1]. This is used as the typical performance acceleration
> of
> > Spark with other languages such as Python [2] and R [3]. In DataBricks
> > notebook, 68% of commands come from Python [4].
> >
> > 2. Holistic strategy of BE support across languages
> > I mostly completed BE support in C++. This implementation uses the
> > following strategy:
> > A. Write and read data in a record batch using platform-native endian
> (NE)
> > when the data is created on a host. The endianness is stored in an endian
> > field in the schema.
> > B. Send data using the IPC-host endian among processes using IPC.
> > C. At B, if an IPC-client endian is different from the received data
> > endian, the IPC client receives data without data copy.
> > D. At B, if an IPC-client endian is different from the received data
> > endian, the IPC client swaps endian of the received data to match the
> > endian with the IPC-client endian as default.
> > E. The primitive data types in memory (e.g. Decimal128 in C++ and
> > UnsafeDirectLittleEndian in Java) is read/written using the NE.
> >
> > A and B-C are typical use cases in Apache Arrow. Therefore, no endian
> swap
> > occurs in these use cases without performance overhead. B-D is rarely
> used
> > (e.g. send data from x86_64 to s390x). Thus, the data swap occurs only
> once
> > at the receive. After that, no data swap occurs for performance. For some
> > use cases, this swap can be stopped by using an option. In these cases,
> > Arrow will not process any data.
> > E. allows us to accessing primitive data (e.g. int32, double, decimal128)
> > without performance loss by using the platform-native endian load/stores.
> >
> > 2-1. Implementation strategy in Java Language
> > The existing primitive data structures such as UnsafeDirectLittleEndian,
> > ArrowBuf, and ValueVector should handle platform-native endian for the
> > strategies A, B-C, and E without performance overhead.
> > In the remaining strategy D, the method
> > MessageSerializer.deserializeRecordBatch() will handle data swap when the
> > endian of the host is different from that of the client, which
> corresponds
> > to the PR [6] in C++.
> >
> > 3. Testing plan
> > For testing the strategies, A, B-C, and E, it would be good to increase
> > the test coverage regardless of endianness e.g. increase the types of a
> > schema to be tested in flight-core).
> > For testing the strategy D, I already prepared data for be and le. When a
> > PR will enable the data swap, the PR will also enable integration test.
> > For performance testing, we can use the existing framework [7] by
> > extending the support for other languages. We can run performance
> > benchmarks on a little-endian platform to avoid performance regression.
> >
> > [1] https://arrow.apache.org/blog/2017/07/26/spark-arrow/
> > [2]
> >
> https://databricks.com/blog/2017/10/30/introducing-vectorized-udfs-for-pyspark.html
> > [3]
> >
> https://databricks.com/jp/blog/2020/06/01/vectorized-r-i-o-in-upcoming-apache-spark-3-0.html
> > [4] https://databricks.com/jp/session_na20/wednesday-morning-keynotes
> > [5] https://github.com/apache/arrow/pull/7507#discussion_r46819873
> > [6] https://github.com/apache/arrow/pull/7507
> > [7] https://github.com/apache/arrow/pull/7940#issuecomment-672690540
> >
> > Best Regards,
> > Kazuaki Ishizaki
> >
> > Wes McKinney <[email protected]> wrote on 2020/08/26 21:27:49:
> >
> > > From: Wes McKinney <[email protected]>
> > > To: dev <[email protected]>, Micah Kornfield <[email protected]
> >
> > > Cc: Fan Liya <[email protected]>
> > > Date: 2020/08/26 21:28
> > > Subject: [EXTERNAL] Re: [DISCUSS] Big Endian support in Arrow (was:
> > > Re: [Java] Supporting Big Endian)
> > >
> > > hi Micah,
> > >
> > > I agree with your reasoning. If supporting BE in some languages (e.g.
> > > Java) is impractical due to performance regressions on LE platforms,
> > > then I don't think it's worth it. But if it can be handled at compile
> > > time or without runtime overhead, and tested / maintained properly on
> > > an ongoing basis, then it seems reasonable to me. It seems that the
> > > number of Arrow stakeholders will only increase from here so I would
> > > hope that there will be more people invested in helping maintain BE in
> > > the future.
> > >
> > > - Wes
> > >
> > > On Tue, Aug 25, 2020 at 11:33 PM Micah Kornfield
> > > <[email protected]> wrote:
> > > >
> > > > I'm expanding the scope of this thread since it looks like work has
> > also
> > > > started for making golang support BigEndian architectures.
> > > >
> > > > I think as a community we should come to a consensus on whether we
> > want to
> > > > support Big Endian architectures in general.  I don't think it is a
> > good
> > > > outcome if some implementations accept PRs for Big Endian fixes and
> > some
> > > > don't.
> > > >
> > > > But maybe this is OK with others?
> > > >
> > > > My current opinion on the matter is that we should support it under
> the
> > > > following conditions:
> > > >
> > > > 1.  As long as there is CI in place to catch regressions (right now I
> > think
> > > > the CI is fairly unreliable?)
> > > > 2.  No degradation in performance for little-endian architectures
> > (verified
> > > > by additional micro benchmarks)
> > > > 3.  Not a large amount of invasive code to distinguish between
> > platforms.
> > > >
> > > > Kazuaki Ishizaki I asked question previously, but could you give some
> > data
> > > > points around:
> > > > 1.  The current state of C++ support (how much code needed to
> change)?
> > > > 2.  How many more PRs you expect to need for Java (and approximate
> > size)?
> > > >
> > > > I think this would help myself and others in the decision making
> > process.
> > > >
> > > > Thanks,
> > > > Micah
> > > >
> > > > On Tue, Aug 18, 2020 at 9:15 AM Micah Kornfield <
> [email protected]
> > >
> > > > wrote:
> > > >
> > > > > My thoughts on the points raised so far:
> > > > >
> > > > > * Does supporting Big Endian increase the reach of Arrow by a lot?
> > > > >
> > > > > Probably not a significant amount, but it does provide one more
> > avenue of
> > > > > adoption.
> > > > >
> > > > > * Does it increase code complexity?
> > > > >
> > > > > Yes.  I agree this is a concern.  The PR in question did not seem
> > too bad
> > > > > to me but this is subjective.  I think the remaining question is
> how
> > many
> > > > > more places need to be fixed up in the code base and how invasive
> > are the
> > > > > changes.  In C++ IIUC it turned out to be a relatively small number
> > of
> > > > > places.
> > > > >
> > > > > Kazuaki Ishizaki have you been able to get the Java implementation
> > working
> > > > > fully locally?  How many additional PRs will be needed and what do
> > > > > they look like (I think there already a few more in the queue)?
> > > > >
> > > > > * Will it introduce performance regressions?
> > > > >
> > > > > If done properly I suspect no, but I think if we continue with
> > BigEndian
> > > > > support the places that need to be touched should have benchmarks
> > added to
> > > > > confirm this (including for PRs already merged).
> > > > >
> > > > > Thanks,
> > > > > Micah
> > > > >
> > > > > On Sun, Aug 16, 2020 at 7:37 PM Fan Liya <[email protected]>
> > wrote:
> > > > >
> > > > >> Thank Kazuaki Ishizaki for working on this.
> > > > >> IMO, supporting the big-endian should be a large change, as in
> many
> > > > >> places of the code base, we have implicitly assumed the
> > little-endian
> > > > >> platform (e.g.
> > > > >> https://urldefense.proofpoint.com/v2/url?
> > >
> >
> u=https-3A__github.com_apache_arrow_blob_master_java_memory_memory-2Dcore_src_main_java_org_apache_arrow_memory_util_ByteFunctionHelpers.java&d=DwIBaQ&c=jf_iaSHvJObTbx-
> > > siA1ZOg&r=b70dG_9wpCdZSkBJahHYQ4IwKMdp2hQM29f-
> > > ZCGj9Pg&m=3rVsa9EYwGOrvQw8rg0L9EtFs7I7B-
> > > n7ezRb8qyWtog&s=poFSWqjJv99prou53ciinHyBmh5IZlXLlhYvftb9fu4&e=
> > > > >> ).
> > > > >> Supporting the big-endian platform may introduce branches in such
> > places
> > > > >> (or virtual calls) which will affect the performance.
> > > > >> So it would be helpful to evaluate the performance impact.
> > > > >>
> > > > >> Best,
> > > > >> Liya Fan
> > > > >>
> > > > >>
> > > > >> On Sat, Aug 15, 2020 at 7:54 AM Jacques Nadeau <
> [email protected]>
> > > > >> wrote:
> > > > >>
> > > > >>> Hey Micah, thanks for starting the discussion.
> > > > >>>
> > > > >>> I just skimmed that thread and it isn't entirely clear that there
> > was a
> > > > >>> conclusion that the overhead was worth it. I think everybody
> > agrees that
> > > > >>> it
> > > > >>> would be nice to have the code work on both platforms. On the
> > flipside,
> > > > >>> the
> > > > >>> code noise for a rare case makes the cost-benefit questionable.
> > > > >>>
> > > > >>> In the Java code, we wrote the code to explicitly disallow big
> > endian
> > > > >>> platforms and put preconditions checks in. I definitely think if
> > we want
> > > > >>> to
> > > > >>> support this, it should be done holistically across the code with
> > > > >>> appropriate test plan (both functional and perf).
> > > > >>>
> > > > >>> To me, the question is really about how many use cases are
> blocked
> > by
> > > > >>> this.
> > > > >>> I'm not sure I've heard anyone say that the limiting factor
> > toleveraging
> > > > >>> Java Arrow was the block on endianess. Keep in mind that until
> very
> > > > >>> recently, using any Arrow Java code would throw a preconditions
> > check
> > > > >>> before you could even get started on big-endian and I don't think
> > we've
> > > > >>> seen a bunch of messages on that exception. Adding if conditions
> > > > >>> throughout
> > > > >>> the codebase like this patch: [1] isn't exactly awesome and it
> can
> > also
> > > > >>> risk performance impacts depending on how carefully it is done.
> > > > >>>
> > > > >>> If there isn't a preponderance of evidence of many users
> > beingblocked by
> > > > >>> this capability, I don't think we should accept the code. We
> > already
> > > > >>> have a
> > > > >>> backlog of items that we need to address just ensure existing use
> > cases
> > > > >>> work well. Expanding to new use cases that there is no clear
> > demand for
> > > > >>> will likely just increase code development cost at little
> benefit.
> > > > >>>
> > > > >>> What do others think?
> > > > >>>
> > > > >>> [1] https://urldefense.proofpoint.com/v2/url?
> > >
> >
> u=https-3A__github.com_apache_arrow_pull_7923-23issuecomment-2D674311119&d=DwIBaQ&c=jf_iaSHvJObTbx-
> > > siA1ZOg&r=b70dG_9wpCdZSkBJahHYQ4IwKMdp2hQM29f-
> > > ZCGj9Pg&m=3rVsa9EYwGOrvQw8rg0L9EtFs7I7B-
> > > n7ezRb8qyWtog&s=vmvc0b4yHFfWLjLheCRysSiyaeRFO_6p0wdH-sLa7M8&e=
> > > > >>>
> > > > >>> On Fri, Aug 14, 2020 at 4:36 PM Micah Kornfield <
> > [email protected]>
> > > > >>> wrote:
> > > > >>>
> > > > >>> > Kazuaki Ishizak has started working on Big Endian support in
> Java
> > > > >>> > (including setting up CI for it).  Thank you!
> > > > >>> >
> > > > >>> > We previously discussed support for Big Endian architectures in
> > C++
> > > > >>> [1] and
> > > > >>> > generally agreed that it was a reasonable thing to do.
> > > > >>> >
> > > > >>> > Similar to C++ I think as long as we have a working CI setup it
> > is
> > > > >>> > reasonable for Java to support Big Endian machines.
> > > > >>> >
> > > > >>> > But I think there might be differing opinions so it is worth a
> > > > >>> discussion
> > > > >>> > to see if there are technical blockers or other reasons for not
> > > > >>> supporting
> > > > >>> > Big Endian architectures in the existing java implementation.
> > > > >>> >
> > > > >>> > Thanks,
> > > > >>> > Micah
> > > > >>> >
> > > > >>> >
> > > > >>> > [1]
> > > > >>> >
> > > > >>> >
> > > > >>> https://urldefense.proofpoint.com/v2/url?
> > >
> >
> u=https-3A__lists.apache.org_thread.html_rcae745f1d848981bb5e8dddacfc4554641aba62e3c949b96bfd8b019-2540-253Cdev.arrow.apache.org-253E&d=DwIBaQ&c=jf_iaSHvJObTbx-
> > > siA1ZOg&r=b70dG_9wpCdZSkBJahHYQ4IwKMdp2hQM29f-
> > > ZCGj9Pg&m=3rVsa9EYwGOrvQw8rg0L9EtFs7I7B-
> > > n7ezRb8qyWtog&s=oDBWI9pmI39bTsEieQNDxZit0My21hLIW0fJRPJI0AM&e=
> > > > >>> >
> > > > >>>
> > > > >>
> > >
> >
> >
>

Re: [DISCUSS] Big Endian support in Arrow (was: Re: [Java] Supporting Big Endian)

Reply via email to