Re: [DISCUSS] Big Endian support in Arrow (was: Re: [Java] Supporting Big Endian)

Antoine Pitrou Mon, 31 Aug 2020 01:55:37 -0700


Le 31/08/2020 à 07:11, Jacques Nadeau a écrit :
> I didn't realize that Ishizaki isn't just proposing a BE platform support,
> he is proposing a new BE version of the format.


What do you mean?  The Endianness field (a Big|Little enum) was added 4
years ago:
https://issues.apache.org/jira/browse/ARROW-245

The format doc has the following wording:
"""
Serialized Schema metadata has an endianness field indicating endianness
of RecordBatches. Typically this is the endianness of the system where
the RecordBatch was generated.  [...] The reference implementation is
focused on Little Endian and provides tests for it. Eventually we may
provide automatic conversion via byte swapping.
"""

So here is how I read it:
* the format allows conveying big-endian as well as little-endian data
(since it's "typically the endianness of the system where the
RecordBatch was generated")
* the reference implementations may or may not choose to support reading
foreign endian data by byte swapping on reception

Regards

Antoine.



>
 In this situation computers
> speaking Arrow potentially have to convert from one version to the other
> version. For example two machines communicating with Arrow flight now have
> to do order swaps if they are speaking different endianess. This makes me
> feel even less positive about the proposal.
> 
> In terms of my main concern: it was around demand versus cost. I don't
> believe that has been addressed. There have been ways discussed to try to
> minimize cost but that cost is non-zero whereas the demand feels very close
> to zero. As far as I can tell, there was one ask for big-endian support in
> August of 2016 (Sanjay) and none since [1]. Even in these threads I don't
> hear a resounding demand for big endian. There are many features that are
> far less invasive/expensive to support that have much more demand.
> 
> This feels like something that could easily rot and/or just become a load
> on the core maintainers of the project. I know I don't want to have to go
> and debug on a remote BE system if some tests starts failing for that
> platform...
> 
> [1] Only a skim of archives, there may have been others:
> https://markmail.org/thread/7gbf5est6mcsnc6l
> 
> 
> 
> 
> On Sun, Aug 30, 2020 at 8:39 PM Micah Kornfield <emkornfi...@gmail.com>
> wrote:
> 
>> Looking over the outstanding PRs while the code isn't necessarily pretty, I
>> don't think they are too invasive.
>>
>> Also it seems that Kazuaki Ishizaki is willing to add benchmarks where
>> necessary to verify the lack of performance regressions.  (Please correct
>> me if I misunderstood).
>>
>> Jacques and Liya Fan does this address your concerns?  Are there further
>> details that you would like to discuss?  Are you still opposed to support
>> in Java?
>>
>> Do maintainers of other implementations have concerns (in particular Go
>> seems to be the other language in progress)?
>>
>> Thanks,
>> Micah
>>
>> On Wed, Aug 26, 2020 at 6:57 AM Kazuaki Ishizaki <ishiz...@jp.ibm.com>
>> wrote:
>>
>>> Hi,
>>> I waited for comments regarding Java Big-Endian (BE) support during my
>>> one-week vacation. Thank you for good suggestions and comments.
>>> I already responded to some questions in another mail. This mail
>> addresses
>>> the remaining questions: Use cases, holistic strategy for BE support, and
>>> testing plans
>>>
>>> 1. Use cases
>>> The use case of Arrow Java is in Apache Spark, which was already
>> published
>>> in Arrow Blog [1]. This is used as the typical performance acceleration
>> of
>>> Spark with other languages such as Python [2] and R [3]. In DataBricks
>>> notebook, 68% of commands come from Python [4].
>>>
>>> 2. Holistic strategy of BE support across languages
>>> I mostly completed BE support in C++. This implementation uses the
>>> following strategy:
>>> A. Write and read data in a record batch using platform-native endian
>> (NE)
>>> when the data is created on a host. The endianness is stored in an endian
>>> field in the schema.
>>> B. Send data using the IPC-host endian among processes using IPC.
>>> C. At B, if an IPC-client endian is different from the received data
>>> endian, the IPC client receives data without data copy.
>>> D. At B, if an IPC-client endian is different from the received data
>>> endian, the IPC client swaps endian of the received data to match the
>>> endian with the IPC-client endian as default.
>>> E. The primitive data types in memory (e.g. Decimal128 in C++ and
>>> UnsafeDirectLittleEndian in Java) is read/written using the NE.
>>>
>>> A and B-C are typical use cases in Apache Arrow. Therefore, no endian
>> swap
>>> occurs in these use cases without performance overhead. B-D is rarely
>> used
>>> (e.g. send data from x86_64 to s390x). Thus, the data swap occurs only
>> once
>>> at the receive. After that, no data swap occurs for performance. For some
>>> use cases, this swap can be stopped by using an option. In these cases,
>>> Arrow will not process any data.
>>> E. allows us to accessing primitive data (e.g. int32, double, decimal128)
>>> without performance loss by using the platform-native endian load/stores.
>>>
>>> 2-1. Implementation strategy in Java Language
>>> The existing primitive data structures such as UnsafeDirectLittleEndian,
>>> ArrowBuf, and ValueVector should handle platform-native endian for the
>>> strategies A, B-C, and E without performance overhead.
>>> In the remaining strategy D, the method
>>> MessageSerializer.deserializeRecordBatch() will handle data swap when the
>>> endian of the host is different from that of the client, which
>> corresponds
>>> to the PR [6] in C++.
>>>
>>> 3. Testing plan
>>> For testing the strategies, A, B-C, and E, it would be good to increase
>>> the test coverage regardless of endianness e.g. increase the types of a
>>> schema to be tested in flight-core).
>>> For testing the strategy D, I already prepared data for be and le. When a
>>> PR will enable the data swap, the PR will also enable integration test.
>>> For performance testing, we can use the existing framework [7] by
>>> extending the support for other languages. We can run performance
>>> benchmarks on a little-endian platform to avoid performance regression.
>>>
>>> [1] https://arrow.apache.org/blog/2017/07/26/spark-arrow/
>>> [2]
>>>
>> https://databricks.com/blog/2017/10/30/introducing-vectorized-udfs-for-pyspark.html
>>> [3]
>>>
>> https://databricks.com/jp/blog/2020/06/01/vectorized-r-i-o-in-upcoming-apache-spark-3-0.html
>>> [4] https://databricks.com/jp/session_na20/wednesday-morning-keynotes
>>> [5] https://github.com/apache/arrow/pull/7507#discussion_r46819873
>>> [6] https://github.com/apache/arrow/pull/7507
>>> [7] https://github.com/apache/arrow/pull/7940#issuecomment-672690540
>>>
>>> Best Regards,
>>> Kazuaki Ishizaki
>>>
>>> Wes McKinney <wesmck...@gmail.com> wrote on 2020/08/26 21:27:49:
>>>
>>>> From: Wes McKinney <wesmck...@gmail.com>
>>>> To: dev <dev@arrow.apache.org>, Micah Kornfield <emkornfi...@gmail.com
>>>
>>>> Cc: Fan Liya <liya.fa...@gmail.com>
>>>> Date: 2020/08/26 21:28
>>>> Subject: [EXTERNAL] Re: [DISCUSS] Big Endian support in Arrow (was:
>>>> Re: [Java] Supporting Big Endian)
>>>>
>>>> hi Micah,
>>>>
>>>> I agree with your reasoning. If supporting BE in some languages (e.g.
>>>> Java) is impractical due to performance regressions on LE platforms,
>>>> then I don't think it's worth it. But if it can be handled at compile
>>>> time or without runtime overhead, and tested / maintained properly on
>>>> an ongoing basis, then it seems reasonable to me. It seems that the
>>>> number of Arrow stakeholders will only increase from here so I would
>>>> hope that there will be more people invested in helping maintain BE in
>>>> the future.
>>>>
>>>> - Wes
>>>>
>>>> On Tue, Aug 25, 2020 at 11:33 PM Micah Kornfield
>>>> <emkornfi...@gmail.com> wrote:
>>>>>
>>>>> I'm expanding the scope of this thread since it looks like work has
>>> also
>>>>> started for making golang support BigEndian architectures.
>>>>>
>>>>> I think as a community we should come to a consensus on whether we
>>> want to
>>>>> support Big Endian architectures in general.  I don't think it is a
>>> good
>>>>> outcome if some implementations accept PRs for Big Endian fixes and
>>> some
>>>>> don't.
>>>>>
>>>>> But maybe this is OK with others?
>>>>>
>>>>> My current opinion on the matter is that we should support it under
>> the
>>>>> following conditions:
>>>>>
>>>>> 1.  As long as there is CI in place to catch regressions (right now I
>>> think
>>>>> the CI is fairly unreliable?)
>>>>> 2.  No degradation in performance for little-endian architectures
>>> (verified
>>>>> by additional micro benchmarks)
>>>>> 3.  Not a large amount of invasive code to distinguish between
>>> platforms.
>>>>>
>>>>> Kazuaki Ishizaki I asked question previously, but could you give some
>>> data
>>>>> points around:
>>>>> 1.  The current state of C++ support (how much code needed to
>> change)?
>>>>> 2.  How many more PRs you expect to need for Java (and approximate
>>> size)?
>>>>>
>>>>> I think this would help myself and others in the decision making
>>> process.
>>>>>
>>>>> Thanks,
>>>>> Micah
>>>>>
>>>>> On Tue, Aug 18, 2020 at 9:15 AM Micah Kornfield <
>> emkornfi...@gmail.com
>>>>
>>>>> wrote:
>>>>>
>>>>>> My thoughts on the points raised so far:
>>>>>>
>>>>>> * Does supporting Big Endian increase the reach of Arrow by a lot?
>>>>>>
>>>>>> Probably not a significant amount, but it does provide one more
>>> avenue of
>>>>>> adoption.
>>>>>>
>>>>>> * Does it increase code complexity?
>>>>>>
>>>>>> Yes.  I agree this is a concern.  The PR in question did not seem
>>> too bad
>>>>>> to me but this is subjective.  I think the remaining question is
>> how
>>> many
>>>>>> more places need to be fixed up in the code base and how invasive
>>> are the
>>>>>> changes.  In C++ IIUC it turned out to be a relatively small number
>>> of
>>>>>> places.
>>>>>>
>>>>>> Kazuaki Ishizaki have you been able to get the Java implementation
>>> working
>>>>>> fully locally?  How many additional PRs will be needed and what do
>>>>>> they look like (I think there already a few more in the queue)?
>>>>>>
>>>>>> * Will it introduce performance regressions?
>>>>>>
>>>>>> If done properly I suspect no, but I think if we continue with
>>> BigEndian
>>>>>> support the places that need to be touched should have benchmarks
>>> added to
>>>>>> confirm this (including for PRs already merged).
>>>>>>
>>>>>> Thanks,
>>>>>> Micah
>>>>>>
>>>>>> On Sun, Aug 16, 2020 at 7:37 PM Fan Liya <liya.fa...@gmail.com>
>>> wrote:
>>>>>>
>>>>>>> Thank Kazuaki Ishizaki for working on this.
>>>>>>> IMO, supporting the big-endian should be a large change, as in
>> many
>>>>>>> places of the code base, we have implicitly assumed the
>>> little-endian
>>>>>>> platform (e.g.
>>>>>>> https://urldefense.proofpoint.com/v2/url?
>>>>
>>>
>> u=https-3A__github.com_apache_arrow_blob_master_java_memory_memory-2Dcore_src_main_java_org_apache_arrow_memory_util_ByteFunctionHelpers.java&d=DwIBaQ&c=jf_iaSHvJObTbx-
>>>> siA1ZOg&r=b70dG_9wpCdZSkBJahHYQ4IwKMdp2hQM29f-
>>>> ZCGj9Pg&m=3rVsa9EYwGOrvQw8rg0L9EtFs7I7B-
>>>> n7ezRb8qyWtog&s=poFSWqjJv99prou53ciinHyBmh5IZlXLlhYvftb9fu4&e=
>>>>>>> ).
>>>>>>> Supporting the big-endian platform may introduce branches in such
>>> places
>>>>>>> (or virtual calls) which will affect the performance.
>>>>>>> So it would be helpful to evaluate the performance impact.
>>>>>>>
>>>>>>> Best,
>>>>>>> Liya Fan
>>>>>>>
>>>>>>>
>>>>>>> On Sat, Aug 15, 2020 at 7:54 AM Jacques Nadeau <
>> jacq...@apache.org>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hey Micah, thanks for starting the discussion.
>>>>>>>>
>>>>>>>> I just skimmed that thread and it isn't entirely clear that there
>>> was a
>>>>>>>> conclusion that the overhead was worth it. I think everybody
>>> agrees that
>>>>>>>> it
>>>>>>>> would be nice to have the code work on both platforms. On the
>>> flipside,
>>>>>>>> the
>>>>>>>> code noise for a rare case makes the cost-benefit questionable.
>>>>>>>>
>>>>>>>> In the Java code, we wrote the code to explicitly disallow big
>>> endian
>>>>>>>> platforms and put preconditions checks in. I definitely think if
>>> we want
>>>>>>>> to
>>>>>>>> support this, it should be done holistically across the code with
>>>>>>>> appropriate test plan (both functional and perf).
>>>>>>>>
>>>>>>>> To me, the question is really about how many use cases are
>> blocked
>>> by
>>>>>>>> this.
>>>>>>>> I'm not sure I've heard anyone say that the limiting factor
>>> toleveraging
>>>>>>>> Java Arrow was the block on endianess. Keep in mind that until
>> very
>>>>>>>> recently, using any Arrow Java code would throw a preconditions
>>> check
>>>>>>>> before you could even get started on big-endian and I don't think
>>> we've
>>>>>>>> seen a bunch of messages on that exception. Adding if conditions
>>>>>>>> throughout
>>>>>>>> the codebase like this patch: [1] isn't exactly awesome and it
>> can
>>> also
>>>>>>>> risk performance impacts depending on how carefully it is done.
>>>>>>>>
>>>>>>>> If there isn't a preponderance of evidence of many users
>>> beingblocked by
>>>>>>>> this capability, I don't think we should accept the code. We
>>> already
>>>>>>>> have a
>>>>>>>> backlog of items that we need to address just ensure existing use
>>> cases
>>>>>>>> work well. Expanding to new use cases that there is no clear
>>> demand for
>>>>>>>> will likely just increase code development cost at little
>> benefit.
>>>>>>>>
>>>>>>>> What do others think?
>>>>>>>>
>>>>>>>> [1] https://urldefense.proofpoint.com/v2/url?
>>>>
>>>
>> u=https-3A__github.com_apache_arrow_pull_7923-23issuecomment-2D674311119&d=DwIBaQ&c=jf_iaSHvJObTbx-
>>>> siA1ZOg&r=b70dG_9wpCdZSkBJahHYQ4IwKMdp2hQM29f-
>>>> ZCGj9Pg&m=3rVsa9EYwGOrvQw8rg0L9EtFs7I7B-
>>>> n7ezRb8qyWtog&s=vmvc0b4yHFfWLjLheCRysSiyaeRFO_6p0wdH-sLa7M8&e=
>>>>>>>>
>>>>>>>> On Fri, Aug 14, 2020 at 4:36 PM Micah Kornfield <
>>> emkornfi...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Kazuaki Ishizak has started working on Big Endian support in
>> Java
>>>>>>>>> (including setting up CI for it).  Thank you!
>>>>>>>>>
>>>>>>>>> We previously discussed support for Big Endian architectures in
>>> C++
>>>>>>>> [1] and
>>>>>>>>> generally agreed that it was a reasonable thing to do.
>>>>>>>>>
>>>>>>>>> Similar to C++ I think as long as we have a working CI setup it
>>> is
>>>>>>>>> reasonable for Java to support Big Endian machines.
>>>>>>>>>
>>>>>>>>> But I think there might be differing opinions so it is worth a
>>>>>>>> discussion
>>>>>>>>> to see if there are technical blockers or other reasons for not
>>>>>>>> supporting
>>>>>>>>> Big Endian architectures in the existing java implementation.
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Micah
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> [1]
>>>>>>>>>
>>>>>>>>>
>>>>>>>> https://urldefense.proofpoint.com/v2/url?
>>>>
>>>
>> u=https-3A__lists.apache.org_thread.html_rcae745f1d848981bb5e8dddacfc4554641aba62e3c949b96bfd8b019-2540-253Cdev.arrow.apache.org-253E&d=DwIBaQ&c=jf_iaSHvJObTbx-
>>>> siA1ZOg&r=b70dG_9wpCdZSkBJahHYQ4IwKMdp2hQM29f-
>>>> ZCGj9Pg&m=3rVsa9EYwGOrvQw8rg0L9EtFs7I7B-
>>>> n7ezRb8qyWtog&s=oDBWI9pmI39bTsEieQNDxZit0My21hLIW0fJRPJI0AM&e=
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>
>>>
>>>
>>
>

Re: [DISCUSS] Big Endian support in Arrow (was: Re: [Java] Supporting Big Endian)

Reply via email to