Le 31/08/2020 à 07:11, Jacques Nadeau a écrit : > I didn't realize that Ishizaki isn't just proposing a BE platform support, > he is proposing a new BE version of the format.
What do you mean? The Endianness field (a Big|Little enum) was added 4 years ago: https://issues.apache.org/jira/browse/ARROW-245 The format doc has the following wording: """ Serialized Schema metadata has an endianness field indicating endianness of RecordBatches. Typically this is the endianness of the system where the RecordBatch was generated. [...] The reference implementation is focused on Little Endian and provides tests for it. Eventually we may provide automatic conversion via byte swapping. """ So here is how I read it: * the format allows conveying big-endian as well as little-endian data (since it's "typically the endianness of the system where the RecordBatch was generated") * the reference implementations may or may not choose to support reading foreign endian data by byte swapping on reception Regards Antoine. > In this situation computers > speaking Arrow potentially have to convert from one version to the other > version. For example two machines communicating with Arrow flight now have > to do order swaps if they are speaking different endianess. This makes me > feel even less positive about the proposal. > > In terms of my main concern: it was around demand versus cost. I don't > believe that has been addressed. There have been ways discussed to try to > minimize cost but that cost is non-zero whereas the demand feels very close > to zero. As far as I can tell, there was one ask for big-endian support in > August of 2016 (Sanjay) and none since [1]. Even in these threads I don't > hear a resounding demand for big endian. There are many features that are > far less invasive/expensive to support that have much more demand. > > This feels like something that could easily rot and/or just become a load > on the core maintainers of the project. I know I don't want to have to go > and debug on a remote BE system if some tests starts failing for that > platform... > > [1] Only a skim of archives, there may have been others: > https://markmail.org/thread/7gbf5est6mcsnc6l > > > > > On Sun, Aug 30, 2020 at 8:39 PM Micah Kornfield <emkornfi...@gmail.com> > wrote: > >> Looking over the outstanding PRs while the code isn't necessarily pretty, I >> don't think they are too invasive. >> >> Also it seems that Kazuaki Ishizaki is willing to add benchmarks where >> necessary to verify the lack of performance regressions. (Please correct >> me if I misunderstood). >> >> Jacques and Liya Fan does this address your concerns? Are there further >> details that you would like to discuss? Are you still opposed to support >> in Java? >> >> Do maintainers of other implementations have concerns (in particular Go >> seems to be the other language in progress)? >> >> Thanks, >> Micah >> >> On Wed, Aug 26, 2020 at 6:57 AM Kazuaki Ishizaki <ishiz...@jp.ibm.com> >> wrote: >> >>> Hi, >>> I waited for comments regarding Java Big-Endian (BE) support during my >>> one-week vacation. Thank you for good suggestions and comments. >>> I already responded to some questions in another mail. This mail >> addresses >>> the remaining questions: Use cases, holistic strategy for BE support, and >>> testing plans >>> >>> 1. Use cases >>> The use case of Arrow Java is in Apache Spark, which was already >> published >>> in Arrow Blog [1]. This is used as the typical performance acceleration >> of >>> Spark with other languages such as Python [2] and R [3]. In DataBricks >>> notebook, 68% of commands come from Python [4]. >>> >>> 2. Holistic strategy of BE support across languages >>> I mostly completed BE support in C++. This implementation uses the >>> following strategy: >>> A. Write and read data in a record batch using platform-native endian >> (NE) >>> when the data is created on a host. The endianness is stored in an endian >>> field in the schema. >>> B. Send data using the IPC-host endian among processes using IPC. >>> C. At B, if an IPC-client endian is different from the received data >>> endian, the IPC client receives data without data copy. >>> D. At B, if an IPC-client endian is different from the received data >>> endian, the IPC client swaps endian of the received data to match the >>> endian with the IPC-client endian as default. >>> E. The primitive data types in memory (e.g. Decimal128 in C++ and >>> UnsafeDirectLittleEndian in Java) is read/written using the NE. >>> >>> A and B-C are typical use cases in Apache Arrow. Therefore, no endian >> swap >>> occurs in these use cases without performance overhead. B-D is rarely >> used >>> (e.g. send data from x86_64 to s390x). Thus, the data swap occurs only >> once >>> at the receive. After that, no data swap occurs for performance. For some >>> use cases, this swap can be stopped by using an option. In these cases, >>> Arrow will not process any data. >>> E. allows us to accessing primitive data (e.g. int32, double, decimal128) >>> without performance loss by using the platform-native endian load/stores. >>> >>> 2-1. Implementation strategy in Java Language >>> The existing primitive data structures such as UnsafeDirectLittleEndian, >>> ArrowBuf, and ValueVector should handle platform-native endian for the >>> strategies A, B-C, and E without performance overhead. >>> In the remaining strategy D, the method >>> MessageSerializer.deserializeRecordBatch() will handle data swap when the >>> endian of the host is different from that of the client, which >> corresponds >>> to the PR [6] in C++. >>> >>> 3. Testing plan >>> For testing the strategies, A, B-C, and E, it would be good to increase >>> the test coverage regardless of endianness e.g. increase the types of a >>> schema to be tested in flight-core). >>> For testing the strategy D, I already prepared data for be and le. When a >>> PR will enable the data swap, the PR will also enable integration test. >>> For performance testing, we can use the existing framework [7] by >>> extending the support for other languages. We can run performance >>> benchmarks on a little-endian platform to avoid performance regression. >>> >>> [1] https://arrow.apache.org/blog/2017/07/26/spark-arrow/ >>> [2] >>> >> https://databricks.com/blog/2017/10/30/introducing-vectorized-udfs-for-pyspark.html >>> [3] >>> >> https://databricks.com/jp/blog/2020/06/01/vectorized-r-i-o-in-upcoming-apache-spark-3-0.html >>> [4] https://databricks.com/jp/session_na20/wednesday-morning-keynotes >>> [5] https://github.com/apache/arrow/pull/7507#discussion_r46819873 >>> [6] https://github.com/apache/arrow/pull/7507 >>> [7] https://github.com/apache/arrow/pull/7940#issuecomment-672690540 >>> >>> Best Regards, >>> Kazuaki Ishizaki >>> >>> Wes McKinney <wesmck...@gmail.com> wrote on 2020/08/26 21:27:49: >>> >>>> From: Wes McKinney <wesmck...@gmail.com> >>>> To: dev <dev@arrow.apache.org>, Micah Kornfield <emkornfi...@gmail.com >>> >>>> Cc: Fan Liya <liya.fa...@gmail.com> >>>> Date: 2020/08/26 21:28 >>>> Subject: [EXTERNAL] Re: [DISCUSS] Big Endian support in Arrow (was: >>>> Re: [Java] Supporting Big Endian) >>>> >>>> hi Micah, >>>> >>>> I agree with your reasoning. If supporting BE in some languages (e.g. >>>> Java) is impractical due to performance regressions on LE platforms, >>>> then I don't think it's worth it. But if it can be handled at compile >>>> time or without runtime overhead, and tested / maintained properly on >>>> an ongoing basis, then it seems reasonable to me. It seems that the >>>> number of Arrow stakeholders will only increase from here so I would >>>> hope that there will be more people invested in helping maintain BE in >>>> the future. >>>> >>>> - Wes >>>> >>>> On Tue, Aug 25, 2020 at 11:33 PM Micah Kornfield >>>> <emkornfi...@gmail.com> wrote: >>>>> >>>>> I'm expanding the scope of this thread since it looks like work has >>> also >>>>> started for making golang support BigEndian architectures. >>>>> >>>>> I think as a community we should come to a consensus on whether we >>> want to >>>>> support Big Endian architectures in general. I don't think it is a >>> good >>>>> outcome if some implementations accept PRs for Big Endian fixes and >>> some >>>>> don't. >>>>> >>>>> But maybe this is OK with others? >>>>> >>>>> My current opinion on the matter is that we should support it under >> the >>>>> following conditions: >>>>> >>>>> 1. As long as there is CI in place to catch regressions (right now I >>> think >>>>> the CI is fairly unreliable?) >>>>> 2. No degradation in performance for little-endian architectures >>> (verified >>>>> by additional micro benchmarks) >>>>> 3. Not a large amount of invasive code to distinguish between >>> platforms. >>>>> >>>>> Kazuaki Ishizaki I asked question previously, but could you give some >>> data >>>>> points around: >>>>> 1. The current state of C++ support (how much code needed to >> change)? >>>>> 2. How many more PRs you expect to need for Java (and approximate >>> size)? >>>>> >>>>> I think this would help myself and others in the decision making >>> process. >>>>> >>>>> Thanks, >>>>> Micah >>>>> >>>>> On Tue, Aug 18, 2020 at 9:15 AM Micah Kornfield < >> emkornfi...@gmail.com >>>> >>>>> wrote: >>>>> >>>>>> My thoughts on the points raised so far: >>>>>> >>>>>> * Does supporting Big Endian increase the reach of Arrow by a lot? >>>>>> >>>>>> Probably not a significant amount, but it does provide one more >>> avenue of >>>>>> adoption. >>>>>> >>>>>> * Does it increase code complexity? >>>>>> >>>>>> Yes. I agree this is a concern. The PR in question did not seem >>> too bad >>>>>> to me but this is subjective. I think the remaining question is >> how >>> many >>>>>> more places need to be fixed up in the code base and how invasive >>> are the >>>>>> changes. In C++ IIUC it turned out to be a relatively small number >>> of >>>>>> places. >>>>>> >>>>>> Kazuaki Ishizaki have you been able to get the Java implementation >>> working >>>>>> fully locally? How many additional PRs will be needed and what do >>>>>> they look like (I think there already a few more in the queue)? >>>>>> >>>>>> * Will it introduce performance regressions? >>>>>> >>>>>> If done properly I suspect no, but I think if we continue with >>> BigEndian >>>>>> support the places that need to be touched should have benchmarks >>> added to >>>>>> confirm this (including for PRs already merged). >>>>>> >>>>>> Thanks, >>>>>> Micah >>>>>> >>>>>> On Sun, Aug 16, 2020 at 7:37 PM Fan Liya <liya.fa...@gmail.com> >>> wrote: >>>>>> >>>>>>> Thank Kazuaki Ishizaki for working on this. >>>>>>> IMO, supporting the big-endian should be a large change, as in >> many >>>>>>> places of the code base, we have implicitly assumed the >>> little-endian >>>>>>> platform (e.g. >>>>>>> https://urldefense.proofpoint.com/v2/url? >>>> >>> >> u=https-3A__github.com_apache_arrow_blob_master_java_memory_memory-2Dcore_src_main_java_org_apache_arrow_memory_util_ByteFunctionHelpers.java&d=DwIBaQ&c=jf_iaSHvJObTbx- >>>> siA1ZOg&r=b70dG_9wpCdZSkBJahHYQ4IwKMdp2hQM29f- >>>> ZCGj9Pg&m=3rVsa9EYwGOrvQw8rg0L9EtFs7I7B- >>>> n7ezRb8qyWtog&s=poFSWqjJv99prou53ciinHyBmh5IZlXLlhYvftb9fu4&e= >>>>>>> ). >>>>>>> Supporting the big-endian platform may introduce branches in such >>> places >>>>>>> (or virtual calls) which will affect the performance. >>>>>>> So it would be helpful to evaluate the performance impact. >>>>>>> >>>>>>> Best, >>>>>>> Liya Fan >>>>>>> >>>>>>> >>>>>>> On Sat, Aug 15, 2020 at 7:54 AM Jacques Nadeau < >> jacq...@apache.org> >>>>>>> wrote: >>>>>>> >>>>>>>> Hey Micah, thanks for starting the discussion. >>>>>>>> >>>>>>>> I just skimmed that thread and it isn't entirely clear that there >>> was a >>>>>>>> conclusion that the overhead was worth it. I think everybody >>> agrees that >>>>>>>> it >>>>>>>> would be nice to have the code work on both platforms. On the >>> flipside, >>>>>>>> the >>>>>>>> code noise for a rare case makes the cost-benefit questionable. >>>>>>>> >>>>>>>> In the Java code, we wrote the code to explicitly disallow big >>> endian >>>>>>>> platforms and put preconditions checks in. I definitely think if >>> we want >>>>>>>> to >>>>>>>> support this, it should be done holistically across the code with >>>>>>>> appropriate test plan (both functional and perf). >>>>>>>> >>>>>>>> To me, the question is really about how many use cases are >> blocked >>> by >>>>>>>> this. >>>>>>>> I'm not sure I've heard anyone say that the limiting factor >>> toleveraging >>>>>>>> Java Arrow was the block on endianess. Keep in mind that until >> very >>>>>>>> recently, using any Arrow Java code would throw a preconditions >>> check >>>>>>>> before you could even get started on big-endian and I don't think >>> we've >>>>>>>> seen a bunch of messages on that exception. Adding if conditions >>>>>>>> throughout >>>>>>>> the codebase like this patch: [1] isn't exactly awesome and it >> can >>> also >>>>>>>> risk performance impacts depending on how carefully it is done. >>>>>>>> >>>>>>>> If there isn't a preponderance of evidence of many users >>> beingblocked by >>>>>>>> this capability, I don't think we should accept the code. We >>> already >>>>>>>> have a >>>>>>>> backlog of items that we need to address just ensure existing use >>> cases >>>>>>>> work well. Expanding to new use cases that there is no clear >>> demand for >>>>>>>> will likely just increase code development cost at little >> benefit. >>>>>>>> >>>>>>>> What do others think? >>>>>>>> >>>>>>>> [1] https://urldefense.proofpoint.com/v2/url? >>>> >>> >> u=https-3A__github.com_apache_arrow_pull_7923-23issuecomment-2D674311119&d=DwIBaQ&c=jf_iaSHvJObTbx- >>>> siA1ZOg&r=b70dG_9wpCdZSkBJahHYQ4IwKMdp2hQM29f- >>>> ZCGj9Pg&m=3rVsa9EYwGOrvQw8rg0L9EtFs7I7B- >>>> n7ezRb8qyWtog&s=vmvc0b4yHFfWLjLheCRysSiyaeRFO_6p0wdH-sLa7M8&e= >>>>>>>> >>>>>>>> On Fri, Aug 14, 2020 at 4:36 PM Micah Kornfield < >>> emkornfi...@gmail.com> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> Kazuaki Ishizak has started working on Big Endian support in >> Java >>>>>>>>> (including setting up CI for it). Thank you! >>>>>>>>> >>>>>>>>> We previously discussed support for Big Endian architectures in >>> C++ >>>>>>>> [1] and >>>>>>>>> generally agreed that it was a reasonable thing to do. >>>>>>>>> >>>>>>>>> Similar to C++ I think as long as we have a working CI setup it >>> is >>>>>>>>> reasonable for Java to support Big Endian machines. >>>>>>>>> >>>>>>>>> But I think there might be differing opinions so it is worth a >>>>>>>> discussion >>>>>>>>> to see if there are technical blockers or other reasons for not >>>>>>>> supporting >>>>>>>>> Big Endian architectures in the existing java implementation. >>>>>>>>> >>>>>>>>> Thanks, >>>>>>>>> Micah >>>>>>>>> >>>>>>>>> >>>>>>>>> [1] >>>>>>>>> >>>>>>>>> >>>>>>>> https://urldefense.proofpoint.com/v2/url? >>>> >>> >> u=https-3A__lists.apache.org_thread.html_rcae745f1d848981bb5e8dddacfc4554641aba62e3c949b96bfd8b019-2540-253Cdev.arrow.apache.org-253E&d=DwIBaQ&c=jf_iaSHvJObTbx- >>>> siA1ZOg&r=b70dG_9wpCdZSkBJahHYQ4IwKMdp2hQM29f- >>>> ZCGj9Pg&m=3rVsa9EYwGOrvQw8rg0L9EtFs7I7B- >>>> n7ezRb8qyWtog&s=oDBWI9pmI39bTsEieQNDxZit0My21hLIW0fJRPJI0AM&e= >>>>>>>>> >>>>>>>> >>>>>>> >>>> >>> >>> >> >