Hi all, Apologies for the silence, I have been occupied. Here is the part 3 of the investigation. This time, I compared the performance of Java and C++ readers. The benchmark is read and checksum 10 billion integers (~40GB of data, with some null values) on a single core. You can read the full investigation here: https://github.com/animeshtrivedi/blog/blob/master/post/2018-11-22-arrow-cpp.md. The best numbers that I have are (see the table at the end of the blog for more details):
C++ (file) : 21.5 Gbps C++ (mmap) : 30.4 Gbps Java (unsafe) : 12.48 Gbps The key insights are: * C++ code (with file I/O) is generally 2x faster than the Java code. I have narrowed down the profiling to JITing issues, memory management, etc. Most of these issues are independent of Arrow (as far as I can tell). The standalone benchmark to debug these issues is here : https://github.com/animeshtrivedi/java-cpp-fun. The pattern of the workloads is - (1) you have a large integer and bitmap array; (2) go over the array, check for NULL, and sum it. In general, C++ code is 2x faster than the Java code. Generally I understand why Java code is slow, but how to make it faster of comparable to C++ code is what I want to know. Any input on it is appreciated. * C++ code has a no-null values optimization branch on the IsValid() check. Java code can benefit from it too, but does not have this implemented. I will open a pull request for this. * C++ bitmap code can be optimized further by using the unsigned integers than "int64_t" for bitmap checks, and eliminating the kBitmap. See here https://godbolt.org/z/deq0_q - compare the size of the assembly code. And the performance measurements in the blog show up to 50% performance gains. Alternatively if signed to unsigned upgrade is not possible (perhaps in every language), then in the C++ code, we should use the bitmap operations directory ( `<<3` for division by 8, and ` & 0x7` for modulo by 8 operation), instead of `/` and `%`. Thanks, -- Animesh On Thu, Oct 11, 2018 at 6:55 PM Animesh Trivedi <animesh.triv...@gmail.com> wrote: > Hi Li - thanks for setting up the jiras. I'll prepare the pull requests > soon. > > Hi Wes, I think that is a good idea to write a blog about performance, > with pointing to the improvements in the code base (fixes and > microbenchmarks). I can prepare a first draft as we go along. This will > definitely serve well to the large java community. > > Cheers, > -- > Animesh > > > On Thu, 11 Oct 2018, 17:55 Li Jin, <ice.xell...@gmail.com> wrote: > >> I have created these as the first step. Animesh, feel free to submit PR >> for >> these. I will look into your micro benchmarks soon. >> >> >> >> 1. [image: Improvement] ARROW-3497[Java] Add user documentation for >> achieving better performance >> <https://jira.apache.org/jira/browse/ARROW-3497> >> 2. [image: Improvement] ARROW-3496[Java] Add microbenchmark code to >> Java >> <https://jira.apache.org/jira/browse/ARROW-3496> >> 3. [image: Improvement] ARROW-3495[Java] Optimize bit operations >> performance <https://jira.apache.org/jira/browse/ARROW-3495> >> 4. [image: Improvement] ARROW-3493[Java] Document >> BOUNDS_CHECKING_ENABLED >> <https://jira.apache.org/jira/browse/ARROW-3493> >> >> >> On Thu, Oct 11, 2018 at 10:00 AM Li Jin <ice.xell...@gmail.com> wrote: >> >> > Hi Wes and Animesh, >> > >> > Thanks for the analysis and discussion. I am happy to looking into >> this. I >> > will create some Jiras soon. >> > >> > Li >> > >> > On Thu, Oct 11, 2018 at 5:49 AM Wes McKinney <wesmck...@gmail.com> >> wrote: >> > >> >> hey Animesh, >> >> >> >> Thank you for doing this analysis. If you'd like to share some of the >> >> analysis more broadly e.g. on the Apache Arrow blog or social media, >> >> let us know. >> >> >> >> Seems like there might be a few follow ups here for the Arrow Java >> >> community: >> >> >> >> * Documentation about achieving better performance >> >> * Writing some microperformance benchmarks >> >> * Making some improvements to the code to facilitate better performance >> >> >> >> Feel free to create some JIRA issues. Are any Java developers >> >> interested in digging a little more into these issues? >> >> >> >> Thanks, >> >> Wes >> >> On Tue, Oct 9, 2018 at 7:18 AM Animesh Trivedi >> >> <animesh.triv...@gmail.com> wrote: >> >> > >> >> > Hi Wes and all, >> >> > >> >> > Here is another round of updates: >> >> > >> >> > Quick recap - previously we established that for 1kB binary blobs, >> Arrow >> >> > can deliver > 160 Gbps performance from in-memory buffers. >> >> > >> >> > In this round I looked at the performance of materializing >> "integers". >> >> In >> >> > my benchmarks, I found that with careful >> optimizations/code-rewriting we >> >> > can push the performance of integer reading from 5.42 Gbps/core to >> 13.61 >> >> > Gbps/core (~2.5x). The peak performance with 16 cores, scale up to >> 110+ >> >> > Gbps. Key things to do is: >> >> > >> >> > 1) Disable memory access checks in Arrow and Netty buffers. This gave >> >> > significant performance boost. However, for such an important >> >> performance >> >> > flag, it is very poorly documented >> >> > ("drill.enable_unsafe_memory_access=true"). >> >> > >> >> > 2) Materialize values from Validity and Value direct buffers instead >> of >> >> > calling getInt() function on the IntVector. This is implemented as a >> new >> >> > Unsafe reader type ( >> >> > >> >> >> https://github.com/animeshtrivedi/benchmarking-arrow/blob/master/src/main/java/com/github/animeshtrivedi/benchmark/ArrowReaderUnsafe.java#L31 >> >> > ) >> >> > >> >> > 3) Optimize bitmap operation to check if a bit is set or not ( >> >> > >> >> >> https://github.com/animeshtrivedi/benchmarking-arrow/blob/master/src/main/java/com/github/animeshtrivedi/benchmark/ArrowReaderUnsafe.java#L23 >> >> > ) >> >> > >> >> > A detailed write up of these steps is available here: >> >> > >> >> >> https://github.com/animeshtrivedi/blog/blob/master/post/2018-10-09-arrow-int.md >> >> > >> >> > I have 2 follow-up questions: >> >> > >> >> > 1) Regarding the `isSet` function, why does it has to calculate >> number >> >> of >> >> > bits set? ( >> >> > >> >> >> https://github.com/apache/arrow/blob/master/java/vector/src/main/java/org/apache/arrow/vector/BaseFixedWidthVector.java#L797 >> >> ). >> >> > Wouldn't just checking if the result of the AND operation is zero or >> >> not be >> >> > sufficient? Like what I did : >> >> > >> >> >> https://github.com/animeshtrivedi/benchmarking-arrow/blob/master/src/main/java/com/github/animeshtrivedi/benchmark/ArrowReaderUnsafe.java#L28 >> >> > >> >> > >> >> > 2) What is the reason behind this bitmap generation optimization here >> >> > >> >> >> https://github.com/apache/arrow/blob/master/java/vector/src/main/java/org/apache/arrow/vector/BitVectorHelper.java#L179 >> >> > ? At this point when this function is called, the bitmap vector is >> >> already >> >> > read from the storage, and contains the right values (either all >> null, >> >> all >> >> > set, or whatever). Generating this mask here for the special cases >> when >> >> the >> >> > values are all NULL or all set (this was the case in my benchmark), >> can >> >> be >> >> > slower than just returning what one has read from the storage. >> >> > >> >> > Collectively optimizing these two bitmap operations give more than 1 >> >> Gbps >> >> > gains in my bench-marking code. >> >> > >> >> > Cheers, >> >> > -- >> >> > Animesh >> >> > >> >> > >> >> > On Thu, Oct 4, 2018 at 12:52 PM Wes McKinney <wesmck...@gmail.com> >> >> wrote: >> >> > >> >> > > See e.g. >> >> > > >> >> > > >> >> > > >> >> >> https://github.com/apache/arrow/blob/master/cpp/src/arrow/ipc/ipc-read-write-test.cc#L222 >> >> > > >> >> > > >> >> > > On Thu, Oct 4, 2018 at 6:48 AM Animesh Trivedi >> >> > > <animesh.triv...@gmail.com> wrote: >> >> > > > >> >> > > > Primarily write the same microbenchmark as I have in Java in C++ >> for >> >> > > table >> >> > > > reading and value materialization. So just an example of >> equivalent >> >> > > > ArrowFileReader example code in C++. Unit tests are a good >> starting >> >> > > point, >> >> > > > thanks for the tip :) >> >> > > > >> >> > > > On Thu, Oct 4, 2018 at 12:39 PM Wes McKinney < >> wesmck...@gmail.com> >> >> > > wrote: >> >> > > > >> >> > > > > > 3. Are there examples of Arrow in C++ read/write code that I >> can >> >> > > have a >> >> > > > > look? >> >> > > > > >> >> > > > > What kind of code are you looking for? I would direct you to >> >> relevant >> >> > > > > unit tests that exhibit certain functionality, but it depends >> on >> >> what >> >> > > > > you are trying to do >> >> > > > > On Wed, Oct 3, 2018 at 9:45 AM Animesh Trivedi >> >> > > > > <animesh.triv...@gmail.com> wrote: >> >> > > > > > >> >> > > > > > Hi all - quick update on the performance investigation: >> >> > > > > > >> >> > > > > > - I spent some time looking at performance profile for a >> binary >> >> blob >> >> > > > > column >> >> > > > > > (1024 bytes of byte[]) and found a few favorable settings for >> >> > > delivering >> >> > > > > up >> >> > > > > > to 168 Gbps from in-memory reading benchmark on 16 cores. >> These >> >> > > settings >> >> > > > > > (NUMA, JVM settings, Arrow holder API, and batch size, etc.) >> are >> >> > > > > documented >> >> > > > > > here: >> >> > > > > > >> >> > > > > >> >> > > >> >> >> https://github.com/animeshtrivedi/blog/blob/master/post/2018-10-03-arrow-binary.md >> >> > > > > > - these setting also help to improved the last number that >> >> reported >> >> > > (but >> >> > > > > > not by much) for the in-memory TPC-DS store_sales table from >> ~39 >> >> > > Gbps up >> >> > > > > to >> >> > > > > > ~45-47 Gbps (note: this number is just in-memory benchmark, >> >> i.e., >> >> > > w/o any >> >> > > > > > networking or storage links) >> >> > > > > > >> >> > > > > > A few follow up questions that I have: >> >> > > > > > 1. Arrow reads a batch size worth of data in one go. Are >> there >> >> any >> >> > > > > > recommended batch sizes? In my investigation, small batch >> size >> >> help >> >> > > with >> >> > > > > a >> >> > > > > > better cache profile but increase number of instructions >> >> required >> >> > > (more >> >> > > > > > looping). Larger one do otherwise. Somehow ~10MB/thread seem >> to >> >> be >> >> > > the >> >> > > > > best >> >> > > > > > performing configuration, which is also a bit counter >> intuitive >> >> as >> >> > > for 16 >> >> > > > > > threads this will lead to 160 MB of memory footprint. May be >> >> this is >> >> > > also >> >> > > > > > tired to the memory management logic which is my next >> question. >> >> > > > > > 2. Arrow use's netty's memory manager. (i) what are decent >> netty >> >> > > memory >> >> > > > > > management settings for "io.netty.allocator.*" parameters? I >> >> don't >> >> > > find >> >> > > > > any >> >> > > > > > decent write-up on them; (ii) Is there a provision for >> ArrowBuf >> >> being >> >> > > > > > re-used once a batch is consumed? As it looks for now, read >> read >> >> > > > > allocates >> >> > > > > > a new buffer to read the whole batch size. >> >> > > > > > 3. Are there examples of Arrow in C++ read/write code that I >> can >> >> > > have a >> >> > > > > > look? >> >> > > > > > >> >> > > > > > Cheers, >> >> > > > > > -- >> >> > > > > > Animesh >> >> > > > > > >> >> > > > > > >> >> > > > > > On Wed, Sep 19, 2018 at 8:49 PM Wes McKinney < >> >> wesmck...@gmail.com> >> >> > > > > wrote: >> >> > > > > > >> >> > > > > > > On Wed, Sep 19, 2018 at 2:13 PM Animesh Trivedi >> >> > > > > > > <animesh.triv...@gmail.com> wrote: >> >> > > > > > > > >> >> > > > > > > > Hi Johan, Wes, and Jacques - many thanks for your >> comments: >> >> > > > > > > > >> >> > > > > > > > @Johan - >> >> > > > > > > > 1. I also do not suspect that there is any inherent >> >> drawback in >> >> > > Java >> >> > > > > or >> >> > > > > > > C++ >> >> > > > > > > > due to the Arrow format. I mentioned C++ because Wes >> >> pointed out >> >> > > that >> >> > > > > > > Java >> >> > > > > > > > routines are not the most optimized ones (yet!). And >> >> naturally >> >> > > one >> >> > > > > would >> >> > > > > > > > expect better performance in a native language with all >> >> > > > > > > pointer/memory/SIMD >> >> > > > > > > > instruction optimizations that you mentioned. As far as I >> >> know, >> >> > > the >> >> > > > > > > > off-heap buffers are managed in ArrowBuf which >> implements an >> >> > > abstract >> >> > > > > > > netty >> >> > > > > > > > class. But there is nothing unusual, i.e., netty >> specific, >> >> about >> >> > > > > these >> >> > > > > > > > unsafe routines, they are used by many projects. Though >> >> there is >> >> > > cost >> >> > > > > > > > associated with materializing on-heap Java values from >> >> off-heap >> >> > > > > memory >> >> > > > > > > > regions. I need to benchmark that more carefully. >> >> > > > > > > > >> >> > > > > > > > 2. When you say "I've so far always been able to get >> similar >> >> > > > > performance >> >> > > > > > > > numbers" - do you mean the same performance of my case 3 >> >> where 16 >> >> > > > > cores >> >> > > > > > > > drive close to 40 Gbps, or the same performance between >> >> your C++ >> >> > > and >> >> > > > > Java >> >> > > > > > > > benchmarks. Do you have some write-up? I would be >> >> interested to >> >> > > read >> >> > > > > up >> >> > > > > > > :) >> >> > > > > > > > >> >> > > > > > > > 3. "Can you get to 100 Gbps starting from primitive >> arrays >> >> in >> >> > > Java" >> >> > > > > -> >> >> > > > > > > that >> >> > > > > > > > is a good idea. Let me try and report back. >> >> > > > > > > > >> >> > > > > > > > @Wes - >> >> > > > > > > > >> >> > > > > > > > Is there some benchmark template for C++ routines I can >> >> have a >> >> > > look? >> >> > > > > I >> >> > > > > > > > would be happy to get some input from Java-Arrow experts >> on >> >> how >> >> > > to >> >> > > > > write >> >> > > > > > > > these benchmarks more efficiently. I will have a closer >> >> look at >> >> > > the >> >> > > > > JIRA >> >> > > > > > > > tickets that you mentioned. >> >> > > > > > > > >> >> > > > > > > > So, for now I am focusing on the case 3, which is about >> >> > > establishing >> >> > > > > > > > performance when reading from a local in-memory I/O >> stream >> >> that I >> >> > > > > > > > implemented ( >> >> > > > > > > > >> >> > > > > > > >> >> > > > > >> >> > > >> >> >> https://github.com/animeshtrivedi/benchmarking-arrow/blob/master/src/main/java/com/github/animeshtrivedi/benchmark/MemoryIOChannel.java >> >> > > > > > > ). >> >> > > > > > > > In this case I first read data from parquet files, >> convert >> >> them >> >> > > into >> >> > > > > > > Arrow, >> >> > > > > > > > and write-out to a MemoryIOChannel, and then read back >> from >> >> it. >> >> > > So, >> >> > > > > the >> >> > > > > > > > performance has nothing to do with Crail or HDFS in the >> >> case 3. >> >> > > > > Once, I >> >> > > > > > > > establish the base performance in this setup (which is >> >> around ~40 >> >> > > > > Gbps >> >> > > > > > > with >> >> > > > > > > > 16 cores) I will add Crail to the mix. Perhaps Crail I/O >> >> streams >> >> > > can >> >> > > > > take >> >> > > > > > > > ArrowBuf as src/dst buffers. That should be doable. >> >> > > > > > > >> >> > > > > > > Right, in any case what you are testing is the performance >> of >> >> > > using a >> >> > > > > > > particular Java accessor layer to JVM off-heap Arrow memory >> >> to sum >> >> > > the >> >> > > > > > > non-null values of each column. I'm not sure that a single >> >> > > bandwidth >> >> > > > > > > number produced by this benchmark is very informative for >> >> people >> >> > > > > > > contemplating what memory format to use in their system due >> >> to the >> >> > > > > > > current state of the implementation (Java) and workload >> >> measured >> >> > > > > > > (summing the non-null values with a naive algorithm). I >> would >> >> guess >> >> > > > > > > that a C++ version with raw pointers and a loop-unrolled, >> >> > > branch-free >> >> > > > > > > vectorized sum is going to be a lot faster. >> >> > > > > > > >> >> > > > > > > > >> >> > > > > > > > @Jacques - >> >> > > > > > > > >> >> > > > > > > > That is a good point that "Arrow's implementation is more >> >> > > focused on >> >> > > > > > > > interacting with the structure than transporting it". >> >> However, >> >> > > in any >> >> > > > > > > > distributed system one needs to move data/structure >> around >> >> - as >> >> > > far >> >> > > > > as I >> >> > > > > > > > understand that is another goal of the project. My >> >> investigation >> >> > > > > started >> >> > > > > > > > within the context of Spark/SQL data processing. Spark >> >> converts >> >> > > > > incoming >> >> > > > > > > > data into its own in-memory UnsafeRow representation for >> >> > > processing. >> >> > > > > So >> >> > > > > > > > naturally the performance of this data ingestion pipeline >> >> cannot >> >> > > > > > > outperform >> >> > > > > > > > the read performance of the used file format. I >> benchmarked >> >> > > Parquet, >> >> > > > > ORC, >> >> > > > > > > > Avro, JSON (for the specific TPC-DS store_sales table). >> And >> >> then >> >> > > > > > > curiously >> >> > > > > > > > benchmarked Arrow as well because its design choices are >> a >> >> better >> >> > > > > fit for >> >> > > > > > > > modern high-performance RDMA/NVMe/100+Gbps devices I am >> >> > > > > investigating. >> >> > > > > > > From >> >> > > > > > > > this point of view, I am trying to find out can Arrow be >> >> the file >> >> > > > > format >> >> > > > > > > > for the next generation of storage/networking devices >> (see >> >> Apache >> >> > > > > Crail >> >> > > > > > > > project) delivering close to the hardware speed >> >> reading/writing >> >> > > > > rates. As >> >> > > > > > > > Wes pointed out that a C++ library implementation should >> be >> >> > > memory-IO >> >> > > > > > > > bound, so what would it take to deliver the same >> >> performance in >> >> > > Java >> >> > > > > ;) >> >> > > > > > > > (and then, from across the network). >> >> > > > > > > > >> >> > > > > > > > I hope this makes sense. >> >> > > > > > > > >> >> > > > > > > > Cheers, >> >> > > > > > > > -- >> >> > > > > > > > Animesh >> >> > > > > > > > >> >> > > > > > > > On Wed, Sep 19, 2018 at 6:28 PM Jacques Nadeau < >> >> > > jacq...@apache.org> >> >> > > > > > > wrote: >> >> > > > > > > > >> >> > > > > > > > > My big question is what is the use case and how/what >> are >> >> you >> >> > > > > trying to >> >> > > > > > > > > compare? Arrow's implementation is more focused on >> >> interacting >> >> > > > > with the >> >> > > > > > > > > structure than transporting it. Generally speaking, >> when >> >> we're >> >> > > > > working >> >> > > > > > > with >> >> > > > > > > > > Arrow data we frequently are just interacting with >> memory >> >> > > > > locations and >> >> > > > > > > > > doing direct operations. If you have a layer that >> >> supports that >> >> > > > > type of >> >> > > > > > > > > semantic, create a movement technique that depends on >> >> that. >> >> > > Arrow >> >> > > > > > > doesn't >> >> > > > > > > > > force a particular API since the data itself is defined >> >> by its >> >> > > > > > > in-memory >> >> > > > > > > > > layout so if you have a custom use or pattern, just >> work >> >> with >> >> > > the >> >> > > > > > > in-memory >> >> > > > > > > > > structures. >> >> > > > > > > > > >> >> > > > > > > > > >> >> > > > > > > > > >> >> > > > > > > > > On Wed, Sep 19, 2018 at 7:49 AM Wes McKinney < >> >> > > wesmck...@gmail.com> >> >> > > > > > > wrote: >> >> > > > > > > > > >> >> > > > > > > > > > hi Animesh, >> >> > > > > > > > > > >> >> > > > > > > > > > Per Johan's comments, the C++ library is essentially >> >> going >> >> > > to be >> >> > > > > > > > > > IO/memory bandwidth bound since you're interacting >> with >> >> raw >> >> > > > > pointers. >> >> > > > > > > > > > >> >> > > > > > > > > > I'm looking at your code >> >> > > > > > > > > > >> >> > > > > > > > > > private void consumeFloat4(FieldVector fv) { >> >> > > > > > > > > > Float4Vector accessor = (Float4Vector) fv; >> >> > > > > > > > > > int valCount = accessor.getValueCount(); >> >> > > > > > > > > > for(int i = 0; i < valCount; i++){ >> >> > > > > > > > > > if(!accessor.isNull(i)){ >> >> > > > > > > > > > float4Count+=1; >> >> > > > > > > > > > checksum+=accessor.get(i); >> >> > > > > > > > > > } >> >> > > > > > > > > > } >> >> > > > > > > > > > } >> >> > > > > > > > > > >> >> > > > > > > > > > You'll want to get a Java-Arrow expert from Dremio to >> >> advise >> >> > > you >> >> > > > > the >> >> > > > > > > > > > fastest way to iterate over this data -- my >> >> understanding is >> >> > > that >> >> > > > > > > much >> >> > > > > > > > > > code in Dremio interacts with the wrapped Netty >> ArrowBuf >> >> > > objects >> >> > > > > > > > > > rather than going through the higher level APIs. >> You're >> >> also >> >> > > > > dropping >> >> > > > > > > > > > performance because memory mapping is not yet >> >> implemented in >> >> > > > > Java, >> >> > > > > > > see >> >> > > > > > > > > > https://issues.apache.org/jira/browse/ARROW-3191. >> >> > > > > > > > > > >> >> > > > > > > > > > Furthermore, the IPC reader class you are using could >> >> be made >> >> > > > > more >> >> > > > > > > > > > efficient. I described the problem in >> >> > > > > > > > > > https://issues.apache.org/jira/browse/ARROW-3192 -- >> >> this >> >> > > will be >> >> > > > > > > > > > required as soon as we have the ability to do memory >> >> mapping >> >> > > in >> >> > > > > Java >> >> > > > > > > > > > >> >> > > > > > > > > > Could Crail use the Arrow data structures in its >> runtime >> >> > > rather >> >> > > > > than >> >> > > > > > > > > > copying? If not, how are Crail's runtime data >> structures >> >> > > > > different? >> >> > > > > > > > > > >> >> > > > > > > > > > - Wes >> >> > > > > > > > > > On Wed, Sep 19, 2018 at 9:19 AM Johan Peltenburg - >> EWI >> >> > > > > > > > > > <j.w.peltenb...@tudelft.nl> wrote: >> >> > > > > > > > > > > >> >> > > > > > > > > > > Hello Animesh, >> >> > > > > > > > > > > >> >> > > > > > > > > > > >> >> > > > > > > > > > > I browsed a bit in your sources, thanks for >> sharing. >> >> We >> >> > > have >> >> > > > > > > performed >> >> > > > > > > > > > some similar measurements to your third case in the >> >> past for >> >> > > > > C/C++ on >> >> > > > > > > > > > collections of various basic types such as primitives >> >> and >> >> > > > > strings. >> >> > > > > > > > > > > >> >> > > > > > > > > > > >> >> > > > > > > > > > > I can say that in terms of consuming data from the >> >> Arrow >> >> > > format >> >> > > > > > > versus >> >> > > > > > > > > > language native collections in C++, I've so far >> always >> >> been >> >> > > able >> >> > > > > to >> >> > > > > > > get >> >> > > > > > > > > > similar performance numbers (e.g. no drawbacks due to >> >> the >> >> > > Arrow >> >> > > > > > > format >> >> > > > > > > > > > itself). Especially when accessing the data through >> >> Arrow's >> >> > > raw >> >> > > > > data >> >> > > > > > > > > > pointers (and using for example std::string_view-like >> >> > > > > constructs). >> >> > > > > > > > > > > >> >> > > > > > > > > > > In C/C++ the fast data structures are engineered in >> >> such a >> >> > > way >> >> > > > > > > that as >> >> > > > > > > > > > little pointer traversals are required and they take >> up >> >> an as >> >> > > > > small >> >> > > > > > > as >> >> > > > > > > > > > possible memory footprint. Thus each memory access is >> >> > > relatively >> >> > > > > > > > > efficient >> >> > > > > > > > > > (in terms of obtaining the data of interest). The >> same >> >> can >> >> > > > > > > absolutely be >> >> > > > > > > > > > said for Arrow, if not even more efficient in some >> cases >> >> > > where >> >> > > > > object >> >> > > > > > > > > > fields are of variable length. >> >> > > > > > > > > > > >> >> > > > > > > > > > > >> >> > > > > > > > > > > In the JVM case, the Arrow data is stored off-heap. >> >> This >> >> > > > > requires >> >> > > > > > > the >> >> > > > > > > > > > JVM to interface to it through some calls to Unsafe >> >> hidden >> >> > > under >> >> > > > > the >> >> > > > > > > > > Netty >> >> > > > > > > > > > layer (but please correct me if I'm wrong, I'm not an >> >> expert >> >> > > on >> >> > > > > > > this). >> >> > > > > > > > > > Those calls are the only reason I can think of that >> >> would >> >> > > > > degrade the >> >> > > > > > > > > > performance a bit compared to a pure JAva case. I >> don't >> >> know >> >> > > if >> >> > > > > the >> >> > > > > > > > > Unsafe >> >> > > > > > > > > > calls are inlined during JIT compilation. If they >> >> aren't they >> >> > > > > will >> >> > > > > > > > > increase >> >> > > > > > > > > > access latency to any data a little bit. >> >> > > > > > > > > > > >> >> > > > > > > > > > > >> >> > > > > > > > > > > I don't have a similar machine so it's not easy to >> >> relate >> >> > > my >> >> > > > > > > numbers to >> >> > > > > > > > > > yours, but if you can get that data consumed with 100 >> >> Gbps >> >> > > in a >> >> > > > > pure >> >> > > > > > > Java >> >> > > > > > > > > > case, I don't see any reason (resulting from Arrow >> >> format / >> >> > > > > off-heap >> >> > > > > > > > > > storage) why you wouldn't be able to get at least >> really >> >> > > close. >> >> > > > > Can >> >> > > > > > > you >> >> > > > > > > > > get >> >> > > > > > > > > > to 100 Gbps starting from primitive arrays in Java >> with >> >> your >> >> > > > > > > consumption >> >> > > > > > > > > > functions in the first place? >> >> > > > > > > > > > > >> >> > > > > > > > > > > >> >> > > > > > > > > > > I'm interested to see your progress on this. >> >> > > > > > > > > > > >> >> > > > > > > > > > > >> >> > > > > > > > > > > Kind regards, >> >> > > > > > > > > > > >> >> > > > > > > > > > > >> >> > > > > > > > > > > Johan Peltenburg >> >> > > > > > > > > > > >> >> > > > > > > > > > > ________________________________ >> >> > > > > > > > > > > From: Animesh Trivedi <animesh.triv...@gmail.com> >> >> > > > > > > > > > > Sent: Wednesday, September 19, 2018 2:08:50 PM >> >> > > > > > > > > > > To: dev@arrow.apache.org; d...@crail.apache.org >> >> > > > > > > > > > > Subject: [JAVA] Arrow performance measurement >> >> > > > > > > > > > > >> >> > > > > > > > > > > Hi all, >> >> > > > > > > > > > > >> >> > > > > > > > > > > A week ago, Wes and I had a discussion about the >> >> > > performance >> >> > > > > of the >> >> > > > > > > > > > > Arrow/Java implementation on the Apache Crail >> >> (Incubating) >> >> > > > > mailing >> >> > > > > > > > > list ( >> >> > > > > > > > > > > >> >> > > > > > > >> >> > > >> >> http://mail-archives.apache.org/mod_mbox/crail-dev/201809.mbox/browser >> >> > > > > > > > > ). >> >> > > > > > > > > > In >> >> > > > > > > > > > > a nutshell: I am investigating the performance of >> >> various >> >> > > file >> >> > > > > > > formats >> >> > > > > > > > > > > (including Arrow) on high-performance NVMe and >> >> > > > > RDMA/100Gbps/RoCE >> >> > > > > > > > > setups. >> >> > > > > > > > > > I >> >> > > > > > > > > > > benchmarked how long does it take to materialize >> >> values >> >> > > (ints, >> >> > > > > > > longs, >> >> > > > > > > > > > > doubles) of the store_sales table, the largest >> table >> >> in the >> >> > > > > TPC-DS >> >> > > > > > > > > > dataset >> >> > > > > > > > > > > stored on different file formats. Here is a >> write-up >> >> on >> >> > > this - >> >> > > > > > > > > > > >> >> > > https://crail.incubator.apache.org/blog/2018/08/sql-p1.html. I >> >> > > > > > > found >> >> > > > > > > > > > that >> >> > > > > > > > > > > between a pair of machine connected over a 100 Gbps >> >> link, >> >> > > Arrow >> >> > > > > > > (using >> >> > > > > > > > > > as a >> >> > > > > > > > > > > file format on HDFS) delivered close to ~30 Gbps of >> >> > > bandwidth >> >> > > > > with >> >> > > > > > > all >> >> > > > > > > > > 16 >> >> > > > > > > > > > > cores engaged. Wes pointed out that (i) Arrow is >> >> in-memory >> >> > > IPC >> >> > > > > > > format, >> >> > > > > > > > > > and >> >> > > > > > > > > > > has not been optimized for storage interfaces/APIs >> >> like >> >> > > HDFS; >> >> > > > > (ii) >> >> > > > > > > the >> >> > > > > > > > > > > performance I am measuring is for the java >> >> implementation. >> >> > > > > > > > > > > >> >> > > > > > > > > > > Wes, I hope I summarized our discussion correctly. >> >> > > > > > > > > > > >> >> > > > > > > > > > > That brings us to this email where I promised to >> >> follow up >> >> > > on >> >> > > > > the >> >> > > > > > > Arrow >> >> > > > > > > > > > > mailing list to understand and optimize the >> >> performance of >> >> > > > > > > Arrow/Java >> >> > > > > > > > > > > implementation on high-performance devices. I >> wrote a >> >> small >> >> > > > > > > stand-alone >> >> > > > > > > > > > > benchmark ( >> >> > > > > https://github.com/animeshtrivedi/benchmarking-arrow) >> >> > > > > > > with >> >> > > > > > > > > > three >> >> > > > > > > > > > > implementations of WritableByteChannel, >> >> SeekableByteChannel >> >> > > > > > > interfaces: >> >> > > > > > > > > > > >> >> > > > > > > > > > > 1. Arrow data is stored in HDFS/tmpfs - this gives >> me >> >> ~30 >> >> > > Gbps >> >> > > > > > > > > > performance >> >> > > > > > > > > > > 2. Arrow data is stored in Crail/DRAM - this gives >> me >> >> > > ~35-36 >> >> > > > > Gbps >> >> > > > > > > > > > > performance >> >> > > > > > > > > > > 3. Arrow data is stored in on-heap byte[] - this >> >> gives me >> >> > > ~39 >> >> > > > > Gbps >> >> > > > > > > > > > > performance >> >> > > > > > > > > > > >> >> > > > > > > > > > > I think the order makes sense. To better understand >> >> the >> >> > > > > > > performance of >> >> > > > > > > > > > > Arrow/Java we can focus on the option 3. >> >> > > > > > > > > > > >> >> > > > > > > > > > > The key question I am trying to answer is "what >> would >> >> it >> >> > > take >> >> > > > > for >> >> > > > > > > > > > > Arrow/Java to deliver 100+ Gbps of performance"? >> Is it >> >> > > > > possible? If >> >> > > > > > > > > yes, >> >> > > > > > > > > > > then what is missing/or mis-interpreted by me? If >> >> not, then >> >> > > > > where >> >> > > > > > > is >> >> > > > > > > > > the >> >> > > > > > > > > > > performance lost? Does anyone have any performance >> >> > > measurements >> >> > > > > > > for C++ >> >> > > > > > > > > > > implementation? if they have seen/expect better >> >> numbers. >> >> > > > > > > > > > > >> >> > > > > > > > > > > As a next step, I will profile the read path of >> >> Arrow/Java >> >> > > for >> >> > > > > the >> >> > > > > > > > > option >> >> > > > > > > > > > > 3. I will report my findings. >> >> > > > > > > > > > > >> >> > > > > > > > > > > Any thoughts and feedback on this investigation are >> >> very >> >> > > > > welcome. >> >> > > > > > > > > > > >> >> > > > > > > > > > > Cheers, >> >> > > > > > > > > > > -- >> >> > > > > > > > > > > Animesh >> >> > > > > > > > > > > >> >> > > > > > > > > > > PS~ Cross-posting on the d...@crail.apache.org >> list as >> >> > > well. >> >> > > > > > > > > > >> >> > > > > > > > > >> >> > > > > > > >> >> > > > > >> >> > > >> >> >> > >> >>