Hi Li - thanks for setting up the jiras. I'll prepare the pull requests soon.
Hi Wes, I think that is a good idea to write a blog about performance, with pointing to the improvements in the code base (fixes and microbenchmarks). I can prepare a first draft as we go along. This will definitely serve well to the large java community. Cheers, -- Animesh On Thu, 11 Oct 2018, 17:55 Li Jin, <ice.xell...@gmail.com> wrote: > I have created these as the first step. Animesh, feel free to submit PR for > these. I will look into your micro benchmarks soon. > > > > 1. [image: Improvement] ARROW-3497[Java] Add user documentation for > achieving better performance > <https://jira.apache.org/jira/browse/ARROW-3497> > 2. [image: Improvement] ARROW-3496[Java] Add microbenchmark code to Java > <https://jira.apache.org/jira/browse/ARROW-3496> > 3. [image: Improvement] ARROW-3495[Java] Optimize bit operations > performance <https://jira.apache.org/jira/browse/ARROW-3495> > 4. [image: Improvement] ARROW-3493[Java] Document > BOUNDS_CHECKING_ENABLED > <https://jira.apache.org/jira/browse/ARROW-3493> > > > On Thu, Oct 11, 2018 at 10:00 AM Li Jin <ice.xell...@gmail.com> wrote: > > > Hi Wes and Animesh, > > > > Thanks for the analysis and discussion. I am happy to looking into this. > I > > will create some Jiras soon. > > > > Li > > > > On Thu, Oct 11, 2018 at 5:49 AM Wes McKinney <wesmck...@gmail.com> > wrote: > > > >> hey Animesh, > >> > >> Thank you for doing this analysis. If you'd like to share some of the > >> analysis more broadly e.g. on the Apache Arrow blog or social media, > >> let us know. > >> > >> Seems like there might be a few follow ups here for the Arrow Java > >> community: > >> > >> * Documentation about achieving better performance > >> * Writing some microperformance benchmarks > >> * Making some improvements to the code to facilitate better performance > >> > >> Feel free to create some JIRA issues. Are any Java developers > >> interested in digging a little more into these issues? > >> > >> Thanks, > >> Wes > >> On Tue, Oct 9, 2018 at 7:18 AM Animesh Trivedi > >> <animesh.triv...@gmail.com> wrote: > >> > > >> > Hi Wes and all, > >> > > >> > Here is another round of updates: > >> > > >> > Quick recap - previously we established that for 1kB binary blobs, > Arrow > >> > can deliver > 160 Gbps performance from in-memory buffers. > >> > > >> > In this round I looked at the performance of materializing "integers". > >> In > >> > my benchmarks, I found that with careful optimizations/code-rewriting > we > >> > can push the performance of integer reading from 5.42 Gbps/core to > 13.61 > >> > Gbps/core (~2.5x). The peak performance with 16 cores, scale up to > 110+ > >> > Gbps. Key things to do is: > >> > > >> > 1) Disable memory access checks in Arrow and Netty buffers. This gave > >> > significant performance boost. However, for such an important > >> performance > >> > flag, it is very poorly documented > >> > ("drill.enable_unsafe_memory_access=true"). > >> > > >> > 2) Materialize values from Validity and Value direct buffers instead > of > >> > calling getInt() function on the IntVector. This is implemented as a > new > >> > Unsafe reader type ( > >> > > >> > https://github.com/animeshtrivedi/benchmarking-arrow/blob/master/src/main/java/com/github/animeshtrivedi/benchmark/ArrowReaderUnsafe.java#L31 > >> > ) > >> > > >> > 3) Optimize bitmap operation to check if a bit is set or not ( > >> > > >> > https://github.com/animeshtrivedi/benchmarking-arrow/blob/master/src/main/java/com/github/animeshtrivedi/benchmark/ArrowReaderUnsafe.java#L23 > >> > ) > >> > > >> > A detailed write up of these steps is available here: > >> > > >> > https://github.com/animeshtrivedi/blog/blob/master/post/2018-10-09-arrow-int.md > >> > > >> > I have 2 follow-up questions: > >> > > >> > 1) Regarding the `isSet` function, why does it has to calculate number > >> of > >> > bits set? ( > >> > > >> > https://github.com/apache/arrow/blob/master/java/vector/src/main/java/org/apache/arrow/vector/BaseFixedWidthVector.java#L797 > >> ). > >> > Wouldn't just checking if the result of the AND operation is zero or > >> not be > >> > sufficient? Like what I did : > >> > > >> > https://github.com/animeshtrivedi/benchmarking-arrow/blob/master/src/main/java/com/github/animeshtrivedi/benchmark/ArrowReaderUnsafe.java#L28 > >> > > >> > > >> > 2) What is the reason behind this bitmap generation optimization here > >> > > >> > https://github.com/apache/arrow/blob/master/java/vector/src/main/java/org/apache/arrow/vector/BitVectorHelper.java#L179 > >> > ? At this point when this function is called, the bitmap vector is > >> already > >> > read from the storage, and contains the right values (either all null, > >> all > >> > set, or whatever). Generating this mask here for the special cases > when > >> the > >> > values are all NULL or all set (this was the case in my benchmark), > can > >> be > >> > slower than just returning what one has read from the storage. > >> > > >> > Collectively optimizing these two bitmap operations give more than 1 > >> Gbps > >> > gains in my bench-marking code. > >> > > >> > Cheers, > >> > -- > >> > Animesh > >> > > >> > > >> > On Thu, Oct 4, 2018 at 12:52 PM Wes McKinney <wesmck...@gmail.com> > >> wrote: > >> > > >> > > See e.g. > >> > > > >> > > > >> > > > >> > https://github.com/apache/arrow/blob/master/cpp/src/arrow/ipc/ipc-read-write-test.cc#L222 > >> > > > >> > > > >> > > On Thu, Oct 4, 2018 at 6:48 AM Animesh Trivedi > >> > > <animesh.triv...@gmail.com> wrote: > >> > > > > >> > > > Primarily write the same microbenchmark as I have in Java in C++ > for > >> > > table > >> > > > reading and value materialization. So just an example of > equivalent > >> > > > ArrowFileReader example code in C++. Unit tests are a good > starting > >> > > point, > >> > > > thanks for the tip :) > >> > > > > >> > > > On Thu, Oct 4, 2018 at 12:39 PM Wes McKinney <wesmck...@gmail.com > > > >> > > wrote: > >> > > > > >> > > > > > 3. Are there examples of Arrow in C++ read/write code that I > can > >> > > have a > >> > > > > look? > >> > > > > > >> > > > > What kind of code are you looking for? I would direct you to > >> relevant > >> > > > > unit tests that exhibit certain functionality, but it depends on > >> what > >> > > > > you are trying to do > >> > > > > On Wed, Oct 3, 2018 at 9:45 AM Animesh Trivedi > >> > > > > <animesh.triv...@gmail.com> wrote: > >> > > > > > > >> > > > > > Hi all - quick update on the performance investigation: > >> > > > > > > >> > > > > > - I spent some time looking at performance profile for a > binary > >> blob > >> > > > > column > >> > > > > > (1024 bytes of byte[]) and found a few favorable settings for > >> > > delivering > >> > > > > up > >> > > > > > to 168 Gbps from in-memory reading benchmark on 16 cores. > These > >> > > settings > >> > > > > > (NUMA, JVM settings, Arrow holder API, and batch size, etc.) > are > >> > > > > documented > >> > > > > > here: > >> > > > > > > >> > > > > > >> > > > >> > https://github.com/animeshtrivedi/blog/blob/master/post/2018-10-03-arrow-binary.md > >> > > > > > - these setting also help to improved the last number that > >> reported > >> > > (but > >> > > > > > not by much) for the in-memory TPC-DS store_sales table from > ~39 > >> > > Gbps up > >> > > > > to > >> > > > > > ~45-47 Gbps (note: this number is just in-memory benchmark, > >> i.e., > >> > > w/o any > >> > > > > > networking or storage links) > >> > > > > > > >> > > > > > A few follow up questions that I have: > >> > > > > > 1. Arrow reads a batch size worth of data in one go. Are there > >> any > >> > > > > > recommended batch sizes? In my investigation, small batch size > >> help > >> > > with > >> > > > > a > >> > > > > > better cache profile but increase number of instructions > >> required > >> > > (more > >> > > > > > looping). Larger one do otherwise. Somehow ~10MB/thread seem > to > >> be > >> > > the > >> > > > > best > >> > > > > > performing configuration, which is also a bit counter > intuitive > >> as > >> > > for 16 > >> > > > > > threads this will lead to 160 MB of memory footprint. May be > >> this is > >> > > also > >> > > > > > tired to the memory management logic which is my next > question. > >> > > > > > 2. Arrow use's netty's memory manager. (i) what are decent > netty > >> > > memory > >> > > > > > management settings for "io.netty.allocator.*" parameters? I > >> don't > >> > > find > >> > > > > any > >> > > > > > decent write-up on them; (ii) Is there a provision for > ArrowBuf > >> being > >> > > > > > re-used once a batch is consumed? As it looks for now, read > read > >> > > > > allocates > >> > > > > > a new buffer to read the whole batch size. > >> > > > > > 3. Are there examples of Arrow in C++ read/write code that I > can > >> > > have a > >> > > > > > look? > >> > > > > > > >> > > > > > Cheers, > >> > > > > > -- > >> > > > > > Animesh > >> > > > > > > >> > > > > > > >> > > > > > On Wed, Sep 19, 2018 at 8:49 PM Wes McKinney < > >> wesmck...@gmail.com> > >> > > > > wrote: > >> > > > > > > >> > > > > > > On Wed, Sep 19, 2018 at 2:13 PM Animesh Trivedi > >> > > > > > > <animesh.triv...@gmail.com> wrote: > >> > > > > > > > > >> > > > > > > > Hi Johan, Wes, and Jacques - many thanks for your > comments: > >> > > > > > > > > >> > > > > > > > @Johan - > >> > > > > > > > 1. I also do not suspect that there is any inherent > >> drawback in > >> > > Java > >> > > > > or > >> > > > > > > C++ > >> > > > > > > > due to the Arrow format. I mentioned C++ because Wes > >> pointed out > >> > > that > >> > > > > > > Java > >> > > > > > > > routines are not the most optimized ones (yet!). And > >> naturally > >> > > one > >> > > > > would > >> > > > > > > > expect better performance in a native language with all > >> > > > > > > pointer/memory/SIMD > >> > > > > > > > instruction optimizations that you mentioned. As far as I > >> know, > >> > > the > >> > > > > > > > off-heap buffers are managed in ArrowBuf which implements > an > >> > > abstract > >> > > > > > > netty > >> > > > > > > > class. But there is nothing unusual, i.e., netty specific, > >> about > >> > > > > these > >> > > > > > > > unsafe routines, they are used by many projects. Though > >> there is > >> > > cost > >> > > > > > > > associated with materializing on-heap Java values from > >> off-heap > >> > > > > memory > >> > > > > > > > regions. I need to benchmark that more carefully. > >> > > > > > > > > >> > > > > > > > 2. When you say "I've so far always been able to get > similar > >> > > > > performance > >> > > > > > > > numbers" - do you mean the same performance of my case 3 > >> where 16 > >> > > > > cores > >> > > > > > > > drive close to 40 Gbps, or the same performance between > >> your C++ > >> > > and > >> > > > > Java > >> > > > > > > > benchmarks. Do you have some write-up? I would be > >> interested to > >> > > read > >> > > > > up > >> > > > > > > :) > >> > > > > > > > > >> > > > > > > > 3. "Can you get to 100 Gbps starting from primitive arrays > >> in > >> > > Java" > >> > > > > -> > >> > > > > > > that > >> > > > > > > > is a good idea. Let me try and report back. > >> > > > > > > > > >> > > > > > > > @Wes - > >> > > > > > > > > >> > > > > > > > Is there some benchmark template for C++ routines I can > >> have a > >> > > look? > >> > > > > I > >> > > > > > > > would be happy to get some input from Java-Arrow experts > on > >> how > >> > > to > >> > > > > write > >> > > > > > > > these benchmarks more efficiently. I will have a closer > >> look at > >> > > the > >> > > > > JIRA > >> > > > > > > > tickets that you mentioned. > >> > > > > > > > > >> > > > > > > > So, for now I am focusing on the case 3, which is about > >> > > establishing > >> > > > > > > > performance when reading from a local in-memory I/O stream > >> that I > >> > > > > > > > implemented ( > >> > > > > > > > > >> > > > > > > > >> > > > > > >> > > > >> > https://github.com/animeshtrivedi/benchmarking-arrow/blob/master/src/main/java/com/github/animeshtrivedi/benchmark/MemoryIOChannel.java > >> > > > > > > ). > >> > > > > > > > In this case I first read data from parquet files, convert > >> them > >> > > into > >> > > > > > > Arrow, > >> > > > > > > > and write-out to a MemoryIOChannel, and then read back > from > >> it. > >> > > So, > >> > > > > the > >> > > > > > > > performance has nothing to do with Crail or HDFS in the > >> case 3. > >> > > > > Once, I > >> > > > > > > > establish the base performance in this setup (which is > >> around ~40 > >> > > > > Gbps > >> > > > > > > with > >> > > > > > > > 16 cores) I will add Crail to the mix. Perhaps Crail I/O > >> streams > >> > > can > >> > > > > take > >> > > > > > > > ArrowBuf as src/dst buffers. That should be doable. > >> > > > > > > > >> > > > > > > Right, in any case what you are testing is the performance > of > >> > > using a > >> > > > > > > particular Java accessor layer to JVM off-heap Arrow memory > >> to sum > >> > > the > >> > > > > > > non-null values of each column. I'm not sure that a single > >> > > bandwidth > >> > > > > > > number produced by this benchmark is very informative for > >> people > >> > > > > > > contemplating what memory format to use in their system due > >> to the > >> > > > > > > current state of the implementation (Java) and workload > >> measured > >> > > > > > > (summing the non-null values with a naive algorithm). I > would > >> guess > >> > > > > > > that a C++ version with raw pointers and a loop-unrolled, > >> > > branch-free > >> > > > > > > vectorized sum is going to be a lot faster. > >> > > > > > > > >> > > > > > > > > >> > > > > > > > @Jacques - > >> > > > > > > > > >> > > > > > > > That is a good point that "Arrow's implementation is more > >> > > focused on > >> > > > > > > > interacting with the structure than transporting it". > >> However, > >> > > in any > >> > > > > > > > distributed system one needs to move data/structure around > >> - as > >> > > far > >> > > > > as I > >> > > > > > > > understand that is another goal of the project. My > >> investigation > >> > > > > started > >> > > > > > > > within the context of Spark/SQL data processing. Spark > >> converts > >> > > > > incoming > >> > > > > > > > data into its own in-memory UnsafeRow representation for > >> > > processing. > >> > > > > So > >> > > > > > > > naturally the performance of this data ingestion pipeline > >> cannot > >> > > > > > > outperform > >> > > > > > > > the read performance of the used file format. I > benchmarked > >> > > Parquet, > >> > > > > ORC, > >> > > > > > > > Avro, JSON (for the specific TPC-DS store_sales table). > And > >> then > >> > > > > > > curiously > >> > > > > > > > benchmarked Arrow as well because its design choices are a > >> better > >> > > > > fit for > >> > > > > > > > modern high-performance RDMA/NVMe/100+Gbps devices I am > >> > > > > investigating. > >> > > > > > > From > >> > > > > > > > this point of view, I am trying to find out can Arrow be > >> the file > >> > > > > format > >> > > > > > > > for the next generation of storage/networking devices (see > >> Apache > >> > > > > Crail > >> > > > > > > > project) delivering close to the hardware speed > >> reading/writing > >> > > > > rates. As > >> > > > > > > > Wes pointed out that a C++ library implementation should > be > >> > > memory-IO > >> > > > > > > > bound, so what would it take to deliver the same > >> performance in > >> > > Java > >> > > > > ;) > >> > > > > > > > (and then, from across the network). > >> > > > > > > > > >> > > > > > > > I hope this makes sense. > >> > > > > > > > > >> > > > > > > > Cheers, > >> > > > > > > > -- > >> > > > > > > > Animesh > >> > > > > > > > > >> > > > > > > > On Wed, Sep 19, 2018 at 6:28 PM Jacques Nadeau < > >> > > jacq...@apache.org> > >> > > > > > > wrote: > >> > > > > > > > > >> > > > > > > > > My big question is what is the use case and how/what are > >> you > >> > > > > trying to > >> > > > > > > > > compare? Arrow's implementation is more focused on > >> interacting > >> > > > > with the > >> > > > > > > > > structure than transporting it. Generally speaking, when > >> we're > >> > > > > working > >> > > > > > > with > >> > > > > > > > > Arrow data we frequently are just interacting with > memory > >> > > > > locations and > >> > > > > > > > > doing direct operations. If you have a layer that > >> supports that > >> > > > > type of > >> > > > > > > > > semantic, create a movement technique that depends on > >> that. > >> > > Arrow > >> > > > > > > doesn't > >> > > > > > > > > force a particular API since the data itself is defined > >> by its > >> > > > > > > in-memory > >> > > > > > > > > layout so if you have a custom use or pattern, just work > >> with > >> > > the > >> > > > > > > in-memory > >> > > > > > > > > structures. > >> > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > On Wed, Sep 19, 2018 at 7:49 AM Wes McKinney < > >> > > wesmck...@gmail.com> > >> > > > > > > wrote: > >> > > > > > > > > > >> > > > > > > > > > hi Animesh, > >> > > > > > > > > > > >> > > > > > > > > > Per Johan's comments, the C++ library is essentially > >> going > >> > > to be > >> > > > > > > > > > IO/memory bandwidth bound since you're interacting > with > >> raw > >> > > > > pointers. > >> > > > > > > > > > > >> > > > > > > > > > I'm looking at your code > >> > > > > > > > > > > >> > > > > > > > > > private void consumeFloat4(FieldVector fv) { > >> > > > > > > > > > Float4Vector accessor = (Float4Vector) fv; > >> > > > > > > > > > int valCount = accessor.getValueCount(); > >> > > > > > > > > > for(int i = 0; i < valCount; i++){ > >> > > > > > > > > > if(!accessor.isNull(i)){ > >> > > > > > > > > > float4Count+=1; > >> > > > > > > > > > checksum+=accessor.get(i); > >> > > > > > > > > > } > >> > > > > > > > > > } > >> > > > > > > > > > } > >> > > > > > > > > > > >> > > > > > > > > > You'll want to get a Java-Arrow expert from Dremio to > >> advise > >> > > you > >> > > > > the > >> > > > > > > > > > fastest way to iterate over this data -- my > >> understanding is > >> > > that > >> > > > > > > much > >> > > > > > > > > > code in Dremio interacts with the wrapped Netty > ArrowBuf > >> > > objects > >> > > > > > > > > > rather than going through the higher level APIs. > You're > >> also > >> > > > > dropping > >> > > > > > > > > > performance because memory mapping is not yet > >> implemented in > >> > > > > Java, > >> > > > > > > see > >> > > > > > > > > > https://issues.apache.org/jira/browse/ARROW-3191. > >> > > > > > > > > > > >> > > > > > > > > > Furthermore, the IPC reader class you are using could > >> be made > >> > > > > more > >> > > > > > > > > > efficient. I described the problem in > >> > > > > > > > > > https://issues.apache.org/jira/browse/ARROW-3192 -- > >> this > >> > > will be > >> > > > > > > > > > required as soon as we have the ability to do memory > >> mapping > >> > > in > >> > > > > Java > >> > > > > > > > > > > >> > > > > > > > > > Could Crail use the Arrow data structures in its > runtime > >> > > rather > >> > > > > than > >> > > > > > > > > > copying? If not, how are Crail's runtime data > structures > >> > > > > different? > >> > > > > > > > > > > >> > > > > > > > > > - Wes > >> > > > > > > > > > On Wed, Sep 19, 2018 at 9:19 AM Johan Peltenburg - EWI > >> > > > > > > > > > <j.w.peltenb...@tudelft.nl> wrote: > >> > > > > > > > > > > > >> > > > > > > > > > > Hello Animesh, > >> > > > > > > > > > > > >> > > > > > > > > > > > >> > > > > > > > > > > I browsed a bit in your sources, thanks for sharing. > >> We > >> > > have > >> > > > > > > performed > >> > > > > > > > > > some similar measurements to your third case in the > >> past for > >> > > > > C/C++ on > >> > > > > > > > > > collections of various basic types such as primitives > >> and > >> > > > > strings. > >> > > > > > > > > > > > >> > > > > > > > > > > > >> > > > > > > > > > > I can say that in terms of consuming data from the > >> Arrow > >> > > format > >> > > > > > > versus > >> > > > > > > > > > language native collections in C++, I've so far always > >> been > >> > > able > >> > > > > to > >> > > > > > > get > >> > > > > > > > > > similar performance numbers (e.g. no drawbacks due to > >> the > >> > > Arrow > >> > > > > > > format > >> > > > > > > > > > itself). Especially when accessing the data through > >> Arrow's > >> > > raw > >> > > > > data > >> > > > > > > > > > pointers (and using for example std::string_view-like > >> > > > > constructs). > >> > > > > > > > > > > > >> > > > > > > > > > > In C/C++ the fast data structures are engineered in > >> such a > >> > > way > >> > > > > > > that as > >> > > > > > > > > > little pointer traversals are required and they take > up > >> an as > >> > > > > small > >> > > > > > > as > >> > > > > > > > > > possible memory footprint. Thus each memory access is > >> > > relatively > >> > > > > > > > > efficient > >> > > > > > > > > > (in terms of obtaining the data of interest). The same > >> can > >> > > > > > > absolutely be > >> > > > > > > > > > said for Arrow, if not even more efficient in some > cases > >> > > where > >> > > > > object > >> > > > > > > > > > fields are of variable length. > >> > > > > > > > > > > > >> > > > > > > > > > > > >> > > > > > > > > > > In the JVM case, the Arrow data is stored off-heap. > >> This > >> > > > > requires > >> > > > > > > the > >> > > > > > > > > > JVM to interface to it through some calls to Unsafe > >> hidden > >> > > under > >> > > > > the > >> > > > > > > > > Netty > >> > > > > > > > > > layer (but please correct me if I'm wrong, I'm not an > >> expert > >> > > on > >> > > > > > > this). > >> > > > > > > > > > Those calls are the only reason I can think of that > >> would > >> > > > > degrade the > >> > > > > > > > > > performance a bit compared to a pure JAva case. I > don't > >> know > >> > > if > >> > > > > the > >> > > > > > > > > Unsafe > >> > > > > > > > > > calls are inlined during JIT compilation. If they > >> aren't they > >> > > > > will > >> > > > > > > > > increase > >> > > > > > > > > > access latency to any data a little bit. > >> > > > > > > > > > > > >> > > > > > > > > > > > >> > > > > > > > > > > I don't have a similar machine so it's not easy to > >> relate > >> > > my > >> > > > > > > numbers to > >> > > > > > > > > > yours, but if you can get that data consumed with 100 > >> Gbps > >> > > in a > >> > > > > pure > >> > > > > > > Java > >> > > > > > > > > > case, I don't see any reason (resulting from Arrow > >> format / > >> > > > > off-heap > >> > > > > > > > > > storage) why you wouldn't be able to get at least > really > >> > > close. > >> > > > > Can > >> > > > > > > you > >> > > > > > > > > get > >> > > > > > > > > > to 100 Gbps starting from primitive arrays in Java > with > >> your > >> > > > > > > consumption > >> > > > > > > > > > functions in the first place? > >> > > > > > > > > > > > >> > > > > > > > > > > > >> > > > > > > > > > > I'm interested to see your progress on this. > >> > > > > > > > > > > > >> > > > > > > > > > > > >> > > > > > > > > > > Kind regards, > >> > > > > > > > > > > > >> > > > > > > > > > > > >> > > > > > > > > > > Johan Peltenburg > >> > > > > > > > > > > > >> > > > > > > > > > > ________________________________ > >> > > > > > > > > > > From: Animesh Trivedi <animesh.triv...@gmail.com> > >> > > > > > > > > > > Sent: Wednesday, September 19, 2018 2:08:50 PM > >> > > > > > > > > > > To: dev@arrow.apache.org; d...@crail.apache.org > >> > > > > > > > > > > Subject: [JAVA] Arrow performance measurement > >> > > > > > > > > > > > >> > > > > > > > > > > Hi all, > >> > > > > > > > > > > > >> > > > > > > > > > > A week ago, Wes and I had a discussion about the > >> > > performance > >> > > > > of the > >> > > > > > > > > > > Arrow/Java implementation on the Apache Crail > >> (Incubating) > >> > > > > mailing > >> > > > > > > > > list ( > >> > > > > > > > > > > > >> > > > > > > > >> > > > >> http://mail-archives.apache.org/mod_mbox/crail-dev/201809.mbox/browser > >> > > > > > > > > ). > >> > > > > > > > > > In > >> > > > > > > > > > > a nutshell: I am investigating the performance of > >> various > >> > > file > >> > > > > > > formats > >> > > > > > > > > > > (including Arrow) on high-performance NVMe and > >> > > > > RDMA/100Gbps/RoCE > >> > > > > > > > > setups. > >> > > > > > > > > > I > >> > > > > > > > > > > benchmarked how long does it take to materialize > >> values > >> > > (ints, > >> > > > > > > longs, > >> > > > > > > > > > > doubles) of the store_sales table, the largest table > >> in the > >> > > > > TPC-DS > >> > > > > > > > > > dataset > >> > > > > > > > > > > stored on different file formats. Here is a write-up > >> on > >> > > this - > >> > > > > > > > > > > > >> > > https://crail.incubator.apache.org/blog/2018/08/sql-p1.html. I > >> > > > > > > found > >> > > > > > > > > > that > >> > > > > > > > > > > between a pair of machine connected over a 100 Gbps > >> link, > >> > > Arrow > >> > > > > > > (using > >> > > > > > > > > > as a > >> > > > > > > > > > > file format on HDFS) delivered close to ~30 Gbps of > >> > > bandwidth > >> > > > > with > >> > > > > > > all > >> > > > > > > > > 16 > >> > > > > > > > > > > cores engaged. Wes pointed out that (i) Arrow is > >> in-memory > >> > > IPC > >> > > > > > > format, > >> > > > > > > > > > and > >> > > > > > > > > > > has not been optimized for storage interfaces/APIs > >> like > >> > > HDFS; > >> > > > > (ii) > >> > > > > > > the > >> > > > > > > > > > > performance I am measuring is for the java > >> implementation. > >> > > > > > > > > > > > >> > > > > > > > > > > Wes, I hope I summarized our discussion correctly. > >> > > > > > > > > > > > >> > > > > > > > > > > That brings us to this email where I promised to > >> follow up > >> > > on > >> > > > > the > >> > > > > > > Arrow > >> > > > > > > > > > > mailing list to understand and optimize the > >> performance of > >> > > > > > > Arrow/Java > >> > > > > > > > > > > implementation on high-performance devices. I wrote > a > >> small > >> > > > > > > stand-alone > >> > > > > > > > > > > benchmark ( > >> > > > > https://github.com/animeshtrivedi/benchmarking-arrow) > >> > > > > > > with > >> > > > > > > > > > three > >> > > > > > > > > > > implementations of WritableByteChannel, > >> SeekableByteChannel > >> > > > > > > interfaces: > >> > > > > > > > > > > > >> > > > > > > > > > > 1. Arrow data is stored in HDFS/tmpfs - this gives > me > >> ~30 > >> > > Gbps > >> > > > > > > > > > performance > >> > > > > > > > > > > 2. Arrow data is stored in Crail/DRAM - this gives > me > >> > > ~35-36 > >> > > > > Gbps > >> > > > > > > > > > > performance > >> > > > > > > > > > > 3. Arrow data is stored in on-heap byte[] - this > >> gives me > >> > > ~39 > >> > > > > Gbps > >> > > > > > > > > > > performance > >> > > > > > > > > > > > >> > > > > > > > > > > I think the order makes sense. To better understand > >> the > >> > > > > > > performance of > >> > > > > > > > > > > Arrow/Java we can focus on the option 3. > >> > > > > > > > > > > > >> > > > > > > > > > > The key question I am trying to answer is "what > would > >> it > >> > > take > >> > > > > for > >> > > > > > > > > > > Arrow/Java to deliver 100+ Gbps of performance"? Is > it > >> > > > > possible? If > >> > > > > > > > > yes, > >> > > > > > > > > > > then what is missing/or mis-interpreted by me? If > >> not, then > >> > > > > where > >> > > > > > > is > >> > > > > > > > > the > >> > > > > > > > > > > performance lost? Does anyone have any performance > >> > > measurements > >> > > > > > > for C++ > >> > > > > > > > > > > implementation? if they have seen/expect better > >> numbers. > >> > > > > > > > > > > > >> > > > > > > > > > > As a next step, I will profile the read path of > >> Arrow/Java > >> > > for > >> > > > > the > >> > > > > > > > > option > >> > > > > > > > > > > 3. I will report my findings. > >> > > > > > > > > > > > >> > > > > > > > > > > Any thoughts and feedback on this investigation are > >> very > >> > > > > welcome. > >> > > > > > > > > > > > >> > > > > > > > > > > Cheers, > >> > > > > > > > > > > -- > >> > > > > > > > > > > Animesh > >> > > > > > > > > > > > >> > > > > > > > > > > PS~ Cross-posting on the d...@crail.apache.org list > as > >> > > well. > >> > > > > > > > > > > >> > > > > > > > > > >> > > > > > > > >> > > > > > >> > > > >> > > > >