By the way, I just did a quick test of using Memory instead of ByteBuffer in VectorAggregator with groupBy (hash grouping, 8-byte key, two aggregators) and measured a 14% speedup. I think the speedup would be higher with more aggregators or with array-based rather than hash-based grouping (since in those cases, relatively more time is spent doing aggregations).
On Thu, Feb 6, 2020 at 11:32 AM Gian Merlino <g...@apache.org> wrote: > We could make an interface that is a subset of Memory and use that. But I > think once the new JEP 370 stuff becomes mainstream it would be best to use > it directly, since it offers a richer API and will probably become > considered idiomatic Java over time. So we may still want to drop the > abstraction later, once we drop support for Java pre-14. > > Separately, I think if we do build an abstraction layer here, we need to > make sure the performance overhead is zero — it's important that the jvm be > able to inline the underlying calls. > > > @Gian Merlino I think i am not 100% sure about the scope you are talking > about, thought you are talking about > https://github.com/apache/druid/issues/3892 and doing the work from > bottom up (storage to processing ...) > > For now I'm only proposing using a non-BB API (whether it's Memory or our > own abstraction) for the VectorAggregator interface and the query engines > that use it. I don't see a strong motivation to tackle the storage layer at > this time. The reason is that when I profile Druid queries I usually see > the lions share of time being spend in query engines and aggregators, so I > think that's what we need to tackle first from a performance perspective. > > Btw, a side note, this talk is great: > https://www.youtube.com/watch?v=iwSCtxMbBLI > > On Wed, Feb 5, 2020 at 6:44 PM Slim Bouguerra <slim.bougue...@gmail.com> > wrote: > >> @Jad Sounds good approach was thinking the same as well how can we expose >> those using higher level API. >> @Gian Merlino <g...@apache.org> I think i am not 100% sure about the >> scope you are talking about, thought you are talking about >> https://github.com/apache/druid/issues/3892 and doing the work from >> bottom up (storage to processing ...) >> Can you please highlight the scope and would love to help if you think >> adding such abstraction will add an extra work overhead that you do want to >> avoid. >> >> >> On Wed, Feb 5, 2020 at 5:52 PM Jad Naous <jad.na...@imply.io> wrote: >> >>> But then you could never get rid of it... Ideally abstraction layers >>> should >>> be constrained to the actual uses within the system using them instead of >>> making something that may be too general and would be hard to replace. >>> >>> On Wed, Feb 5, 2020, 4:03 PM Gian Merlino <g...@apache.org> wrote: >>> >>> > I wonder if Memory itself could be that layer? >>> > >>> > On Wed, Feb 5, 2020 at 10:03 AM Jad Naous <jad.na...@imply.io> wrote: >>> > >>> > > We could build an abstraction layer on top of the memory interface >>> > provided >>> > > by DataSketches. When the JDK gets the new stuff, we can just change >>> the >>> > > implementation of the abstraction. >>> > > >>> > > On Wed, Feb 5, 2020 at 9:43 AM Gian Merlino <g...@apache.org> wrote: >>> > > >>> > > > The thing that worries me about JEP 370 is that if historical Java >>> user >>> > > > migration patterns hold up, we will need to support Java 11 for a >>> while >>> > > > (probably another 2–3 years), and we would therefore need to wait >>> that >>> > > long >>> > > > to use JEP 370. It seems like a long time and until then we would >>> be >>> > > stuck >>> > > > with a pretty inferior API. >>> > > > >>> > > > I also would prefer not having to rewrite code a bunch of times, >>> but >>> > > that's >>> > > > why I suggested starting by using Memory for the VectorAggregator >>> > > interface >>> > > > and stuff that interacts with it. There isn't that much code there >>> yet >>> > > > (only a few aggregators implement VectorAggregator). So we will >>> need to >>> > > > write most of it for the first time, and since it is fresh code, I >>> > think >>> > > > it'd be nice to use the best API currently available in Java 8 / >>> 11. >>> > From >>> > > > what I see that is Memory. >>> > > > >>> > > > On Wed, Feb 5, 2020 at 9:21 AM Slim Bouguerra <bs...@apache.org> >>> > wrote: >>> > > > >>> > > > > Hi Gian, >>> > > > > Thanks for bringing this up. >>> > > > > IMO for the long run and looking at how much code will have to >>> > change, >>> > > it >>> > > > > makes more sense to rely on JDK based API JEP 370 and have this >>> work >>> > > done >>> > > > > ONCE as oppose to multiple iteration. FYI i do not think it is >>> far >>> > > away, >>> > > > > seems like there is a good momentum around it. >>> > > > > This does not exclude or means we should not use Memory API for >>> other >>> > > > stuff >>> > > > > like sketches et al, in fact i think even for project like >>> Sketches >>> > it >>> > > > > makes more sense to move to newer API offered by the JDK rather >>> that >>> > do >>> > > > it >>> > > > > your self. >>> > > > > >>> > > > > >>> > > > > On Tue, Feb 4, 2020 at 10:12 PM Gian Merlino <g...@apache.org> >>> > wrote: >>> > > > > >>> > > > > > Hey Druids, >>> > > > > > >>> > > > > > There has generally been a lot of talk about moving away from >>> > > > ByteBuffer >>> > > > > > and towards the DataSketches Memory package ( >>> > > > > > https://datasketches.apache.org/docs/Memory/MemoryPackage.html) >>> or >>> > > > even >>> > > > > > using Unsafe directly. Much of that discussion happened on >>> > > > > > https://github.com/apache/druid/issues/3892. >>> > > > > > >>> > > > > > Recently a patch was merged that added datasketches-memory as a >>> > > > > dependency >>> > > > > > of druid-processing: https://github.com/apache/druid/pull/9308 >>> . >>> > The >>> > > > > reason >>> > > > > > was partially due to better performance and partially due to >>> nicer >>> > > API >>> > > > > > (both reasons mentioned in #3892 as well). >>> > > > > > >>> > > > > > JEP 370 is a potential long term solution but it seems a while >>> away >>> > > > from >>> > > > > > being ready: https://openjdk.java.net/jeps/370 >>> > > > > > >>> > > > > > I wanted to bring the larger discussion back up and see what >>> people >>> > > > think >>> > > > > > is a good path forward. >>> > > > > > >>> > > > > > My suggestion is that we migrate the VectorAggregator >>> interface to >>> > > use >>> > > > > > Memory, but keep BufferAggregator the way it is. That way, as >>> we >>> > > build >>> > > > > out >>> > > > > > support for vectorization (right now, only timeseries/groupby >>> > support >>> > > > it, >>> > > > > > and only a few aggregators, but we should be building this out) >>> > we'll >>> > > > be >>> > > > > > doing it with a nicer and potentially faster API. But we won't >>> need >>> > > to >>> > > > go >>> > > > > > back and redo a bunch of old code, since we'll keep the >>> > > non-vectorized >>> > > > > code >>> > > > > > paths the way they are. (And hopefully, one day, delete them >>> all >>> > > > > outright.) >>> > > > > > >>> > > > > > Gian >>> > > > > > >>> > > > > >>> > > > >>> > > >>> > >>> >> >> >> -- >> >> B-Slim >> _______/\/\/\_______/\/\/\_______/\/\/\_______/\/\/\_______/\/\/\_______ >> >