By the way, I just did a quick test of using Memory instead of ByteBuffer
in VectorAggregator with groupBy (hash grouping, 8-byte key, two
aggregators) and measured a 14% speedup. I think the speedup would be
higher with more aggregators or with array-based rather than hash-based
grouping (since in those cases, relatively more time is spent doing
aggregations).

On Thu, Feb 6, 2020 at 11:32 AM Gian Merlino <g...@apache.org> wrote:

> We could make an interface that is a subset of Memory and use that. But I
> think once the new JEP 370 stuff becomes mainstream it would be best to use
> it directly, since it offers a richer API and will probably become
> considered idiomatic Java over time. So we may still want to drop the
> abstraction later, once we drop support for Java pre-14.
>
> Separately, I think if we do build an abstraction layer here, we need to
> make sure the performance overhead is zero — it's important that the jvm be
> able to inline the underlying calls.
>
> > @Gian Merlino I think i am not 100% sure about the scope you are talking
> about, thought you are talking about
> https://github.com/apache/druid/issues/3892 and doing the work from
> bottom up (storage to processing ...)
>
> For now I'm only proposing using a non-BB API (whether it's Memory or our
> own abstraction) for the VectorAggregator interface and the query engines
> that use it. I don't see a strong motivation to tackle the storage layer at
> this time. The reason is that when I profile Druid queries I usually see
> the lions share of time being spend in query engines and aggregators, so I
> think that's what we need to tackle first from a performance perspective.
>
> Btw, a side note, this talk is great:
> https://www.youtube.com/watch?v=iwSCtxMbBLI
>
> On Wed, Feb 5, 2020 at 6:44 PM Slim Bouguerra <slim.bougue...@gmail.com>
> wrote:
>
>> @Jad Sounds good approach was thinking the same as well how can we expose
>> those using higher level API.
>> @Gian Merlino <g...@apache.org> I think i am not 100% sure about the
>> scope you are talking about, thought you are talking about
>> https://github.com/apache/druid/issues/3892 and doing the work from
>> bottom up (storage to processing ...)
>> Can you please highlight the scope and would love to help if you think
>> adding such abstraction will add an extra work overhead that you do want to
>> avoid.
>>
>>
>> On Wed, Feb 5, 2020 at 5:52 PM Jad Naous <jad.na...@imply.io> wrote:
>>
>>> But then you could never get rid of it... Ideally abstraction layers
>>> should
>>> be constrained to the actual uses within the system using them instead of
>>> making something that may be too general and would be hard to replace.
>>>
>>> On Wed, Feb 5, 2020, 4:03 PM Gian Merlino <g...@apache.org> wrote:
>>>
>>> > I wonder if Memory itself could be that layer?
>>> >
>>> > On Wed, Feb 5, 2020 at 10:03 AM Jad Naous <jad.na...@imply.io> wrote:
>>> >
>>> > > We could build an abstraction layer on top of the memory interface
>>> > provided
>>> > > by DataSketches. When the JDK gets the new stuff, we can just change
>>> the
>>> > > implementation of the abstraction.
>>> > >
>>> > > On Wed, Feb 5, 2020 at 9:43 AM Gian Merlino <g...@apache.org> wrote:
>>> > >
>>> > > > The thing that worries me about JEP 370 is that if historical Java
>>> user
>>> > > > migration patterns hold up, we will need to support Java 11 for a
>>> while
>>> > > > (probably another 2–3 years), and we would therefore need to wait
>>> that
>>> > > long
>>> > > > to use JEP 370. It seems like a long time and until then we would
>>> be
>>> > > stuck
>>> > > > with a pretty inferior API.
>>> > > >
>>> > > > I also would prefer not having to rewrite code a bunch of times,
>>> but
>>> > > that's
>>> > > > why I suggested starting by using Memory for the VectorAggregator
>>> > > interface
>>> > > > and stuff that interacts with it. There isn't that much code there
>>> yet
>>> > > > (only a few aggregators implement VectorAggregator). So we will
>>> need to
>>> > > > write most of it for the first time, and since it is fresh code, I
>>> > think
>>> > > > it'd be nice to use the best API currently available in Java 8 /
>>> 11.
>>> > From
>>> > > > what I see that is Memory.
>>> > > >
>>> > > > On Wed, Feb 5, 2020 at 9:21 AM Slim Bouguerra <bs...@apache.org>
>>> > wrote:
>>> > > >
>>> > > > > Hi Gian,
>>> > > > > Thanks for bringing this up.
>>> > > > > IMO for the long run and looking at how much code will have to
>>> > change,
>>> > > it
>>> > > > > makes more sense to rely on JDK based API JEP 370 and have this
>>> work
>>> > > done
>>> > > > > ONCE as oppose to multiple iteration. FYI i do not think it is
>>> far
>>> > > away,
>>> > > > > seems like there is a good momentum around it.
>>> > > > > This does not exclude or means we should not use Memory API for
>>> other
>>> > > > stuff
>>> > > > > like sketches et al, in fact i think even for project like
>>> Sketches
>>> > it
>>> > > > > makes more sense to move to newer API offered by the JDK rather
>>> that
>>> > do
>>> > > > it
>>> > > > > your self.
>>> > > > >
>>> > > > >
>>> > > > > On Tue, Feb 4, 2020 at 10:12 PM Gian Merlino <g...@apache.org>
>>> > wrote:
>>> > > > >
>>> > > > > > Hey Druids,
>>> > > > > >
>>> > > > > > There has generally been a lot of talk about moving away from
>>> > > > ByteBuffer
>>> > > > > > and towards the DataSketches Memory package (
>>> > > > > > https://datasketches.apache.org/docs/Memory/MemoryPackage.html)
>>> or
>>> > > > even
>>> > > > > > using Unsafe directly. Much of that discussion happened on
>>> > > > > > https://github.com/apache/druid/issues/3892.
>>> > > > > >
>>> > > > > > Recently a patch was merged that added datasketches-memory as a
>>> > > > > dependency
>>> > > > > > of druid-processing: https://github.com/apache/druid/pull/9308
>>> .
>>> > The
>>> > > > > reason
>>> > > > > > was partially due to better performance and partially due to
>>> nicer
>>> > > API
>>> > > > > > (both reasons mentioned in #3892 as well).
>>> > > > > >
>>> > > > > > JEP 370 is a potential long term solution but it seems a while
>>> away
>>> > > > from
>>> > > > > > being ready: https://openjdk.java.net/jeps/370
>>> > > > > >
>>> > > > > > I wanted to bring the larger discussion back up and see what
>>> people
>>> > > > think
>>> > > > > > is a good path forward.
>>> > > > > >
>>> > > > > > My suggestion is that we migrate the VectorAggregator
>>> interface to
>>> > > use
>>> > > > > > Memory, but keep BufferAggregator the way it is. That way, as
>>> we
>>> > > build
>>> > > > > out
>>> > > > > > support for vectorization (right now, only timeseries/groupby
>>> > support
>>> > > > it,
>>> > > > > > and only a few aggregators, but we should be building this out)
>>> > we'll
>>> > > > be
>>> > > > > > doing it with a nicer and potentially faster API. But we won't
>>> need
>>> > > to
>>> > > > go
>>> > > > > > back and redo a bunch of old code, since we'll keep the
>>> > > non-vectorized
>>> > > > > code
>>> > > > > > paths the way they are. (And hopefully, one day, delete them
>>> all
>>> > > > > outright.)
>>> > > > > >
>>> > > > > > Gian
>>> > > > > >
>>> > > > >
>>> > > >
>>> > >
>>> >
>>>
>>
>>
>> --
>>
>> B-Slim
>> _______/\/\/\_______/\/\/\_______/\/\/\_______/\/\/\_______/\/\/\_______
>>
>

Reply via email to