clintropolis opened a new pull request #7496: refactor druid-bloom-filter 
aggregators
URL: https://github.com/apache/incubator-druid/pull/7496
 
 
   This PR refactors the `druid-bloom-filter` extension aggregators to strictly 
use `ByteBuffer` representation of `BloomKFilter` and buffer based methods to 
manipulate them in place, allowing for combined `Aggregator` and 
`BufferAggregator` implementations and vastly simplifying the code. It also 
fixes a bug in the `combine` logic of the current implementation, resulting 
from the mixed use of `BloomKFilter` and `ByteBuffer` (the current 
implementation only can combine `BloomKFilter`, but the broker will receive 
`byte[]` from the historicals, resulting in a cast exception).
   
   Originally introduced in #6397, the PR sort of grew organically over its 41 
commit lifetime; from starting out where it only dealt with `BloomKFilter` on 
heap and suffered the overhead of a lot of serde, to where it ended up, which 
mixed usage of `BloomKFilter` and `ByteBuffer` after adding some methods to 
optimize the `BufferAggregator` implementation, done in an attempt to reduce 
the overhead of serializing and deserializing such potentially large values so 
often, and picking up a bug or two along the way.
   
   This new approach combines implementations of both types of the aggregators 
and now _always_ operates with `ByteBuffer`. The `BloomFilterAggregatorFactory` 
and `BloomFilterMergeAggregatorFactory` construct the `Aggregator` and 
`BufferAggregator` implementations with a perhaps not greatly named boolean 
parameter on the constructor, `onHeap`. If `onHeap` is set to `true`, the 
aggregator will allocate an appropriately sized `ByteBuffer` to hold a 
`BloomKFilter` allow usage as an `Aggregator`, and if not must rely on the 
`ByteBuffer` that will be passed to it's methods during it's life as a 
`BufferAggregator`. 
   
   There is probably room for improvement to make this a bit more elegant, 
maybe making the aggregator constructors private and adding static methods to 
more explicitly construct either an `Aggregator` with `onHeap = true`, or a 
`BufferAggregator` with `onHeap = false`? Regardless, this eliminates all of 
the extra serde and should be a lot more simple to reason about and 
troubleshoot.
   
   I have tested with top-n, timeseries, and group-by queries on a small test 
cluster with multiple-historical spun up on my laptop, and things look good so 
far.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to