gianm opened a new issue #8118: GroupBy array based result rows
URL: https://github.com/apache/incubator-druid/issues/8118
 
 
   ### Motivation
   
   GroupBy queries internally represent result rows as `MapBasedRow` objects, 
which have the following two fields:
   
   ```java
   private final DateTime timestamp;
   private final Map<String, Object> event;
   ```
   
   As a result, we need to do relatively expensive Map put and get operations 
(typically these are HashMaps or LinkedHashMaps) at many points: when rows are 
first generated after each segment scan, when they are merged on historicals, 
when they are serialized and deserialized, and then when they are merged again 
on the broker.
   
   The overhead is especially noticeable when the resultset of the groupBy 
query is large.
   
   See also #6389.
   
   ### Proposed changes
   
   1. Create a `ResultRow` class that simply wraps an `Object[]` and allows 
position-based access.
   2. Modify the GroupBy query implementation to use ResultRow throughout, 
rather than Row / MapBasedRow.
   3. Add `ObjectMapper decorateObjectMapper(ObjectMapper, QueryType)` to 
QueryToolChest, to aid in implementing the compatibility plan described in 
"Operational impact" below. QueryResource would use it so it could serialize 
results into either arrays or maps depending on the value of `resultAsArray`. 
DirectDruidClient would use it so it could deserialize results into ResultRow 
regardless of whether they originated as ResultRows or MapBasedRows.
   
   ### Rationale
   
   Some other potential approaches that I considered, and did not go with, 
include:
   
   - Creating an ArrayBasedRow that implements 
`org.apache.druid.data.input.Row` (just like MapBasedRow does). The reason for 
avoiding this is that the interface is all about retrieving fields by name -- 
`getRaw(String dimension)`, etc -- and I wanted to do positional access instead.
   - Using `Object[]` instead of a wrapper `ResultRow` around the `Object[]`. 
It would have saved a little memory, but I thought the benefits of type-safety 
(it's clear what ResultRow means when it appears in method signatures) and a 
nicer API would be worth it.
   
   ### Operational impact
   
   The format of data in the query cache would not change.
   
   The wire format of groupBy results would change (this is part of the point 
of the change) but I plan to do this with no compatibility impact, by adding a 
new query context flag `resultAsArray` that defaults to false. If false, Druid 
would use the array-based result rows for in-memory operations, but then 
convert them to MapBasedRows for serialization purposes, keeping the wire 
format compatible. If true, Druid would use array-based result rows for 
serialization too.
   
   I'd have brokers always set `resultAsArray` true on queries they send down 
to historicals. Since we tell cluster operators to update historicals first, 
that means that by the time the broker is updated, we can assume the 
historicals will know how to interpret the option. Users would also be able to 
set `resultAsArray` if they want once brokers are updated, and receive 
array-based results themselves.
   
   So, due to the above design, there should be no operational impact.
   
   ### Test plan
   
   Existing unit tests will cover a lot of this. In addition, I plan to test on 
live clusters, especially the compatibility stuff.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to