leerho commented on issue #9806:
URL: https://github.com/apache/druid/issues/9806#issuecomment-622988699
@gianm
Yes and Yes. I can envision a couple of ways to do this.
The first and simplest option would be an added feature of the DST that,
given a column, would dump the data contents of each aggregation item as a
separate binary file in some user-specified output directory. This has the
advantage that the DST does not need to know any specifics of the aggregation
item. This was the first thing I tried by just decoding the base64 rows from
the DST output and spitting out a bin file for each one. In the segment I was
given, it produced 47 binary files. It has the disadvantage that it doesn't
communicate anything about the internal state of those binary files, which in
our case represent sketches.
The second option would add functionality, somewhere, that utilizes the
DataSketches library to read those binary files and produce human-readable
output like the following for each one. This is the output from the 6th of the
47 sketches in the segment
```
HllSketch 6
### HLL SKETCH PREAMBLE:
Byte 0: Preamble Ints : 10
Byte 1: SerVer : 1
Byte 2: Family : HLL
Byte 3: lgK : 14
Byte 4: LgArr or Aux LgArr : 0
Byte 5: Flags: : 00001000, 8
BIG_ENDIAN_STORAGE : false
(Native Byte Order) : LITTLE_ENDIAN
READ_ONLY : false
EMPTY : false
COMPACT : true
OUT_OF_ORDER : false
REBUILD_KXQ : false
Byte 6: Cur Min : 0
Byte 7: Mode : HLL, HLL_4
HIP Accum : 6562.656349859294
KxQ0 : 12685.76611328125
KxQ1 : 0.0
Num At Cur Min : 10988
Aux Count : 0
### END HLL SKETCH PREAMBLE
### HLL SKETCH SUMMARY:
Log Config K : 14
Hll Target : HLL_4
Current Mode : HLL
Memory : false
LB : 6520.24649606101
Estimate : 6562.656349859294
UB : 6605.621511177209
OutOfOrder Flag: false
CurMin : 0
NumAtCurMin : 10988
HipAccum : 6562.656349859294
KxQ0 : 12685.76611328125
KxQ1 : 0.0
Rebuild KxQ Flg: false
```
However, for the Druid use case where we could be dealing with potentially
thousands of sketches in a large segment, reformatting the above information
into a summary matrix, where each sketch is a row, would provide a more compact
visual presentation and also allow easier visual detection of anomalous states
of the sketches.
I could easily see this as a separate tool that just processes all the files
produced by the DST in the 1st option. A separate independent tool has the
advantage of not polluting the dependencies of the DST with dependencies on the
DataSketches Library.
A tighter integration of these two options would allow the ability to output
just the human-readable summary matrix without having to output all the
individual files. This would be a nice feature, but not a critical one.
My vote would be to just change the DST to allow outputting individual files
for each aggregation item of a column. And then I could provide the tool that
allows the analysis of those files for the sketches case.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]