leerho commented on issue #9806:
URL: https://github.com/apache/druid/issues/9806#issuecomment-622988699


   @gianm 
   
   Yes and Yes.  I can envision a couple of ways to do this.  
   
   The first and simplest option would be an added feature of the DST that, 
given a column, would dump the data contents of each aggregation item as a 
separate binary file in some user-specified output directory.  This has the 
advantage that the DST does not need to know any specifics of the aggregation 
item.  This was the first thing I tried by just decoding the base64 rows from 
the DST output and spitting out a bin file for each one.  In the segment I was 
given, it produced 47 binary files.  It has the disadvantage that it doesn't 
communicate anything about the internal state of those binary files, which in 
our case represent sketches.  
   
   The second option would add functionality, somewhere, that utilizes the 
DataSketches library to read those binary files and produce human-readable 
output like the following for each one.  This is the output from the 6th of the 
47 sketches in the segment
   
   ```
   HllSketch 6
   
   ### HLL SKETCH PREAMBLE:
   Byte 0: Preamble Ints         : 10
   Byte 1: SerVer                : 1
   Byte 2: Family                : HLL
   Byte 3: lgK                   : 14
   Byte 4: LgArr or Aux LgArr    : 0
   Byte 5: Flags:                : 00001000, 8
     BIG_ENDIAN_STORAGE          : false
     (Native Byte Order)         : LITTLE_ENDIAN
     READ_ONLY                   : false
     EMPTY                       : false
     COMPACT                     : true
     OUT_OF_ORDER                : false
     REBUILD_KXQ                 : false
   Byte 6: Cur Min               : 0
   Byte 7: Mode                  : HLL, HLL_4
   HIP Accum                     : 6562.656349859294
   KxQ0                          : 12685.76611328125
   KxQ1                          : 0.0
   Num At Cur Min                : 10988
   Aux Count                     : 0
   ### END HLL SKETCH PREAMBLE
   
   ### HLL SKETCH SUMMARY: 
     Log Config K   : 14
     Hll Target     : HLL_4
     Current Mode   : HLL
     Memory         : false
     LB             : 6520.24649606101
     Estimate       : 6562.656349859294
     UB             : 6605.621511177209
     OutOfOrder Flag: false
     CurMin         : 0
     NumAtCurMin    : 10988
     HipAccum       : 6562.656349859294
     KxQ0           : 12685.76611328125
     KxQ1           : 0.0
     Rebuild KxQ Flg: false
   ```
   
   However, for the Druid use case where we could be dealing with potentially 
thousands of sketches in a large segment, reformatting the above information 
into a summary matrix, where each sketch is a row, would provide a more compact 
visual presentation and also allow easier visual detection of anomalous states 
of the sketches.  
   
   I could easily see this as a separate tool that just processes all the files 
produced by the DST in the 1st option.  A separate independent tool has the 
advantage of not polluting the dependencies of the DST with dependencies on the 
DataSketches Library.  
   
   A tighter integration of these two options would allow the ability to output 
just the human-readable summary matrix without having to output all the 
individual files.  This would be a nice feature, but not a critical one.  
   
   My vote would be to just change the DST to allow outputting individual files 
for each aggregation item of a column.  And then I could provide the tool that 
allows the analysis of those files for the sketches case.
   
   
   
   
   
   
   
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to