gianm commented on issue #9806:
URL: https://github.com/apache/druid/issues/9806#issuecomment-625905734


   > The first and simplest option would be an added feature of the DST that, 
given a column, would dump the data contents of each aggregation item as a 
separate binary file in some user-specified output directory. This has the 
advantage that the DST does not need to know any specifics of the aggregation 
item.
   
   For this I would caution that the dump-segment tool actually does not have 
access to the sketch images from the column! It only seems like it does. What 
it actually has access to is the thing that the relevant ComplexMetricSerde 
deserializes it into. And what it does is then take that object and serialize 
it using Jackson.
   
   In the case of DataSketches HLL, the HllSketchMergeComplexMetricSerde 
returns an HllSketch object. And that is serialized to Jackson using the 
HllSketchJsonSerializer that is part of the Druid DataSketches extension, which 
does the following.
   
   ```java
       jgen.writeBinary(sketch.toCompactByteArray());
   ```
   
   And what writeBinary does is write a Base64 string. So, that's why you get 
the behavior you get.
   
   With regard to taking this output and splitting it up into multiple files, 
my first thought is we don't need the dump-segment tool to do that, since you 
could do it by composing two tools in the Unix style. (dump-segment to generate 
a dump of the entire segment, `jq` to extract line-by-line output corresponding 
to each sketch, and `split` to split into multiple files). I suppose it 
wouldn't hurt to add an option to dump-segment that does this, but the Unix way 
is more flexible.
   
   With regard to dumping interesting "human readable" stuff, that actually 
seems very useful as a new option to dump-segment. Perhaps we need a way for 
ComplexMetricsSerde implementations to return human-readable debugging output 
for a particular object that they understand, and the dump-segment option would 
use this functionality. Or perhaps we adopt a convention that calling 
`toString()` should do this, and then we add an option to dump-segment to dump 
using `toString()` instead of the Jackson serializer.
   
   Any thoughts on the above @leerho?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to