gianm commented on issue #9806:
URL: https://github.com/apache/druid/issues/9806#issuecomment-625905734
> The first and simplest option would be an added feature of the DST that,
given a column, would dump the data contents of each aggregation item as a
separate binary file in some user-specified output directory. This has the
advantage that the DST does not need to know any specifics of the aggregation
item.
For this I would caution that the dump-segment tool actually does not have
access to the sketch images from the column! It only seems like it does. What
it actually has access to is the thing that the relevant ComplexMetricSerde
deserializes it into. And what it does is then take that object and serialize
it using Jackson.
In the case of DataSketches HLL, the HllSketchMergeComplexMetricSerde
returns an HllSketch object. And that is serialized to Jackson using the
HllSketchJsonSerializer that is part of the Druid DataSketches extension, which
does the following.
```java
jgen.writeBinary(sketch.toCompactByteArray());
```
And what writeBinary does is write a Base64 string. So, that's why you get
the behavior you get.
With regard to taking this output and splitting it up into multiple files,
my first thought is we don't need the dump-segment tool to do that, since you
could do it by composing two tools in the Unix style. (dump-segment to generate
a dump of the entire segment, `jq` to extract line-by-line output corresponding
to each sketch, and `split` to split into multiple files). I suppose it
wouldn't hurt to add an option to dump-segment that does this, but the Unix way
is more flexible.
With regard to dumping interesting "human readable" stuff, that actually
seems very useful as a new option to dump-segment. Perhaps we need a way for
ComplexMetricsSerde implementations to return human-readable debugging output
for a particular object that they understand, and the dump-segment option would
use this functionality. Or perhaps we adopt a convention that calling
`toString()` should do this, and then we add an option to dump-segment to dump
using `toString()` instead of the Jackson serializer.
Any thoughts on the above @leerho?
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]