[
https://issues.apache.org/jira/browse/HBASE-17756?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17107652#comment-17107652
]
Michael Stack commented on HBASE-17756:
---------------------------------------
[~shahrs87] that'd be great.
I think you want to inline writing to data sketches datastructures each time we
write an hfile (flush, compaction).
First, I'd check that writing the sketches doesn't slow us down significantly.
It should be easy enough figuring this out. There are command-line tools for
reading/writing hfiles. Could do before/after.
There is an hfile meta Map that can be freely added to. It is serialized out on
the tail of hfiles. It already has some file meta data. See the hfile pretty
printer for what is there currently.
If perf is good, then add sketching simple stuff like cell size/distribution.
Histograms. Row cardinality and/or column cardinality would be good too. The
sketch would then be added as a Map entry and then serialized out as part of
the hfile close. Could even do a sketch per attribute with an entry per
attribute in the Map so easy to extend. Then add to pretty printer the dumping
of the sketch info.
Later we could figure how to do a region view with a command-tool that just
summed the sketches of the hfiles. It might even have an option for scanning
rows in Region to produced row-based sketches.
On region open, could sum all the hfile sketches and dump findings in UI in the
Region view area?
This is enough for a start I'd say.
Oh, be careful w/ output. Make it so it easy to feed to something like
gnuplot... So folks can graph if they want.
> We should have better introspection of HFiles
> ---------------------------------------------
>
> Key: HBASE-17756
> URL: https://issues.apache.org/jira/browse/HBASE-17756
> Project: HBase
> Issue Type: Brainstorming
> Components: HFile
> Reporter: Esteban Gutierrez
> Priority: Major
>
> [[email protected]] was suggesting to use DataSketches
> (https://datasketches.github.io) in order to write additional statistics to
> the HFiles. This could be used to improve our split decisions,
> troubleshooting or potentially do other interesting analysis without having
> to perform full table scans. The statistics could be stored as part of the
> HFile but we could initially improve the visibility of the data by adding
> some statistics to HFilePrettyPrinter.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)