[ 
https://issues.apache.org/jira/browse/HBASE-17756?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17107652#comment-17107652
 ] 

Michael Stack commented on HBASE-17756:
---------------------------------------

[~shahrs87] that'd be great.

I think you want to inline writing to data sketches datastructures each time we 
write an hfile (flush, compaction).

First, I'd check that writing the sketches doesn't slow us down significantly.  
It should be easy enough figuring this out. There are command-line tools for 
reading/writing hfiles. Could do before/after.

There is an hfile meta Map that can be freely added to. It is serialized out on 
the tail of hfiles. It already has some file meta data. See the hfile pretty 
printer for what is there currently.

If perf is good, then add sketching simple stuff like cell size/distribution. 
Histograms. Row cardinality and/or column cardinality would be good too.  The 
sketch would then be added as a Map entry and then serialized out as part of 
the hfile close. Could even do a sketch per attribute with an entry per 
attribute in the Map so easy to extend. Then add to pretty printer the dumping 
of the sketch info.

Later we could figure how to do a region view with a command-tool that just 
summed the sketches of the hfiles.  It might even have an option for scanning 
rows in Region to produced row-based sketches.

On region open, could sum all the hfile sketches and dump findings in UI in the 
Region view area?

This is enough for a start I'd say.

Oh, be careful w/ output. Make it so it easy to feed to something like 
gnuplot... So folks can graph if they want.

> We should have better introspection of HFiles
> ---------------------------------------------
>
>                 Key: HBASE-17756
>                 URL: https://issues.apache.org/jira/browse/HBASE-17756
>             Project: HBase
>          Issue Type: Brainstorming
>          Components: HFile
>            Reporter: Esteban Gutierrez
>            Priority: Major
>
> [[email protected]] was suggesting to use DataSketches 
> (https://datasketches.github.io) in order to write additional statistics to 
> the HFiles. This could be used to improve our split decisions, 
> troubleshooting or potentially do other interesting analysis without having 
> to perform full table scans. The statistics could be stored as part of the 
> HFile but we could initially improve the visibility of the data by adding 
> some statistics to HFilePrettyPrinter.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to