[ https://issues.apache.org/jira/browse/HBASE-17756?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17111404#comment-17111404 ]
Michael Stack commented on HBASE-17756: --------------------------------------- I ran the HFilePerformanceEvaluation tool... Had to disable the gz bit because it just logs getting a gz compressor all day (see below). Seems like the cost writing the sketches is small if any. Here are the write times before the patch: {code} 2020-05-18 14:18:44,519 INFO [main] hbase.HFilePerformanceEvaluation: Running SequentialWriteBenchmark with codec[none] cipher[none] for 1000000 rows. 2020-05-18 14:18:45,535 INFO [main] hbase.HFilePerformanceEvaluation: Running SequentialWriteBenchmark with codec[none] cipher[none] for 1000000 rows took 634ms. 2020-05-18 14:37:03,658 INFO [main] hbase.HFilePerformanceEvaluation: Running SequentialWriteBenchmark with codec[none] cipher[aes-default] for 1000000 rows. 2020-05-18 14:37:04,737 INFO [main] hbase.HFilePerformanceEvaluation: Running SequentialWriteBenchmark with codec[none] cipher[aes-default] for 1000000 rows took 1027ms. 2020-05-18 15:17:58,282 INFO [main] hbase.HFilePerformanceEvaluation: Running SequentialWriteBenchmark with codec[none] cipher[aes-commons] for 1000000 rows. 2020-05-18 15:17:59,177 INFO [main] hbase.HFilePerformanceEvaluation: Running SequentialWriteBenchmark with codec[none] cipher[aes-commons] for 1000000 rows took 859ms. 2020-05-18 15:40:34,594 INFO [main] hbase.HFilePerformanceEvaluation: ntialWriteBenchmark with codec[none] cipher[aes-default] for 1000000 rows took 1027ms. Running SequentialWriteBenchmark with codec[none] cipher[aes-commons] for 1000000 rows took 859ms. {code} Here is after patch {code} 2020-05-18 17:55:59,529 INFO [main] hbase.HFilePerformanceEvaluation: Running SequentialWriteBenchmark with codec[none] cipher[none] for 1000000 rows. 2020-05-18 17:56:00,615 INFO [main] hbase.HFilePerformanceEvaluation: Running SequentialWriteBenchmark with codec[none] cipher[none] for 1000000 rows took 675ms. 2020-05-18 18:14:35,889 INFO [main] hbase.HFilePerformanceEvaluation: Running SequentialWriteBenchmark with codec[none] cipher[aes-default] for 1000000 rows. 2020-05-18 18:14:36,986 INFO [main] hbase.HFilePerformanceEvaluation: Running SequentialWriteBenchmark with codec[none] cipher[aes-default] for 1000000 rows took 1046ms. 2020-05-18 18:52:47,509 INFO [main] hbase.HFilePerformanceEvaluation: Running SequentialWriteBenchmark with codec[none] cipher[aes-commons] for 1000000 rows. 2020-05-18 18:52:48,426 INFO [main] hbase.HFilePerformanceEvaluation: Running SequentialWriteBenchmark with codec[none] cipher[aes-commons] for 1000000 rows took 882ms. 2020-05-18 19:15:35,734 INFO [main] hbase.HFilePerformanceEvaluation: ntialWriteBenchmark with codec[none] cipher[aes-default] for 1000000 rows took 1046ms. Running SequentialWriteBenchmark with codec[none] cipher[aes-commons] for 1000000 rows took 882ms. {code} Only did one run. But looking at what datasketches does on each update..., it seems too much... I think writing sketches should be optional at least at first: https://github.com/apache/incubator-datasketches-java/blob/db69384f4ab206b85d7d9e26bbb5e7fd0cac78e7/src/main/java/org/apache/datasketches/quantiles/DirectUpdateDoublesSketch.java#L131 Maybe the best thing to do [~shahrs87] would be: A new UI page on a RegionServer where you can ask for a Report on a Region (or for all Regions in a Table ON this server or all Regions on a server). Operator will run it when they have a 'hot' server. It will do what is in the hbase-operator-tools subtask reading the Region row-wise so can do row distribution of sizes and row counts AND because it is in the RS context, it could do this key/value sketch that is in the patch here. Because it is in the RS context, there is already a dedicated Region page that reports on hfiles and sizes that you could add this report to. Here is how I turned off gz'ing in the tool: {code} diff --git a/hbase-server/src/test/java/org/apache/hadoop/hbase/HFilePerformanceEvaluation.java b/hbase-server/src/test/java/org/apache/hadoop/hbase/HFilePerformanceEvaluation.java index 4e9a39f2b5..0360351a98 100644 --- a/hbase-server/src/test/java/org/apache/hadoop/hbase/HFilePerformanceEvaluation.java +++ b/hbase-server/src/test/java/org/apache/hadoop/hbase/HFilePerformanceEvaluation.java @@ -113,8 +113,8 @@ public class HFilePerformanceEvaluation { runReadBenchmark(conf, fs, mf, "none", "none"); // codec=gz cipher=none - runWriteBenchmark(conf, fs, mf, "gz", "none"); - runReadBenchmark(conf, fs, mf, "gz", "none"); + // runWriteBenchmark(conf, fs, mf, "gz", "none"); + // runReadBenchmark(conf, fs, mf, "gz", "none"); // Add configuration for AES cipher final Configuration aesconf = new Configuration(); @@ -129,8 +129,8 @@ public class HFilePerformanceEvaluation { runReadBenchmark(aesconf, aesfs, aesmf, "none", "aes"); // codec=gz cipher=aes - runWriteBenchmark(aesconf, aesfs, aesmf, "gz", "aes"); - runReadBenchmark(aesconf, aesfs, aesmf, "gz", "aes"); + //runWriteBenchmark(aesconf, aesfs, aesmf, "gz", "aes"); + //runReadBenchmark(aesconf, aesfs, aesmf, "gz", "aes"); // Add configuration for Commons cipher final Configuration cryptoconf = new Configuration(); @@ -146,8 +146,8 @@ public class HFilePerformanceEvaluation { runReadBenchmark(cryptoconf, cryptofs, aesmf, "none", "aes"); // codec=gz cipher=aes - runWriteBenchmark(cryptoconf, aesfs, aesmf, "gz", "aes"); - runReadBenchmark(cryptoconf, aesfs, aesmf, "gz", "aes"); + // runWriteBenchmark(cryptoconf, aesfs, aesmf, "gz", "aes"); + // runReadBenchmark(cryptoconf, aesfs, aesmf, "gz", "aes"); // cleanup test files if (fs.exists(mf)) { {code} > We should have better introspection of HFiles > --------------------------------------------- > > Key: HBASE-17756 > URL: https://issues.apache.org/jira/browse/HBASE-17756 > Project: HBase > Issue Type: Brainstorming > Components: HFile > Reporter: Esteban Gutierrez > Assignee: Rushabh Shah > Priority: Major > > [~saint....@gmail.com] was suggesting to use DataSketches > (https://datasketches.github.io) in order to write additional statistics to > the HFiles. This could be used to improve our split decisions, > troubleshooting or potentially do other interesting analysis without having > to perform full table scans. The statistics could be stored as part of the > HFile but we could initially improve the visibility of the data by adding > some statistics to HFilePrettyPrinter. -- This message was sent by Atlassian Jira (v8.3.4#803005)