[jira] [Commented] (HBASE-17756) We should have better introspection of HFiles

Michael Stack (Jira) Tue, 19 May 2020 11:13:33 -0700


    [ 
https://issues.apache.org/jira/browse/HBASE-17756?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17111404#comment-17111404
 ]


Michael Stack commented on HBASE-17756:
---------------------------------------

I ran the HFilePerformanceEvaluation tool... Had to disable the gz bit because 
it just logs getting a gz compressor all day (see below). Seems like the cost 
writing the sketches is small if any. Here are the write times before the patch:
{code}
 2020-05-18 14:18:44,519 INFO  [main] hbase.HFilePerformanceEvaluation: Running 
SequentialWriteBenchmark with codec[none] cipher[none] for 1000000 rows.
 2020-05-18 14:18:45,535 INFO  [main] hbase.HFilePerformanceEvaluation: Running 
SequentialWriteBenchmark with codec[none] cipher[none] for 1000000 rows took 
634ms.
 2020-05-18 14:37:03,658 INFO  [main] hbase.HFilePerformanceEvaluation: Running 
SequentialWriteBenchmark with codec[none] cipher[aes-default] for 1000000 rows.
 2020-05-18 14:37:04,737 INFO  [main] hbase.HFilePerformanceEvaluation: Running 
SequentialWriteBenchmark with codec[none] cipher[aes-default] for 1000000 rows 
took 1027ms.
 2020-05-18 15:17:58,282 INFO  [main] hbase.HFilePerformanceEvaluation: Running 
SequentialWriteBenchmark with codec[none] cipher[aes-commons] for 1000000 rows.
 2020-05-18 15:17:59,177 INFO  [main] hbase.HFilePerformanceEvaluation: Running 
SequentialWriteBenchmark with codec[none] cipher[aes-commons] for 1000000 rows 
took 859ms.
 2020-05-18 15:40:34,594 INFO  [main] hbase.HFilePerformanceEvaluation: 
ntialWriteBenchmark with codec[none] cipher[aes-default] for 1000000 rows took 
1027ms.
 Running SequentialWriteBenchmark with codec[none] cipher[aes-commons] for 
1000000 rows took 859ms.
{code}

Here is after patch
{code}
 2020-05-18 17:55:59,529 INFO  [main] hbase.HFilePerformanceEvaluation: Running 
SequentialWriteBenchmark with codec[none] cipher[none] for 1000000 rows.
 2020-05-18 17:56:00,615 INFO  [main] hbase.HFilePerformanceEvaluation: Running 
SequentialWriteBenchmark with codec[none] cipher[none] for 1000000 rows took 
675ms.
 2020-05-18 18:14:35,889 INFO  [main] hbase.HFilePerformanceEvaluation: Running 
SequentialWriteBenchmark with codec[none] cipher[aes-default] for 1000000 rows.
 2020-05-18 18:14:36,986 INFO  [main] hbase.HFilePerformanceEvaluation: Running 
SequentialWriteBenchmark with codec[none] cipher[aes-default] for 1000000 rows 
took 1046ms.
 2020-05-18 18:52:47,509 INFO  [main] hbase.HFilePerformanceEvaluation: Running 
SequentialWriteBenchmark with codec[none] cipher[aes-commons] for 1000000 rows.
 2020-05-18 18:52:48,426 INFO  [main] hbase.HFilePerformanceEvaluation: Running 
SequentialWriteBenchmark with codec[none] cipher[aes-commons] for 1000000 rows 
took 882ms.
 2020-05-18 19:15:35,734 INFO  [main] hbase.HFilePerformanceEvaluation: 
ntialWriteBenchmark with codec[none] cipher[aes-default] for 1000000 rows took 
1046ms.
 Running SequentialWriteBenchmark with codec[none] cipher[aes-commons] for 
1000000 rows took 882ms.
{code}

Only did one run.

But looking at what datasketches does on each update..., it seems too much... I 
think writing sketches should be optional at least at first:

https://github.com/apache/incubator-datasketches-java/blob/db69384f4ab206b85d7d9e26bbb5e7fd0cac78e7/src/main/java/org/apache/datasketches/quantiles/DirectUpdateDoublesSketch.java#L131

Maybe the best thing to do [~shahrs87] would be:

 A new UI page on a RegionServer where you can ask for a Report on a Region (or 
for all Regions in a Table ON this server or all Regions on a server). Operator 
will run it when they have a 'hot' server. It will do what is in the 
hbase-operator-tools subtask reading the Region row-wise so can do row 
distribution of sizes and row counts AND because it is in the RS context, it 
could do this key/value sketch that is in the patch here. Because it is in the 
RS context, there is already a dedicated Region page that reports on hfiles and 
sizes that you could add this report to.

Here is how I turned off gz'ing in the tool:

{code}
diff --git 
a/hbase-server/src/test/java/org/apache/hadoop/hbase/HFilePerformanceEvaluation.java
 
b/hbase-server/src/test/java/org/apache/hadoop/hbase/HFilePerformanceEvaluation.java
index 4e9a39f2b5..0360351a98 100644
--- 
a/hbase-server/src/test/java/org/apache/hadoop/hbase/HFilePerformanceEvaluation.java
+++ 
b/hbase-server/src/test/java/org/apache/hadoop/hbase/HFilePerformanceEvaluation.java
@@ -113,8 +113,8 @@ public class HFilePerformanceEvaluation {
     runReadBenchmark(conf, fs, mf, "none", "none");

     // codec=gz cipher=none
-    runWriteBenchmark(conf, fs, mf, "gz", "none");
-    runReadBenchmark(conf, fs, mf, "gz", "none");
+    // runWriteBenchmark(conf, fs, mf, "gz", "none");
+    // runReadBenchmark(conf, fs, mf, "gz", "none");

     // Add configuration for AES cipher
     final Configuration aesconf = new Configuration();
@@ -129,8 +129,8 @@ public class HFilePerformanceEvaluation {
     runReadBenchmark(aesconf, aesfs, aesmf, "none", "aes");

     // codec=gz cipher=aes
-    runWriteBenchmark(aesconf, aesfs, aesmf, "gz", "aes");
-    runReadBenchmark(aesconf, aesfs, aesmf, "gz", "aes");
+    //runWriteBenchmark(aesconf, aesfs, aesmf, "gz", "aes");
+    //runReadBenchmark(aesconf, aesfs, aesmf, "gz", "aes");

     // Add configuration for Commons cipher
     final Configuration cryptoconf = new Configuration();
@@ -146,8 +146,8 @@ public class HFilePerformanceEvaluation {
     runReadBenchmark(cryptoconf, cryptofs, aesmf, "none", "aes");

     // codec=gz cipher=aes
-    runWriteBenchmark(cryptoconf, aesfs, aesmf, "gz", "aes");
-    runReadBenchmark(cryptoconf, aesfs, aesmf, "gz", "aes");
+    // runWriteBenchmark(cryptoconf, aesfs, aesmf, "gz", "aes");
+    // runReadBenchmark(cryptoconf, aesfs, aesmf, "gz", "aes");

     // cleanup test files
     if (fs.exists(mf)) {
{code}

> We should have better introspection of HFiles
> ---------------------------------------------
>
>                 Key: HBASE-17756
>                 URL: https://issues.apache.org/jira/browse/HBASE-17756
>             Project: HBase
>          Issue Type: Brainstorming
>          Components: HFile
>            Reporter: Esteban Gutierrez
>            Assignee: Rushabh Shah
>            Priority: Major
>
> [~saint....@gmail.com] was suggesting to use DataSketches 
> (https://datasketches.github.io) in order to write additional statistics to 
> the HFiles. This could be used to improve our split decisions, 
> troubleshooting or potentially do other interesting analysis without having 
> to perform full table scans. The statistics could be stored as part of the 
> HFile but we could initially improve the visibility of the data by adding 
> some statistics to HFilePrettyPrinter.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (HBASE-17756) We should have better introspection of HFiles

Reply via email to