[
https://issues.apache.org/jira/browse/HBASE-17756?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17111404#comment-17111404
]
Michael Stack commented on HBASE-17756:
---------------------------------------
I ran the HFilePerformanceEvaluation tool... Had to disable the gz bit because
it just logs getting a gz compressor all day (see below). Seems like the cost
writing the sketches is small if any. Here are the write times before the patch:
{code}
2020-05-18 14:18:44,519 INFO [main] hbase.HFilePerformanceEvaluation: Running
SequentialWriteBenchmark with codec[none] cipher[none] for 1000000 rows.
2020-05-18 14:18:45,535 INFO [main] hbase.HFilePerformanceEvaluation: Running
SequentialWriteBenchmark with codec[none] cipher[none] for 1000000 rows took
634ms.
2020-05-18 14:37:03,658 INFO [main] hbase.HFilePerformanceEvaluation: Running
SequentialWriteBenchmark with codec[none] cipher[aes-default] for 1000000 rows.
2020-05-18 14:37:04,737 INFO [main] hbase.HFilePerformanceEvaluation: Running
SequentialWriteBenchmark with codec[none] cipher[aes-default] for 1000000 rows
took 1027ms.
2020-05-18 15:17:58,282 INFO [main] hbase.HFilePerformanceEvaluation: Running
SequentialWriteBenchmark with codec[none] cipher[aes-commons] for 1000000 rows.
2020-05-18 15:17:59,177 INFO [main] hbase.HFilePerformanceEvaluation: Running
SequentialWriteBenchmark with codec[none] cipher[aes-commons] for 1000000 rows
took 859ms.
2020-05-18 15:40:34,594 INFO [main] hbase.HFilePerformanceEvaluation:
ntialWriteBenchmark with codec[none] cipher[aes-default] for 1000000 rows took
1027ms.
Running SequentialWriteBenchmark with codec[none] cipher[aes-commons] for
1000000 rows took 859ms.
{code}
Here is after patch
{code}
2020-05-18 17:55:59,529 INFO [main] hbase.HFilePerformanceEvaluation: Running
SequentialWriteBenchmark with codec[none] cipher[none] for 1000000 rows.
2020-05-18 17:56:00,615 INFO [main] hbase.HFilePerformanceEvaluation: Running
SequentialWriteBenchmark with codec[none] cipher[none] for 1000000 rows took
675ms.
2020-05-18 18:14:35,889 INFO [main] hbase.HFilePerformanceEvaluation: Running
SequentialWriteBenchmark with codec[none] cipher[aes-default] for 1000000 rows.
2020-05-18 18:14:36,986 INFO [main] hbase.HFilePerformanceEvaluation: Running
SequentialWriteBenchmark with codec[none] cipher[aes-default] for 1000000 rows
took 1046ms.
2020-05-18 18:52:47,509 INFO [main] hbase.HFilePerformanceEvaluation: Running
SequentialWriteBenchmark with codec[none] cipher[aes-commons] for 1000000 rows.
2020-05-18 18:52:48,426 INFO [main] hbase.HFilePerformanceEvaluation: Running
SequentialWriteBenchmark with codec[none] cipher[aes-commons] for 1000000 rows
took 882ms.
2020-05-18 19:15:35,734 INFO [main] hbase.HFilePerformanceEvaluation:
ntialWriteBenchmark with codec[none] cipher[aes-default] for 1000000 rows took
1046ms.
Running SequentialWriteBenchmark with codec[none] cipher[aes-commons] for
1000000 rows took 882ms.
{code}
Only did one run.
But looking at what datasketches does on each update..., it seems too much... I
think writing sketches should be optional at least at first:
https://github.com/apache/incubator-datasketches-java/blob/db69384f4ab206b85d7d9e26bbb5e7fd0cac78e7/src/main/java/org/apache/datasketches/quantiles/DirectUpdateDoublesSketch.java#L131
Maybe the best thing to do [~shahrs87] would be:
A new UI page on a RegionServer where you can ask for a Report on a Region (or
for all Regions in a Table ON this server or all Regions on a server). Operator
will run it when they have a 'hot' server. It will do what is in the
hbase-operator-tools subtask reading the Region row-wise so can do row
distribution of sizes and row counts AND because it is in the RS context, it
could do this key/value sketch that is in the patch here. Because it is in the
RS context, there is already a dedicated Region page that reports on hfiles and
sizes that you could add this report to.
Here is how I turned off gz'ing in the tool:
{code}
diff --git
a/hbase-server/src/test/java/org/apache/hadoop/hbase/HFilePerformanceEvaluation.java
b/hbase-server/src/test/java/org/apache/hadoop/hbase/HFilePerformanceEvaluation.java
index 4e9a39f2b5..0360351a98 100644
---
a/hbase-server/src/test/java/org/apache/hadoop/hbase/HFilePerformanceEvaluation.java
+++
b/hbase-server/src/test/java/org/apache/hadoop/hbase/HFilePerformanceEvaluation.java
@@ -113,8 +113,8 @@ public class HFilePerformanceEvaluation {
runReadBenchmark(conf, fs, mf, "none", "none");
// codec=gz cipher=none
- runWriteBenchmark(conf, fs, mf, "gz", "none");
- runReadBenchmark(conf, fs, mf, "gz", "none");
+ // runWriteBenchmark(conf, fs, mf, "gz", "none");
+ // runReadBenchmark(conf, fs, mf, "gz", "none");
// Add configuration for AES cipher
final Configuration aesconf = new Configuration();
@@ -129,8 +129,8 @@ public class HFilePerformanceEvaluation {
runReadBenchmark(aesconf, aesfs, aesmf, "none", "aes");
// codec=gz cipher=aes
- runWriteBenchmark(aesconf, aesfs, aesmf, "gz", "aes");
- runReadBenchmark(aesconf, aesfs, aesmf, "gz", "aes");
+ //runWriteBenchmark(aesconf, aesfs, aesmf, "gz", "aes");
+ //runReadBenchmark(aesconf, aesfs, aesmf, "gz", "aes");
// Add configuration for Commons cipher
final Configuration cryptoconf = new Configuration();
@@ -146,8 +146,8 @@ public class HFilePerformanceEvaluation {
runReadBenchmark(cryptoconf, cryptofs, aesmf, "none", "aes");
// codec=gz cipher=aes
- runWriteBenchmark(cryptoconf, aesfs, aesmf, "gz", "aes");
- runReadBenchmark(cryptoconf, aesfs, aesmf, "gz", "aes");
+ // runWriteBenchmark(cryptoconf, aesfs, aesmf, "gz", "aes");
+ // runReadBenchmark(cryptoconf, aesfs, aesmf, "gz", "aes");
// cleanup test files
if (fs.exists(mf)) {
{code}
> We should have better introspection of HFiles
> ---------------------------------------------
>
> Key: HBASE-17756
> URL: https://issues.apache.org/jira/browse/HBASE-17756
> Project: HBase
> Issue Type: Brainstorming
> Components: HFile
> Reporter: Esteban Gutierrez
> Assignee: Rushabh Shah
> Priority: Major
>
> [[email protected]] was suggesting to use DataSketches
> (https://datasketches.github.io) in order to write additional statistics to
> the HFiles. This could be used to improve our split decisions,
> troubleshooting or potentially do other interesting analysis without having
> to perform full table scans. The statistics could be stored as part of the
> HFile but we could initially improve the visibility of the data by adding
> some statistics to HFilePrettyPrinter.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)