[ 
https://issues.apache.org/jira/browse/HBASE-20628?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16493360#comment-16493360
 ] 

Eshcar Hillel commented on HBASE-20628:
---------------------------------------

Thanks [~anoop.hbase] for the explanation. Now I get it. 
So actually there is no reason or intention to seek for a key that was already 
read. This is good.
Actually when reading the documentation of the reseek(cell) method this is what 
is written there:
bq. Reseek the scanner at or after the specified KeyValue. This method is 
guaranteed to seek at or after the required key *only if the key comes after 
the current position of the scanner*. Should not be used to seek to a key which 
may come before the current position.
So in this respect the implementation of CollectionBackedScanner::reseek(cell) 
is correct, while the implementation of SegmentScanner::reseek(cell) seems to 
be inefficient with no real reason.
I checked, no one changed this piece of code since I created this file 2 years 
ago, and I simply copy-pasted it from another file 
(DefaultMemStore/MemStoreScanner or else), which means this existed also in 
hbase-1.4 and hence CollectionBackedScanner allowed for improved performance.

Here is a suggestion: replace the implementation of 
SegmentScanner::reseek(cell) with the code from 
CollectionBackedScanner::reseek(cell) which does not create a new iterator and 
simply runs next until reaching the cell. 
We can then in parallel (1) run regression tests to make sure we didn't violate 
correctness, and (2) run the same benchmark to see if we get the same 
performance boost.
If we get (/) in both fronts then I think it is better to keep just the 
SegmentScanner and not use a special case scanner. 

Reasonable?


> SegmentScanner does over-comparing when one flushing
> ----------------------------------------------------
>
>                 Key: HBASE-20628
>                 URL: https://issues.apache.org/jira/browse/HBASE-20628
>             Project: HBase
>          Issue Type: Sub-task
>          Components: Performance
>            Reporter: stack
>            Priority: Critical
>             Fix For: 2.0.1
>
>         Attachments: HBASE-20628.branch-2.0.001 (1).patch, 
> HBASE-20628.branch-2.0.001.patch, HBASE-20628.branch-2.0.001.patch, 
> HBASE-20628.branch-2.0.002.patch, Screen Shot 2018-05-25 at 9.38.00 AM.png, 
> hits-20628.png
>
>
> Flushing memstore is taking too long. It looks like we are doing a bunch of 
> comparing out of a new facility in hbase2, the Segment scanner at flush time.
> Below is a patch from [~anoop.hbase]. I had a similar more hacky version. 
> Both undo the extra comparing we were seeing in perf tests.
> [~anastas] and [~eshcar]. Need your help please.
> As I read it, we are trying to flush the memstore snapshot (default, no IMC 
> case). There is only ever going to be one Segment involved (even if IMC is 
> enabled); the snapshot Segment. But the getScanners is returning a list (of 
> one)  Scanners and the scan is via the generic SegmentScanner which is all 
> about a bunch of stuff we don't need when doing a flush so it seems to do 
> more work than is necessary. It also supports scanning backwards which is not 
> needed when trying to flush memstore.
> Do you see a problem doing a version of Anoops patch (whether IMC or not)? It 
> makes a big difference in general throughput when the below patch is in 
> place. Thanks.
> {code}
> diff --git 
> a/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/MemStoreSnapshot.java
>  
> b/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/MemStoreSnapshot.java
> index cbd60e5da3..c3dd972254 100644
> --- 
> a/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/MemStoreSnapshot.java
> +++ 
> b/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/MemStoreSnapshot.java
> @@ -40,7 +40,8 @@ public class MemStoreSnapshot implements Closeable {
>      this.cellsCount = snapshot.getCellsCount();
>      this.memStoreSize = snapshot.getMemStoreSize();
>      this.timeRangeTracker = snapshot.getTimeRangeTracker();
> -    this.scanners = snapshot.getScanners(Long.MAX_VALUE, Long.MAX_VALUE);
> +    //this.scanners = snapshot.getScanners(Long.MAX_VALUE, Long.MAX_VALUE);
> +    this.scanners = snapshot.getScannersForSnapshot();
>      this.tagsPresent = snapshot.isTagsPresent();
>    }
> diff --git 
> a/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/Segment.java
>  
> b/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/Segment.java
> index 70074bf3b4..279c4e50c8 100644
> --- 
> a/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/Segment.java
> +++ 
> b/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/Segment.java
> @@ -33,6 +33,7 @@ import org.apache.hadoop.hbase.KeyValueUtil;
>  import org.apache.hadoop.hbase.io.TimeRange;
>  import org.apache.hadoop.hbase.util.Bytes;
>  import org.apache.hadoop.hbase.util.ClassSize;
> +import org.apache.hadoop.hbase.util.CollectionBackedScanner;
>  import org.apache.yetus.audience.InterfaceAudience;
>  import org.slf4j.Logger;
>  import 
> org.apache.hbase.thirdparty.com.google.common.annotations.VisibleForTesting;
> @@ -130,6 +131,10 @@ public abstract class Segment {
>      return Collections.singletonList(new SegmentScanner(this, readPoint, 
> order));
>    }
> +  public List<KeyValueScanner> getScannersForSnapshot() {
> +    return Collections.singletonList(new 
> CollectionBackedScanner(this.cellSet.get(), comparator));
> +  }
> +
>    /**
>     * @return whether the segment has any cells
>     */
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to