Bryan Beaudreault created HBASE-26519:
-----------------------------------------

             Summary: StoreFileScanner parallel seek -- productionize or drop?
                 Key: HBASE-26519
                 URL: https://issues.apache.org/jira/browse/HBASE-26519
             Project: HBase
          Issue Type: Improvement
            Reporter: Bryan Beaudreault


hbase.storescanner.parallel.seek.enable was added a few years ago in 
https://issues.apache.org/jira/browse/HBASE-7495, but still defaults to 
disabled. The description of that says "Enables StoreFileScanner 
parallel-seeking in StoreScanner, a feature which can reduce response latency 
under special conditions".

It's not very clear what "special conditions" means. Reading through the entire 
comment history on that issue seems to indicate it can help when you have "high 
random read, low cache hit rate, many store files". 

We have a bunch of clusters with this shape, and in fact we use SSDs for all 
storage so I figured this might help a lot. I tried setting this to true on one 
RegionServer of one of our highest QPS clusters hoping I'd see some clear 
improvement. This very simple test was pretty much a wash, so I need to do more 
methodical testing.

In the test one thing became clear though – is the default thread pool size of 
10 good enough for my use-case? I have no way of knowing, as there is no 
logging or metrics that I can find around thread pool saturation. What I ended 
up doing was spamming refresh of the /dump endpoint of the RS, and noticed that 
there were sometimes 1-5 tasks queued for the RS_PARALLEL_SEEK executor. This 
indicates maybe I should scale the thread pool, but use-cases change over time 
so this seems like not a great way to determine that.

Task queuing seems not great for a feature which is aimed at reducing 
latencies. I wonder if we should consider some changes to make this more easy 
to deploy in production. Here are some ideas I had:
 * Can we generate a better default value for the thread pool size, maybe based 
on number of RS handler threads or some other heuristic?
 * Should we consider eliminating queuing for this feature? Instead, if the 
threadpool is saturated run the seek in-line in the current thread (i.e. revert 
to normal). This would be more similar to how hedged reads work in HDFS.
 * Can we expose a metric or logging to help operators know when to scale up 
the thread pool? If we implemented the 2nd option above we could expose 
"seeksInCurrentThread" counter to track this, again similar to how hedged reads 
report on saturation.

But with all of this said, I wonder if anyone is running this in production and 
has any updated guidance on when to use this? Does it still make sense given 
the last 8 years of development in HBase? Would it ever make sense to make it 
enabled by default?



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to