Bryan Beaudreault created HBASE-26519:
-----------------------------------------
Summary: StoreFileScanner parallel seek -- productionize or drop?
Key: HBASE-26519
URL: https://issues.apache.org/jira/browse/HBASE-26519
Project: HBase
Issue Type: Improvement
Reporter: Bryan Beaudreault
hbase.storescanner.parallel.seek.enable was added a few years ago in
https://issues.apache.org/jira/browse/HBASE-7495, but still defaults to
disabled. The description of that says "Enables StoreFileScanner
parallel-seeking in StoreScanner, a feature which can reduce response latency
under special conditions".
It's not very clear what "special conditions" means. Reading through the entire
comment history on that issue seems to indicate it can help when you have "high
random read, low cache hit rate, many store files".
We have a bunch of clusters with this shape, and in fact we use SSDs for all
storage so I figured this might help a lot. I tried setting this to true on one
RegionServer of one of our highest QPS clusters hoping I'd see some clear
improvement. This very simple test was pretty much a wash, so I need to do more
methodical testing.
In the test one thing became clear though – is the default thread pool size of
10 good enough for my use-case? I have no way of knowing, as there is no
logging or metrics that I can find around thread pool saturation. What I ended
up doing was spamming refresh of the /dump endpoint of the RS, and noticed that
there were sometimes 1-5 tasks queued for the RS_PARALLEL_SEEK executor. This
indicates maybe I should scale the thread pool, but use-cases change over time
so this seems like not a great way to determine that.
Task queuing seems not great for a feature which is aimed at reducing
latencies. I wonder if we should consider some changes to make this more easy
to deploy in production. Here are some ideas I had:
* Can we generate a better default value for the thread pool size, maybe based
on number of RS handler threads or some other heuristic?
* Should we consider eliminating queuing for this feature? Instead, if the
threadpool is saturated run the seek in-line in the current thread (i.e. revert
to normal). This would be more similar to how hedged reads work in HDFS.
* Can we expose a metric or logging to help operators know when to scale up
the thread pool? If we implemented the 2nd option above we could expose
"seeksInCurrentThread" counter to track this, again similar to how hedged reads
report on saturation.
But with all of this said, I wonder if anyone is running this in production and
has any updated guidance on when to use this? Does it still make sense given
the last 8 years of development in HBase? Would it ever make sense to make it
enabled by default?
--
This message was sent by Atlassian Jira
(v8.20.1#820001)