Jaeyun Noh wrote:
I wonder if the network RPC involves whenever we call next() if scanner
class.
Its not a pretty story. A next in client makes for a trip over to the
server carrying the region that hosts the row the scanner is currently
stalled on. Serverside, the region has a Scanner context that has
within it a scanner on the memcache and then a scanner for each of the
storefiles present in the filesystem. The storefile scanners in turn
reduce to Hadoop MapFile#next calls so another network hop is involved
out to the particular datanode hosting the MapFile block the scanner is
currently within. The next on the serverside is a careful nexting
through the memcache first and through each of the store files
respecting order trying to turn up appropriate next result.
Also if the scanner works as a manner of parallel-request to Hregions and
fetch to temporary cache of Hbase clients.
Well, scanner will be homed on a single row at a time only so will be
against a single region only at any one time. That said, at the moment,
if a row comprises many column families, we currently proceed through
each in series. I believe there is an issue to parallelize the requests
across all the column families in a row.
If so, we're happy to live with that.
Is the following hbase parameter related to my question?
<property>
<name>hbase.client.scanner.caching</name>
<value>30</value>
<description>Number of rows that will be fetched when calling next
on a scanner if it is not served from memory. Higher caching values
will enable faster scanners but will eat up more memory and some
calls of next may take longer and longer times when the cache is empty.
</description>
</property>
Yes. Just added. Fetches a bunch at a time rather than one at a time
as it used to. Was just added. In my testing, makes scanners 4X faster.
St.Ack