[
https://issues.apache.org/jira/browse/HBASE-13090?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Jonathan Lawlor updated HBASE-13090:
------------------------------------
Attachment: HBASE-13090-v1.patch
Attached is a rough work in progress patch. The patch does provide a working
implementation of heartbeats but I believe it could be refined so I am looking
to get some feedback.
The implementation points that I wanted to highlight for discussion are below:
* We wanted to move all time tracking into RegionScanner and StoreScanner and
leave RSRpcServices unscathed. I started off with that intention but it was
slowly revealed that it may be better to simply have a timeLimit field in the
call to nextRaw from RSRpcServices. Logic outlined below:
** While it is certainly possible to add a reset() or newSession() method to
the RegionScanner interface that would allow us to reset time tracking, the
issue becomes how do we communicate that size limit down from the RegionScanner
into the StoreScanner (the scanner that is looping through the cells for a
particular column family).
** The StoreScanners are stored in a KeyValueHeap in the RegionScanner... So it
would be possible to loop through them all and call a similar reset/newSession
method on all of them but that seems dirty and wasteful. It seems more
appropriate to communicate the timeLimit down to only the relevant storeScanner
via a timeLimit field in the InternalScanner#next(List<Cell> results, ...,
timeLimit) call.
** Since the RegionScanner also implements the InternalScanner interface, that
same next method would need to be implemented in RegionScannerImpl. Because of
this, I think it makes the most sense to simply have a nextRaw(List<Cell>, ...,
timeLimit) method to specify the timeLimit from RSRpcServices rather than an
update/newSession call
* To avoid polluting the returned Result array with state information about
heartbeats, a new heartbeat flag has been added to the ScanResponse. Since only
the ScannerCallable ever sees the ScanResponse returned from the server, I have
exposed the method ScannerCallable#isHeartbeatMessage() to allow the
ClientScanner to check if the most recent server response was a
heartbeat/keep-alive message.
* The method postHeapNext(List<Cells>) was added to RegionScannerImpl to allow
me to insert delays in between fetches of column family cells for testing. It
didn't feel clean, so I was wondering if anyone had any ideas about alternative
approaches to emulate long running scans on the server side
* Since heartbeat messages have the potential to create partial results (in the
event that the timeout occurs in the middle of a row) we only allow heartbeat
messages if the client has specified that heartbeats are supported AND partial
results are also supported.
Ideas for improvement:
* As earlier discussion indicated, the tracking of limits in RSRpcServices is
somewhat messy. When a new limit needs to be added, the RegionScanner and
InternalScanner interfaces must both be changed. The limit logic may be
simplified by defining something along the lines of a ScannerLimit object. The
object would have a field per limit and would have an associated Builder that
would allow us to specify only the limits we care about (if a limit is not set,
then it doesn't get enforced). Then, in the future, if a new limit was needed
it would only amount to adding a new field in ScannerLimit and adding the
appropriate enforcement logic (no changes to interfaces necessary). What do you
guys think? I thought this would clean things up a bit but wanted to see if any
objections first
Of course the finer implementation points can be seen in the patch itself and
any feedback would be appreciated. Will post to reviewboard
Thanks
> Progress heartbeats for long running scanners
> ---------------------------------------------
>
> Key: HBASE-13090
> URL: https://issues.apache.org/jira/browse/HBASE-13090
> Project: HBase
> Issue Type: New Feature
> Reporter: Andrew Purtell
> Attachments: HBASE-13090-v1.patch
>
>
> It can be necessary to set very long timeouts for clients that issue scans
> over large regions when all data in the region might be filtered out
> depending on scan criteria. This is a usability concern because it can be
> hard to identify what worst case timeout to use until scans are
> occasionally/intermittently failing in production, depending on variable scan
> criteria. It would be better if the client-server scan protocol can send back
> periodic progress heartbeats to clients as long as server scanners are alive
> and making progress.
> This is related but orthogonal to streaming scan (HBASE-13071).
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)