Re: remove scanner caching?

Dave Latham Thu, 09 Apr 2015 06:28:36 -0700

We definitely wouldn't want to remove it too soon, for compatibility
reasons.

Adding a "limitRows" notion sounds reasonable, but I'd argue is actually
something different than caching was.  If an app is relying on the scanner
to actually limit the number of rows returned, the current caching limit
won't work for scans that cross region boundaries.  We would need to keep
the state client side in addition to server side and decrement as we
traverse regions.

Looking into the branch-1 API, I can see there is now also
Scan.allowPartialResults in addition to Scan.batch.  For most cases, I'd
expect batching is just to avoid memory issues for wide rows, in which case
allowPartialResults could be a better, simpler interface to tell HBase not
to overflow a small buffer with wide rows.  Though it looks like at the
moment that doesn't happen.

As an app developer myself, the interaction between batch, caching,
maxResultSize, allowPartialResults is confusing (as well as the similarly
named cacheBlocks).  The names could be better (maxResultSize has actually
nothing to do with the maximum size of the Result's returned - it's rather
the internal buffer size, batch is a max Cells per Result [I think.  does
it reset between rows?], and of course caching is an internal
maxRowsPerRPC).  The documentation is limited and out of date (e.g.
https://hbase.apache.org/book.html#perf.hbase.client.caching).  Some things
could at least be more consistent (getAllowPartialResults instead of
isAllowPartialResults like almost all the other boolean properties).

As I poke around and write this out, I guess I'd argue instead that it's
time (or past time) to clean up the Scan API and document it more clearly.
Which is a scary task I know.  But for a newcomer, it's a scary API right
now.

What about something like:
Scan.bufferSize (instead of maxResultSize for the target over-the-wire size
- though this is still confusing because it's common to go over this size)
Scan.limitRows (instead of caching - along with true client side support)
Scan.allowPartialResults (to indicate it's ok to break up rows across
Results. it is transmitted to the server to indicate stop adding Cells to
the buffer as soon as it fills rather than at the end of the row.  if a
client needs true pagination for Cells within a row it can be done with a
Filter.)
Scan.cacheBlocks (less confusing without other things called "caching")

Dave

On Wed, Apr 8, 2015 at 10:00 PM, lars hofhansl <[email protected]> wrote:

> Scanner caching (in 1.1 and 2.0) is now a _limit_. I.e. normally you leave
> it disabled (the default of Long.MAX_VALUE) unless you know ahead of time
> that you'll only look at the first N rows returned. In that case you'd set
> it to N. I thought we had renamed it from "caching" to "limit" but looking
> at the code, that is not the case.
>
> In 0.98 and 1.0.x we need to keep it around defaulting to 100 for
> backwards compatibility.
>
> -- Lars
>       From: Dave Latham <[email protected]>
>  To: [email protected]
>  Sent: Wednesday, April 8, 2015 9:09 PM
>  Subject: remove scanner caching?
>
> After debugging a scans missing data issue while migrating to 0.98 (thanks
> Andrew, Jonathon, Josh, and Lars for the help), I'm left wondering why we
> have both caching and maxResultSize for scans.  It seems to be more client
> api complexity than it's worth.  Why would someone need to set caching when
> maxResultSize is available?  Indeed, the first patch proposed by some
> fellow in HBASE-1996 simply replaced caching with maxResultSize.  Can we
> deprecate and eventually remove caching?  Is there a good case for keeping
> it in the client API surface?
>
> Dave
>
>
>
>

Re: remove scanner caching?

Reply via email to