Re: remove scanner caching?

Stack Thu, 09 Apr 2015 12:34:49 -0700

On Thu, Apr 9, 2015 at 6:25 AM, Dave Latham <[email protected]> wrote:


> We definitely wouldn't want to remove it too soon, for compatibility
> reasons.
>
> Adding a "limitRows" notion sounds reasonable, but I'd argue is actually
> something different than caching was.  If an app is relying on the scanner
> to actually limit the number of rows returned, the current caching limit
> won't work for scans that cross region boundaries.  We would need to keep
> the state client side in addition to server side and decrement as we
> traverse regions.
>
> Looking into the branch-1 API, I can see there is now also
> Scan.allowPartialResults in addition to Scan.batch.  For most cases, I'd
> expect batching is just to avoid memory issues for wide rows, in which case
> allowPartialResults could be a better, simpler interface to tell HBase not
> to overflow a small buffer with wide rows.  Though it looks like at the
> moment that doesn't happen.
>
> As an app developer myself, the interaction between batch, caching,
> maxResultSize, allowPartialResults is confusing (as well as the similarly
> named cacheBlocks).  The names could be better (maxResultSize has actually
> nothing to do with the maximum size of the Result's returned - it's rather
> the internal buffer size, batch is a max Cells per Result [I think.  does
> it reset between rows?], and of course caching is an internal
> maxRowsPerRPC).  The documentation is limited and out of date (e.g.
> https://hbase.apache.org/book.html#perf.hbase.client.caching).  Some
> things
> could at least be more consistent (getAllowPartialResults instead of
> isAllowPartialResults like almost all the other boolean properties).
>
> As I poke around and write this out, I guess I'd argue instead that it's
> time (or past time) to clean up the Scan API and document it more clearly.
> Which is a scary task I know.  But for a newcomer, it's a scary API right
> now.
>
> What about something like:
> Scan.bufferSize (instead of maxResultSize for the target over-the-wire size
> - though this is still confusing because it's common to go over this size)
> Scan.limitRows (instead of caching - along with true client side support)
> Scan.allowPartialResults (to indicate it's ok to break up rows across
> Results. it is transmitted to the server to indicate stop adding Cells to
> the buffer as soon as it fills rather than at the end of the row.  if a
> client needs true pagination for Cells within a row it can be done with a
> Filter.)
> Scan.cacheBlocks (less confusing without other things called "caching")
>
>
I think you've identified next logical follow-on to the work that
JonathanL, JoshE, Lars et al. have been at in Scanners recently. A Scanner
2.0 project sounds good to me (with appropriate bridging from old API
through to the new -- configs and docs too).

St.Ack




> Dave
>
>
>
>
>
> On Wed, Apr 8, 2015 at 10:00 PM, lars hofhansl <[email protected]> wrote:
>
> > Scanner caching (in 1.1 and 2.0) is now a _limit_. I.e. normally you
> leave
> > it disabled (the default of Long.MAX_VALUE) unless you know ahead of time
> > that you'll only look at the first N rows returned. In that case you'd
> set
> > it to N. I thought we had renamed it from "caching" to "limit" but
> looking
> > at the code, that is not the case.
> >
> > In 0.98 and 1.0.x we need to keep it around defaulting to 100 for
> > backwards compatibility.
> >
> > -- Lars
> >       From: Dave Latham <[email protected]>
> >  To: [email protected]
> >  Sent: Wednesday, April 8, 2015 9:09 PM
> >  Subject: remove scanner caching?
> >
> > After debugging a scans missing data issue while migrating to 0.98
> (thanks
> > Andrew, Jonathon, Josh, and Lars for the help), I'm left wondering why we
> > have both caching and maxResultSize for scans.  It seems to be more
> client
> > api complexity than it's worth.  Why would someone need to set caching
> when
> > maxResultSize is available?  Indeed, the first patch proposed by some
> > fellow in HBASE-1996 simply replaced caching with maxResultSize.  Can we
> > deprecate and eventually remove caching?  Is there a good case for
> keeping
> > it in the client API surface?
> >
> > Dave
> >
> >
> >
> >
>

Re: remove scanner caching?

Reply via email to