We definitely wouldn't want to remove it too soon, for compatibility reasons.
Adding a "limitRows" notion sounds reasonable, but I'd argue is actually something different than caching was. If an app is relying on the scanner to actually limit the number of rows returned, the current caching limit won't work for scans that cross region boundaries. We would need to keep the state client side in addition to server side and decrement as we traverse regions. Looking into the branch-1 API, I can see there is now also Scan.allowPartialResults in addition to Scan.batch. For most cases, I'd expect batching is just to avoid memory issues for wide rows, in which case allowPartialResults could be a better, simpler interface to tell HBase not to overflow a small buffer with wide rows. Though it looks like at the moment that doesn't happen. As an app developer myself, the interaction between batch, caching, maxResultSize, allowPartialResults is confusing (as well as the similarly named cacheBlocks). The names could be better (maxResultSize has actually nothing to do with the maximum size of the Result's returned - it's rather the internal buffer size, batch is a max Cells per Result [I think. does it reset between rows?], and of course caching is an internal maxRowsPerRPC). The documentation is limited and out of date (e.g. https://hbase.apache.org/book.html#perf.hbase.client.caching). Some things could at least be more consistent (getAllowPartialResults instead of isAllowPartialResults like almost all the other boolean properties). As I poke around and write this out, I guess I'd argue instead that it's time (or past time) to clean up the Scan API and document it more clearly. Which is a scary task I know. But for a newcomer, it's a scary API right now. What about something like: Scan.bufferSize (instead of maxResultSize for the target over-the-wire size - though this is still confusing because it's common to go over this size) Scan.limitRows (instead of caching - along with true client side support) Scan.allowPartialResults (to indicate it's ok to break up rows across Results. it is transmitted to the server to indicate stop adding Cells to the buffer as soon as it fills rather than at the end of the row. if a client needs true pagination for Cells within a row it can be done with a Filter.) Scan.cacheBlocks (less confusing without other things called "caching") Dave On Wed, Apr 8, 2015 at 10:00 PM, lars hofhansl <[email protected]> wrote: > Scanner caching (in 1.1 and 2.0) is now a _limit_. I.e. normally you leave > it disabled (the default of Long.MAX_VALUE) unless you know ahead of time > that you'll only look at the first N rows returned. In that case you'd set > it to N. I thought we had renamed it from "caching" to "limit" but looking > at the code, that is not the case. > > In 0.98 and 1.0.x we need to keep it around defaulting to 100 for > backwards compatibility. > > -- Lars > From: Dave Latham <[email protected]> > To: [email protected] > Sent: Wednesday, April 8, 2015 9:09 PM > Subject: remove scanner caching? > > After debugging a scans missing data issue while migrating to 0.98 (thanks > Andrew, Jonathon, Josh, and Lars for the help), I'm left wondering why we > have both caching and maxResultSize for scans. It seems to be more client > api complexity than it's worth. Why would someone need to set caching when > maxResultSize is available? Indeed, the first patch proposed by some > fellow in HBASE-1996 simply replaced caching with maxResultSize. Can we > deprecate and eventually remove caching? Is there a good case for keeping > it in the client API surface? > > Dave > > > >
