On Thu, Apr 9, 2015 at 6:25 AM, Dave Latham <[email protected]> wrote:
> We definitely wouldn't want to remove it too soon, for compatibility > reasons. > > Adding a "limitRows" notion sounds reasonable, but I'd argue is actually > something different than caching was. If an app is relying on the scanner > to actually limit the number of rows returned, the current caching limit > won't work for scans that cross region boundaries. We would need to keep > the state client side in addition to server side and decrement as we > traverse regions. > > Looking into the branch-1 API, I can see there is now also > Scan.allowPartialResults in addition to Scan.batch. For most cases, I'd > expect batching is just to avoid memory issues for wide rows, in which case > allowPartialResults could be a better, simpler interface to tell HBase not > to overflow a small buffer with wide rows. Though it looks like at the > moment that doesn't happen. > > As an app developer myself, the interaction between batch, caching, > maxResultSize, allowPartialResults is confusing (as well as the similarly > named cacheBlocks). The names could be better (maxResultSize has actually > nothing to do with the maximum size of the Result's returned - it's rather > the internal buffer size, batch is a max Cells per Result [I think. does > it reset between rows?], and of course caching is an internal > maxRowsPerRPC). The documentation is limited and out of date (e.g. > https://hbase.apache.org/book.html#perf.hbase.client.caching). Some > things > could at least be more consistent (getAllowPartialResults instead of > isAllowPartialResults like almost all the other boolean properties). > > As I poke around and write this out, I guess I'd argue instead that it's > time (or past time) to clean up the Scan API and document it more clearly. > Which is a scary task I know. But for a newcomer, it's a scary API right > now. > > What about something like: > Scan.bufferSize (instead of maxResultSize for the target over-the-wire size > - though this is still confusing because it's common to go over this size) > Scan.limitRows (instead of caching - along with true client side support) > Scan.allowPartialResults (to indicate it's ok to break up rows across > Results. it is transmitted to the server to indicate stop adding Cells to > the buffer as soon as it fills rather than at the end of the row. if a > client needs true pagination for Cells within a row it can be done with a > Filter.) > Scan.cacheBlocks (less confusing without other things called "caching") > > I think you've identified next logical follow-on to the work that JonathanL, JoshE, Lars et al. have been at in Scanners recently. A Scanner 2.0 project sounds good to me (with appropriate bridging from old API through to the new -- configs and docs too). St.Ack > Dave > > > > > > On Wed, Apr 8, 2015 at 10:00 PM, lars hofhansl <[email protected]> wrote: > > > Scanner caching (in 1.1 and 2.0) is now a _limit_. I.e. normally you > leave > > it disabled (the default of Long.MAX_VALUE) unless you know ahead of time > > that you'll only look at the first N rows returned. In that case you'd > set > > it to N. I thought we had renamed it from "caching" to "limit" but > looking > > at the code, that is not the case. > > > > In 0.98 and 1.0.x we need to keep it around defaulting to 100 for > > backwards compatibility. > > > > -- Lars > > From: Dave Latham <[email protected]> > > To: [email protected] > > Sent: Wednesday, April 8, 2015 9:09 PM > > Subject: remove scanner caching? > > > > After debugging a scans missing data issue while migrating to 0.98 > (thanks > > Andrew, Jonathon, Josh, and Lars for the help), I'm left wondering why we > > have both caching and maxResultSize for scans. It seems to be more > client > > api complexity than it's worth. Why would someone need to set caching > when > > maxResultSize is available? Indeed, the first patch proposed by some > > fellow in HBASE-1996 simply replaced caching with maxResultSize. Can we > > deprecate and eventually remove caching? Is there a good case for > keeping > > it in the client API surface? > > > > Dave > > > > > > > > >
