It would, of course have to be in increments of 1 row, and have a minimum of 1 row.
As they say, "patches welcome" :-) On Fri, Nov 20, 2009 at 4:40 PM, Dave Latham <[email protected]> wrote: > Right, that's the problem with the current setting. If we change the > setting so that the buffer is measured in bytes, then I think there is a > decent 'one size fits all' setting, like 1MB. You'd still want to adjust it > in some cases, but I think it would be a lot better by default. > > Dave > > On Fri, Nov 20, 2009 at 4:36 PM, Ryan Rawson <[email protected]> wrote: > >> The problem with this setting, is there is no good 'one size fits all' >> value. If we set it to 1, we do a RPC for ever row, clearly not >> efficient for small rows. If we set it to something as seemingly >> innocuous as 5 or 10, then map reduces which do a significant amount >> of processing on a row can cause the scanner to time out. The client >> code will also give up if its been more than 60 seconds since the >> scanner was last used, it's possible this code might need to be >> adjusted so we can resume scanning. >> >> I personally set it to anywhere between 1000-5000 for high performance >> jobs on small rows. >> >> The only factor is "can you process the cached chunk of rows in < >> 60s". Set the value as large as possible to not violate this and >> you'll achieve max perf. >> >> -ryan >> >> On Fri, Nov 20, 2009 at 4:20 PM, Dave Latham <[email protected]> wrote: >> > Thanks for your thoughts. It's true you can configure the scan buffer >> rows >> > on an HTable or Scan instance, but I think there's something to be said >> to >> > try to work as well as we can out of the box. >> > >> > It would be more complication, but not by much. To track the idea and >> see >> > what it would look like, I made an issue and attached a proposed patch. >> > >> > Dave >> > >> > On Fri, Nov 20, 2009 at 1:55 PM, Jean-Daniel Cryans <[email protected] >> >wrote: >> > >> >> And on the Scan as I wrote in my answer which is really really >> convenient. >> >> >> >> Not convinced on using bytes as a value for caching... It would be >> >> also more complicated. >> >> >> >> J-D >> >> >> >> On Fri, Nov 20, 2009 at 1:45 PM, Ryan Rawson <[email protected]> >> wrote: >> >> > You can set it on a per-HTable basis. HTable.setScannerCaching(int); >> >> > >> >> > >> >> > >> >> > On Fri, Nov 20, 2009 at 1:43 PM, Dave Latham <[email protected]> >> >> wrote: >> >> >> I have some tables with large rows and some tables with very small >> rows, >> >> so >> >> >> I keep my default scanner caching at 1 row, but have to remember to >> set >> >> it >> >> >> higher when scanner tables with smaller rows. It would be nice to >> have >> >> a >> >> >> default that did something reasonable across tables. >> >> >> >> >> >> Would it make sense to set scanner caching as a count of bytes rather >> >> than a >> >> >> count of rows? That would make it similar to the write buffer for >> >> batches >> >> >> of puts that get flushed based on size rather than a fixed number of >> >> Puts. >> >> >> Then there could be some default value which should provide decent >> >> >> performance out of the box. >> >> >> >> >> >> Dave >> >> >> >> >> >> On Fri, Nov 20, 2009 at 12:35 PM, Gary Helmling <[email protected] >> > >> >> wrote: >> >> >> >> >> >>> To set this per scan you should be able to do: >> >> >>> >> >> >>> Scan s = new Scan() >> >> >>> s.setCaching(...) >> >> >>> >> >> >>> (I think this works anyway) >> >> >>> >> >> >>> >> >> >>> The other thing that I've found useful is using a PageFilter on >> scans: >> >> >>> >> >> >>> >> >> >> http://hadoop.apache.org/hbase/docs/r0.20.2/api/org/apache/hadoop/hbase/filter/PageFilter.html >> >> >>> >> >> >>> I believe this is applied independently on each region server (?) so >> >> you >> >> >>> still need to do your own counting in iterating the results, but it >> can >> >> be >> >> >>> used to early out on the server side separately from the scanner >> >> caching >> >> >>> value. >> >> >>> >> >> >>> --gh >> >> >>> >> >> >>> On Fri, Nov 20, 2009 at 3:04 PM, stack <[email protected]> wrote: >> >> >>> >> >> >>> > There is this in the configuration: >> >> >>> > >> >> >>> > <property> >> >> >>> > <name>hbase.client.scanner.caching</name> >> >> >>> > <value>1</value> >> >> >>> > <description>Number of rows that will be fetched when calling >> next >> >> >>> > on a scanner if it is not served from memory. Higher caching >> >> values >> >> >>> > will enable faster scanners but will eat up more memory and >> some >> >> >>> > calls of next may take longer and longer times when the cache >> is >> >> >>> empty. >> >> >>> > </description> >> >> >>> > </property> >> >> >>> > >> >> >>> > >> >> >>> > Being able to do it per Scan sounds like something we should add. >> >> >>> > >> >> >>> > St.Ack >> >> >>> > >> >> >>> > >> >> >>> > On Fri, Nov 20, 2009 at 11:43 AM, Adam Silberstein >> >> >>> > <[email protected]>wrote: >> >> >>> > >> >> >>> > > Hi, >> >> >>> > > Is there a way to specify a limit on number of returned records >> for >> >> >>> scan? >> >> >>> > > I >> >> >>> > > don¹t see any way to do this when building the scan. If there >> is, >> >> that >> >> >>> > > would be great. If not, what about when iterating over the >> result? >> >> If >> >> >>> I >> >> >>> > > exit the loop when I reach my limit, will that approximate this >> >> clause? >> >> >>> > I >> >> >>> > > guess my real question is about how scan is implemented in the >> >> client. >> >> >>> > > I.e. >> >> >>> > > How many records are returned from Hbase at a time as I iterate >> >> through >> >> >>> > the >> >> >>> > > scan result? If I want 1,000 records and 100 get returned at a >> >> time, >> >> >>> > then >> >> >>> > > I¹m in good shape. On the other hand, if I want 10 records and >> get >> >> 100 >> >> >>> > at >> >> >>> > > a >> >> >>> > > time, it¹s a bit wasteful, though the waste is bounded. >> >> >>> > > >> >> >>> > > Thanks, >> >> >>> > > Adam >> >> >>> > > >> >> >>> > >> >> >>> >> >> >> >> >> > >> >> >> > >> >
