Cool, let me know what you think of the patch here: https://issues.apache.org/jira/browse/HBASE-1996
On Fri, Nov 20, 2009 at 4:45 PM, Ryan Rawson <[email protected]> wrote: > It would, of course have to be in increments of 1 row, and have a > minimum of 1 row. > > As they say, "patches welcome" :-) > > On Fri, Nov 20, 2009 at 4:40 PM, Dave Latham <[email protected]> wrote: > > Right, that's the problem with the current setting. If we change the > > setting so that the buffer is measured in bytes, then I think there is a > > decent 'one size fits all' setting, like 1MB. You'd still want to adjust > it > > in some cases, but I think it would be a lot better by default. > > > > Dave > > > > On Fri, Nov 20, 2009 at 4:36 PM, Ryan Rawson <[email protected]> wrote: > > > >> The problem with this setting, is there is no good 'one size fits all' > >> value. If we set it to 1, we do a RPC for ever row, clearly not > >> efficient for small rows. If we set it to something as seemingly > >> innocuous as 5 or 10, then map reduces which do a significant amount > >> of processing on a row can cause the scanner to time out. The client > >> code will also give up if its been more than 60 seconds since the > >> scanner was last used, it's possible this code might need to be > >> adjusted so we can resume scanning. > >> > >> I personally set it to anywhere between 1000-5000 for high performance > >> jobs on small rows. > >> > >> The only factor is "can you process the cached chunk of rows in < > >> 60s". Set the value as large as possible to not violate this and > >> you'll achieve max perf. > >> > >> -ryan > >> > >> On Fri, Nov 20, 2009 at 4:20 PM, Dave Latham <[email protected]> > wrote: > >> > Thanks for your thoughts. It's true you can configure the scan buffer > >> rows > >> > on an HTable or Scan instance, but I think there's something to be > said > >> to > >> > try to work as well as we can out of the box. > >> > > >> > It would be more complication, but not by much. To track the idea and > >> see > >> > what it would look like, I made an issue and attached a proposed > patch. > >> > > >> > Dave > >> > > >> > On Fri, Nov 20, 2009 at 1:55 PM, Jean-Daniel Cryans < > [email protected] > >> >wrote: > >> > > >> >> And on the Scan as I wrote in my answer which is really really > >> convenient. > >> >> > >> >> Not convinced on using bytes as a value for caching... It would be > >> >> also more complicated. > >> >> > >> >> J-D > >> >> > >> >> On Fri, Nov 20, 2009 at 1:45 PM, Ryan Rawson <[email protected]> > >> wrote: > >> >> > You can set it on a per-HTable basis. > HTable.setScannerCaching(int); > >> >> > > >> >> > > >> >> > > >> >> > On Fri, Nov 20, 2009 at 1:43 PM, Dave Latham <[email protected]> > >> >> wrote: > >> >> >> I have some tables with large rows and some tables with very small > >> rows, > >> >> so > >> >> >> I keep my default scanner caching at 1 row, but have to remember > to > >> set > >> >> it > >> >> >> higher when scanner tables with smaller rows. It would be nice to > >> have > >> >> a > >> >> >> default that did something reasonable across tables. > >> >> >> > >> >> >> Would it make sense to set scanner caching as a count of bytes > rather > >> >> than a > >> >> >> count of rows? That would make it similar to the write buffer for > >> >> batches > >> >> >> of puts that get flushed based on size rather than a fixed number > of > >> >> Puts. > >> >> >> Then there could be some default value which should provide decent > >> >> >> performance out of the box. > >> >> >> > >> >> >> Dave > >> >> >> > >> >> >> On Fri, Nov 20, 2009 at 12:35 PM, Gary Helmling < > [email protected] > >> > > >> >> wrote: > >> >> >> > >> >> >>> To set this per scan you should be able to do: > >> >> >>> > >> >> >>> Scan s = new Scan() > >> >> >>> s.setCaching(...) > >> >> >>> > >> >> >>> (I think this works anyway) > >> >> >>> > >> >> >>> > >> >> >>> The other thing that I've found useful is using a PageFilter on > >> scans: > >> >> >>> > >> >> >>> > >> >> > >> > http://hadoop.apache.org/hbase/docs/r0.20.2/api/org/apache/hadoop/hbase/filter/PageFilter.html > >> >> >>> > >> >> >>> I believe this is applied independently on each region server (?) > so > >> >> you > >> >> >>> still need to do your own counting in iterating the results, but > it > >> can > >> >> be > >> >> >>> used to early out on the server side separately from the scanner > >> >> caching > >> >> >>> value. > >> >> >>> > >> >> >>> --gh > >> >> >>> > >> >> >>> On Fri, Nov 20, 2009 at 3:04 PM, stack <[email protected]> wrote: > >> >> >>> > >> >> >>> > There is this in the configuration: > >> >> >>> > > >> >> >>> > <property> > >> >> >>> > <name>hbase.client.scanner.caching</name> > >> >> >>> > <value>1</value> > >> >> >>> > <description>Number of rows that will be fetched when > calling > >> next > >> >> >>> > on a scanner if it is not served from memory. Higher caching > >> >> values > >> >> >>> > will enable faster scanners but will eat up more memory and > >> some > >> >> >>> > calls of next may take longer and longer times when the > cache > >> is > >> >> >>> empty. > >> >> >>> > </description> > >> >> >>> > </property> > >> >> >>> > > >> >> >>> > > >> >> >>> > Being able to do it per Scan sounds like something we should > add. > >> >> >>> > > >> >> >>> > St.Ack > >> >> >>> > > >> >> >>> > > >> >> >>> > On Fri, Nov 20, 2009 at 11:43 AM, Adam Silberstein > >> >> >>> > <[email protected]>wrote: > >> >> >>> > > >> >> >>> > > Hi, > >> >> >>> > > Is there a way to specify a limit on number of returned > records > >> for > >> >> >>> scan? > >> >> >>> > > I > >> >> >>> > > don¹t see any way to do this when building the scan. If > there > >> is, > >> >> that > >> >> >>> > > would be great. If not, what about when iterating over the > >> result? > >> >> If > >> >> >>> I > >> >> >>> > > exit the loop when I reach my limit, will that approximate > this > >> >> clause? > >> >> >>> > I > >> >> >>> > > guess my real question is about how scan is implemented in > the > >> >> client. > >> >> >>> > > I.e. > >> >> >>> > > How many records are returned from Hbase at a time as I > iterate > >> >> through > >> >> >>> > the > >> >> >>> > > scan result? If I want 1,000 records and 100 get returned at > a > >> >> time, > >> >> >>> > then > >> >> >>> > > I¹m in good shape. On the other hand, if I want 10 records > and > >> get > >> >> 100 > >> >> >>> > at > >> >> >>> > > a > >> >> >>> > > time, it¹s a bit wasteful, though the waste is bounded. > >> >> >>> > > > >> >> >>> > > Thanks, > >> >> >>> > > Adam > >> >> >>> > > > >> >> >>> > > >> >> >>> > >> >> >> > >> >> > > >> >> > >> > > >> > > >
