It would, of course have to be in increments of 1 row, and have a
minimum of 1 row.

As they say, "patches welcome" :-)

On Fri, Nov 20, 2009 at 4:40 PM, Dave Latham <[email protected]> wrote:
> Right, that's the problem with the current setting.  If we change the
> setting so that the buffer is measured in bytes, then I think there is a
> decent 'one size fits all' setting, like 1MB.  You'd still want to adjust it
> in some cases, but I think it would be a lot better by default.
>
> Dave
>
> On Fri, Nov 20, 2009 at 4:36 PM, Ryan Rawson <[email protected]> wrote:
>
>> The problem with this setting, is there is no good 'one size fits all'
>> value.  If we set it to 1, we do a RPC for ever row, clearly not
>> efficient for small rows.  If we set it to something as seemingly
>> innocuous as 5 or 10, then map reduces which do a significant amount
>> of processing on a row can cause the scanner to time out. The client
>> code will also give up if its been more than 60 seconds since the
>> scanner was last used, it's possible this code might need to be
>> adjusted so we can resume scanning.
>>
>> I personally set it to anywhere between 1000-5000 for high performance
>> jobs on small rows.
>>
>> The only factor is "can you process the cached chunk of rows in  <
>> 60s".  Set the value as large as possible to not violate this and
>> you'll achieve max perf.
>>
>> -ryan
>>
>> On Fri, Nov 20, 2009 at 4:20 PM, Dave Latham <[email protected]> wrote:
>> > Thanks for your thoughts.  It's true you can configure the scan buffer
>> rows
>> > on an HTable or Scan instance, but I think there's something to be said
>> to
>> > try to work as well as we can out of the box.
>> >
>> > It would be more complication, but not by much.  To track the idea and
>> see
>> > what it would look like, I made an issue and attached a proposed patch.
>> >
>> > Dave
>> >
>> > On Fri, Nov 20, 2009 at 1:55 PM, Jean-Daniel Cryans <[email protected]
>> >wrote:
>> >
>> >> And on the Scan as I wrote in my answer which is really really
>> convenient.
>> >>
>> >> Not convinced on using bytes as a value for caching... It would be
>> >> also more complicated.
>> >>
>> >> J-D
>> >>
>> >> On Fri, Nov 20, 2009 at 1:45 PM, Ryan Rawson <[email protected]>
>> wrote:
>> >> > You can set it on a per-HTable basis.  HTable.setScannerCaching(int);
>> >> >
>> >> >
>> >> >
>> >> > On Fri, Nov 20, 2009 at 1:43 PM, Dave Latham <[email protected]>
>> >> wrote:
>> >> >> I have some tables with large rows and some tables with very small
>> rows,
>> >> so
>> >> >> I keep my default scanner caching at 1 row, but have to remember to
>> set
>> >> it
>> >> >> higher when scanner tables with smaller rows.  It would be nice to
>> have
>> >> a
>> >> >> default that did something reasonable across tables.
>> >> >>
>> >> >> Would it make sense to set scanner caching as a count of bytes rather
>> >> than a
>> >> >> count of rows?  That would make it similar to the write buffer for
>> >> batches
>> >> >> of puts that get flushed based on size rather than a fixed number of
>> >> Puts.
>> >> >> Then there could be some default value which should provide decent
>> >> >> performance out of the box.
>> >> >>
>> >> >> Dave
>> >> >>
>> >> >> On Fri, Nov 20, 2009 at 12:35 PM, Gary Helmling <[email protected]
>> >
>> >> wrote:
>> >> >>
>> >> >>> To set this per scan you should be able to do:
>> >> >>>
>> >> >>> Scan s = new Scan()
>> >> >>> s.setCaching(...)
>> >> >>>
>> >> >>> (I think this works anyway)
>> >> >>>
>> >> >>>
>> >> >>> The other thing that I've found useful is using a PageFilter on
>> scans:
>> >> >>>
>> >> >>>
>> >>
>> http://hadoop.apache.org/hbase/docs/r0.20.2/api/org/apache/hadoop/hbase/filter/PageFilter.html
>> >> >>>
>> >> >>> I believe this is applied independently on each region server (?) so
>> >> you
>> >> >>> still need to do your own counting in iterating the results, but it
>> can
>> >> be
>> >> >>> used to early out on the server side separately from the scanner
>> >> caching
>> >> >>> value.
>> >> >>>
>> >> >>> --gh
>> >> >>>
>> >> >>> On Fri, Nov 20, 2009 at 3:04 PM, stack <[email protected]> wrote:
>> >> >>>
>> >> >>> > There is this in the configuration:
>> >> >>> >
>> >> >>> >  <property>
>> >> >>> >    <name>hbase.client.scanner.caching</name>
>> >> >>> >    <value>1</value>
>> >> >>> >    <description>Number of rows that will be fetched when calling
>> next
>> >> >>> >    on a scanner if it is not served from memory. Higher caching
>> >> values
>> >> >>> >    will enable faster scanners but will eat up more memory and
>> some
>> >> >>> >    calls of next may take longer and longer times when the cache
>> is
>> >> >>> empty.
>> >> >>> >    </description>
>> >> >>> >  </property>
>> >> >>> >
>> >> >>> >
>> >> >>> > Being able to do it per Scan sounds like something we should add.
>> >> >>> >
>> >> >>> > St.Ack
>> >> >>> >
>> >> >>> >
>> >> >>> > On Fri, Nov 20, 2009 at 11:43 AM, Adam Silberstein
>> >> >>> > <[email protected]>wrote:
>> >> >>> >
>> >> >>> > >   Hi,
>> >> >>> > > Is there a way to specify a limit on number of returned records
>> for
>> >> >>> scan?
>> >> >>> > >  I
>> >> >>> > > don¹t see any way to do this when building the scan.  If there
>> is,
>> >> that
>> >> >>> > > would be great.  If not, what about when iterating over the
>> result?
>> >>  If
>> >> >>> I
>> >> >>> > > exit the loop when I reach my limit, will that approximate this
>> >> clause?
>> >> >>> > I
>> >> >>> > > guess my real question is about how scan is implemented in the
>> >> client.
>> >> >>> > >  I.e.
>> >> >>> > > How many records are returned from Hbase at a time as I iterate
>> >> through
>> >> >>> > the
>> >> >>> > > scan result?  If I want 1,000 records and 100 get returned at a
>> >> time,
>> >> >>> > then
>> >> >>> > > I¹m in good shape.  On the other hand, if I want 10 records and
>> get
>> >> 100
>> >> >>> > at
>> >> >>> > > a
>> >> >>> > > time, it¹s a bit wasteful, though the waste is bounded.
>> >> >>> > >
>> >> >>> > > Thanks,
>> >> >>> > > Adam
>> >> >>> > >
>> >> >>> >
>> >> >>>
>> >> >>
>> >> >
>> >>
>> >
>>
>

Reply via email to