Re: Record limit in scan api?

Dave Latham Fri, 20 Nov 2009 16:47:40 -0800

Cool, let me know what you think of the patch here:
https://issues.apache.org/jira/browse/HBASE-1996


On Fri, Nov 20, 2009 at 4:45 PM, Ryan Rawson <[email protected]> wrote:

> It would, of course have to be in increments of 1 row, and have a
> minimum of 1 row.
>
> As they say, "patches welcome" :-)
>
> On Fri, Nov 20, 2009 at 4:40 PM, Dave Latham <[email protected]> wrote:
> > Right, that's the problem with the current setting.  If we change the
> > setting so that the buffer is measured in bytes, then I think there is a
> > decent 'one size fits all' setting, like 1MB.  You'd still want to adjust
> it
> > in some cases, but I think it would be a lot better by default.
> >
> > Dave
> >
> > On Fri, Nov 20, 2009 at 4:36 PM, Ryan Rawson <[email protected]> wrote:
> >
> >> The problem with this setting, is there is no good 'one size fits all'
> >> value.  If we set it to 1, we do a RPC for ever row, clearly not
> >> efficient for small rows.  If we set it to something as seemingly
> >> innocuous as 5 or 10, then map reduces which do a significant amount
> >> of processing on a row can cause the scanner to time out. The client
> >> code will also give up if its been more than 60 seconds since the
> >> scanner was last used, it's possible this code might need to be
> >> adjusted so we can resume scanning.
> >>
> >> I personally set it to anywhere between 1000-5000 for high performance
> >> jobs on small rows.
> >>
> >> The only factor is "can you process the cached chunk of rows in  <
> >> 60s".  Set the value as large as possible to not violate this and
> >> you'll achieve max perf.
> >>
> >> -ryan
> >>
> >> On Fri, Nov 20, 2009 at 4:20 PM, Dave Latham <[email protected]>
> wrote:
> >> > Thanks for your thoughts.  It's true you can configure the scan buffer
> >> rows
> >> > on an HTable or Scan instance, but I think there's something to be
> said
> >> to
> >> > try to work as well as we can out of the box.
> >> >
> >> > It would be more complication, but not by much.  To track the idea and
> >> see
> >> > what it would look like, I made an issue and attached a proposed
> patch.
> >> >
> >> > Dave
> >> >
> >> > On Fri, Nov 20, 2009 at 1:55 PM, Jean-Daniel Cryans <
> [email protected]
> >> >wrote:
> >> >
> >> >> And on the Scan as I wrote in my answer which is really really
> >> convenient.
> >> >>
> >> >> Not convinced on using bytes as a value for caching... It would be
> >> >> also more complicated.
> >> >>
> >> >> J-D
> >> >>
> >> >> On Fri, Nov 20, 2009 at 1:45 PM, Ryan Rawson <[email protected]>
> >> wrote:
> >> >> > You can set it on a per-HTable basis.
>  HTable.setScannerCaching(int);
> >> >> >
> >> >> >
> >> >> >
> >> >> > On Fri, Nov 20, 2009 at 1:43 PM, Dave Latham <[email protected]>
> >> >> wrote:
> >> >> >> I have some tables with large rows and some tables with very small
> >> rows,
> >> >> so
> >> >> >> I keep my default scanner caching at 1 row, but have to remember
> to
> >> set
> >> >> it
> >> >> >> higher when scanner tables with smaller rows.  It would be nice to
> >> have
> >> >> a
> >> >> >> default that did something reasonable across tables.
> >> >> >>
> >> >> >> Would it make sense to set scanner caching as a count of bytes
> rather
> >> >> than a
> >> >> >> count of rows?  That would make it similar to the write buffer for
> >> >> batches
> >> >> >> of puts that get flushed based on size rather than a fixed number
> of
> >> >> Puts.
> >> >> >> Then there could be some default value which should provide decent
> >> >> >> performance out of the box.
> >> >> >>
> >> >> >> Dave
> >> >> >>
> >> >> >> On Fri, Nov 20, 2009 at 12:35 PM, Gary Helmling <
> [email protected]
> >> >
> >> >> wrote:
> >> >> >>
> >> >> >>> To set this per scan you should be able to do:
> >> >> >>>
> >> >> >>> Scan s = new Scan()
> >> >> >>> s.setCaching(...)
> >> >> >>>
> >> >> >>> (I think this works anyway)
> >> >> >>>
> >> >> >>>
> >> >> >>> The other thing that I've found useful is using a PageFilter on
> >> scans:
> >> >> >>>
> >> >> >>>
> >> >>
> >>
> http://hadoop.apache.org/hbase/docs/r0.20.2/api/org/apache/hadoop/hbase/filter/PageFilter.html
> >> >> >>>
> >> >> >>> I believe this is applied independently on each region server (?)
> so
> >> >> you
> >> >> >>> still need to do your own counting in iterating the results, but
> it
> >> can
> >> >> be
> >> >> >>> used to early out on the server side separately from the scanner
> >> >> caching
> >> >> >>> value.
> >> >> >>>
> >> >> >>> --gh
> >> >> >>>
> >> >> >>> On Fri, Nov 20, 2009 at 3:04 PM, stack <[email protected]> wrote:
> >> >> >>>
> >> >> >>> > There is this in the configuration:
> >> >> >>> >
> >> >> >>> >  <property>
> >> >> >>> >    <name>hbase.client.scanner.caching</name>
> >> >> >>> >    <value>1</value>
> >> >> >>> >    <description>Number of rows that will be fetched when
> calling
> >> next
> >> >> >>> >    on a scanner if it is not served from memory. Higher caching
> >> >> values
> >> >> >>> >    will enable faster scanners but will eat up more memory and
> >> some
> >> >> >>> >    calls of next may take longer and longer times when the
> cache
> >> is
> >> >> >>> empty.
> >> >> >>> >    </description>
> >> >> >>> >  </property>
> >> >> >>> >
> >> >> >>> >
> >> >> >>> > Being able to do it per Scan sounds like something we should
> add.
> >> >> >>> >
> >> >> >>> > St.Ack
> >> >> >>> >
> >> >> >>> >
> >> >> >>> > On Fri, Nov 20, 2009 at 11:43 AM, Adam Silberstein
> >> >> >>> > <[email protected]>wrote:
> >> >> >>> >
> >> >> >>> > >   Hi,
> >> >> >>> > > Is there a way to specify a limit on number of returned
> records
> >> for
> >> >> >>> scan?
> >> >> >>> > >  I
> >> >> >>> > > don¹t see any way to do this when building the scan.  If
> there
> >> is,
> >> >> that
> >> >> >>> > > would be great.  If not, what about when iterating over the
> >> result?
> >> >>  If
> >> >> >>> I
> >> >> >>> > > exit the loop when I reach my limit, will that approximate
> this
> >> >> clause?
> >> >> >>> > I
> >> >> >>> > > guess my real question is about how scan is implemented in
> the
> >> >> client.
> >> >> >>> > >  I.e.
> >> >> >>> > > How many records are returned from Hbase at a time as I
> iterate
> >> >> through
> >> >> >>> > the
> >> >> >>> > > scan result?  If I want 1,000 records and 100 get returned at
> a
> >> >> time,
> >> >> >>> > then
> >> >> >>> > > I¹m in good shape.  On the other hand, if I want 10 records
> and
> >> get
> >> >> 100
> >> >> >>> > at
> >> >> >>> > > a
> >> >> >>> > > time, it¹s a bit wasteful, though the waste is bounded.
> >> >> >>> > >
> >> >> >>> > > Thanks,
> >> >> >>> > > Adam
> >> >> >>> > >
> >> >> >>> >
> >> >> >>>
> >> >> >>
> >> >> >
> >> >>
> >> >
> >>
> >
>

Re: Record limit in scan api?

Reply via email to