[
https://issues.apache.org/jira/browse/HBASE-1996?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Erik Rozendaal updated HBASE-1996:
----------------------------------
Attachment: 1996-0.20.3.patch
This patch limits the result from a single call to a scanner's next method to
one MB. I couldn't get "minimum N rows, minimum M bytes" to work without
needing changes in the protocol. So now it is "maximum N rows, maximum M bytes"
where M is hardcoded to 1 MB.
This allows me to set a scanner's caching to Integer.MAX_VALUE and not get any
OOMs on the region server. Obviously only ~1 MB of data is returned.
Scanning performance is very high (I get 20+ MB/second on my Core2Duo 2.4 GHz
laptop going to HBase through a web server... so more like 40+ MB/second on the
HBase side).
> Configure scanner buffer in bytes instead of number of rows
> -----------------------------------------------------------
>
> Key: HBASE-1996
> URL: https://issues.apache.org/jira/browse/HBASE-1996
> Project: Hadoop HBase
> Issue Type: Improvement
> Reporter: Dave Latham
> Assignee: Dave Latham
> Fix For: 0.21.0
>
> Attachments: 1966.patch, 1996-0.20.3.patch
>
>
> Currently, the default scanner fetches a single row at a time. This makes
> for very slow scans on tables where the rows are not large. You can change
> the setting for an HTable instance or for each Scan.
> It would be better to have a default that performs reasonably well so that
> people stop running into slow scans because they are evaluating HBase, aren't
> familiar with the setting, or simply forgot. Unfortunately, if we increase
> the value of the current setting, then we run the risk of running OOM for
> tables with large rows. Let's change the setting so that it works with a
> size in bytes, rather than in rows. This will allow us to set a reasonable
> default so that tables with small rows will scan performantly and tables with
> large rows will not run OOM.
> Note that the case is very similar to table writes as well. When disabling
> auto flush, we buffer a list of Put's to commit at once. That buffer is
> measured in bytes, so that a small number of large Puts or a lot of small
> Puts can each fit in a single flush. If that buffer were measured in number
> of Put's it would have the same problem that we have for the scan buffer, and
> we wouldn't be able to set a good default value for tables with different
> size rows. Changing the scan buffer to be configured like the write buffer
> will make it more consistent.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.