Hey list, Just a small tip for those who uses the scanners in HBase and that their processing time takes more than 2-3 seconds per row : lower the hbase.client.scanner.caching. When I wrote that feature, my tests showed my that a value of 30 gives the best speed VS memory consumption. 80% of the time, that's what you need. In the case I first described, you will very likely hit scanner timeouts (or unknown). Why? Some simple maths :
Default lease time : 60 secs Example row processing time : 3 secs Scanner prefeching value : 30 That means that you will query 30 rows in a single batch in the first next(), then you will take the 29 others directly from the client cache, then you will re-query a region server for 30 more. Since 3*30 = 90 and that's > 60, you get a scanner timeout. In one case recently, it was taking me more than 2 minutes per row (rss crawling) so timeouts were inevitable. You can set this value in hbase-site, a HBaseConfiguration object or using HTable.setScannerCaching J-D
