Re: Scan performance

lars hofhansl Sat, 22 Jun 2013 06:30:10 -0700

Yep generally you should design your keys such that start/stopKey can 
efficiently narrow the scope.


If that really cannot be done (and you should try hard), the 2nd  best option 
are "skip scans".

Filters in HBase allow for providing the scanner framework with hints where to 
go next.
They can skip to the next column (to avoid looking at many versions), to the 
next row (to avoid looking at many columns), or they can provide a custom seek 
hint to a specific key value. The latter is what FuzzyRowFilter does.


-- Lars



________________________________
 From: Anoop John <anoop.hb...@gmail.com>
To: user@hbase.apache.org 
Sent: Friday, June 21, 2013 11:58 PM
Subject: Re: Scan performance
 

Have a look at FuzzyRowFilter

-Anoop-

On Sat, Jun 22, 2013 at 9:20 AM, Tony Dean <tony.d...@sas.com> wrote:

> I understand more, but have additional questions about the internals...
>
> So, in this example I have 6000 rows X 40 columns in this table.  In this
> test my startRow and stopRow do not narrow the scan criterior therefore all
> 6000x40 KVs must be included in the search and thus read from disk and into
> memory.
>
> The first filter that I used was:
> Filter f2 = new SingleColumnValueFilter(cf, qualifier,
>  CompareFilter.CompareOp.EQUALS, value);
>
> This means that HBase must look for the qualifier column on all 6000 rows.
>  As you mention I could add certain columns to a different cf; but
> unfortunately, in my case there is no such small set of columns that will
> need to be compared (filtered on).  I could try to use indexes so that a
> complete row key can be calculated from a secondary index in order to
> perform a faster search against data in a primary table.  This requires
> additional tables and maintenance that I would like to avoid.
>
> I did try a row key filter with regex hoping that it would limit the
> number of rows that were read from disk.
> Filter f2 = new RowFilter(CompareFilter.CompareOp.EQUAL, new
> RegexStringComparator(row_regexpr));
>
> My row keys are something like: vid,sid,event.  sid is not known at query
> time so I can use a regex similar to: vid,.*,Logon where Logon is the event
> that I am looking for in a particular visit.  In my test data this should
> have narrowed the scan to 1 row X 40 columns.  The best I could do for
> start/stop row is: vid,0 and vid,~ respectively.  I guess that is still
> going to cause all 6000 rows to be scanned, but the filtering should be
> more specific with the rowKey filter.  However, I did not see any
> performance improvement.  Anything obvious?
>
> Do you have any other ideas to help out with performance when row key is:
> vid,sid,event and sid is not known at query time which leaves a gap in the
> start/stop row?  Too bad regex can't be used in start/stop row
> specification.  That's really what I need.
>
> Thanks again.
> -Tony
>
> -----Original Message-----
> From: Vladimir Rodionov [mailto:vrodio...@carrieriq.com]
> Sent: Friday, June 21, 2013 8:00 PM
> To: user@hbase.apache.org; lars hofhansl
> Subject: RE: Scan performance
>
> Lars,
> I thought that column family is the locality group and placement columns
> which are frequently accessed together into the same column family
> (locality group) is the obvious performance improvement tip. What are the
> "essential column families" for in this context?
>
> As for original question..  Unless you place your column into a separate
> column family in Table 2, you will need to scan (load from disk if not
> cached) ~ 40x more data for the second table (because you have 40 columns).
> This may explain why do  see such a difference in execution time if all
> data needs to be loaded first from HDFS.
>
> Best regards,
> Vladimir Rodionov
> Principal Platform Engineer
> Carrier IQ, www.carrieriq.com
> e-mail: vrodio...@carrieriq.com
>
> ________________________________________
> From: lars hofhansl [la...@apache.org]
> Sent: Friday, June 21, 2013 3:37 PM
> To: user@hbase.apache.org
> Subject: Re: Scan performance
>
> HBase is a key value (KV) store. Each column is stored in its own KV, a
> row is just a set of KVs that happen to have the row key (which is the
> first part of the key).
> I tried to summarize this here:
> http://hadoop-hbase.blogspot.de/2011/12/introduction-to-hbase.html)
>
> In the StoreFiles all KVs are sorted in row/column order, but HBase still
> needs to skip over many KVs in order to "reach" the next row. So more disk
> and memory IO is needed.
>
> If you using 0.94 there is a new feature "essential column families". If
> you always search by the same column you can place that one in its own
> column family and all other column in another column family. In that case
> your scan performance should be close identical.
>
>
> -- Lars
> ________________________________
>
> From: Tony Dean <tony.d...@sas.com>
> To: "user@hbase.apache.org" <user@hbase.apache.org>
> Sent: Friday, June 21, 2013 2:08 PM
> Subject: Scan performance
>
>
>
>
> Hi,
>
> I hope that you can shed some light on these 2 scenarios below.
>
> I have 2 small tables of 6000 rows.
> Table 1 has only 1 column in each of its rows.
> Table 2 has 40 columns in each of its rows.
> Other than that the two tables are identical.
>
> In both tables there is only 1 row that contains a matching column that I
> am filtering on.   And the Scan performs correctly in both cases by
> returning only the single result.
>
> The code looks something like the following:
>
> Scan scan = new Scan(startRow, stopRow);   // the start/stop rows should
> include all 6000 rows
> scan.addColumn(cf, qualifier); // only return the column that I am
> interested in (should only be in 1 row and only 1 version)
>
> Filter f1 = new InclusiveStopFilter(stopRow); Filter f2 = new
> SingleColumnValueFilter(cf, qualifier,  CompareFilter.CompareOp.EQUALS,
> value); scan.setFilter(new FilterList(f1, f2));
>
> scan .setTimeRange(0, MAX_LONG);
> scan.setMaxVersions(1);
>
> ResultScanner rs = t.getScanner(scan);
> for (Result result: rs)
> {
>
> }
>
> For table 1, rs.next() takes about 30ms.
> For table 2, rs.next() takes about 180ms.
>
> Both are returning the exact same result.  Why is it taking so much longer
> on table 2 to get the same result?  The scan depth is the same.  The only
> difference is the column width.  But I'm filtering on a single column and
> returning only that column.
>
> Am I missing something?  As I increase the number of columns, the response
> time gets worse.  I do expect the response time to get worse when
> increasing the number of rows, but not by increasing the number of columns
> since I'm returning only 1 column in both cases.
>
> I appreciate any comments that you have.
>
> -Tony
>
>
>
> Tony Dean
> SAS Institute Inc.
> Principal Software Developer
> 919-531-6704          ...
>
> Confidentiality Notice:  The information contained in this message,
> including any attachments hereto, may be confidential and is intended to be
> read only by the individual or entity to whom this message is addressed. If
> the reader of this message is not the intended recipient or an agent or
> designee of the intended recipient, please note that any review, use,
> disclosure or distribution of this message or its attachments, in any form,
> is strictly prohibited.  If you have received this message in error, please
> immediately notify the sender and/or notificati...@carrieriq.com and
> delete or destroy any copy of this message and its attachments.
>
>
>

Re: Scan performance

Reply via email to