Hi there, I don't know the specifics of your environment, but ... http://hbase.apache.org/book.html#perf.reading 11.8.2. Scan Attribute Selection
Š describes paying attention to the number of columns you are returning, particularly when using HBase as a MR source. In short, returning only the columns you need means you are reducing the data transferred between the RS and the client and the number of KV's evaluated in the RS, etc. On 9/13/12 10:12 AM, "Shengjie Min" <kelvin....@gmail.com> wrote: >Hi, > >I found an interesting difference between hbase scan query. > >I have a hbase table which has a lot of columns in a single column family. >eg. let's say I have a users table, then userid, username, email .... etc >etc 15 fields all together are in the single columnFamily. > >if you are familiar with RDBMS, > >query 1: select * from users >vs >query 2: select userid, username from users > >in mysql, these two has a difference, the query 2 will be obviously >faster, >but two queries won't give you a huge difference from performance >perspective. > >In Hbase, I noticed that: > >query 3: scan 'users', // this is basically return me all 15 fields >vs >query 4: scan 'users', {COLUMNS=>['cf:userid','cf:username']} // this >is >return me only two fields: userid , username > >query 3 here takes way longer than query 4, Given a big data set. In my >test, I have around 1,000,000 user records. You are talking about query 3 >- >100 secs VS query 4 - a few secs. > > >Can anybody explain to me, why the width of the resultset in HBASE can >impact the performance that much? > > >Shengjie Min