I believe this is the latest status: https://issues.apache.org/jira/browse/HIVE-1643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
Suggest following up to [email protected] and/or [email protected]. Best regards, - Andy Problems worthy of attack prove their worth by hitting back. - Piet Hein (via Tom White) >________________________________ >From: Matthew Tovbin <[email protected]> >To: HBase User <[email protected]> >Cc: Hbase Dev <[email protected]> >Sent: Friday, September 30, 2011 5:49 AM >Subject: Re: Hbase-Hive integration performance issues > >Hello guys, > >Any updates on the issue? Anyone?! ;)) > >Best regards, > Matthew Tovbin =) > > > >On Tue, Sep 20, 2011 at 09:41, Matthew Tovbin <[email protected]> wrote: > >> Thanks Jean and Sandy. >> >> I have hive 0.7.1, and according to this patch >> https://issues.apache.org/jira/browse/HIVE-1226 at least exact match >> queries like "...where id = '12345'-123' " or partial pushdown "...where id >> like "12345%" should work, but I didn't notice it. >> >> Matthew. >> >> >> >> On Mon, Sep 19, 2011 at 20:37, Sandy Pratt <[email protected]> wrote: >> >>> I suffered the same let down a little while ago. I believe this is the >>> relevant JIRA: >>> >>> https://issues.apache.org/jira/browse/HIVE-1643 >>> >>> I'd also like to see Hive be able to limit scans to particular HBase >>> version ranges, but I don't know if that's even planned. >>> >>> Sandy >>> >>> > -----Original Message----- >>> > From: [email protected] [mailto:[email protected]] On Behalf Of Jean- >>> > Daniel Cryans >>> > Sent: Monday, September 19, 2011 09:58 >>> > To: [email protected] >>> > Subject: Re: Hbase-Hive integration performance issues >>> > >>> > (replying to user@, dev@ in BCC) >>> > >>> > AFAIK the HBase handler doesn't have the wits to understand that you are >>> > doing a prefix scan and thus limit the scan to only the required rows. >>> There's >>> > a bunch of optimizations like that that need to be done. >>> > >>> > I'm pretty sure Pig does the same thing, but don't take my word on it. >>> > >>> > J-D >>> > >>> > On Sun, Sep 18, 2011 at 4:12 AM, Matthew Tovbin <[email protected]> >>> > wrote: >>> > > Hi guys, >>> > > >>> > > I've got a table in Hbase let's say "tbl" and I would like to query it >>> > > using Hive. Therefore I mapped a table to hive as follows: >>> > > >>> > > CREATE EXTERNAL TABLE tbl(id string, data map<string,string>) STORED >>> > > BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' >>> > > WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,data:") >>> > > TBLPROPERTIES("hbase.table.name" = "tbl"); >>> > > >>> > > Queries like: "select * from tbl", "select id from tbl", "select id, >>> > > data from tbl" are really fast. >>> > > But queries like "select id from tbl where substr(id, 0, 5) = "12345"" >>> > > or "select id from tbl where data["777"] IS NOT NULL" are incredibly >>> slow. >>> > > >>> > > In the contrary when running from Hbase shell: "scan 'tbl', { >>> > > COLUMNS=>'data', STARTROW='12345', ENDROW='12346'}" or "scan 'tbl', { >>> > > COLUMNS=>'data', "FILTER" => >>> > > FilterList.new([qualifierFilter('777')])}" >>> > > it is lightning fast! >>> > > >>> > > When I looked into the mapred job generated by hive on jobtracker I >>> > > discovered that "map.input.records" counts ALL the items in Hbase >>> > > table, meaning the job makes a full table scan before it even starts >>> any >>> > mappers!! >>> > > Moreover, I suspect it copies all the data from Hbase table to hdfs to >>> > > mapper tmp input folder before executuion. >>> > > >>> > > So, my questions are - Why hbase storage handler for hive does not >>> > > translate hive queries into appropriate hbase functions? Why it scans >>> > > all the records and then slices them using "where" clause? How can it >>> > > be improved? Is Pig's integration better in this case? >>> > > >>> > > >>> > > Some additional information about the tables: >>> > > Table description in Hbase: >>> > > jruby-1.6.2 :011 > describe 'tbl' >>> > > DESCRIPTION >>> > > ENABLED >>> > > {NAME => 'users', FAMILIES => [{NAME => 'data', BLOOMFILTER => >>> > > 'ROWCOL', REPLICATIO true >>> > > N_SCOPE => '0', COMPRESSION => 'LZO', VERSIONS => '3', TTL => >>> > > '2147483647', BLOCKSIZE => >>> > > '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'}]} >>> > > >>> > > Table desciption in Hive: >>> > > hive> describe tbl; >>> > > OK >>> > > id string from deserializer >>> > > data map<string,string> from deserializer Time taken: 0.08 seconds >>> > > >>> > > Best regards, >>> > > Matthew Tovbin =) >>> > > >>> >> >> > > >
