Thanks Sandy, I'll try it too! Best regards, Matthew Tovbin =)
On Mon, Oct 3, 2011 at 22:36, Sandy Pratt <[email protected]> wrote: > I've been working on this issue lately. I am beginning to deploy a > modified version of the stock HBase serde to my own clusters. For one > thing, it contains the code to push down scan ranges to HBase (see jira), > and I've also adapted it to read my single-cell protobuf records via > reflection. Once I've tested it on larger datasets, I'll see about getting > something together than I can submit to back to Hive. But for now the patch > I posted on Jira should apply to trunk (and also cdh3u0, which I use, I > think) and allow range scans on the rowkey to be pushed down (if it doesn't > please let me know ;) ). > > Sandy > > > -----Original Message----- > > From: Andrew Purtell [mailto:[email protected]] > > Sent: Friday, September 30, 2011 09:50 > > To: [email protected]; HBase User > > Subject: Re: Hbase-Hive integration performance issues > > > > I believe this is the latest status: > > > > https://issues.apache.org/jira/browse/HIVE- > > 1643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel > > > > Suggest following up to [email protected] and/or [email protected]. > > > > Best regards, > > > > > > - Andy > > > > Problems worthy of attack prove their worth by hitting back. - Piet Hein > (via > > Tom White) > > > > > > >________________________________ > > >From: Matthew Tovbin <[email protected]> > > >To: HBase User <[email protected]> > > >Cc: Hbase Dev <[email protected]> > > >Sent: Friday, September 30, 2011 5:49 AM > > >Subject: Re: Hbase-Hive integration performance issues > > > > > >Hello guys, > > > > > >Any updates on the issue? Anyone?! ;)) > > > > > >Best regards, > > > Matthew Tovbin =) > > > > > > > > > > > >On Tue, Sep 20, 2011 at 09:41, Matthew Tovbin <[email protected]> > > wrote: > > > > > >> Thanks Jean and Sandy. > > >> > > >> I have hive 0.7.1, and according to this patch > > >> https://issues.apache.org/jira/browse/HIVE-1226 at least exact match > > >>queries like "...where id = '12345'-123' " or partial pushdown > > >>"...where id like "12345%" should work, but I didn't notice it. > > >> > > >> Matthew. > > >> > > >> > > >> > > >> On Mon, Sep 19, 2011 at 20:37, Sandy Pratt <[email protected]> wrote: > > >> > > >>> I suffered the same let down a little while ago. I believe this is > > >>> the relevant JIRA: > > >>> > > >>> https://issues.apache.org/jira/browse/HIVE-1643 > > >>> > > >>> I'd also like to see Hive be able to limit scans to particular HBase > > >>> version ranges, but I don't know if that's even planned. > > >>> > > >>> Sandy > > >>> > > >>> > -----Original Message----- > > >>> > From: [email protected] [mailto:[email protected]] On Behalf Of > > >>> > Jean- Daniel Cryans > > >>> > Sent: Monday, September 19, 2011 09:58 > > >>> > To: [email protected] > > >>> > Subject: Re: Hbase-Hive integration performance issues > > >>> > > > >>> > (replying to user@, dev@ in BCC) > > >>> > > > >>> > AFAIK the HBase handler doesn't have the wits to understand that > > >>> > you are doing a prefix scan and thus limit the scan to only the > required > > rows. > > >>> There's > > >>> > a bunch of optimizations like that that need to be done. > > >>> > > > >>> > I'm pretty sure Pig does the same thing, but don't take my word on > it. > > >>> > > > >>> > J-D > > >>> > > > >>> > On Sun, Sep 18, 2011 at 4:12 AM, Matthew Tovbin > > >>> > <[email protected]> > > >>> > wrote: > > >>> > > Hi guys, > > >>> > > > > >>> > > I've got a table in Hbase let's say "tbl" and I would like to > > >>> > > query it using Hive. Therefore I mapped a table to hive as > follows: > > >>> > > > > >>> > > CREATE EXTERNAL TABLE tbl(id string, data map<string,string>) > > >>> > > STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' > > >>> > > WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,data:") > > >>> > > TBLPROPERTIES("hbase.table.name" = "tbl"); > > >>> > > > > >>> > > Queries like: "select * from tbl", "select id from tbl", "select > > >>> > > id, data from tbl" are really fast. > > >>> > > But queries like "select id from tbl where substr(id, 0, 5) = > "12345"" > > >>> > > or "select id from tbl where data["777"] IS NOT NULL" are > > >>> > > incredibly > > >>> slow. > > >>> > > > > >>> > > In the contrary when running from Hbase shell: "scan 'tbl', { > > >>> > > COLUMNS=>'data', STARTROW='12345', ENDROW='12346'}" or "scan > > >>> > > 'tbl', { COLUMNS=>'data', "FILTER" => > > >>> > > FilterList.new([qualifierFilter('777')])}" > > >>> > > it is lightning fast! > > >>> > > > > >>> > > When I looked into the mapred job generated by hive on > > >>> > > jobtracker I discovered that "map.input.records" counts ALL the > > >>> > > items in Hbase table, meaning the job makes a full table scan > > >>> > > before it even starts > > >>> any > > >>> > mappers!! > > >>> > > Moreover, I suspect it copies all the data from Hbase table to > > >>> > > hdfs to mapper tmp input folder before executuion. > > >>> > > > > >>> > > So, my questions are - Why hbase storage handler for hive does > > >>> > > not translate hive queries into appropriate hbase functions? Why > > >>> > > it scans all the records and then slices them using "where" > > >>> > > clause? How can it be improved? Is Pig's integration better in > this > > case? > > >>> > > > > >>> > > > > >>> > > Some additional information about the tables: > > >>> > > Table description in Hbase: > > >>> > > jruby-1.6.2 :011 > describe 'tbl' > > >>> > > DESCRIPTION > > >>> > > ENABLED > > >>> > > {NAME => 'users', FAMILIES => [{NAME => 'data', BLOOMFILTER => > > >>> > >'ROWCOL', REPLICATIO true > > >>> > > N_SCOPE => '0', COMPRESSION => 'LZO', VERSIONS => '3', TTL => > > >>> > >'2147483647', BLOCKSIZE => > > >>> > > '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'}]} > > >>> > > > > >>> > > Table desciption in Hive: > > >>> > > hive> describe tbl; > > >>> > > OK > > >>> > > id string from deserializer > > >>> > > data map<string,string> from deserializer Time taken: 0.08 > > >>> > > seconds > > >>> > > > > >>> > > Best regards, > > >>> > > Matthew Tovbin =) > > >>> > > > > >>> > > >> > > >> > > > > > > > > > >
