Bah, you gotta be kidding me :P Something that might be a reasonable compromise is to print a warning about trailing whitespace on fields?
- Aaron On Fri, Jan 23, 2009 at 4:00 PM, Raghu Murthy <ra...@facebook.com> wrote: > We could add trim to hive load, but, unfortunately it has been patented by > IBM! > > "Method of automatically removing leading and trailing space characters > from data being entered into a database system" > > > http://patft.uspto.gov/netacgi/nph-Parser?Sect1=PTO1&Sect2=HITOFF&d=PALL&p=1 > > &u=%2Fnetahtml%2FPTO%2Fsrchnum.htm&r=1&f=G&l=50&s1=7,475,086.PN.&OS=PN/7,475 > ,086&RS=PN/7,475,086<http://patft.uspto.gov/netacgi/nph-Parser?Sect1=PTO1&Sect2=HITOFF&d=PALL&p=1&u=%2Fnetahtml%2FPTO%2Fsrchnum.htm&r=1&f=G&l=50&s1=7,475,086.PN.&OS=PN/7,475,086&RS=PN/7,475,086> > > "Retrieving data from a database system without leading and trailing > space characters" > > > http://appft1.uspto.gov/netacgi/nph-Parser?Sect1=PTO1&Sect2=HITOFF&d=PG01&p= > 1&u=/netahtml/PTO/srchnum.html&r=1&f=G&l=50&s1=<http://appft1.uspto.gov/netacgi/nph-Parser?Sect1=PTO1&Sect2=HITOFF&d=PG01&p=1&u=/netahtml/PTO/srchnum.html&r=1&f=G&l=50&s1=> > "20070282820".PGNR.&OS=DN/200 > 70282820&RS=DN/20070282820 > > > On 1/23/09 3:55 PM, "Aaron Kimball" <aa...@cloudera.com> wrote: > > > I solved the bug. As it turns out, there was trailing whitespace in some > of my > > input files. So Hive wasn't matching the column's contents correctly, > since > > Hive doesn't do an implicit trim() (my handwritten code did this). And of > > course, when I looked at the input files with less to verify their > contents, > > the whitespace was invisible :P > > > > - Aaron > > > > On Fri, Jan 23, 2009 at 3:48 AM, Ashish Thusoo <athu...@facebook.com> > wrote: > >> Aaron > >> > >> can you send the plan for this query.. > >> > >> explain extended <your query> > >> > >> This is very unexpected.. > >> > >> Thanks, > >> Ashish > >> ________________________________________ > >> From: Aaron Kimball [aa...@cloudera.com] > >> Sent: Friday, January 23, 2009 1:24 AM > >> To: hive-user@hadoop.apache.org > >> Subject: Re: equijoin with multiple columns? > >> > >> Not only can I verify that the rows are present, I broke down and coded > the > >> join in Java MapReduce by hand. Works. > >> - Aaron > >> > >> On Thu, Jan 22, 2009 at 11:42 PM, Zheng Shao > >> <zsh...@gmail.com<mailto:zsh...@gmail.com>> wrote: > >> This is unexpected. We did try similar cases before (with multiple join > >> keys). > >> > >> Can you verify by finding out the actual rows in each table that will > produce > >> a join result row? > >> > >> Zheng > >> > >> > >> On Thu, Jan 22, 2009 at 6:09 PM, Aaron Kimball > >> <aa...@cloudera.com<mailto:aa...@cloudera.com>> wrote: > >> Hi Zheng, > >> > >> I was eliding some information to try to boil things down a bit -- but > here > >> goes. > >> > >> > >> To expand the white lie into the fuller debug dump: > >> table A ~ ak_filtered_ips > >> table B ~ ip_locations > >> > >> It's actually three fields ('a', 'b', and 'c') that I'm joining on, not > two. > >> > >> hive> describe extended ak_filtered_ips; > >> OK > >> a int > >> b int > >> c int > >> Detailed Table Information: > >> Table(tableName:ak_filtered_ > >> > > ips,dbName:default,owner:aaron,createTime:1232661581,lastAccessTime:0,retenti>> > o > >> > n:0,sd:StorageDescriptor(cols:[FieldSchema(name:a,type:int,comment:null), > >> FieldSchema(name:b,type:int,comment:null), > >> FieldSchema(name:c,type:int,comment:null)],location:hdfs://(server name > >> > > redacted):9000/user/hive/warehouse/ak_filtered_ips,inputFormat:org.apache.had>> > o > >> > > op.mapred.TextInputFormat,outputFormat:org.apache.hadoop.hive.ql.io.IgnoreKey>> > T > >> > > extOutputFormat,compressed:false,numBuckets:-1,serdeInfo:SerDeInfo(name:null,>> > s > >> > > erializationLib:org.apache.hadoop.hive.serde2.MetadataTypedColumnsetSerDe,par>> > a > >> > > meters:{serialization.format=1}),bucketCols:[],sortCols:[],parameters:{}),par>> > t > >> itionKeys:[],parameters:{}) > >> > >> hive> describe extended ip_locations; > >> OK > >> a int > >> b int > >> c int > >> cityid int > >> countryid int > >> Detailed Table Information: > >> > > Table(tableName:ip_locations,dbName:default,owner:aaron,createTime:1232596116>> > , > >> > > lastAccessTime:0,retention:0,sd:StorageDescriptor(cols:[FieldSchema(name:a,ty>> > p > >> e:int,comment:null), FieldSchema(name:b,type:int,comment:null), > >> FieldSchema(name:c,type:int,comment:null), > >> FieldSchema(name:cityid,type:int,comment:null), > >> > FieldSchema(name:countryid,type:int,comment:null)],location:hdfs://(server > >> name > >> > > redacted):9000/user/hive/warehouse/ip_locations,inputFormat:org.apache.hadoop>> > . > >> > > mapred.TextInputFormat,outputFormat:org.apache.hadoop.hive.ql.io.IgnoreKeyTex>> > t > >> > > OutputFormat,compressed:false,numBuckets:-1,serdeInfo:SerDeInfo(name:null,ser>> > i > >> > > alizationLib:org.apache.hadoop.hive.serde2.MetadataTypedColumnsetSerDe,parame>> > t > >> > > ers:{serialization.format=1}),bucketCols:[],sortCols:[],parameters:{}),partit>> > i > >> onKeys:[],parameters:{}) > >> > >> I actually just loaded in a super-small test case involving A, B, x, y, > and z > >> as I initially described. And that statement *does* work. > >> > >> The actual statement I was running into problems with is: > >> > >> SELECT ip_locations.cityid, ip_locations.countryid FROM > >> ak_filtered_ips JOIN ip_locations ON (ak_filtered_ips.a = ip_locations.a > AND > >> ak_filtered_ips.b = ip_locations.b AND ak_filtered_ips.c = > ip_locations.c) > >> > >> I would eventually like to add a "count(1)" to the field list, and a > "GROUP > >> BY > >> ip_locations.cityid, ip_locations.countryid" at the end. > >> > >> This SELECT statement returns no rows. > >> > >> Thanks, > >> - Aaron > >> > >> > >> On Thu, Jan 22, 2009 at 4:27 PM, Zheng Shao > >> <zsh...@gmail.com<mailto:zsh...@gmail.com>> wrote: > >> Can you do "describe extended a;" and "describe extended b;" and pste > >> the result here? > >> > >> Zheng > >> > >> > >> On 1/22/09, Aaron Kimball <aa...@cloudera.com<mailto:aa...@cloudera.com > >> > >> wrote: > >>>> Hi all, > >>>> > >>>> I have a perplexing problem with joins in Hive. I have two tables "A" > and > >>>> "B". A contains two columns x and y. B also contains x and y, and > also z. > >>>> I > >>>> want to select the values of z from B, for every (x, y) I see in table > A. > >>>> > >>>> So I wrote this statement: > >>>> > >>>> SELECT b.z FROM a JOIN b ON (a.x = b.x AND a.y = b.y); > >>>> > >>>> This crunches away and then comes back with an empty resultset. I have > >>>> exported the same data into MySQL and ran the same query; it works > just > >>> fine > >>>> there. Does Hive support multi-column join criteria? > >>>> > >>>> Thanks, > >>>> - Aaron Kimball > >>>> > >> > >> -- > >> Sent from Gmail for mobile | mobile.google.com < > http://mobile.google.com> > >> <http://mobile.google.com> > >> > >> Yours, > >> Zheng > >> > >> > >> > >> > >> -- > >> Yours, > >> Zheng > >> > > > >