Re: equijoin with multiple columns?

Aaron Kimball Fri, 23 Jan 2009 16:11:53 -0800

Bah, you gotta be kidding me :P

Something that might be a reasonable compromise is to print a warning about
trailing whitespace on fields?


- Aaron

On Fri, Jan 23, 2009 at 4:00 PM, Raghu Murthy <ra...@facebook.com> wrote:

> We could add trim to hive load, but, unfortunately it has been patented by
> IBM!
>
> "Method of automatically removing leading and trailing space characters
> from data being entered into a database system"
>
>
> http://patft.uspto.gov/netacgi/nph-Parser?Sect1=PTO1&Sect2=HITOFF&d=PALL&p=1
>
> &u=%2Fnetahtml%2FPTO%2Fsrchnum.htm&r=1&f=G&l=50&s1=7,475,086.PN.&OS=PN/7,475
> ,086&RS=PN/7,475,086<http://patft.uspto.gov/netacgi/nph-Parser?Sect1=PTO1&Sect2=HITOFF&d=PALL&p=1&u=%2Fnetahtml%2FPTO%2Fsrchnum.htm&r=1&f=G&l=50&s1=7,475,086.PN.&OS=PN/7,475,086&RS=PN/7,475,086>
>
> "Retrieving data from a database system without leading and trailing
> space characters"
>
>
> http://appft1.uspto.gov/netacgi/nph-Parser?Sect1=PTO1&Sect2=HITOFF&d=PG01&p=
> 1&u=/netahtml/PTO/srchnum.html&r=1&f=G&l=50&s1=<http://appft1.uspto.gov/netacgi/nph-Parser?Sect1=PTO1&Sect2=HITOFF&d=PG01&p=1&u=/netahtml/PTO/srchnum.html&r=1&f=G&l=50&s1=>
> "20070282820".PGNR.&OS=DN/200
> 70282820&RS=DN/20070282820
>
>
> On 1/23/09 3:55 PM, "Aaron Kimball" <aa...@cloudera.com> wrote:
>
> > I solved the bug. As it turns out, there was trailing whitespace in some
> of my
> > input files. So Hive wasn't matching the column's contents correctly,
> since
> > Hive doesn't do an implicit trim() (my handwritten code did this). And of
> > course, when I looked at the input files with less to verify their
> contents,
> > the whitespace was invisible :P
> >
> > - Aaron
> >
> > On Fri, Jan 23, 2009 at 3:48 AM, Ashish Thusoo <athu...@facebook.com>
> wrote:
> >> Aaron
> >>
> >> can you send the plan for this query..
> >>
> >> explain extended <your query>
> >>
> >> This is very unexpected..
> >>
> >> Thanks,
> >> Ashish
> >> ________________________________________
> >> From: Aaron Kimball [aa...@cloudera.com]
> >> Sent: Friday, January 23, 2009 1:24 AM
> >> To: hive-user@hadoop.apache.org
> >> Subject: Re: equijoin with multiple columns?
> >>
> >> Not only can I verify that the rows are present, I broke down and coded
> the
> >> join in Java MapReduce by hand. Works.
> >> - Aaron
> >>
> >> On Thu, Jan 22, 2009 at 11:42 PM, Zheng Shao
> >> <zsh...@gmail.com<mailto:zsh...@gmail.com>> wrote:
> >> This is unexpected. We did try similar cases before (with multiple join
> >> keys).
> >>
> >> Can you verify by finding out the actual rows in each table that will
> produce
> >> a join result row?
> >>
> >> Zheng
> >>
> >>
> >> On Thu, Jan 22, 2009 at 6:09 PM, Aaron Kimball
> >> <aa...@cloudera.com<mailto:aa...@cloudera.com>> wrote:
> >> Hi Zheng,
> >>
> >> I was eliding some information to try to boil things down a bit -- but
> here
> >> goes.
> >>
> >>
> >> To expand the white lie into the fuller debug dump:
> >> table A ~ ak_filtered_ips
> >> table B ~ ip_locations
> >>
> >> It's actually three fields ('a', 'b', and 'c') that I'm joining on, not
> two.
> >>
> >> hive> describe extended ak_filtered_ips;
> >> OK
> >> a    int
> >> b    int
> >> c    int
> >> Detailed Table Information:
> >> Table(tableName:ak_filtered_
> >>
>
> ips,dbName:default,owner:aaron,createTime:1232661581,lastAccessTime:0,retenti>>
> o
> >>
> n:0,sd:StorageDescriptor(cols:[FieldSchema(name:a,type:int,comment:null),
> >> FieldSchema(name:b,type:int,comment:null),
> >> FieldSchema(name:c,type:int,comment:null)],location:hdfs://(server name
> >>
>
> redacted):9000/user/hive/warehouse/ak_filtered_ips,inputFormat:org.apache.had>>
> o
> >>
>
> op.mapred.TextInputFormat,outputFormat:org.apache.hadoop.hive.ql.io.IgnoreKey>>
> T
> >>
>
> extOutputFormat,compressed:false,numBuckets:-1,serdeInfo:SerDeInfo(name:null,>>
> s
> >>
>
> erializationLib:org.apache.hadoop.hive.serde2.MetadataTypedColumnsetSerDe,par>>
> a
> >>
>
> meters:{serialization.format=1}),bucketCols:[],sortCols:[],parameters:{}),par>>
> t
> >> itionKeys:[],parameters:{})
> >>
> >> hive> describe extended ip_locations;
> >> OK
> >> a    int
> >> b    int
> >> c    int
> >> cityid    int
> >> countryid    int
> >> Detailed Table Information:
> >>
>
> Table(tableName:ip_locations,dbName:default,owner:aaron,createTime:1232596116>>
> ,
> >>
>
> lastAccessTime:0,retention:0,sd:StorageDescriptor(cols:[FieldSchema(name:a,ty>>
> p
> >> e:int,comment:null), FieldSchema(name:b,type:int,comment:null),
> >> FieldSchema(name:c,type:int,comment:null),
> >> FieldSchema(name:cityid,type:int,comment:null),
> >>
> FieldSchema(name:countryid,type:int,comment:null)],location:hdfs://(server
> >> name
> >>
>
> redacted):9000/user/hive/warehouse/ip_locations,inputFormat:org.apache.hadoop>>
> .
> >>
>
> mapred.TextInputFormat,outputFormat:org.apache.hadoop.hive.ql.io.IgnoreKeyTex>>
> t
> >>
>
> OutputFormat,compressed:false,numBuckets:-1,serdeInfo:SerDeInfo(name:null,ser>>
> i
> >>
>
> alizationLib:org.apache.hadoop.hive.serde2.MetadataTypedColumnsetSerDe,parame>>
> t
> >>
>
> ers:{serialization.format=1}),bucketCols:[],sortCols:[],parameters:{}),partit>>
> i
> >> onKeys:[],parameters:{})
> >>
> >> I actually just loaded in a super-small test case involving A, B, x, y,
> and z
> >> as I initially described. And that statement *does* work.
> >>
> >> The actual statement I was running into problems with is:
> >>
> >> SELECT ip_locations.cityid, ip_locations.countryid FROM
> >> ak_filtered_ips JOIN ip_locations ON (ak_filtered_ips.a = ip_locations.a
> AND
> >> ak_filtered_ips.b = ip_locations.b AND ak_filtered_ips.c =
> ip_locations.c)
> >>
> >> I would eventually like to add a "count(1)" to the field list, and a
> "GROUP
> >> BY
> >> ip_locations.cityid, ip_locations.countryid" at the end.
> >>
> >> This SELECT statement returns no rows.
> >>
> >> Thanks,
> >> - Aaron
> >>
> >>
> >> On Thu, Jan 22, 2009 at 4:27 PM, Zheng Shao
> >> <zsh...@gmail.com<mailto:zsh...@gmail.com>> wrote:
> >> Can you do "describe extended a;" and "describe extended b;" and pste
> >> the result here?
> >>
> >> Zheng
> >>
> >>
> >> On 1/22/09, Aaron Kimball <aa...@cloudera.com<mailto:aa...@cloudera.com
> >>
> >> wrote:
> >>>> Hi all,
> >>>>
> >>>> I have a perplexing problem with joins in Hive. I have two tables "A"
> and
> >>>> "B".  A contains two columns x and y. B also contains x and y, and
> also z.
> >>>> I
> >>>> want to select the values of z from B, for every (x, y) I see in table
> A.
> >>>>
> >>>> So I wrote this statement:
> >>>>
> >>>> SELECT b.z FROM a JOIN b ON (a.x = b.x AND a.y = b.y);
> >>>>
> >>>> This crunches away and then comes back with an empty resultset. I have
> >>>> exported the same data into MySQL and ran the same query; it works
> just
> >>> fine
> >>>> there. Does Hive support multi-column join criteria?
> >>>>
> >>>> Thanks,
> >>>> - Aaron Kimball
> >>>>
> >>
> >> --
> >> Sent from Gmail for mobile | mobile.google.com <
> http://mobile.google.com>
> >> <http://mobile.google.com>
> >>
> >> Yours,
> >> Zheng
> >>
> >>
> >>
> >>
> >> --
> >> Yours,
> >> Zheng
> >>
> >
>
>

Re: equijoin with multiple columns?

Reply via email to