Re: equijoin with multiple columns?

Aaron Kimball Fri, 23 Jan 2009 15:56:09 -0800

I solved the bug. As it turns out, there was trailing whitespace in some of
my input files. So Hive wasn't matching the column's contents correctly,
since Hive doesn't do an implicit trim() (my handwritten code did this). And
of course, when I looked at the input files with less to verify their
contents, the whitespace was invisible :P


- Aaron

On Fri, Jan 23, 2009 at 3:48 AM, Ashish Thusoo <athu...@facebook.com> wrote:

> Aaron
>
> can you send the plan for this query..
>
> explain extended <your query>
>
> This is very unexpected..
>
> Thanks,
> Ashish
> ________________________________________
> From: Aaron Kimball [aa...@cloudera.com]
> Sent: Friday, January 23, 2009 1:24 AM
> To: hive-user@hadoop.apache.org
> Subject: Re: equijoin with multiple columns?
>
> Not only can I verify that the rows are present, I broke down and coded the
> join in Java MapReduce by hand. Works.
> - Aaron
>
> On Thu, Jan 22, 2009 at 11:42 PM, Zheng Shao <zsh...@gmail.com<mailto:
> zsh...@gmail.com>> wrote:
> This is unexpected. We did try similar cases before (with multiple join
> keys).
>
> Can you verify by finding out the actual rows in each table that will
> produce a join result row?
>
> Zheng
>
>
> On Thu, Jan 22, 2009 at 6:09 PM, Aaron Kimball <aa...@cloudera.com<mailto:
> aa...@cloudera.com>> wrote:
> Hi Zheng,
>
> I was eliding some information to try to boil things down a bit -- but here
> goes.
>
>
> To expand the white lie into the fuller debug dump:
> table A ~ ak_filtered_ips
> table B ~ ip_locations
>
> It's actually three fields ('a', 'b', and 'c') that I'm joining on, not
> two.
>
> hive> describe extended ak_filtered_ips;
> OK
> a    int
> b    int
> c    int
> Detailed Table Information:
> Table(tableName:ak_filtered_
> ips,dbName:default,owner:aaron,createTime:1232661581,lastAccessTime:0,retention:0,sd:StorageDescriptor(cols:[FieldSchema(name:a,type:int,comment:null),
> FieldSchema(name:b,type:int,comment:null),
> FieldSchema(name:c,type:int,comment:null)],location:hdfs://(server name
> redacted):9000/user/hive/warehouse/ak_filtered_ips,inputFormat:org.apache.hadoop.mapred.TextInputFormat,outputFormat:org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat,compressed:false,numBuckets:-1,serdeInfo:SerDeInfo(name:null,serializationLib:org.apache.hadoop.hive.serde2.MetadataTypedColumnsetSerDe,parameters:{serialization.format=1}),bucketCols:[],sortCols:[],parameters:{}),partitionKeys:[],parameters:{})
>
> hive> describe extended ip_locations;
> OK
> a    int
> b    int
> c    int
> cityid    int
> countryid    int
> Detailed Table Information:
> Table(tableName:ip_locations,dbName:default,owner:aaron,createTime:1232596116,lastAccessTime:0,retention:0,sd:StorageDescriptor(cols:[FieldSchema(name:a,type:int,comment:null),
> FieldSchema(name:b,type:int,comment:null),
> FieldSchema(name:c,type:int,comment:null),
> FieldSchema(name:cityid,type:int,comment:null),
> FieldSchema(name:countryid,type:int,comment:null)],location:hdfs://(server
> name
> redacted):9000/user/hive/warehouse/ip_locations,inputFormat:org.apache.hadoop.mapred.TextInputFormat,outputFormat:org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat,compressed:false,numBuckets:-1,serdeInfo:SerDeInfo(name:null,serializationLib:org.apache.hadoop.hive.serde2.MetadataTypedColumnsetSerDe,parameters:{serialization.format=1}),bucketCols:[],sortCols:[],parameters:{}),partitionKeys:[],parameters:{})
>
> I actually just loaded in a super-small test case involving A, B, x, y, and
> z as I initially described. And that statement *does* work.
>
> The actual statement I was running into problems with is:
>
> SELECT ip_locations.cityid, ip_locations.countryid FROM
> ak_filtered_ips JOIN ip_locations ON (ak_filtered_ips.a = ip_locations.a
> AND ak_filtered_ips.b = ip_locations.b AND ak_filtered_ips.c =
> ip_locations.c)
>
> I would eventually like to add a "count(1)" to the field list, and a "GROUP
> BY ip_locations.cityid, ip_locations.countryid" at the end.
>
> This SELECT statement returns no rows.
>
> Thanks,
> - Aaron
>
>
> On Thu, Jan 22, 2009 at 4:27 PM, Zheng Shao <zsh...@gmail.com<mailto:
> zsh...@gmail.com>> wrote:
> Can you do "describe extended a;" and "describe extended b;" and pste
> the result here?
>
> Zheng
>
>
> On 1/22/09, Aaron Kimball <aa...@cloudera.com<mailto:aa...@cloudera.com>>
> wrote:
> > Hi all,
> >
> > I have a perplexing problem with joins in Hive. I have two tables "A" and
> > "B".  A contains two columns x and y. B also contains x and y, and also
> z. I
> > want to select the values of z from B, for every (x, y) I see in table A.
> >
> > So I wrote this statement:
> >
> > SELECT b.z FROM a JOIN b ON (a.x = b.x AND a.y = b.y);
> >
> > This crunches away and then comes back with an empty resultset. I have
> > exported the same data into MySQL and ran the same query; it works just
> fine
> > there. Does Hive support multi-column join criteria?
> >
> > Thanks,
> > - Aaron Kimball
> >
>
> --
> Sent from Gmail for mobile | mobile.google.com<http://mobile.google.com>
>
> Yours,
> Zheng
>
>
>
>
> --
> Yours,
> Zheng
>
>

Re: equijoin with multiple columns?

Reply via email to