I was thinking to make UDFs like this.
select geo_lookup('databasefile' , fieldx , 'municipality' ) from table;
Alternatively we can embedded the data files inside in the jar/udf
geo_lookup('ip' , 'municipality' );
As to ant. since the licensing prevent geo_ip_java from getting
bundled with apache I am not looking to build this into hive. Simply
going to build a netbeans project and use hive-jars as libraries.
Edward
On Mon, Feb 15, 2010 at 4:58 PM, Adam O'Donnell <[email protected]> wrote:
> Edward:
>
> How can I help? I got most of the UDF built myself last night, and
> today I was sorting out ant build issues. My main frustration is
> trying to get it to play nice in amazon's environment.
>
> How did you solve the issue of selecting only parts of the geocity
> data on a single lookup? Did you just do multiple lookups, one for
> each piece of data?
>
> Also, did your jar run on amazon's elastic mapreduce?
>
> On Mon, Feb 15, 2010 at 9:02 AM, Edward Capriolo <[email protected]>
> wrote:
>> On Mon, Feb 15, 2010 at 11:27 AM, Adam J. O'Donnell <[email protected]> wrote:
>>> Edward:
>>>
>>> I don't have access to the individual data nodes, so I can't install the
>>> pure perl module. I tried distributing it via the add file command, but that
>>> is mangling the file name, which causes perl to not load the module as the
>>> file name and package name dont match. Kinda frustrating, but it is really
>>> all about trying to work around an issue on amazon's elastic map reduce. I
>>> love the service in general, but some issues are frustrating.
>>>
>>> Sent from my iPhone
>>>
>>> On Feb 15, 2010, at 6:05, Edward Capriolo <[email protected]> wrote:
>>>
>>>> On Mon, Feb 15, 2010 at 1:29 AM, Adam O'Donnell <[email protected]> wrote:
>>>>>>
>>>>>> Hope this helps.
>>>>>>
>>>>>> Carl
>>>>>
>>>>> How about this... .can I run a standard hadoop streaming job against a
>>>>> hive table that is stored as a sequence file? The idea would be I
>>>>> would break my hive query into two separate tasks and do a hadoop
>>>>> streaming job in between, then pick up the hive job afterwards.
>>>>> Thoughts?
>>>>>
>>>>> Adam
>>>>>
>>>>
>>>> I actually did do this with a streaming job. The UDF was tied up with
>>>> the apache/gpl issues.
>>>>
>>>> Here is how I did this. 1 install geo-ip-perl on all datanodes
>>>>
>>>> ret = qp.run(
>>>> " FROM ( "+
>>>> " FROM raw_web_data_hour "+
>>>> " SELECT transform( remote_ip ) "+
>>>> " USING 'perl geo_state.pl' "+
>>>> " AS ip, country_code3, region "+
>>>> " WHERE log_date_part='"+theDate+"' and log_hour_part='"+theHour+"' " +
>>>> " ) a " +
>>>> " INSERT OVERWRITE TABLE raw_web_data_hour_geo PARTITION
>>>> (log_date_part='"+theDate+"',log_hour_part='"+theHour+"') "+
>>>> " SELECT a.country_code3, a.region,a.ip,count(1) as theCount " +
>>>> " GROUP BY a.country_code3,a.region,a.ip "
>>>> );
>>>>
>>>>
>>>> #!/usr/bin/perl
>>>> use Geo::IP;
>>>> use strict;
>>>> my $gi = Geo::IP->open("/usr/local/share/GeoIP/GeoIPCity.dat",
>>>> GEOIP_STANDARD);
>>>> while (<STDIN>){
>>>> #my $record = $gi->record_by_name("209.191.139.200");
>>>> chomp($_);
>>>> my $record = $gi->record_by_name($_);
>>>> print STDERR "was sent $_ \n" ;
>>>> if (defined $record) {
>>>> print $_ . "\t" . $record->country_code3 . "\t" . $record->region . "\n"
>>>> ;
>>>> print STDERR "return " . $record->region . "\n" ;
>>>> } else {
>>>> print "??\n";
>>>> print STDERR "return was undefined \n";
>>>> }
>>>>
>>>> }
>>>>
>>>> Good luck.
>>>
>>
>> Sorry to hear that your having problems. It is a fairly simple UDF,
>> for those familiar writing udf/genudf. You probably could embed the
>> lookup data file in the jar as well. I meant to build/host this on my
>> site, but I have not got around to it. If you want to tag team it, I
>> am interested.
>>
>
>
>
> --
> Adam J. O'Donnell, Ph.D.
> Immunet Corporation
> Cell: +1 (267) 251-0070
>