Re: Working UDF for GeoIP lookup?

Edward Capriolo Tue, 16 Feb 2010 12:58:02 -0800

On Tue, Feb 16, 2010 at 3:23 PM, Edward Capriolo <[email protected]> wrote:
> On Tue, Feb 16, 2010 at 2:54 PM, Eric Arenas <[email protected]> wrote:
>> Hi Ed,
>>
>> I created a similar UDF some time ago, and if I am not mistaken you have to 
>> assume that your file is going to be in the same directory, as in:
>>
>> path_of_dat_file = "./name_of_file";
>>
>> And it worked for me,
>>
>> let me know if this solves your issue, and if not, I will look into my old 
>> code and see how I did it.
>>
>> regards
>> Eric Arenas
>>
>>
>>
>> ----- Original Message ----
>> From: Edward Capriolo <[email protected]>
>> To: [email protected]
>> Sent: Tue, February 16, 2010 7:47:30 AM
>> Subject: Re: Working UDF for GeoIP lookup?
>>
>> On Mon, Feb 15, 2010 at 12:02 PM, Edward Capriolo <[email protected]> 
>> wrote:
>>> On Mon, Feb 15, 2010 at 11:27 AM, Adam J. O'Donnell <[email protected]> 
>>> wrote:
>>>> Edward:
>>>>
>>>> I don't have access to the individual data nodes, so I can't install the
>>>> pure perl module. I tried distributing it via the add file command, but 
>>>> that
>>>> is mangling the file name, which causes perl to not load the module as the
>>>> file name and package name dont match.  Kinda frustrating, but it is really
>>>> all about trying to work around an issue on amazon's elastic map reduce.  I
>>>> love the service in general, but some issues are frustrating.
>>>>
>>>> Sent from my iPhone
>>>>
>>>> On Feb 15, 2010, at 6:05, Edward Capriolo <[email protected]> wrote:
>>>>
>>>>> On Mon, Feb 15, 2010 at 1:29 AM, Adam O'Donnell <[email protected]> wrote:
>>>>>>>
>>>>>>> Hope this helps.
>>>>>>>
>>>>>>> Carl
>>>>>>
>>>>>> How about this... .can I run a standard hadoop streaming job against a
>>>>>> hive table that is stored as a sequence file?  The idea would be I
>>>>>> would break my hive query into two separate tasks and do a hadoop
>>>>>> streaming job in between, then pick up the hive job afterwards.
>>>>>> Thoughts?
>>>>>>
>>>>>> Adam
>>>>>>
>>>>>
>>>>> I actually did do this with a streaming job. The UDF was tied up with
>>>>> the apache/gpl issues.
>>>>>
>>>>> Here is how I did this. 1 install geo-ip-perl on all datanodes
>>>>>
>>>>>  ret = qp.run(
>>>>>   " FROM ( "+
>>>>>   " FROM raw_web_data_hour "+
>>>>>   " SELECT transform( remote_ip ) "+
>>>>>   " USING 'perl geo_state.pl' "+
>>>>>   " AS ip, country_code3, region "+
>>>>>   " WHERE log_date_part='"+theDate+"' and log_hour_part='"+theHour+"' " +
>>>>>   " ) a " +
>>>>>   " INSERT OVERWRITE TABLE raw_web_data_hour_geo PARTITION
>>>>> (log_date_part='"+theDate+"',log_hour_part='"+theHour+"') "+
>>>>>   " SELECT a.country_code3, a.region,a.ip,count(1) as theCount " +
>>>>>   " GROUP BY a.country_code3,a.region,a.ip "
>>>>>   );
>>>>>
>>>>>
>>>>> #!/usr/bin/perl
>>>>> use Geo::IP;
>>>>> use strict;
>>>>> my $gi = Geo::IP->open("/usr/local/share/GeoIP/GeoIPCity.dat",
>>>>> GEOIP_STANDARD);
>>>>> while (<STDIN>){
>>>>>  #my $record = $gi->record_by_name("209.191.139.200");
>>>>>  chomp($_);
>>>>>  my $record = $gi->record_by_name($_);
>>>>>  print STDERR "was sent $_ \n" ;
>>>>>  if (defined $record) {
>>>>>   print $_ . "\t" . $record->country_code3 . "\t" . $record->region . "\n"
>>>>>  ;
>>>>>   print STDERR "return " . $record->region . "\n" ;
>>>>>  } else {
>>>>>   print "??\n";
>>>>>   print STDERR "return was undefined \n";
>>>>>  }
>>>>>
>>>>> }
>>>>>
>>>>> Good luck.
>>>>
>>>
>>> Sorry to hear that your having problems. It is a fairly simple UDF,
>>> for those familiar writing udf/genudf. You probably could embed the
>>> lookup data file in the jar as well. I meant to build/host this on my
>>> site, but I have not got around to it. If you want to tag team it, I
>>> am interested.
>>>
>> So I started working on this:
>> I packaged geo-ip into a jar:
>> http://www.jointhegrid.com/svn/geo-ip-java/
>> And I am building a Hive UDF
>> http://www.jointhegrid.com/svn/hive-udf-geo-ip-jtg/
>>
>> I am running into a problem, I am trying to have the UDF work with two
>> signatures
>>
>> geoip('209.191.139.200', 'STATE_NAME');
>> geoip('209.191.139.200', 'STATE_NAME', 'path/to/datafile' );
>>
>> For the first invocation I have bundled the data into the JAR file. I
>> have verified that I can access it:
>> http://www.jointhegrid.com/svn/geo-ip-java/trunk/src/LoadInternalData.java
>>
>> I am trying to do the same thing inside by UDF but I get FileNotFound
>> exceptions. I have also tried adding the file to the distributed
>> cache.
>>
>> add file 
>> /home/ecapriolo/encrypted-mount-ec/NetBeansProjects/geo-ip-java/src/GeoIP.dat;
>> add jar 
>> /home/ecapriolo/encrypted-mount-ec/NetBeansProjects/geo-ip-java/dist/geo-ip-java.jar;
>> add jar 
>> /home/ecapriolo/encrypted-mount-ec/NetBeansProjects/hive-udf-geo-ip-jtg/dist/hive-udf-geo-ip-jtg.jar;
>> create temporary function geoip as 
>> 'com.jointhegrid.hive.udf.GenericUDFGeoIP';
>> select geoip(first,'COUNTRY_NAME', 'GeoIP.dat' ) from a;
>>
>>
>> Any hints ? I did notice a Jira about UDF reading distributed cache,
>> so that may be an issue. I still wonder though why I can not pull the
>> file out of the jar. Any hints?
>>
>> -ed
>>
>>
>
> './file' is not working either.
>
> My UDF works when I specify the entire local path however, but that is
> not actually using the file in 'add file'.
>
> This works:
> add file 
> /home/ecapriolo/encrypted-mount-ec/NetBeansProjects/geo-ip-java/src/GeoIP.dat;
> select geoip(first, 'COUNTRY_NAME',
> '/home/ecapriolo/encrypted-mount-ec/NetBeansProjects/geo-ip-java/src/GeoIP.dat'
> ) from a;
>


My mistake! My cluster 18.3 + hive-4.0rc2 is doing this fine. I am
working off some 5.0 trunk locally, might be a bug or I might need to
move to the latest trunk.

Re: Working UDF for GeoIP lookup?

Reply via email to