On Tue, Feb 16, 2010 at 3:23 PM, Edward Capriolo <[email protected]> wrote: > On Tue, Feb 16, 2010 at 2:54 PM, Eric Arenas <[email protected]> wrote: >> Hi Ed, >> >> I created a similar UDF some time ago, and if I am not mistaken you have to >> assume that your file is going to be in the same directory, as in: >> >> path_of_dat_file = "./name_of_file"; >> >> And it worked for me, >> >> let me know if this solves your issue, and if not, I will look into my old >> code and see how I did it. >> >> regards >> Eric Arenas >> >> >> >> ----- Original Message ---- >> From: Edward Capriolo <[email protected]> >> To: [email protected] >> Sent: Tue, February 16, 2010 7:47:30 AM >> Subject: Re: Working UDF for GeoIP lookup? >> >> On Mon, Feb 15, 2010 at 12:02 PM, Edward Capriolo <[email protected]> >> wrote: >>> On Mon, Feb 15, 2010 at 11:27 AM, Adam J. O'Donnell <[email protected]> >>> wrote: >>>> Edward: >>>> >>>> I don't have access to the individual data nodes, so I can't install the >>>> pure perl module. I tried distributing it via the add file command, but >>>> that >>>> is mangling the file name, which causes perl to not load the module as the >>>> file name and package name dont match. Kinda frustrating, but it is really >>>> all about trying to work around an issue on amazon's elastic map reduce. I >>>> love the service in general, but some issues are frustrating. >>>> >>>> Sent from my iPhone >>>> >>>> On Feb 15, 2010, at 6:05, Edward Capriolo <[email protected]> wrote: >>>> >>>>> On Mon, Feb 15, 2010 at 1:29 AM, Adam O'Donnell <[email protected]> wrote: >>>>>>> >>>>>>> Hope this helps. >>>>>>> >>>>>>> Carl >>>>>> >>>>>> How about this... .can I run a standard hadoop streaming job against a >>>>>> hive table that is stored as a sequence file? The idea would be I >>>>>> would break my hive query into two separate tasks and do a hadoop >>>>>> streaming job in between, then pick up the hive job afterwards. >>>>>> Thoughts? >>>>>> >>>>>> Adam >>>>>> >>>>> >>>>> I actually did do this with a streaming job. The UDF was tied up with >>>>> the apache/gpl issues. >>>>> >>>>> Here is how I did this. 1 install geo-ip-perl on all datanodes >>>>> >>>>> ret = qp.run( >>>>> " FROM ( "+ >>>>> " FROM raw_web_data_hour "+ >>>>> " SELECT transform( remote_ip ) "+ >>>>> " USING 'perl geo_state.pl' "+ >>>>> " AS ip, country_code3, region "+ >>>>> " WHERE log_date_part='"+theDate+"' and log_hour_part='"+theHour+"' " + >>>>> " ) a " + >>>>> " INSERT OVERWRITE TABLE raw_web_data_hour_geo PARTITION >>>>> (log_date_part='"+theDate+"',log_hour_part='"+theHour+"') "+ >>>>> " SELECT a.country_code3, a.region,a.ip,count(1) as theCount " + >>>>> " GROUP BY a.country_code3,a.region,a.ip " >>>>> ); >>>>> >>>>> >>>>> #!/usr/bin/perl >>>>> use Geo::IP; >>>>> use strict; >>>>> my $gi = Geo::IP->open("/usr/local/share/GeoIP/GeoIPCity.dat", >>>>> GEOIP_STANDARD); >>>>> while (<STDIN>){ >>>>> #my $record = $gi->record_by_name("209.191.139.200"); >>>>> chomp($_); >>>>> my $record = $gi->record_by_name($_); >>>>> print STDERR "was sent $_ \n" ; >>>>> if (defined $record) { >>>>> print $_ . "\t" . $record->country_code3 . "\t" . $record->region . "\n" >>>>> ; >>>>> print STDERR "return " . $record->region . "\n" ; >>>>> } else { >>>>> print "??\n"; >>>>> print STDERR "return was undefined \n"; >>>>> } >>>>> >>>>> } >>>>> >>>>> Good luck. >>>> >>> >>> Sorry to hear that your having problems. It is a fairly simple UDF, >>> for those familiar writing udf/genudf. You probably could embed the >>> lookup data file in the jar as well. I meant to build/host this on my >>> site, but I have not got around to it. If you want to tag team it, I >>> am interested. >>> >> So I started working on this: >> I packaged geo-ip into a jar: >> http://www.jointhegrid.com/svn/geo-ip-java/ >> And I am building a Hive UDF >> http://www.jointhegrid.com/svn/hive-udf-geo-ip-jtg/ >> >> I am running into a problem, I am trying to have the UDF work with two >> signatures >> >> geoip('209.191.139.200', 'STATE_NAME'); >> geoip('209.191.139.200', 'STATE_NAME', 'path/to/datafile' ); >> >> For the first invocation I have bundled the data into the JAR file. I >> have verified that I can access it: >> http://www.jointhegrid.com/svn/geo-ip-java/trunk/src/LoadInternalData.java >> >> I am trying to do the same thing inside by UDF but I get FileNotFound >> exceptions. I have also tried adding the file to the distributed >> cache. >> >> add file >> /home/ecapriolo/encrypted-mount-ec/NetBeansProjects/geo-ip-java/src/GeoIP.dat; >> add jar >> /home/ecapriolo/encrypted-mount-ec/NetBeansProjects/geo-ip-java/dist/geo-ip-java.jar; >> add jar >> /home/ecapriolo/encrypted-mount-ec/NetBeansProjects/hive-udf-geo-ip-jtg/dist/hive-udf-geo-ip-jtg.jar; >> create temporary function geoip as >> 'com.jointhegrid.hive.udf.GenericUDFGeoIP'; >> select geoip(first,'COUNTRY_NAME', 'GeoIP.dat' ) from a; >> >> >> Any hints ? I did notice a Jira about UDF reading distributed cache, >> so that may be an issue. I still wonder though why I can not pull the >> file out of the jar. Any hints? >> >> -ed >> >> > > './file' is not working either. > > My UDF works when I specify the entire local path however, but that is > not actually using the file in 'add file'. > > This works: > add file > /home/ecapriolo/encrypted-mount-ec/NetBeansProjects/geo-ip-java/src/GeoIP.dat; > select geoip(first, 'COUNTRY_NAME', > '/home/ecapriolo/encrypted-mount-ec/NetBeansProjects/geo-ip-java/src/GeoIP.dat' > ) from a; >
My mistake! My cluster 18.3 + hive-4.0rc2 is doing this fine. I am working off some 5.0 trunk locally, might be a bug or I might need to move to the latest trunk.
