On Tue, Feb 16, 2010 at 2:54 PM, Eric Arenas <[email protected]> wrote: > Hi Ed, > > I created a similar UDF some time ago, and if I am not mistaken you have to > assume that your file is going to be in the same directory, as in: > > path_of_dat_file = "./name_of_file"; > > And it worked for me, > > let me know if this solves your issue, and if not, I will look into my old > code and see how I did it. > > regards > Eric Arenas > > > > ----- Original Message ---- > From: Edward Capriolo <[email protected]> > To: [email protected] > Sent: Tue, February 16, 2010 7:47:30 AM > Subject: Re: Working UDF for GeoIP lookup? > > On Mon, Feb 15, 2010 at 12:02 PM, Edward Capriolo <[email protected]> > wrote: >> On Mon, Feb 15, 2010 at 11:27 AM, Adam J. O'Donnell <[email protected]> wrote: >>> Edward: >>> >>> I don't have access to the individual data nodes, so I can't install the >>> pure perl module. I tried distributing it via the add file command, but that >>> is mangling the file name, which causes perl to not load the module as the >>> file name and package name dont match. Kinda frustrating, but it is really >>> all about trying to work around an issue on amazon's elastic map reduce. I >>> love the service in general, but some issues are frustrating. >>> >>> Sent from my iPhone >>> >>> On Feb 15, 2010, at 6:05, Edward Capriolo <[email protected]> wrote: >>> >>>> On Mon, Feb 15, 2010 at 1:29 AM, Adam O'Donnell <[email protected]> wrote: >>>>>> >>>>>> Hope this helps. >>>>>> >>>>>> Carl >>>>> >>>>> How about this... .can I run a standard hadoop streaming job against a >>>>> hive table that is stored as a sequence file? The idea would be I >>>>> would break my hive query into two separate tasks and do a hadoop >>>>> streaming job in between, then pick up the hive job afterwards. >>>>> Thoughts? >>>>> >>>>> Adam >>>>> >>>> >>>> I actually did do this with a streaming job. The UDF was tied up with >>>> the apache/gpl issues. >>>> >>>> Here is how I did this. 1 install geo-ip-perl on all datanodes >>>> >>>> ret = qp.run( >>>> " FROM ( "+ >>>> " FROM raw_web_data_hour "+ >>>> " SELECT transform( remote_ip ) "+ >>>> " USING 'perl geo_state.pl' "+ >>>> " AS ip, country_code3, region "+ >>>> " WHERE log_date_part='"+theDate+"' and log_hour_part='"+theHour+"' " + >>>> " ) a " + >>>> " INSERT OVERWRITE TABLE raw_web_data_hour_geo PARTITION >>>> (log_date_part='"+theDate+"',log_hour_part='"+theHour+"') "+ >>>> " SELECT a.country_code3, a.region,a.ip,count(1) as theCount " + >>>> " GROUP BY a.country_code3,a.region,a.ip " >>>> ); >>>> >>>> >>>> #!/usr/bin/perl >>>> use Geo::IP; >>>> use strict; >>>> my $gi = Geo::IP->open("/usr/local/share/GeoIP/GeoIPCity.dat", >>>> GEOIP_STANDARD); >>>> while (<STDIN>){ >>>> #my $record = $gi->record_by_name("209.191.139.200"); >>>> chomp($_); >>>> my $record = $gi->record_by_name($_); >>>> print STDERR "was sent $_ \n" ; >>>> if (defined $record) { >>>> print $_ . "\t" . $record->country_code3 . "\t" . $record->region . "\n" >>>> ; >>>> print STDERR "return " . $record->region . "\n" ; >>>> } else { >>>> print "??\n"; >>>> print STDERR "return was undefined \n"; >>>> } >>>> >>>> } >>>> >>>> Good luck. >>> >> >> Sorry to hear that your having problems. It is a fairly simple UDF, >> for those familiar writing udf/genudf. You probably could embed the >> lookup data file in the jar as well. I meant to build/host this on my >> site, but I have not got around to it. If you want to tag team it, I >> am interested. >> > So I started working on this: > I packaged geo-ip into a jar: > http://www.jointhegrid.com/svn/geo-ip-java/ > And I am building a Hive UDF > http://www.jointhegrid.com/svn/hive-udf-geo-ip-jtg/ > > I am running into a problem, I am trying to have the UDF work with two > signatures > > geoip('209.191.139.200', 'STATE_NAME'); > geoip('209.191.139.200', 'STATE_NAME', 'path/to/datafile' ); > > For the first invocation I have bundled the data into the JAR file. I > have verified that I can access it: > http://www.jointhegrid.com/svn/geo-ip-java/trunk/src/LoadInternalData.java > > I am trying to do the same thing inside by UDF but I get FileNotFound > exceptions. I have also tried adding the file to the distributed > cache. > > add file > /home/ecapriolo/encrypted-mount-ec/NetBeansProjects/geo-ip-java/src/GeoIP.dat; > add jar > /home/ecapriolo/encrypted-mount-ec/NetBeansProjects/geo-ip-java/dist/geo-ip-java.jar; > add jar > /home/ecapriolo/encrypted-mount-ec/NetBeansProjects/hive-udf-geo-ip-jtg/dist/hive-udf-geo-ip-jtg.jar; > create temporary function geoip as 'com.jointhegrid.hive.udf.GenericUDFGeoIP'; > select geoip(first,'COUNTRY_NAME', 'GeoIP.dat' ) from a; > > > Any hints ? I did notice a Jira about UDF reading distributed cache, > so that may be an issue. I still wonder though why I can not pull the > file out of the jar. Any hints? > > -ed > >
'./file' is not working either. My UDF works when I specify the entire local path however, but that is not actually using the file in 'add file'. This works: add file /home/ecapriolo/encrypted-mount-ec/NetBeansProjects/geo-ip-java/src/GeoIP.dat; select geoip(first, 'COUNTRY_NAME', '/home/ecapriolo/encrypted-mount-ec/NetBeansProjects/geo-ip-java/src/GeoIP.dat' ) from a;
