Edward: How can I help? I got most of the UDF built myself last night, and today I was sorting out ant build issues. My main frustration is trying to get it to play nice in amazon's environment.
How did you solve the issue of selecting only parts of the geocity data on a single lookup? Did you just do multiple lookups, one for each piece of data? Also, did your jar run on amazon's elastic mapreduce? On Mon, Feb 15, 2010 at 9:02 AM, Edward Capriolo <[email protected]> wrote: > On Mon, Feb 15, 2010 at 11:27 AM, Adam J. O'Donnell <[email protected]> wrote: >> Edward: >> >> I don't have access to the individual data nodes, so I can't install the >> pure perl module. I tried distributing it via the add file command, but that >> is mangling the file name, which causes perl to not load the module as the >> file name and package name dont match. Kinda frustrating, but it is really >> all about trying to work around an issue on amazon's elastic map reduce. I >> love the service in general, but some issues are frustrating. >> >> Sent from my iPhone >> >> On Feb 15, 2010, at 6:05, Edward Capriolo <[email protected]> wrote: >> >>> On Mon, Feb 15, 2010 at 1:29 AM, Adam O'Donnell <[email protected]> wrote: >>>>> >>>>> Hope this helps. >>>>> >>>>> Carl >>>> >>>> How about this... .can I run a standard hadoop streaming job against a >>>> hive table that is stored as a sequence file? The idea would be I >>>> would break my hive query into two separate tasks and do a hadoop >>>> streaming job in between, then pick up the hive job afterwards. >>>> Thoughts? >>>> >>>> Adam >>>> >>> >>> I actually did do this with a streaming job. The UDF was tied up with >>> the apache/gpl issues. >>> >>> Here is how I did this. 1 install geo-ip-perl on all datanodes >>> >>> ret = qp.run( >>> " FROM ( "+ >>> " FROM raw_web_data_hour "+ >>> " SELECT transform( remote_ip ) "+ >>> " USING 'perl geo_state.pl' "+ >>> " AS ip, country_code3, region "+ >>> " WHERE log_date_part='"+theDate+"' and log_hour_part='"+theHour+"' " + >>> " ) a " + >>> " INSERT OVERWRITE TABLE raw_web_data_hour_geo PARTITION >>> (log_date_part='"+theDate+"',log_hour_part='"+theHour+"') "+ >>> " SELECT a.country_code3, a.region,a.ip,count(1) as theCount " + >>> " GROUP BY a.country_code3,a.region,a.ip " >>> ); >>> >>> >>> #!/usr/bin/perl >>> use Geo::IP; >>> use strict; >>> my $gi = Geo::IP->open("/usr/local/share/GeoIP/GeoIPCity.dat", >>> GEOIP_STANDARD); >>> while (<STDIN>){ >>> #my $record = $gi->record_by_name("209.191.139.200"); >>> chomp($_); >>> my $record = $gi->record_by_name($_); >>> print STDERR "was sent $_ \n" ; >>> if (defined $record) { >>> print $_ . "\t" . $record->country_code3 . "\t" . $record->region . "\n" >>> ; >>> print STDERR "return " . $record->region . "\n" ; >>> } else { >>> print "??\n"; >>> print STDERR "return was undefined \n"; >>> } >>> >>> } >>> >>> Good luck. >> > > Sorry to hear that your having problems. It is a fairly simple UDF, > for those familiar writing udf/genudf. You probably could embed the > lookup data file in the jar as well. I meant to build/host this on my > site, but I have not got around to it. If you want to tag team it, I > am interested. > -- Adam J. O'Donnell, Ph.D. Immunet Corporation Cell: +1 (267) 251-0070
