On Mon, Feb 15, 2010 at 11:27 AM, Adam J. O'Donnell <[email protected]> wrote: > Edward: > > I don't have access to the individual data nodes, so I can't install the > pure perl module. I tried distributing it via the add file command, but that > is mangling the file name, which causes perl to not load the module as the > file name and package name dont match. Kinda frustrating, but it is really > all about trying to work around an issue on amazon's elastic map reduce. I > love the service in general, but some issues are frustrating. > > Sent from my iPhone > > On Feb 15, 2010, at 6:05, Edward Capriolo <[email protected]> wrote: > >> On Mon, Feb 15, 2010 at 1:29 AM, Adam O'Donnell <[email protected]> wrote: >>>> >>>> Hope this helps. >>>> >>>> Carl >>> >>> How about this... .can I run a standard hadoop streaming job against a >>> hive table that is stored as a sequence file? The idea would be I >>> would break my hive query into two separate tasks and do a hadoop >>> streaming job in between, then pick up the hive job afterwards. >>> Thoughts? >>> >>> Adam >>> >> >> I actually did do this with a streaming job. The UDF was tied up with >> the apache/gpl issues. >> >> Here is how I did this. 1 install geo-ip-perl on all datanodes >> >> ret = qp.run( >> " FROM ( "+ >> " FROM raw_web_data_hour "+ >> " SELECT transform( remote_ip ) "+ >> " USING 'perl geo_state.pl' "+ >> " AS ip, country_code3, region "+ >> " WHERE log_date_part='"+theDate+"' and log_hour_part='"+theHour+"' " + >> " ) a " + >> " INSERT OVERWRITE TABLE raw_web_data_hour_geo PARTITION >> (log_date_part='"+theDate+"',log_hour_part='"+theHour+"') "+ >> " SELECT a.country_code3, a.region,a.ip,count(1) as theCount " + >> " GROUP BY a.country_code3,a.region,a.ip " >> ); >> >> >> #!/usr/bin/perl >> use Geo::IP; >> use strict; >> my $gi = Geo::IP->open("/usr/local/share/GeoIP/GeoIPCity.dat", >> GEOIP_STANDARD); >> while (<STDIN>){ >> #my $record = $gi->record_by_name("209.191.139.200"); >> chomp($_); >> my $record = $gi->record_by_name($_); >> print STDERR "was sent $_ \n" ; >> if (defined $record) { >> print $_ . "\t" . $record->country_code3 . "\t" . $record->region . "\n" >> ; >> print STDERR "return " . $record->region . "\n" ; >> } else { >> print "??\n"; >> print STDERR "return was undefined \n"; >> } >> >> } >> >> Good luck. >
Sorry to hear that your having problems. It is a fairly simple UDF, for those familiar writing udf/genudf. You probably could embed the lookup data file in the jar as well. I meant to build/host this on my site, but I have not got around to it. If you want to tag team it, I am interested.
