Edward:
I don't have access to the individual data nodes, so I can't install
the pure perl module. I tried distributing it via the add file
command, but that is mangling the file name, which causes perl to not
load the module as the file name and package name dont match. Kinda
frustrating, but it is really all about trying to work around an issue
on amazon's elastic map reduce. I love the service in general, but
some issues are frustrating.
Sent from my iPhone
On Feb 15, 2010, at 6:05, Edward Capriolo <[email protected]> wrote:
On Mon, Feb 15, 2010 at 1:29 AM, Adam O'Donnell <[email protected]>
wrote:
Hope this helps.
Carl
How about this... .can I run a standard hadoop streaming job
against a
hive table that is stored as a sequence file? The idea would be I
would break my hive query into two separate tasks and do a hadoop
streaming job in between, then pick up the hive job afterwards.
Thoughts?
Adam
I actually did do this with a streaming job. The UDF was tied up with
the apache/gpl issues.
Here is how I did this. 1 install geo-ip-perl on all datanodes
ret = qp.run(
" FROM ( "+
" FROM raw_web_data_hour "+
" SELECT transform( remote_ip ) "+
" USING 'perl geo_state.pl' "+
" AS ip, country_code3, region "+
" WHERE log_date_part='"+theDate+"' and log_hour_part='"+theHour
+"' " +
" ) a " +
" INSERT OVERWRITE TABLE raw_web_data_hour_geo PARTITION
(log_date_part='"+theDate+"',log_hour_part='"+theHour+"') "+
" SELECT a.country_code3, a.region,a.ip,count(1) as theCount " +
" GROUP BY a.country_code3,a.region,a.ip "
);
#!/usr/bin/perl
use Geo::IP;
use strict;
my $gi = Geo::IP->open("/usr/local/share/GeoIP/GeoIPCity.dat",
GEOIP_STANDARD);
while (<STDIN>){
#my $record = $gi->record_by_name("209.191.139.200");
chomp($_);
my $record = $gi->record_by_name($_);
print STDERR "was sent $_ \n" ;
if (defined $record) {
print $_ . "\t" . $record->country_code3 . "\t" . $record-
>region . "\n" ;
print STDERR "return " . $record->region . "\n" ;
} else {
print "??\n";
print STDERR "return was undefined \n";
}
}
Good luck.