Re: Working UDF for GeoIP lookup?

Edward Capriolo Mon, 15 Feb 2010 09:02:56 -0800

On Mon, Feb 15, 2010 at 11:27 AM, Adam J. O'Donnell <[email protected]> wrote:
> Edward:
>
> I don't have access to the individual data nodes, so I can't install the
> pure perl module. I tried distributing it via the add file command, but that
> is mangling the file name, which causes perl to not load the module as the
> file name and package name dont match.  Kinda frustrating, but it is really
> all about trying to work around an issue on amazon's elastic map reduce.  I
> love the service in general, but some issues are frustrating.
>
> Sent from my iPhone
>
> On Feb 15, 2010, at 6:05, Edward Capriolo <[email protected]> wrote:
>
>> On Mon, Feb 15, 2010 at 1:29 AM, Adam O'Donnell <[email protected]> wrote:
>>>>
>>>> Hope this helps.
>>>>
>>>> Carl
>>>
>>> How about this... .can I run a standard hadoop streaming job against a
>>> hive table that is stored as a sequence file?  The idea would be I
>>> would break my hive query into two separate tasks and do a hadoop
>>> streaming job in between, then pick up the hive job afterwards.
>>> Thoughts?
>>>
>>> Adam
>>>
>>
>> I actually did do this with a streaming job. The UDF was tied up with
>> the apache/gpl issues.
>>
>> Here is how I did this. 1 install geo-ip-perl on all datanodes
>>
>>  ret = qp.run(
>>   " FROM ( "+
>>   " FROM raw_web_data_hour "+
>>   " SELECT transform( remote_ip ) "+
>>   " USING 'perl geo_state.pl' "+
>>   " AS ip, country_code3, region "+
>>   " WHERE log_date_part='"+theDate+"' and log_hour_part='"+theHour+"' " +
>>   " ) a " +
>>   " INSERT OVERWRITE TABLE raw_web_data_hour_geo PARTITION
>> (log_date_part='"+theDate+"',log_hour_part='"+theHour+"') "+
>>   " SELECT a.country_code3, a.region,a.ip,count(1) as theCount " +
>>   " GROUP BY a.country_code3,a.region,a.ip "
>>   );
>>
>>
>> #!/usr/bin/perl
>> use Geo::IP;
>> use strict;
>> my $gi = Geo::IP->open("/usr/local/share/GeoIP/GeoIPCity.dat",
>> GEOIP_STANDARD);
>> while (<STDIN>){
>>  #my $record = $gi->record_by_name("209.191.139.200");
>>  chomp($_);
>>  my $record = $gi->record_by_name($_);
>>  print STDERR "was sent $_ \n" ;
>>  if (defined $record) {
>>   print $_ . "\t" . $record->country_code3 . "\t" . $record->region . "\n"
>>  ;
>>   print STDERR "return " . $record->region . "\n" ;
>>  } else {
>>   print "??\n";
>>   print STDERR "return was undefined \n";
>>  }
>>
>> }
>>
>> Good luck.
>


Sorry to hear that your having problems. It is a fairly simple UDF,
for those familiar writing udf/genudf. You probably could embed the
lookup data file in the jar as well. I meant to build/host this on my
site, but I have not got around to it. If you want to tag team it, I
am interested.

Re: Working UDF for GeoIP lookup?

Reply via email to