On Mon, Feb 15, 2010 at 1:29 AM, Adam O'Donnell <[email protected]> wrote:
>> Hope this helps.
>>
>> Carl
>
> How about this... .can I run a standard hadoop streaming job against a
> hive table that is stored as a sequence file?  The idea would be I
> would break my hive query into two separate tasks and do a hadoop
> streaming job in between, then pick up the hive job afterwards.
> Thoughts?
>
> Adam
>

I actually did do this with a streaming job. The UDF was tied up with
the apache/gpl issues.

Here is how I did this. 1 install geo-ip-perl on all datanodes

  ret = qp.run(
    " FROM ( "+
    " FROM raw_web_data_hour "+
    " SELECT transform( remote_ip ) "+
    " USING 'perl geo_state.pl' "+
    " AS ip, country_code3, region "+
    " WHERE log_date_part='"+theDate+"' and log_hour_part='"+theHour+"' " +
    " ) a " +
    " INSERT OVERWRITE TABLE raw_web_data_hour_geo PARTITION
(log_date_part='"+theDate+"',log_hour_part='"+theHour+"') "+
    " SELECT a.country_code3, a.region,a.ip,count(1) as theCount " +
    " GROUP BY a.country_code3,a.region,a.ip "
    );


#!/usr/bin/perl
use Geo::IP;
use strict;
my $gi = Geo::IP->open("/usr/local/share/GeoIP/GeoIPCity.dat", GEOIP_STANDARD);
while (<STDIN>){
  #my $record = $gi->record_by_name("209.191.139.200");
  chomp($_);
  my $record = $gi->record_by_name($_);
  print STDERR "was sent $_ \n" ;
  if (defined $record) {
    print $_ . "\t" . $record->country_code3 . "\t" . $record->region . "\n"  ;
    print STDERR "return " . $record->region . "\n" ;
  } else {
    print "??\n";
    print STDERR "return was undefined \n";
  }

}

Good luck.

Reply via email to