[Pig Wiki] Update of "PigExercise2" by breed

Apache Wiki Tue, 11 Nov 2008 11:55:22 -0800

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.


The following page has been changed by breed:
http://wiki.apache.org/pig/PigExercise2

New page:
For this second exercise we are going to be a bit more adventurous. We are 
going to generate some example data for Exercise 1 using a shell script and a 
UDF. We will start off with a list of names in a file:

{{{
sn = load 'singlenames';
}}}

Now we are going to write a shell script to permute the names into a list of 
userids with ages. We will invoke it using: (Note, this time those quotes need 
to be back quotes.)

{{{
users = stream a through `randid.sh` as (user, age);
}}}

randid.sh will get the contents of 'singlenames' in standard in. Things written 
to stdout will be taken as output tuples. By default tuples are separated by \n 
and fields by \t. If you'd rather skip the pain of writing the randid.sh 
script, here is an example:

{{{
#!/bin/bash

function partName() {
        name=${list[$((RANDOM%count))]}
        seg=$((RANDOM%3))
        if [ $seg -eq 1 ]
        then
                name=${name:0:2}
        fi
        if [ $seg -eq 2 ]
        then
                name=${name:0:3}
        fi
        echo -n $name
}

count=0
while read name
do
        list[$count]="$name"
        count=$((count+1))
done

iterations=$((count*count/4))
while [ $iterations -gt 0 ]
do
        partName
        partName
        age=$(((RANDOM%50)+18))
        echo
        iterations=$((iterations-1))
done
}}}

Okay now we have our users, lets generate the pages dataset. We want to 
generate a bunch of page requests for each user, so we will make a UDF that 
takes in tuples from users and generate fake traffic:

{{{
pages = foreach a generate flatten(pig.example.GenerateClicks(*)) as (user, 
url);
}}}

GenerateClicks needs to extend EvalFunc<DataBag>. Here is an example 
implementation:

{{{
package pig.example;

import java.io.IOException;
import java.util.ArrayList;
import java.util.Random;

import org.apache.pig.EvalFunc;
import org.apache.pig.data.DataAtom;
import org.apache.pig.data.DataBag;
import org.apache.pig.data.Datum;
import org.apache.pig.data.Tuple;

public class GenerateClicks extends EvalFunc<DataBag> {
    Random rand = new Random((int)System.currentTimeMillis());
    String prefixes[] = {
            "finance",
            "www",
            "search",
            "mail",
            "photo",
            "personal",
            "news",
            "m",
            "video",
            "music",
            "answers",
            "i",
            "im",
            "svcs",
            "web",
            "shop",
            "help",
            "buy",
            "rec",
            "money"
    };
    String sites[] = {
            "cnn",
            "msn",
            "yahoo",
            "google",
            "aol",
            "live",
            "cnet",
            "ask",
            "boop",
            "slashdot",
            "nbc",
            "cbs",
            "baidu",
    };
    String suffixes[] = {
            "com",
            "net",
            "org",
            "us",
            "ca",
            "ch",
            "sg",
            "il",
            "ja",
            "uk",
    };
    
    void bias(ArrayList<String> l) {
        for(int i = 0; i < 4; i++) {
            int r = rand.nextInt(l.size());
            String e = l.get(r);
            for(int j = 0; j < i*4; j++) {
                l.add(e);
            }
        }
    }
    ArrayList<String> prefix;
    ArrayList<String> site;
    ArrayList<String> suffix;
    public GenerateClicks() {
        prefix = new ArrayList<String>();
        for(String p: prefixes) {
            prefix.add(p);
        }
        site = new ArrayList<String>();
        for(String p: sites) {
            site.add(p);
        }
        suffix = new ArrayList<String>();
        for(String p: suffixes) {
            suffix.add(p);
        }
        bias(prefix);
        bias(site);
        bias(suffix);
    }
    String generateURL() {
        int p = rand.nextInt(prefix.size());
        int m = rand.nextInt(site.size());
        int e = rand.nextInt(suffix.size());
        return "http://"; + prefix.get(p) + "." + site.get(m) + "." + 
site.get(e); 
    }
    @Override
    public void exec(Tuple in, DataBag out) throws IOException {
        int count = rand.nextInt(1000+rand.nextInt(10000));
        for(int i = 0; i < count; i++) {
            Tuple t = new Tuple();
            t.appendField((DataAtom)in.getField(0));
            t.appendField(new DataAtom(generateURL()));
            out.add(t);
        }
    }

}
}}}

Okay, so you compiled it, but you are getting a class not found exception. Pig 
needs to be able to find your UDF class and ship it to hadoop. We do this using 
register: create a jar file with the class and use

{{{
register myjar.jar;
}}}

before trying to use the UDF.

Why do we need flatten? (To answer that question try that pig latin with and 
without flatten. Use describe to see the difference.)

The only thing left is to store everything:

{{{
store pages into 'pages';
store users into 'users';
}}}

[Pig Wiki] Update of "PigExercise2" by breed

Reply via email to