Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification.
The following page has been changed by breed: http://wiki.apache.org/pig/PigExercise2 New page: For this second exercise we are going to be a bit more adventurous. We are going to generate some example data for Exercise 1 using a shell script and a UDF. We will start off with a list of names in a file: {{{ sn = load 'singlenames'; }}} Now we are going to write a shell script to permute the names into a list of userids with ages. We will invoke it using: (Note, this time those quotes need to be back quotes.) {{{ users = stream a through `randid.sh` as (user, age); }}} randid.sh will get the contents of 'singlenames' in standard in. Things written to stdout will be taken as output tuples. By default tuples are separated by \n and fields by \t. If you'd rather skip the pain of writing the randid.sh script, here is an example: {{{ #!/bin/bash function partName() { name=${list[$((RANDOM%count))]} seg=$((RANDOM%3)) if [ $seg -eq 1 ] then name=${name:0:2} fi if [ $seg -eq 2 ] then name=${name:0:3} fi echo -n $name } count=0 while read name do list[$count]="$name" count=$((count+1)) done iterations=$((count*count/4)) while [ $iterations -gt 0 ] do partName partName age=$(((RANDOM%50)+18)) echo iterations=$((iterations-1)) done }}} Okay now we have our users, lets generate the pages dataset. We want to generate a bunch of page requests for each user, so we will make a UDF that takes in tuples from users and generate fake traffic: {{{ pages = foreach a generate flatten(pig.example.GenerateClicks(*)) as (user, url); }}} GenerateClicks needs to extend EvalFunc<DataBag>. Here is an example implementation: {{{ package pig.example; import java.io.IOException; import java.util.ArrayList; import java.util.Random; import org.apache.pig.EvalFunc; import org.apache.pig.data.DataAtom; import org.apache.pig.data.DataBag; import org.apache.pig.data.Datum; import org.apache.pig.data.Tuple; public class GenerateClicks extends EvalFunc<DataBag> { Random rand = new Random((int)System.currentTimeMillis()); String prefixes[] = { "finance", "www", "search", "mail", "photo", "personal", "news", "m", "video", "music", "answers", "i", "im", "svcs", "web", "shop", "help", "buy", "rec", "money" }; String sites[] = { "cnn", "msn", "yahoo", "google", "aol", "live", "cnet", "ask", "boop", "slashdot", "nbc", "cbs", "baidu", }; String suffixes[] = { "com", "net", "org", "us", "ca", "ch", "sg", "il", "ja", "uk", }; void bias(ArrayList<String> l) { for(int i = 0; i < 4; i++) { int r = rand.nextInt(l.size()); String e = l.get(r); for(int j = 0; j < i*4; j++) { l.add(e); } } } ArrayList<String> prefix; ArrayList<String> site; ArrayList<String> suffix; public GenerateClicks() { prefix = new ArrayList<String>(); for(String p: prefixes) { prefix.add(p); } site = new ArrayList<String>(); for(String p: sites) { site.add(p); } suffix = new ArrayList<String>(); for(String p: suffixes) { suffix.add(p); } bias(prefix); bias(site); bias(suffix); } String generateURL() { int p = rand.nextInt(prefix.size()); int m = rand.nextInt(site.size()); int e = rand.nextInt(suffix.size()); return "http://" + prefix.get(p) + "." + site.get(m) + "." + site.get(e); } @Override public void exec(Tuple in, DataBag out) throws IOException { int count = rand.nextInt(1000+rand.nextInt(10000)); for(int i = 0; i < count; i++) { Tuple t = new Tuple(); t.appendField((DataAtom)in.getField(0)); t.appendField(new DataAtom(generateURL())); out.add(t); } } } }}} Okay, so you compiled it, but you are getting a class not found exception. Pig needs to be able to find your UDF class and ship it to hadoop. We do this using register: create a jar file with the class and use {{{ register myjar.jar; }}} before trying to use the UDF. Why do we need flatten? (To answer that question try that pig latin with and without flatten. Use describe to see the difference.) The only thing left is to store everything: {{{ store pages into 'pages'; store users into 'users'; }}}
