Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The following page has been changed by CorinneC:
http://wiki.apache.org/pig/PigTutorial

------------------------------------------------------------------------------
  || script2-hadoop.pig || Pig Script 2, Temporal Query Phrase Popularity 
(Hadoop cluster) ||
  || excite-small.log || Log file, Excite search engine (local mode) ||
  || excite.log || Log file, Excite search engine (Hadoop cluster) ||
- || pornwords || Data file (porn keywords) ||
  
  The user-defined functions (UDFs) are described here.
  
  || '''UDF''' || '''Description'''||
  || !ExtractHour || Extracts the hour from the record.||
  || N!GramGenerator || Composes n-grams from the set of words. ||
- || !NonPornDetector|| Removes the record if the query field includes porn 
terms. ||
  || NonURLDetector || Removes the record if the query field is empty or a URL. 
||
  || !ScoreGenerator || Calculates a "popularity" score for the n-gram.||
  || !ToLower || Changes the query field to lowercase. ||
@@ -131, +129 @@

  clean2 = FOREACH clean1 GENERATE user, time, 
org.apache.pig.tutorial.ToLower(query) as query;
  }}}
  
-  * Call the !NonPornDetector UDF to remove records if the query field 
contains porn terms. 
- {{{ 
- clean3 = FILTER clean2 BY org.apache.pig.tutorial.NonPornDetector(query);
- }}}
  
   * Because the log file only contains queries for a single day, we are only 
interested in the hour. The excite query log timestamp format is YYMMDDHHMMSS. 
Call the !ExtractHour UDF to extract the hour (HH) from the time field.
  {{{ 
@@ -218, +212 @@

  clean2 = FOREACH clean1 GENERATE user, time, 
org.apache.pig.tutorial.ToLower(query) as query;
  }}}
  
- 
-  * Call the Non!PornDetector UDF to remove records if the query field 
contains porn terms.
- {{{
- clean3 = FILTER clean2 BY org.apache.pig.tutorial.NonPornDetector(query);
- }}}
- 
   
   * Because the log file only contains queries for a single day, we are only 
interested in the hour. The excite query log timestamp format is YYMMDDHHMMSS. 
Call the !ExtractHour UDF to extract the hour from the time field.
  {{{

Reply via email to