Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification.
The following page has been changed by CorinneC: http://wiki.apache.org/pig/PigTutorial ------------------------------------------------------------------------------ || script2-hadoop.pig || Pig Script 2, Temporal Query Phrase Popularity (Hadoop cluster) || || excite-small.log || Log file, Excite search engine (local mode) || || excite.log || Log file, Excite search engine (Hadoop cluster) || - || pornwords || Data file (porn keywords) || The user-defined functions (UDFs) are described here. || '''UDF''' || '''Description'''|| || !ExtractHour || Extracts the hour from the record.|| || N!GramGenerator || Composes n-grams from the set of words. || - || !NonPornDetector|| Removes the record if the query field includes porn terms. || || NonURLDetector || Removes the record if the query field is empty or a URL. || || !ScoreGenerator || Calculates a "popularity" score for the n-gram.|| || !ToLower || Changes the query field to lowercase. || @@ -131, +129 @@ clean2 = FOREACH clean1 GENERATE user, time, org.apache.pig.tutorial.ToLower(query) as query; }}} - * Call the !NonPornDetector UDF to remove records if the query field contains porn terms. - {{{ - clean3 = FILTER clean2 BY org.apache.pig.tutorial.NonPornDetector(query); - }}} * Because the log file only contains queries for a single day, we are only interested in the hour. The excite query log timestamp format is YYMMDDHHMMSS. Call the !ExtractHour UDF to extract the hour (HH) from the time field. {{{ @@ -218, +212 @@ clean2 = FOREACH clean1 GENERATE user, time, org.apache.pig.tutorial.ToLower(query) as query; }}} - - * Call the Non!PornDetector UDF to remove records if the query field contains porn terms. - {{{ - clean3 = FILTER clean2 BY org.apache.pig.tutorial.NonPornDetector(query); - }}} - * Because the log file only contains queries for a single day, we are only interested in the hour. The excite query log timestamp format is YYMMDDHHMMSS. Call the !ExtractHour UDF to extract the hour from the time field. {{{