Repository: mahout Updated Branches: refs/heads/master 3eb9fdf92 -> 441460e77
MAHOUT-1564: Naive Bayes Classifier for New Text Documents closes apache/mahout#91 Project: http://git-wip-us.apache.org/repos/asf/mahout/repo Commit: http://git-wip-us.apache.org/repos/asf/mahout/commit/441460e7 Tree: http://git-wip-us.apache.org/repos/asf/mahout/tree/441460e7 Diff: http://git-wip-us.apache.org/repos/asf/mahout/diff/441460e7 Branch: refs/heads/master Commit: 441460e77cd38acc684cb2351dad5f0e6156c1f0 Parents: 3eb9fdf Author: Andrew Palumbo <[email protected]> Authored: Wed Apr 1 17:20:39 2015 -0400 Committer: Andrew Palumbo <[email protected]> Committed: Wed Apr 1 17:21:47 2015 -0400 ---------------------------------------------------------------------- CHANGELOG | 6 +- examples/bin/spark-document-classifier.mscala | 195 +++++++++++++++++++++ 2 files changed, 200 insertions(+), 1 deletion(-) ---------------------------------------------------------------------- http://git-wip-us.apache.org/repos/asf/mahout/blob/441460e7/CHANGELOG ---------------------------------------------------------------------- diff --git a/CHANGELOG b/CHANGELOG index 3c29278..45869e4 100644 --- a/CHANGELOG +++ b/CHANGELOG @@ -1,7 +1,11 @@ Mahout Change Log Release 0.10.0 - unreleased - + + MAHOUT-1564: Naive Bayes Classifier for New Text Documents (apalumbo) + + MAHOUT-1524: Script to auto-generate and view the Mahout website on a local machine (Saleem Ansari via apalumbo) + MAHOUT-1589: Deprecate mahout.cmd due to lack of support MAHOUT-1655: Refactors mr-legacy into mahout-hdfs and mahout-mr, Spark now depends on much reduced mahout-hdfs http://git-wip-us.apache.org/repos/asf/mahout/blob/441460e7/examples/bin/spark-document-classifier.mscala ---------------------------------------------------------------------- diff --git a/examples/bin/spark-document-classifier.mscala b/examples/bin/spark-document-classifier.mscala new file mode 100644 index 0000000..9700253 --- /dev/null +++ b/examples/bin/spark-document-classifier.mscala @@ -0,0 +1,195 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. +*/ + +/* + * Binary Naive Bayes classifer (United States, United Kingdom) example for an out of sample document based + * on a model trained on the wikipedia xml dump: + * + * NOTE: As of version 0.10.0 Mahout uses MapReduce seq2sparse to vectorize large text corpora. + * + * To run this example first run : + * $MAHOUT_HOME/examples/bin/classify-wikipedia.sh --> option 2 + * + * then from the mahout spark-shell: + * :load $MAHOUT_HOME/examples/spark-document-classifier.mscala +*/ + +import org.apache.mahout.classifier.naivebayes._ +import org.apache.mahout.classifier.stats._ +import org.apache.mahout.nlp.tfidf._ + +import org.apache.hadoop.io.Text +import org.apache.hadoop.io.IntWritable +import org.apache.hadoop.io.LongWritable + +val pathToData = "/tmp/mahout-work-wiki/" + +// read in our full set as vectorized by seq2sparse in classify-wikipedia.sh +val fullData = drmDfsRead(pathToData + "wikipediaVecs/tfidf-vectors") +//val trainData = drmDfsRead(pathToData + "training") +//val testData = drmDfsRead(pathToData + "testing") + +// build a standard NaiveBayes model using the full dataset (training +testing) +val (labelIndex, aggregatedObservations) = SparkNaiveBayes.extractLabelsAndAggregateObservations(fullData) +val model = NaiveBayes.train(aggregatedObservations, labelIndex, false) + +// self test on the full set +val resAnalyzer = NaiveBayes.test(model, fullData, false) + +// display the confusion matrix +println(resAnalyzer) + +// read in the dictionary and document frequency count +val dictionary = sdc.sequenceFile(pathToData + "wikipediaVecs/dictionary.file-0", classOf[Text], classOf[IntWritable]) +val documentFrequencyCount = sdc.sequenceFile(pathToData + "wikipediaVecs/df-count", classOf[IntWritable], classOf[LongWritable]) + +// setup the dictionary and document frequency count as maps +val dictionaryRDD = dictionary.map { case (wKey, wVal) => wKey.asInstanceOf[Text].toString() -> wVal.get() } +val documentFrequencyCountRDD = documentFrequencyCount.map{ case (wKey, wVal) => wKey.asInstanceOf[IntWritable].get() -> wVal.get() } + +val dictionaryMap = dictionaryRDD.collect.map(x => x._1.toString -> x._2.toInt).toMap +val dfCountMap = documentFrequencyCountRDD.collect.map(x => x._1.toInt -> x._2.toLong).toMap + +// for this simple example, tokenize our document into unigrams using native string methods andvectorize using +// our dictionary and document frequencies. You could also use a lucene analyzer for bigrams, trigrams, etc. +def vectorizeDocument(document: String, + dictionaryMap: Map[String,Int], + dfMap: Map[Int,Long]): Vector = { + + val wordCounts = document.replaceAll("[^\\p{L}\\p{Nd}]+", " ").toLowerCase.split(" ").groupBy(identity).mapValues(_.length) + + val vec = new RandomAccessSparseVector(dictionaryMap.size) + + val totalDFSize = dfMap(-1) + val docSize = wordCounts.size + + for (word <- wordCounts) { + val term = word._1 + if (dictionaryMap.contains(term)) { + val tfidf: TFIDF = new TFIDF() + val termFreq = word._2 + val dictIndex = dictionaryMap(term) + val docFreq = dfCountMap(dictIndex) + val currentTfIdf = tfidf.calculate(termFreq, docFreq.toInt, docSize, totalDFSize.toInt) + vec.setQuick(dictIndex, currentTfIdf) + } + } + vec +} + +val labelMap = model.labelIndex +val numLabels = model.numLabels +val reverseLabelMap = labelMap.map(x => x._2 -> x._1) + +// instantiate the correct type of classifier +val classifier = model.isComplementary match { + case true => new ComplementaryNBClassifier(model) + case _ => new StandardNBClassifier(model) +} + +// the label with the higest score wins the classification for a given document +def argmax(v: Vector): (Int, Double) = { + var bestIdx: Int = Integer.MIN_VALUE + var bestScore: Double = Integer.MIN_VALUE.asInstanceOf[Int].toDouble + for(i <- 0 until v.size) { + if(v(i) > bestScore){ + bestScore = v(i) + bestIdx = i + } + } + (bestIdx, bestScore) +} + +// our final classifier +def classifyDocument(clvec: Vector) : String ={ + val cvec = classifier.classifyFull(clvec) + val (bestIdx, bestScore) = argmax(cvec) + reverseLabelMap(bestIdx) +} + +// A random United States footbal article +//http://www.reuters.com/article/2015/01/28/us-nfl-superbowl-security-idUSKBN0L12JR20150128 +val UStextToClassify = new String("(Reuters) - Super Bowl security officials acknowledge the NFL championship game represents" + + " a high profile target on a world stage but are unaware of any specific credible threats against" + + " Sunday's showcase. In advance of one of the world's biggest single day sporting events, Homeland" + + " Security Secretary Jeh Johnson was in Glendale on Wednesday to review security preparations and" + + " tour University of Phoenix Stadium where the Seattle Seahawks and New England Patriots will battle." + + " Deadly shootings in Paris and arrest of suspects in Belgium, Greece and Germany heightened fears of" + + " more attacks around the world and social media accounts linked to Middle East militant groups have" + + " carried a number of threats to attack high-profile U.S. events. There is no specific credible" + + " threat, said Johnson, who has appointed a federal coordination team to work with local, state and" + + " federal agencies to ensure safety of fans, players and other workers associated with the Super Bowl." + + " I'm confident we will have a safe and secure and successful event. Sunday's game has been given a" + + " Special Event Assessment Rating (SEAR) 1 rating, the same as in previous years, except for the year" + + " after the Sept. 11, 2001 attacks, when a higher level was declared. But security will be tight and" + + " visible around Super Bowl-related events as well as during the game itself. All fans will pass through" + + " metal detectors and pat downs. Over 4,000 private security personnel will be deployed and the almost" + + " 3,000 member Phoenix police force will be on Super Bowl duty. Nuclear device sniffing teams will be" + + " deployed and a network of Bio-Watch detectors will be set up to provide a warning in the event of " + + " a biological attack. The Department of Homeland Security (DHS) said in a press release it had held " + + " special cyber-security and anti-sniper training sessions. A U.S. official said the Transportation " + + " Security Administration, which is responsible for screening airline passengers, will add screeners " + + " and checkpoint lanes at airports. Federal air marshals, behavior detection officers and dog teams " + + " will help to secure transportation systems in the area. We will be ramping it (security) up on Sunday," + + " there is no doubt about that, said Federal Coordinator Matthew Allen, the DHS point of contact for " + + " planning and support. I have every confidence the public safety agencies that represented in the " + + " planning process are going to have their best and brightest out there this weekend and we will have" + + " a very safe Super Bowl.") + +// A random United Kingdom footbal article +// http://www.reuters.com/article/2015/01/26/manchester-united-swissquote-idUSL6N0V52RZ20150126 +val UKtextToClassify = new String("(Reuters) - Manchester United have signed a sponsorship deal with online financial trading company" + + " Swissquote, expanding the commercial partnerships that have helped to make the English club one of" + + " the richest teams in world soccer. United did not give a value for the deal, the club's first in the" + + " sector, but said on Monday it was a multi-year agreement. The Premier League club, 20 times English" + + " champions, claim to have 659 million followers around the globe, making the United name attractive to" + + " major brands like Chevrolet cars and sportswear group Adidas. Swissquote said the global deal would" + + " allow it to use United's popularity in Asia to help it meet its targets for expansion in China. Among" + + " benefits from the deal, Swissquote's clients will have a chance to meet United players and get behind" + + " the scenes at the Old Trafford stadium. Swissquote is a Geneva-based online trading company that allows" + + " retail investors to buy and sell foreign exchange, equities, bonds and other asset classes. Like other" + + " retail FX brokers, Swissquote was left nursing losses on the Swiss franc after Switzerland's central bank" + + " stunned markets this month by abandoning its cap on the currency. The fallout from the abrupt move put rival" + + " and West Ham United shirt sponsor Alpari UK into administration. Swissquote itself was forced to book a 25 "+ + " million Swiss francs ($28 million) provision for its clients who were left out of pocket following the" + + " franc's surge. United's ability to grow revenues off the pitch has made them the second richest club in" + + " the world behind Spain's Real Madrid, despite a downturn in their playing fortunes. United Managing" + + " Director Richard Arnold said there was still lots of scope for United to develop sponsorships in" + + " other areas of business. The last quoted statistics that we had showed that of the top 25 sponsorship" + + " categories, we were only active in 15 of those, Arnold told Reuters. I think there is a huge potential" + + " still for the club, and the other thing we have seen is there is very significant growth even within" + + " categories. United have endured a tricky transition following the retirement of manager Alex Ferguson" + + " in 2013, finishing seventh in the Premier League last season and missing out on a place in the lucrative" + + " Champions League. ($1 = 0.8910 Swiss francs) (Writing by Neil Maidment, additional reporting by Jemima Kelly;" + + " editing by Keith Weir)") + +val usVec = vectorizeDocument(UStextToClassify, dictionaryMap, dfCountMap) +val ukVec = vectorizeDocument(UKtextToClassify, dictionaryMap, dfCountMap) + +println("Classifing the news article about the superbowl (united states)") +classifyDocument(usVec) + +println("Classifing the news article about the Manchester United (united kingdom)") +classifyDocument(ukVec) + +// to classify new text, simply run this method on a string +def classifyText(txt: String): String ={ + val v = vectorizeDocument(txt, dictionaryMap, dfCountMap) + classifyDocument(v) +} + +
