Make WikipediaDatasetCreatorMapper fuzzy category match respect word boundaries -------------------------------------------------------------------------------
Key: MAHOUT-271 URL: https://issues.apache.org/jira/browse/MAHOUT-271 Project: Mahout Issue Type: Improvement Components: Classification Affects Versions: 0.2 Reporter: Olivier Grisel Assignee: Olivier Grisel Priority: Trivial Fix For: 0.3 WikipediaDatasetCreatorDriver is useful to create categorisation corpora out of wikipedia, however the category match just do a String#contains check which can catch a lot of unrelated categories. Checking the word boundaries with a regexp such as String.format("\\b%s\\b", theCategoryNameIAmLookingFor); should fix the issue. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.