[
https://issues.apache.org/jira/browse/MAHOUT-271?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12913496#action_12913496
]
Olivier Grisel commented on MAHOUT-271:
---------------------------------------
Suppose you are interested in articles related to the country of India.
If you choose to use to disable the exactMatchOnly, you will find articles with
the following categories which are perfectly fine: "History of India", "Economy
of India". But you will also get completely unrelated articles such as those
from the category: "Politics of Indiana".
The word boundaries trick would help get rid of those false positives.
Alternatively it is always possible to use the exact category taxonomy data
from: http://download.wikimedia.org/enwiki/latest/enwiki-latest-category.sql.gz
and then use the exactMatchOnly on the expanded subcategories in mahout
processing.
> Make WikipediaDatasetCreatorMapper fuzzy category match respect word
> boundaries
> -------------------------------------------------------------------------------
>
> Key: MAHOUT-271
> URL: https://issues.apache.org/jira/browse/MAHOUT-271
> Project: Mahout
> Issue Type: Improvement
> Components: Classification
> Affects Versions: 0.2
> Reporter: Olivier Grisel
> Assignee: Olivier Grisel
> Priority: Trivial
> Fix For: 0.4
>
>
> WikipediaDatasetCreatorDriver is useful to create categorisation corpora out
> of wikipedia, however the category match just do a String#contains check
> which can catch a lot of unrelated categories.
> Checking the word boundaries with a regexp such as String.format("\\b%s\\b",
> theCategoryNameIAmLookingFor); should fix the issue.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.