[ 
https://issues.apache.org/jira/browse/MAHOUT-271?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12913496#action_12913496
 ] 

Olivier Grisel commented on MAHOUT-271:
---------------------------------------

Suppose you are interested in articles related to the country of India.

If you choose to use to disable the exactMatchOnly, you will find articles with 
the following categories which are perfectly fine: "History of India", "Economy 
of India". But you will also get completely unrelated articles such as those 
from the category: "Politics of Indiana".

The word boundaries trick would help get rid of those false positives. 
Alternatively it is always possible to use the exact category taxonomy data 
from: http://download.wikimedia.org/enwiki/latest/enwiki-latest-category.sql.gz 
and then use the exactMatchOnly on the expanded subcategories in mahout 
processing.

> Make WikipediaDatasetCreatorMapper fuzzy category match respect word 
> boundaries
> -------------------------------------------------------------------------------
>
>                 Key: MAHOUT-271
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-271
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Classification
>    Affects Versions: 0.2
>            Reporter: Olivier Grisel
>            Assignee: Olivier Grisel
>            Priority: Trivial
>             Fix For: 0.4
>
>
> WikipediaDatasetCreatorDriver is useful to create categorisation corpora out 
> of wikipedia, however the category match just do a String#contains check 
> which can catch a lot of unrelated categories.
> Checking the word boundaries with a regexp such as String.format("\\b%s\\b", 
> theCategoryNameIAmLookingFor); should fix the  issue.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to