[ 
https://issues.apache.org/jira/browse/MAHOUT-271?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12914315#action_12914315
 ] 

Joe Prasanna Kumar commented on MAHOUT-271:
-------------------------------------------

I am not sure if I completely understand your qn,
-- For the case that there is no requirement for an exactMatch, why do we need 
to check if the category contains the inputCategory?
Actually Indiana should not be listed in the results when we do an inexact 
match search.

If we are doing an inExactMatch search, then we match any category that has the 
specified input category. So for example, when you are searching for a category 
called "India", the current code is iterating over the category set 
(inputCategories) and gets all the categories that has the word "India" in it. 
So it would match "Politics of India", "People Of India"  and also "People of 
Indiana" etc. But we are really interested in anything that matches India (here 
it is "Politics of India" and "People Of India" ) .
So instead of using the contains method we are using regular expression that'll 
match anything that has the whole word "India" in it and thereby shortlisting 
"Politics of India" and "People Of India".

Does that answer what you are looking for ?


> Make WikipediaDatasetCreatorMapper fuzzy category match respect word 
> boundaries
> -------------------------------------------------------------------------------
>
>                 Key: MAHOUT-271
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-271
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Classification
>    Affects Versions: 0.2
>            Reporter: Olivier Grisel
>            Assignee: Olivier Grisel
>            Priority: Trivial
>             Fix For: 0.4
>
>         Attachments: MAHOUT-271.patch
>
>
> WikipediaDatasetCreatorDriver is useful to create categorisation corpora out 
> of wikipedia, however the category match just do a String#contains check 
> which can catch a lot of unrelated categories.
> Checking the word boundaries with a regexp such as String.format("\\b%s\\b", 
> theCategoryNameIAmLookingFor); should fix the  issue.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to