[ 
https://issues.apache.org/jira/browse/MAHOUT-271?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12914295#action_12914295
 ] 

Gangadhar Nittala commented on MAHOUT-271:
------------------------------------------

Joe / Olivier,
In the WikipediaDatasetCreatorMapper.java, for this issue, I have a question 
regarding the exactMatchOnly check. The current code has this 

{code:title=WikipediaDatasetCreatorMapper.java|borderStyle=solid}
 if (exactMatchOnly && inputCategories.contains(category)) {
        return category;
      } else if (!exactMatchOnly) {
        for (String inputCategory : inputCategories) {
          if (category.contains(inputCategory)) { // we have an inexact match
            return inputCategory;
          }
        } 
{code}
For the case that there is no requirement for an exactMatch, why do we need to 
check if the category contains the inputCategory? Couldn't we flip the search 
to say, 
{code}
if (!exactMatchOnly) {
         if(inputCategories.contains(category))
            return category;
{code} ? 
That way, the way Olivier explained, _Indiana_ would also figure in the 
category for _India_. 

Am I missing something basic here?


> Make WikipediaDatasetCreatorMapper fuzzy category match respect word 
> boundaries
> -------------------------------------------------------------------------------
>
>                 Key: MAHOUT-271
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-271
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Classification
>    Affects Versions: 0.2
>            Reporter: Olivier Grisel
>            Assignee: Olivier Grisel
>            Priority: Trivial
>             Fix For: 0.4
>
>         Attachments: MAHOUT-271.patch
>
>
> WikipediaDatasetCreatorDriver is useful to create categorisation corpora out 
> of wikipedia, however the category match just do a String#contains check 
> which can catch a lot of unrelated categories.
> Checking the word boundaries with a regexp such as String.format("\\b%s\\b", 
> theCategoryNameIAmLookingFor); should fix the  issue.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to