[jira] Commented: (MAHOUT-271) Make WikipediaDatasetCreatorMapper fuzzy category match respect word boundaries

Joe Prasanna Kumar (JIRA) Fri, 24 Sep 2010 16:44:58 -0700

    [ 
https://issues.apache.org/jira/browse/MAHOUT-271?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12914708#action_12914708
 ]


Joe Prasanna Kumar commented on MAHOUT-271:
-------------------------------------------

In your patch, you have handled the situation for a word boundary when
exactMatchOnly is false (i.e. when the user says I don't want an exact
match). Shouldn't your patch address the case when the user says that
(s)he wants an exactMatch (i.e. when the exactMatchOnly is True).
Shouldn't the check for the word boundaries be in the condition where
the exactMatch is True ?
{color:blue}
When the user wants to perform exactMatch, we check for the exact word. For eg, 
when he wants to exactMatch for category "India", we match only the word 
"India" which is done using the contains() of the <Set>inputCategories
{color}
And my question was when the exactMatch is false. In that case, why do
we need to iterate to check each category ? In his case we just want
to check if the set (inputCategories) contains the string pattern
category.
{color:blue}
When the exactMatch is false, we try to match all those categories that are 
close to the input category. For example, when we specify "India" in the input 
categories file, we try to match all the documents in wikipedia data set with 
categories that contains something similar to "India" like "Politics of India", 
"People of India" , "People from Indiana". But if you look at the categories 
that gets matched, we have matched "People from Indiana" which is not of 
interest to us since we are searching for something similar to the whole word 
"India". Hence I have modified to code to do an word boundary match. So now 
when we search for "India", the wikipedia articles which has just  "Politics of 
India" and "People of India" would match, which is what we want.
{color}

Was I able to explain my question better ? Please let me know if I am
not clear and I will try to rephrase it. I hope I understood the issue
in question properly. If not, please correct me.
{color:blue}
Hope the above explanation answers your qn.
I just realized a mistake that I did and have re-uploaded the patch. Could you 
please get the latest patch and try out if it works good. I am planning to do 
it over the weekend but if you could help that'll be great.
{color}
Thank you
Gangadhar

> Make WikipediaDatasetCreatorMapper fuzzy category match respect word 
> boundaries
> -------------------------------------------------------------------------------
>
>                 Key: MAHOUT-271
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-271
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Classification
>    Affects Versions: 0.2
>            Reporter: Olivier Grisel
>            Assignee: Olivier Grisel
>            Priority: Trivial
>             Fix For: 0.4
>
>         Attachments: MAHOUT-271.patch
>
>
> WikipediaDatasetCreatorDriver is useful to create categorisation corpora out 
> of wikipedia, however the category match just do a String#contains check 
> which can catch a lot of unrelated categories.
> Checking the word boundaries with a regexp such as String.format("\\b%s\\b", 
> theCategoryNameIAmLookingFor); should fix the  issue.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-271) Make WikipediaDatasetCreatorMapper fuzzy category match respect word boundaries

Reply via email to