Joe,

In your patch, you have handled the situation for a word boundary when
exactMatchOnly is false (i.e. when the user says I don't want an exact
match). Shouldn't your patch address the case when the user says that
(s)he wants an exactMatch (i.e. when the exactMatchOnly is True).
Shouldn't the check for the word boundaries be in the condition where
the exactMatch is True ?

And my question was when the exactMatch is false. In that case, why do
we need to iterate to check each category ? In his case we just want
to check if the set (inputCategories) contains the string pattern
category.

Was I able to explain my question better ? Please let me know if I am
not clear and I will try to rephrase it. I hope I understood the issue
in question properly. If not, please correct me.

Thank you
Gangadhar
On Thu, Sep 23, 2010 at 10:05 PM, Joe Prasanna Kumar (JIRA)
<[email protected]> wrote:
>
>    [ 
> https://issues.apache.org/jira/browse/MAHOUT-271?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12914315#action_12914315
>  ]
>
> Joe Prasanna Kumar commented on MAHOUT-271:
> -------------------------------------------
>
> I am not sure if I completely understand your qn,
> -- For the case that there is no requirement for an exactMatch, why do we 
> need to check if the category contains the inputCategory?
> Actually Indiana should not be listed in the results when we do an inexact 
> match search.
>
> If we are doing an inExactMatch search, then we match any category that has 
> the specified input category. So for example, when you are searching for a 
> category called "India", the current code is iterating over the category set 
> (inputCategories) and gets all the categories that has the word "India" in 
> it. So it would match "Politics of India", "People Of India"  and also 
> "People of Indiana" etc. But we are really interested in anything that 
> matches India (here it is "Politics of India" and "People Of India" ) .
> So instead of using the contains method we are using regular expression 
> that'll match anything that has the whole word "India" in it and thereby 
> shortlisting "Politics of India" and "People Of India".
>
> Does that answer what you are looking for ?
>
>
>> Make WikipediaDatasetCreatorMapper fuzzy category match respect word 
>> boundaries
>> -------------------------------------------------------------------------------
>>
>>                 Key: MAHOUT-271
>>                 URL: https://issues.apache.org/jira/browse/MAHOUT-271
>>             Project: Mahout
>>          Issue Type: Improvement
>>          Components: Classification
>>    Affects Versions: 0.2
>>            Reporter: Olivier Grisel
>>            Assignee: Olivier Grisel
>>            Priority: Trivial
>>             Fix For: 0.4
>>
>>         Attachments: MAHOUT-271.patch
>>
>>
>> WikipediaDatasetCreatorDriver is useful to create categorisation corpora out 
>> of wikipedia, however the category match just do a String#contains check 
>> which can catch a lot of unrelated categories.
>> Checking the word boundaries with a regexp such as String.format("\\b%s\\b", 
>> theCategoryNameIAmLookingFor); should fix the  issue.
>
> --
> This message is automatically generated by JIRA.
> -
> You can reply to this email to add a comment to the issue online.
>
>

Reply via email to