[
https://issues.apache.org/jira/browse/MAHOUT-271?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12914295#action_12914295
]
Gangadhar Nittala commented on MAHOUT-271:
------------------------------------------
Joe / Olivier,
In the WikipediaDatasetCreatorMapper.java, for this issue, I have a question
regarding the exactMatchOnly check. The current code has this
{code:title=WikipediaDatasetCreatorMapper.java|borderStyle=solid}
if (exactMatchOnly && inputCategories.contains(category)) {
return category;
} else if (!exactMatchOnly) {
for (String inputCategory : inputCategories) {
if (category.contains(inputCategory)) { // we have an inexact match
return inputCategory;
}
}
{code}
For the case that there is no requirement for an exactMatch, why do we need to
check if the category contains the inputCategory? Couldn't we flip the search
to say,
{code}
if (!exactMatchOnly) {
if(inputCategories.contains(category))
return category;
{code} ?
That way, the way Olivier explained, _Indiana_ would also figure in the
category for _India_.
Am I missing something basic here?
> Make WikipediaDatasetCreatorMapper fuzzy category match respect word
> boundaries
> -------------------------------------------------------------------------------
>
> Key: MAHOUT-271
> URL: https://issues.apache.org/jira/browse/MAHOUT-271
> Project: Mahout
> Issue Type: Improvement
> Components: Classification
> Affects Versions: 0.2
> Reporter: Olivier Grisel
> Assignee: Olivier Grisel
> Priority: Trivial
> Fix For: 0.4
>
> Attachments: MAHOUT-271.patch
>
>
> WikipediaDatasetCreatorDriver is useful to create categorisation corpora out
> of wikipedia, however the category match just do a String#contains check
> which can catch a lot of unrelated categories.
> Checking the word boundaries with a regexp such as String.format("\\b%s\\b",
> theCategoryNameIAmLookingFor); should fix the issue.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.