Make Wikipedia example set maker easier to mod
----------------------------------------------

                 Key: MAHOUT-895
                 URL: https://issues.apache.org/jira/browse/MAHOUT-895
             Project: Mahout
          Issue Type: Bug
          Components: Classification, Examples
    Affects Versions: 0.6
            Reporter: tom pierce
            Priority: Minor


The WikipediaDatasetCreator uses 2 mechanisms to scrape out the text of 
articles; first an XmlInputFormat is used with the "text" tags as start/end 
markers (which demarcate the article content), then the content inside the text 
tags is pattern matched out in the Mapper.

This means a newcomer must discover both pruning steps before modifying this 
program to create a dataset including other fields from the article.

I am attaching a patch which mods the Driver to split on entire articles and 
changes the mapper to accommodate the extra input without allowing spurious new 
category matches outside the text element.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to