[ https://issues.apache.org/jira/browse/MAHOUT-250?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Sean Owen updated MAHOUT-250: ----------------------------- Resolution: Fixed Status: Resolved (was: Patch Available) Looks nice and uncontroversial. I took the liberty of committing. > Make WikipediaXmlSplitter able to directly use the bzip2 compressed dump as > input > --------------------------------------------------------------------------------- > > Key: MAHOUT-250 > URL: https://issues.apache.org/jira/browse/MAHOUT-250 > Project: Mahout > Issue Type: Improvement > Components: Classification > Affects Versions: 0.2 > Reporter: Olivier Grisel > Priority: Minor > Fix For: 0.3 > > Attachments: MAHOUT-250-WikipediaXmlSplitter-BZip2.patch > > > Wikipedia.org ships large bzip2 compressed archives hence it would make sense > to be able to load the chunked XML into HDFS directly from the original file > without having to uncompress a 25GB temporary file on the local hard drive. > Reusing the Hadoop BZip2 codecs allows us to avoid having to introduce a new > dependency. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.