[
https://issues.apache.org/jira/browse/MAHOUT-183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Olivier Grisel updated MAHOUT-183:
----------------------------------
Description:
The Wikipedia XML splitter inner loop erronously detects end of the
line-iterator which causes it to create chunks with just one line worth of page
content instead of respecting the --chunkSize cli option.
Simple patch to fix this will follow.
was:
The Wikipedia XML splitter inner loops erronously detects end of the line
iterators which cause it to create chunks with just one line worth of page
content instead of respecting the --chunkSize cli option.
Simple patch to fixe this will follow.
> WikipediaXmlSplitter spits one chunk per line
> ---------------------------------------------
>
> Key: MAHOUT-183
> URL: https://issues.apache.org/jira/browse/MAHOUT-183
> Project: Mahout
> Issue Type: Bug
> Components: Classification
> Affects Versions: 0.2
> Reporter: Olivier Grisel
> Fix For: 0.2
>
> Attachments: MAHOUT-183-wikipedia-xml-splitter.patch
>
>
> The Wikipedia XML splitter inner loop erronously detects end of the
> line-iterator which causes it to create chunks with just one line worth of
> page content instead of respecting the --chunkSize cli option.
> Simple patch to fix this will follow.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.