[ https://issues.apache.org/jira/browse/NUTCH-814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Andrzej Bialecki updated NUTCH-814: ------------------------------------ Attachment: merger.patch Patch fixing the issue, and a unit test. I will commit this shortly. > SegmentMerger bug > ----------------- > > Key: NUTCH-814 > URL: https://issues.apache.org/jira/browse/NUTCH-814 > Project: Nutch > Issue Type: Bug > Affects Versions: 1.1 > Reporter: Dennis Kubes > Assignee: Andrzej Bialecki > Fix For: 1.1 > > Attachments: merger.patch > > > Dennis reported: > {quote} > In the SegmentMerger.java file about line 150 we have this: > final SequenceFile.Reader reader = > new SequenceFile.Reader(FileSystem.get(job), fSplit.getPath(), > job); > Then about line 166 in the record reader we have this: > boolean res = reader.next(key, w); > If I am reading that right, that would mean that the map tap would loop > over all records for a given file and not just a given split. > {quote} > Right, this should instead use SequenceFileRecordReader that already has the > logic to handle splits. Patch coming shortly - thanks for spotting this! This > could be the reason for "out of disk space" errors that many users reported. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.