[
https://issues.apache.org/jira/browse/NUTCH-677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12888530#action_12888530
]
Chris A. Mattmann commented on NUTCH-677:
-----------------------------------------
Hi Marcin,
I applied your patch, and was unit testing it, all ready to commit, when I ran
into this:
{noformat}
[junit] Test org.apache.nutch.segment.TestSegmentMerger FAILED
[junit] Running org.apache.nutch.util.TestEncodingDetector
[junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 2.408 sec
[junit] Running org.apache.nutch.util.TestGZIPUtils
[junit] Tests run: 4, Failures: 0, Errors: 0, Time elapsed: 2.521 sec
[junit] Running org.apache.nutch.util.TestNodeWalker
[junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 0.593 sec
[junit] Running org.apache.nutch.util.TestPrefixStringMatcher
[junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 0.452 sec
[junit] Running org.apache.nutch.util.TestStringUtil
[junit] Tests run: 2, Failures: 0, Errors: 0, Time elapsed: 0.076 sec
[junit] Running org.apache.nutch.util.TestSuffixStringMatcher
[junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 0.321 sec
[junit] Running org.apache.nutch.util.TestURLUtil
[junit] Tests run: 4, Failures: 0, Errors: 0, Time elapsed: 2.009 sec
BUILD FAILED
/Users/mattmann/src/nutch/build.xml:258: Tests failed!
Total time: 8 minutes 44 seconds
[chipotle:~/src/nutch] mattmann%
{noformat}
The root cause of the SegmentMerger test error is: (from
build/test/TEST-org.apache.nutch.segment.TestSegmentMerger.txt):
{noformat}
432
2010-07-14 13:45:33,085 INFO mapred.LocalJobRunner
(LocalJobRunner.java:statusUpdate(276)) -
file:/tmp/hadoop-mattmann/merge-1279140109299/seg1/parse_text/part-00000/data:0+3355
4432
2010-07-14 13:45:33,445 INFO mapred.MapTask (MapTask.java:flush(1115)) -
Starting flush of map output
2010-07-14 13:45:35,101 INFO mapred.MapTask (MapTask.java:sortAndSpill(1295))
- Finished spill 2
2010-07-14 13:45:35,107 WARN mapred.LocalJobRunner
(LocalJobRunner.java:run(256)) - job_local_0001
org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find
taskTracker/jobcache/job_local_0001/attempt_local_0001_m_000000_0/output/spill0.out
in any of the configured
local directories
at
org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathToRead(LocalDirAllocator.java:389)
at
org.apache.hadoop.fs.LocalDirAllocator.getLocalPathToRead(LocalDirAllocator.java:138)
at
org.apache.hadoop.mapred.MapOutputFile.getSpillFile(MapOutputFile.java:94)
at
org.apache.hadoop.mapred.MapTask$MapOutputBuffer.mergeParts(MapTask.java:1443)
at
org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:1154)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:359)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
at
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177)
2010-07-14 13:45:35,879 INFO mapred.JobClient
(JobClient.java:monitorAndPrintJob(1343)) - Job complete: job_local_0001
2010-07-14 13:45:35,883 INFO mapred.JobClient (Counters.java:log(514)) -
Counters: 9
2010-07-14 13:45:35,884 INFO mapred.JobClient (Counters.java:log(516)) -
FileSystemCounters
2010-07-14 13:45:35,884 INFO mapred.JobClient (Counters.java:log(518)) -
FILE_BYTES_READ=68360507
2010-07-14 13:45:35,885 INFO mapred.JobClient (Counters.java:log(518)) -
FILE_BYTES_WRITTEN=229824559
2010-07-14 13:45:35,885 INFO mapred.JobClient (Counters.java:log(516)) -
Map-Reduce Framework
2010-07-14 13:45:35,885 INFO mapred.JobClient (Counters.java:log(518)) -
Combine output records=0
2010-07-14 13:45:35,886 INFO mapred.JobClient (Counters.java:log(518)) -
Map input records=703319
2010-07-14 13:45:35,886 INFO mapred.JobClient (Counters.java:log(518)) -
Spilled Records=524287
2010-07-14 13:45:35,887 INFO mapred.JobClient (Counters.java:log(518)) -
Map output bytes=42791349
2010-07-14 13:45:35,888 INFO mapred.JobClient (Counters.java:log(518)) -
Map input bytes=0
2010-07-14 13:45:35,888 INFO mapred.JobClient (Counters.java:log(518)) -
Map output records=703319
2010-07-14 13:45:35,889 INFO mapred.JobClient (Counters.java:log(518)) -
Combine input records=0
------------- ---------------- ---------------
------------- Standard Error -----------------
Creating large segment 1...
- done: 1677722 records.
Creating large segment 2...
- done: 1677722 records.
------------- ---------------- ---------------
Testcase: testLargeMerge took 227.804 sec
Caused an ERROR
Job failed!
java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1252)
at org.apache.nutch.segment.SegmentMerger.merge(SegmentMerger.java:639)
at
org.apache.nutch.segment.TestSegmentMerger.testLargeMerge(TestSegmentMerger.java:87)
{noformat}
Any ideas? I'd be happy to commit this, provided we can get it to past
regression....
Cheers,
Chris
> Segment merge filering based on segment content
> -----------------------------------------------
>
> Key: NUTCH-677
> URL: https://issues.apache.org/jira/browse/NUTCH-677
> Project: Nutch
> Issue Type: Improvement
> Affects Versions: 0.9.0
> Reporter: Marcin Okraszewski
> Assignee: Chris A. Mattmann
> Attachments: MergeFilter.patch, MergeFilter_for_1.0.patch,
> SegmentMergeFilter.java, SegmentMergeFilter.java, SegmentMergeFilters.java,
> SegmentMergeFilters.java
>
>
> I needed a segment filtering based on meta data detected during parse phase.
> Unfortunately current URL based filtering does not allow for this. So I have
> created a new SegmentMergeFilter extension which receives segment entry which
> is being merged and decides if it should be included or not. Even though I
> needed only ParseData for my purpose I have done it a bit more general
> purpose, so the filter receives all merged data.
> The attached patch is for version 0.9 which I use. Unfortunately I didn't
> have time to check how it fits to trunk version. Sorry :(
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.