Greetings, I'm running nutch trunk with the patch for hadoop 0.17 from NUTCH-634 (http://issues.apache.org/jira/browse/NUTCH-634)
I've run into a problem merging segments: $ ./bin/nutch mergesegs crawl/segments_merge -dir crawl/segments/ 08/06/11 14:32:35 INFO segment.SegmentMerger: Merging 3 segments to crawl/segments_merge/20080611143235 08/06/11 14:32:35 INFO segment.SegmentMerger: SegmentMerger: adding hdfs://localhost:54310/user/lritter/crawl/segments/20080611135945 08/06/11 14:32:35 INFO segment.SegmentMerger: SegmentMerger: adding hdfs://localhost:54310/user/lritter/crawl/segments/20080611141414 08/06/11 14:32:35 INFO segment.SegmentMerger: SegmentMerger: adding hdfs://localhost:54310/user/lritter/crawl/segments/_logs 08/06/11 14:32:35 INFO segment.SegmentMerger: SegmentMerger: using segment data from: java.io.IOException: No input paths specified in input at org.apache.hadoop.mapred.FileInputFormat.validateInput(FileInputFormat.java:173) at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:705) at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:973) at org.apache.nutch.segment.SegmentMerger.merge(SegmentMerger.java:605) at org.apache.nutch.segment.SegmentMerger.main(SegmentMerger.java:648) This looks to be the same (or similar) issue as: http://www.mail-archive.com/[EMAIL PROTECTED]/msg10999.html In my case, the merger seems to think that the '_log' directory is valid fodder for merging. This is "clearly" not the case. In this case, I assume that underscore-prefixed names are "reserved" by nutch. With this assumption, I can make a filter that screens these out. I have done this and attached a patch against trunk below. While the patch fixes my immediate problem it makes me a little nervous that I am designating underscore-prefixed stuff as "special" in a pretty adhoc way. Is there any "real" way to determine whether or not a directory contains segment information? Thanks! -lincoln -- lincolnritter.com --- PATCH --- Index: src/java/org/apache/nutch/segment/SegmentMerger.java =================================================================== --- src/java/org/apache/nutch/segment/SegmentMerger.java (revision 666871) +++ src/java/org/apache/nutch/segment/SegmentMerger.java (working copy) @@ -626,7 +626,7 @@ boolean normalize = false; for (int i = 1; i < args.length; i++) { if (args[i].equals("-dir")) { - Path[] files = fs.listPaths(new Path(args[++i]), HadoopFSUtil.getPassDirectoriesFilter(fs)); + Path[] files = fs.listPaths(new Path(args[++i]), HadoopFSUtil.getPassNormalDirectoriesFilter(fs)); for (int j = 0; j < files.length; j++) segs.add(files[j]); } else if (args[i].equals("-filter")) { Index: src/java/org/apache/nutch/util/HadoopFSUtil.java =================================================================== --- src/java/org/apache/nutch/util/HadoopFSUtil.java (revision 666871) +++ src/java/org/apache/nutch/util/HadoopFSUtil.java (working copy) @@ -51,6 +51,23 @@ }; } + + /** + * Returns PathFilter that passes directories that are not "special" through. + */ + public static PathFilter getPassNormalDirectoriesFilter(final FileSystem fs) { + return new PathFilter() { + public boolean accept(final Path path) { + try { + FileStatus status = fs.getFileStatus(path); + return status.isDir() && !status.getPath().getName().startsWith("_"); + } catch (IOException ioe) { + return false; + } + } + + }; + } /** * Turns an array of FileStatus into an array of Paths. --- END PATCH ---