Hi Semyon, thanks! Please open an issue on https://issues.apache.org/jira/projects/NUTCH and if possible provide a patch or a pull-request on github with the Jira issue id "NUTCH-XXX" in the title.
You're right the usage of ARG_SEGMENT isn't consistent single path in Fetcher ParseSegment list of paths in CrawlDb LinkDb IndexingJob > My proposal is either to add new variable ARG_SEGMENTS(plural) or This would break the existing API, but ok if properly logged and not silently ignored. That's bad with the current behavior: if ARG_SEGMENT isn't an instance of ArrayList no error is shown. > to replace the the logic to the single path processing. Not my favorite option since some of the tools are made to work with multiple segments. Third option: could check the type of ARG_SEGMENT (Path/String vs. (Array)List) But the decision is on you. Thanks, Sebastian On 10/11/2017 03:38 PM, Semyon Semyonov wrote: > Dear community, > > I would like to suggest an improvement/ask an advice about an API key > usage(Nutch 1.13) > > The class metadata/Nutch.java contains two possible parameters for > InvertLinks job in api. > line 97: public static final String ARG_SEGMENTDIR = "segment_dir" > line 99 : public static final String ARG_SEGMENT = "segment"; > > An example of usage can be found in nutch/crawl/LinkDB.java(see bellow), > witch the following logic: > The ARG_SEGMENTDIR - a directory of all segments > but Nutch.ARG_SEGMENT - an array of paths to segments instead of single path. > Therefore if a string > is passed instead of array of strings it throws an exception. > > I belive it contradicts the naming convention and misleads the > users(Nutch.ARG_SEGMENT is a singular > noun, but it is used as a plural). > My proposal is either to add new variable ARG_SEGMENTS(plural) or to replace > the the logic to the > single path processing. > > nutch/crawl/LinkDB.java: > lines 383 : > if(args.containsKey(Nutch.ARG_SEGMENTDIR)) { > Object segDir = args.get(Nutch.ARG_SEGMENTDIR); > if(segDir instanceof Path) { > segmentsDir = (Path) segDir; > } > else { > segmentsDir = new Path(segDir.toString()); > } > FileStatus[] paths = fs.listStatus(segmentsDir, > HadoopFSUtil.getPassDirectoriesFilter(fs)); > segs.addAll(Arrays.asList(HadoopFSUtil.getPaths(paths))); > } > else if(args.containsKey(Nutch.ARG_SEGMENT)) { > Object segments = args.get(Nutch.ARG_SEGMENT); > ArrayList<String> segmentList = new ArrayList<>(); > if(segments instanceof ArrayList) { > segmentList = (ArrayList<String>)segments; > } > for(String segment: segmentList) { > segs.add(new Path(segment)); > } > } > > Semyon.