Dear community,

I would like to suggest an improvement/ask an advice about an API key usage(Nutch 1.13)

The class metadata/Nutch.java contains two possible parameters for InvertLinks job in api.
    line 97: public static final String ARG_SEGMENTDIR = "segment_dir"
    line 99 : public static final String ARG_SEGMENT = "segment";

An example of usage can be found in nutch/crawl/LinkDB.java(see bellow), witch the following logic:
The ARG_SEGMENTDIR  - a directory of all segments 
but Nutch.ARG_SEGMENT - an array of paths to segments instead of single path. Therefore if a string is passed instead of array of strings it throws an exception.

I belive it contradicts the naming convention and misleads the users(Nutch.ARG_SEGMENT is a singular noun, but it is used as a plural).
My proposal is either to add new variable ARG_SEGMENTS(plural) or to replace the the logic to the single path processing.

nutch/crawl/LinkDB.java:
lines 383 : 
    if(args.containsKey(Nutch.ARG_SEGMENTDIR)) {
      Object segDir = args.get(Nutch.ARG_SEGMENTDIR);
      if(segDir instanceof Path) {
        segmentsDir = (Path) segDir;
      }
      else {
        segmentsDir = new Path(segDir.toString());
      }
      FileStatus[] paths = fs.listStatus(segmentsDir,
          HadoopFSUtil.getPassDirectoriesFilter(fs));
      segs.addAll(Arrays.asList(HadoopFSUtil.getPaths(paths)));
    }
    else if(args.containsKey(Nutch.ARG_SEGMENT)) {
      Object segments = args.get(Nutch.ARG_SEGMENT);
      ArrayList<String> segmentList = new ArrayList<>();
      if(segments instanceof ArrayList) {
        segmentList = (ArrayList<String>)segments;
      }
      for(String segment: segmentList) {
        segs.add(new Path(segment));
      }
    }

Semyon.

Reply via email to