Hi Semyon,

thanks!  Please open an issue on
  https://issues.apache.org/jira/projects/NUTCH
and if possible provide a patch or a pull-request
on github with the Jira issue id "NUTCH-XXX" in the title.

You're right the usage of ARG_SEGMENT isn't consistent
 single path in
   Fetcher
   ParseSegment
 list of paths in
   CrawlDb
   LinkDb
   IndexingJob

> My proposal is either to add new variable ARG_SEGMENTS(plural) or
This would break the existing API, but ok if properly logged and not
silently ignored. That's bad with the current behavior: if ARG_SEGMENT
isn't an instance of ArrayList no error is shown.

> to replace the the logic to the single path processing.
Not my favorite option since some of the tools are made to work with multiple 
segments.

Third option: could check the type of ARG_SEGMENT (Path/String vs. (Array)List)

But the decision is on you.

Thanks,
Sebastian

On 10/11/2017 03:38 PM, Semyon Semyonov wrote:
> Dear community,
> 
> I would like to suggest an improvement/ask an advice about an API key 
> usage(Nutch 1.13)
> 
> The class metadata/Nutch.java contains two possible parameters for 
> InvertLinks job in api.
>     line 97: public static final String ARG_SEGMENTDIR = "segment_dir"
>     line 99 : public static final String ARG_SEGMENT = "segment";
> 
> An example of usage can be found in nutch/crawl/LinkDB.java(see bellow), 
> witch the following logic:
> The ARG_SEGMENTDIR  - a directory of all segments 
> but Nutch.ARG_SEGMENT - an array of paths to segments instead of single path. 
> Therefore if a string
> is passed instead of array of strings it throws an exception.
> 
> I belive it contradicts the naming convention and misleads the 
> users(Nutch.ARG_SEGMENT is a singular
> noun, but it is used as a plural).
> My proposal is either to add new variable ARG_SEGMENTS(plural) or to replace 
> the the logic to the
> single path processing.
> 
> nutch/crawl/LinkDB.java:
> lines 383 : 
>     if(args.containsKey(Nutch.ARG_SEGMENTDIR)) {
>       Object segDir = args.get(Nutch.ARG_SEGMENTDIR);
>       if(segDir instanceof Path) {
>         segmentsDir = (Path) segDir;
>       }
>       else {
>         segmentsDir = new Path(segDir.toString());
>       }
>       FileStatus[] paths = fs.listStatus(segmentsDir,
>           HadoopFSUtil.getPassDirectoriesFilter(fs));
>       segs.addAll(Arrays.asList(HadoopFSUtil.getPaths(paths)));
>     }
>     else if(args.containsKey(Nutch.ARG_SEGMENT)) {
>       Object segments = args.get(Nutch.ARG_SEGMENT);
>       ArrayList<String> segmentList = new ArrayList<>();
>       if(segments instanceof ArrayList) {
>         segmentList = (ArrayList<String>)segments;
>       }
>       for(String segment: segmentList) {
>         segs.add(new Path(segment));
>       }
>     }
> 
> Semyon.

Reply via email to