Hi Semyon,
thanks! Please open an issue on
https://issues.apache.org/jira/projects/NUTCH
and if possible provide a patch or a pull-request
on github with the Jira issue id "NUTCH-XXX" in the title.
You're right the usage of ARG_SEGMENT isn't consistent
single path in
Fetcher
ParseSegment
list of paths in
CrawlDb
LinkDb
IndexingJob
> My proposal is either to add new variable ARG_SEGMENTS(plural) or
This would break the existing API, but ok if properly logged and not
silently ignored. That's bad with the current behavior: if ARG_SEGMENT
isn't an instance of ArrayList no error is shown.
> to replace the the logic to the single path processing.
Not my favorite option since some of the tools are made to work with multiple
segments.
Third option: could check the type of ARG_SEGMENT (Path/String vs. (Array)List)
But the decision is on you.
Thanks,
Sebastian
On 10/11/2017 03:38 PM, Semyon Semyonov wrote:
> Dear community,
>
> I would like to suggest an improvement/ask an advice about an API key
> usage(Nutch 1.13)
>
> The class metadata/Nutch.java contains two possible parameters for
> InvertLinks job in api.
> line 97: public static final String ARG_SEGMENTDIR = "segment_dir"
> line 99 : public static final String ARG_SEGMENT = "segment";
>
> An example of usage can be found in nutch/crawl/LinkDB.java(see bellow),
> witch the following logic:
> The ARG_SEGMENTDIR - a directory of all segments
> but Nutch.ARG_SEGMENT - an array of paths to segments instead of single path.
> Therefore if a string
> is passed instead of array of strings it throws an exception.
>
> I belive it contradicts the naming convention and misleads the
> users(Nutch.ARG_SEGMENT is a singular
> noun, but it is used as a plural).
> My proposal is either to add new variable ARG_SEGMENTS(plural) or to replace
> the the logic to the
> single path processing.
>
> nutch/crawl/LinkDB.java:
> lines 383 :
> if(args.containsKey(Nutch.ARG_SEGMENTDIR)) {
> Object segDir = args.get(Nutch.ARG_SEGMENTDIR);
> if(segDir instanceof Path) {
> segmentsDir = (Path) segDir;
> }
> else {
> segmentsDir = new Path(segDir.toString());
> }
> FileStatus[] paths = fs.listStatus(segmentsDir,
> HadoopFSUtil.getPassDirectoriesFilter(fs));
> segs.addAll(Arrays.asList(HadoopFSUtil.getPaths(paths)));
> }
> else if(args.containsKey(Nutch.ARG_SEGMENT)) {
> Object segments = args.get(Nutch.ARG_SEGMENT);
> ArrayList<String> segmentList = new ArrayList<>();
> if(segments instanceof ArrayList) {
> segmentList = (ArrayList<String>)segments;
> }
> for(String segment: segmentList) {
> segs.add(new Path(segment));
> }
> }
>
> Semyon.