Dear community,
I would like to suggest an improvement/ask an advice about an API key usage(Nutch 1.13)
The class metadata/Nutch.java contains two possible parameters for InvertLinks job in api.
line 97: public static final String ARG_SEGMENTDIR = "segment_dir"
I would like to suggest an improvement/ask an advice about an API key usage(Nutch 1.13)
The class metadata/Nutch.java contains two possible parameters for InvertLinks job in api.
line 97: public static final String ARG_SEGMENTDIR = "segment_dir"
line 99 : public static final String ARG_SEGMENT = "segment";
An example of usage can be found in nutch/crawl/LinkDB.java(see bellow), witch the following logic:
The ARG_SEGMENTDIR - a directory of all segments
but Nutch.ARG_SEGMENT - an array of paths to segments instead of single path. Therefore if a string is passed instead of array of strings it throws an exception.
I belive it contradicts the naming convention and misleads the users(Nutch.ARG_SEGMENT is a singular noun, but it is used as a plural).
My proposal is either to add new variable ARG_SEGMENTS(plural) or to replace the the logic to the single path processing.
nutch/crawl/LinkDB.java:
but Nutch.ARG_SEGMENT - an array of paths to segments instead of single path. Therefore if a string is passed instead of array of strings it throws an exception.
I belive it contradicts the naming convention and misleads the users(Nutch.ARG_SEGMENT is a singular noun, but it is used as a plural).
My proposal is either to add new variable ARG_SEGMENTS(plural) or to replace the the logic to the single path processing.
nutch/crawl/LinkDB.java:
lines 383 :
if(args.containsKey(Nutch.ARG_SEGMENTDIR)) {
Object segDir = args.get(Nutch.ARG_SEGMENTDIR);
if(segDir instanceof Path) {
segmentsDir = (Path) segDir;
}
else {
segmentsDir = new Path(segDir.toString());
}
FileStatus[] paths = fs.listStatus(segmentsDir,
HadoopFSUtil.getPassDirectoriesFilter(fs));
segs.addAll(Arrays.asList(HadoopFSUtil.getPaths(paths)));
}
else if(args.containsKey(Nutch.ARG_SEGMENT)) {
Object segments = args.get(Nutch.ARG_SEGMENT);
ArrayList<String> segmentList = new ArrayList<>();
if(segments instanceof ArrayList) {
segmentList = (ArrayList<String>)segments;
}
for(String segment: segmentList) {
segs.add(new Path(segment));
}
}
Semyon.
if(args.containsKey(Nutch.ARG_SEGMENTDIR)) {
Object segDir = args.get(Nutch.ARG_SEGMENTDIR);
if(segDir instanceof Path) {
segmentsDir = (Path) segDir;
}
else {
segmentsDir = new Path(segDir.toString());
}
FileStatus[] paths = fs.listStatus(segmentsDir,
HadoopFSUtil.getPassDirectoriesFilter(fs));
segs.addAll(Arrays.asList(HadoopFSUtil.getPaths(paths)));
}
else if(args.containsKey(Nutch.ARG_SEGMENT)) {
Object segments = args.get(Nutch.ARG_SEGMENT);
ArrayList<String> segmentList = new ArrayList<>();
if(segments instanceof ArrayList) {
segmentList = (ArrayList<String>)segments;
}
for(String segment: segmentList) {
segs.add(new Path(segment));
}
}
Semyon.

