pmezard opened a new pull request #533:
URL: https://github.com/apache/nutch/pull/533
- Handle Google Cloud Storage URLs as crawldb inputs in domainstats,
protocolstats and crawlcomplete commands.
- Correctly resolve numReducers in protocolstats.
- Align crawlcomplete -inputDirs behaviour on the other commands: expect
directories containing "current", not "crawldb/current".
Like the other PR, I cannot run all tests locally. But I am not too worried,
test-core pass.
Also, these changes are not super high quality in the sense that ideally I
would have added some helper to enforce path manipulations which work with
URLs, for all commands. Now, I do not think these commands deserve so much work
so:
- I just copy/paste "Path" manipulations used in other commands like
CrawlDBReader which I know work locally and in GCS. Then run the commands in
GCP and some of them again locally just to be sure I did not break everything.
I do not know Java API well enough to decide if the path manipulations make
sense, I just know they work.
- I spotted the numReducers issue while reading the code. I did not test it.
It is obviously a copy/paste from domainstats to protocolstats.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]