Hiran Chaudhuri created NUTCH-3077:
--------------------------------------
Summary: Docker container with Nutch WebApp works but missing
drill-down
Key: NUTCH-3077
URL: https://issues.apache.org/jira/browse/NUTCH-3077
Project: Nutch
Issue Type: Improvement
Components: docker, web gui
Affects Versions: 1.21
Environment: {*}{*}Docker container built under Ubuntu 22 LTS using
BUILD_MODE=2
Reporter: Hiran Chaudhuri
I used the current master branch of Nutch (commit
4a61208f492613f2c5282741e64c036acabeb71e) to build the docker container myself.
*Hint 1:* Please make the chosen `BUILD_MODE` visible at runtime. This could be
a startup message, a tag or some environment variable. Just so that when
reverse-engineering it is easier to find out that on
hub.docker.com/apache/nutch:latest the REST API and WebApp are not present by
design.
When building with `BUILD_MODE=2` both the REST API and the WebApp startup as
expected. I configured a seed url and started a crawl, which very quickly ran
into an error condition. The UI does not offer any means of troubleshooting.
*Hint 2:* Please either add drill-down capability to the WebApp, or point out
how the errors can be investigated.
I attached to the container and executed
{{cat /var/log/supervisord/nutchserver_stdout.log}}
In this file I could find an error log like
{{2024-10-12 22:01:56,710 INFO o.a.n.p.ParseSegment [pool-2-thread-4]
ParseSegment: starting}}
{{2024-10-12 22:01:56,710 INFO o.a.n.p.ParseSegment [pool-2-thread-4]
ParseSegment: segment: crawl-1/segments/20241012220155}}
{{2024-10-12 22:01:56,725 ERROR o.a.n.p.ParseSegment [pool-2-thread-4]
org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does
not exist: file:/root/crawl-1/segments/20241012220155/content}}
{{ at
org.apache.hadoop.mapreduce.lib.input.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:340)}}
{{ at
org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:279)}}
{{ at
org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:59)}}
{{ at
org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:404)}}
{{ at
org.apache.hadoop.mapreduce.JobSubmitter.writeNewSplits(JobSubmitter.java:310)}}
{{ at
org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:327)}}
{{ at
org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:200)}}
{{ at org.apache.hadoop.mapreduce.Job$11.run(Job.java:1678)}}
{{ at org.apache.hadoop.mapreduce.Job$11.run(Job.java:1675)}}
{{ at java.base/java.security.AccessController.doPrivileged(Native Method)}}
{{ at java.base/javax.security.auth.Subject.doAs(Subject.java:423)}}
{{ at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1899)}}
{{ at org.apache.hadoop.mapreduce.Job.submit(Job.java:1675)}}
{{ at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1696)}}
{{ at org.apache.nutch.parse.ParseSegment.parse(ParseSegment.java:262)}}
{{ at org.apache.nutch.parse.ParseSegment.run(ParseSegment.java:347)}}
{{ at org.apache.nutch.service.impl.JobWorker.run(JobWorker.java:74)}}
{{ at
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)}}
{{ at
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)}}
{{ at java.base/java.lang.Thread.run(Thread.java:829)}}
{{Caused by: java.io.IOException: Input path does not exist:
file:/root/crawl-1/segments/20241012220155/content}}
{{ at
org.apache.hadoop.mapreduce.lib.input.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:313)}}
{{ ... 19 more}}
This points to a problem at parsing the segment. But going back in time (to
find out why no content was retrieved) I saw this:
{{2024-10-12 22:01:53,296 INFO o.a.n.c.FetchScheduleFactory [LocalJobRunner Map
Task Executor #0] Using FetchSchedule impl:
org.apache.nutch.crawl.DefaultFetchSchedule}}
{{2024-10-12 22:01:53,297 INFO o.a.n.c.AbstractFetchSchedule [LocalJobRunner
Map Task Executor #0] defaultInterval=2592000}}
{{2024-10-12 22:01:53,297 INFO o.a.n.c.AbstractFetchSchedule [LocalJobRunner
Map Task Executor #0] maxInterval=7776000}}
{{2024-10-12 22:01:53,321 INFO o.a.n.n.u.r.RegexURLNormalizer [pool-9-thread-1]
can't find rules for scope 'generate_host_count', using default}}
{{2024-10-12 22:01:54,264 INFO o.a.n.c.Generator [pool-2-thread-2] Generator:
number of items rejected during selection:}}
{{2024-10-12 22:01:54,274 INFO o.a.n.c.Generator [pool-2-thread-2] Generator:
Partitioning selected urls for politeness.}}
{{2024-10-12 22:01:55,275 INFO o.a.n.c.Generator [pool-2-thread-2] Generator:
segment: crawl-1/segments/20241012220155}}
{{2024-10-12 22:01:56,434 INFO o.a.n.c.Generator [pool-2-thread-2] Generator:
finished, elapsed: 3276 ms}}
{{2024-10-12 22:01:56,705 ERROR o.a.n.f.Fetcher [pool-2-thread-3] Fetcher: No
agents listed in 'http.agent.name' property.}}
{{2024-10-12 22:01:56,705 ERROR o.a.n.f.Fetcher [pool-2-thread-3] Fetcher:
java.lang.IllegalArgumentException: Fetcher: No agents listed in
'http.agent.name' property.}}
{{ at org.apache.nutch.fetcher.Fetcher.checkConfiguration(Fetcher.java:604)}}
{{ at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:471)}}
{{ at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:645)}}
{{ at org.apache.nutch.service.impl.JobWorker.run(JobWorker.java:74)}}
{{ at
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)}}
{{ at
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)}}
{{ at java.base/java.lang.Thread.run(Thread.java:829)}}
It means no content was fetched at all. I can only suspect that the exit code
of the fetcher was insufficient for the wrapper (the nutch or the crawl script)
to detect malfunction early enough.
*Hint 3:* Improve error handling for the different Nutch commands and the
wrapper scripts with good exit codes
Now that I know the `{{{}http.agent.name` property has to be set, I cannot do
so since the webapp does not offer it for editing.{}}}
*Hint 4:* Please allow adding unknown properties through the configuration
page. More plugins will require more properties in future.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)