[jira] [Created] (NUTCH-3077) Docker container with Nutch WebApp works but missing drill-down

Hiran Chaudhuri (Jira) Sat, 12 Oct 2024 15:27:04 -0700

Hiran Chaudhuri created NUTCH-3077:
--------------------------------------

             Summary: Docker container with Nutch WebApp works but missing 
drill-down
                 Key: NUTCH-3077
                 URL: https://issues.apache.org/jira/browse/NUTCH-3077
             Project: Nutch
          Issue Type: Improvement
          Components: docker, web gui
    Affects Versions: 1.21
         Environment: {*}{*}Docker container built under Ubuntu 22 LTS using 
BUILD_MODE=2
            Reporter: Hiran Chaudhuri



I used the current master branch of Nutch (commit 
4a61208f492613f2c5282741e64c036acabeb71e) to build the docker container myself.

*Hint 1:* Please make the chosen `BUILD_MODE` visible at runtime. This could be 
a startup message, a tag or some environment variable. Just so that when 
reverse-engineering it is easier to find out that on 
hub.docker.com/apache/nutch:latest the REST API and WebApp are not present by 
design.

 

When building with `BUILD_MODE=2` both the REST API and the WebApp startup as 
expected. I configured a seed url and started a crawl, which very quickly ran 
into an error condition. The UI does not offer any means of troubleshooting.

*Hint 2:* Please either add drill-down capability to the WebApp, or point out 
how the errors can be investigated.

 

I attached to the container and executed

{{cat /var/log/supervisord/nutchserver_stdout.log}}

In this file I could find an error log like

{{2024-10-12 22:01:56,710 INFO o.a.n.p.ParseSegment [pool-2-thread-4] 
ParseSegment: starting}}
{{2024-10-12 22:01:56,710 INFO o.a.n.p.ParseSegment [pool-2-thread-4] 
ParseSegment: segment: crawl-1/segments/20241012220155}}
{{2024-10-12 22:01:56,725 ERROR o.a.n.p.ParseSegment [pool-2-thread-4] 
org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does 
not exist: file:/root/crawl-1/segments/20241012220155/content}}
{{    at 
org.apache.hadoop.mapreduce.lib.input.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:340)}}
{{    at 
org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:279)}}
{{    at 
org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:59)}}
{{    at 
org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:404)}}
{{    at 
org.apache.hadoop.mapreduce.JobSubmitter.writeNewSplits(JobSubmitter.java:310)}}
{{    at 
org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:327)}}
{{    at 
org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:200)}}
{{    at org.apache.hadoop.mapreduce.Job$11.run(Job.java:1678)}}
{{    at org.apache.hadoop.mapreduce.Job$11.run(Job.java:1675)}}
{{    at java.base/java.security.AccessController.doPrivileged(Native Method)}}
{{    at java.base/javax.security.auth.Subject.doAs(Subject.java:423)}}
{{    at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1899)}}
{{    at org.apache.hadoop.mapreduce.Job.submit(Job.java:1675)}}
{{    at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1696)}}
{{    at org.apache.nutch.parse.ParseSegment.parse(ParseSegment.java:262)}}
{{    at org.apache.nutch.parse.ParseSegment.run(ParseSegment.java:347)}}
{{    at org.apache.nutch.service.impl.JobWorker.run(JobWorker.java:74)}}
{{    at 
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)}}
{{    at 
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)}}
{{    at java.base/java.lang.Thread.run(Thread.java:829)}}
{{Caused by: java.io.IOException: Input path does not exist: 
file:/root/crawl-1/segments/20241012220155/content}}
{{    at 
org.apache.hadoop.mapreduce.lib.input.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:313)}}
{{    ... 19 more}}

 

This points to a problem at parsing the segment. But going back in time (to 
find out why no content was retrieved) I saw this:

{{2024-10-12 22:01:53,296 INFO o.a.n.c.FetchScheduleFactory [LocalJobRunner Map 
Task Executor #0] Using FetchSchedule impl: 
org.apache.nutch.crawl.DefaultFetchSchedule}}
{{2024-10-12 22:01:53,297 INFO o.a.n.c.AbstractFetchSchedule [LocalJobRunner 
Map Task Executor #0] defaultInterval=2592000}}
{{2024-10-12 22:01:53,297 INFO o.a.n.c.AbstractFetchSchedule [LocalJobRunner 
Map Task Executor #0] maxInterval=7776000}}
{{2024-10-12 22:01:53,321 INFO o.a.n.n.u.r.RegexURLNormalizer [pool-9-thread-1] 
can't find rules for scope 'generate_host_count', using default}}
{{2024-10-12 22:01:54,264 INFO o.a.n.c.Generator [pool-2-thread-2] Generator: 
number of items rejected during selection:}}
{{2024-10-12 22:01:54,274 INFO o.a.n.c.Generator [pool-2-thread-2] Generator: 
Partitioning selected urls for politeness.}}
{{2024-10-12 22:01:55,275 INFO o.a.n.c.Generator [pool-2-thread-2] Generator: 
segment: crawl-1/segments/20241012220155}}
{{2024-10-12 22:01:56,434 INFO o.a.n.c.Generator [pool-2-thread-2] Generator: 
finished, elapsed: 3276 ms}}
{{2024-10-12 22:01:56,705 ERROR o.a.n.f.Fetcher [pool-2-thread-3] Fetcher: No 
agents listed in 'http.agent.name' property.}}
{{2024-10-12 22:01:56,705 ERROR o.a.n.f.Fetcher [pool-2-thread-3] Fetcher: 
java.lang.IllegalArgumentException: Fetcher: No agents listed in 
'http.agent.name' property.}}
{{    at org.apache.nutch.fetcher.Fetcher.checkConfiguration(Fetcher.java:604)}}
{{    at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:471)}}
{{    at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:645)}}
{{    at org.apache.nutch.service.impl.JobWorker.run(JobWorker.java:74)}}
{{    at 
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)}}
{{    at 
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)}}
{{    at java.base/java.lang.Thread.run(Thread.java:829)}}

 

It means no content was fetched at all. I can only suspect that the exit code 
of the fetcher was insufficient for the wrapper (the nutch or the crawl script) 
to detect malfunction early enough.

 

*Hint 3:* Improve error handling for the different Nutch commands and the 
wrapper scripts with good exit codes

 

Now that I know the `{{{}http.agent.name` property has to be set, I cannot do 
so since the webapp does not offer it for editing.{}}}

*Hint 4:* Please allow adding unknown properties through the configuration 
page. More plugins will require more properties in future.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (NUTCH-3077) Docker container with Nutch WebApp works but missing drill-down

Reply via email to