[ https://issues.apache.org/jira/browse/NUTCH-3077?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17889736#comment-17889736 ]
Hiran Chaudhuri commented on NUTCH-3077: ---------------------------------------- That would explain why it is not mentioned in any documentation and not installed by default. > Docker container with Nutch WebApp works but missing drill-down > --------------------------------------------------------------- > > Key: NUTCH-3077 > URL: https://issues.apache.org/jira/browse/NUTCH-3077 > Project: Nutch > Issue Type: Improvement > Components: docker, web gui > Affects Versions: 1.21 > Environment: {*}{*}Docker container built under Ubuntu 22 LTS using > BUILD_MODE=2 > Reporter: Hiran Chaudhuri > Priority: Major > > I used the current master branch of Nutch (commit > 4a61208f492613f2c5282741e64c036acabeb71e) to build the docker container > myself. > *Hint 1:* Please make the chosen `BUILD_MODE` visible at runtime. This could > be a startup message, a tag or some environment variable. Just so that when > reverse-engineering it is easier to find out that on > hub.docker.com/apache/nutch:latest the REST API and WebApp are not present by > design. > > When building with `BUILD_MODE=2` both the REST API and the WebApp startup as > expected. I configured a seed url and started a crawl, which very quickly ran > into an error condition. The UI does not offer any means of troubleshooting. > *Hint 2:* Please either add drill-down capability to the WebApp, or point out > how the errors can be investigated. > > I attached to the container and executed > {{cat /var/log/supervisord/nutchserver_stdout.log}} > In this file I could find an error log like > {{2024-10-12 22:01:56,710 INFO o.a.n.p.ParseSegment [pool-2-thread-4] > ParseSegment: starting}} > {{2024-10-12 22:01:56,710 INFO o.a.n.p.ParseSegment [pool-2-thread-4] > ParseSegment: segment: crawl-1/segments/20241012220155}} > {{2024-10-12 22:01:56,725 ERROR o.a.n.p.ParseSegment [pool-2-thread-4] > org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does > not exist: file:/root/crawl-1/segments/20241012220155/content}} > {{ at > org.apache.hadoop.mapreduce.lib.input.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:340)}} > {{ at > org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:279)}} > {{ at > org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:59)}} > {{ at > org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:404)}} > {{ at > org.apache.hadoop.mapreduce.JobSubmitter.writeNewSplits(JobSubmitter.java:310)}} > {{ at > org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:327)}} > {{ at > org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:200)}} > {{ at org.apache.hadoop.mapreduce.Job$11.run(Job.java:1678)}} > {{ at org.apache.hadoop.mapreduce.Job$11.run(Job.java:1675)}} > {{ at java.base/java.security.AccessController.doPrivileged(Native > Method)}} > {{ at java.base/javax.security.auth.Subject.doAs(Subject.java:423)}} > {{ at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1899)}} > {{ at org.apache.hadoop.mapreduce.Job.submit(Job.java:1675)}} > {{ at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1696)}} > {{ at org.apache.nutch.parse.ParseSegment.parse(ParseSegment.java:262)}} > {{ at org.apache.nutch.parse.ParseSegment.run(ParseSegment.java:347)}} > {{ at org.apache.nutch.service.impl.JobWorker.run(JobWorker.java:74)}} > {{ at > java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)}} > {{ at > java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)}} > {{ at java.base/java.lang.Thread.run(Thread.java:829)}} > {{Caused by: java.io.IOException: Input path does not exist: > file:/root/crawl-1/segments/20241012220155/content}} > {{ at > org.apache.hadoop.mapreduce.lib.input.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:313)}} > {{ ... 19 more}} > > This points to a problem at parsing the segment. But going back in time (to > find out why no content was retrieved) I saw this: > {{2024-10-12 22:01:53,296 INFO o.a.n.c.FetchScheduleFactory [LocalJobRunner > Map Task Executor #0] Using FetchSchedule impl: > org.apache.nutch.crawl.DefaultFetchSchedule}} > {{2024-10-12 22:01:53,297 INFO o.a.n.c.AbstractFetchSchedule [LocalJobRunner > Map Task Executor #0] defaultInterval=2592000}} > {{2024-10-12 22:01:53,297 INFO o.a.n.c.AbstractFetchSchedule [LocalJobRunner > Map Task Executor #0] maxInterval=7776000}} > {{2024-10-12 22:01:53,321 INFO o.a.n.n.u.r.RegexURLNormalizer > [pool-9-thread-1] can't find rules for scope 'generate_host_count', using > default}} > {{2024-10-12 22:01:54,264 INFO o.a.n.c.Generator [pool-2-thread-2] Generator: > number of items rejected during selection:}} > {{2024-10-12 22:01:54,274 INFO o.a.n.c.Generator [pool-2-thread-2] Generator: > Partitioning selected urls for politeness.}} > {{2024-10-12 22:01:55,275 INFO o.a.n.c.Generator [pool-2-thread-2] Generator: > segment: crawl-1/segments/20241012220155}} > {{2024-10-12 22:01:56,434 INFO o.a.n.c.Generator [pool-2-thread-2] Generator: > finished, elapsed: 3276 ms}} > {{2024-10-12 22:01:56,705 ERROR o.a.n.f.Fetcher [pool-2-thread-3] Fetcher: No > agents listed in 'http.agent.name' property.}} > {{2024-10-12 22:01:56,705 ERROR o.a.n.f.Fetcher [pool-2-thread-3] Fetcher: > java.lang.IllegalArgumentException: Fetcher: No agents listed in > 'http.agent.name' property.}} > {{ at > org.apache.nutch.fetcher.Fetcher.checkConfiguration(Fetcher.java:604)}} > {{ at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:471)}} > {{ at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:645)}} > {{ at org.apache.nutch.service.impl.JobWorker.run(JobWorker.java:74)}} > {{ at > java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)}} > {{ at > java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)}} > {{ at java.base/java.lang.Thread.run(Thread.java:829)}} > > It means no content was fetched at all. I can only suspect that the exit code > of the fetcher was insufficient for the wrapper (the nutch or the crawl > script) to detect malfunction early enough. > > *Hint 3:* Improve error handling for the different Nutch commands and the > wrapper scripts with good exit codes > > Now that I know the `{{{}http.agent.name` property has to be set, I cannot do > so since the webapp does not offer it for editing.{}}} > *Hint 4:* Please allow adding unknown properties through the configuration > page. More plugins will require more properties in future. -- This message was sent by Atlassian Jira (v8.20.10#820010)