[jira] [Comment Edited] (NUTCH-3077) Docker container with Nutch WebApp works but missing drill-down

Hiran Chaudhuri (Jira) Tue, 15 Oct 2024 08:58:11 -0700


    [ 
https://issues.apache.org/jira/browse/NUTCH-3077?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17889736#comment-17889736
 ]


Hiran Chaudhuri edited comment on NUTCH-3077 at 10/15/24 3:57 PM:
------------------------------------------------------------------

That would explain why it is not mentioned in any documentation and not 
installed by default.

 

Ok, let's kind of mark this webapp as experimental. What about the other 
suggestions?


was (Author: hiranchaudhuri):
That would explain why it is not mentioned in any documentation and not 
installed by default.

> Docker container with Nutch WebApp works but missing drill-down
> ---------------------------------------------------------------
>
>                 Key: NUTCH-3077
>                 URL: https://issues.apache.org/jira/browse/NUTCH-3077
>             Project: Nutch
>          Issue Type: Improvement
>          Components: docker, web gui
>    Affects Versions: 1.21
>         Environment: {*}{*}Docker container built under Ubuntu 22 LTS using 
> BUILD_MODE=2
>            Reporter: Hiran Chaudhuri
>            Priority: Major
>
> I used the current master branch of Nutch (commit 
> 4a61208f492613f2c5282741e64c036acabeb71e) to build the docker container 
> myself.
> *Hint 1:* Please make the chosen `BUILD_MODE` visible at runtime. This could 
> be a startup message, a tag or some environment variable. Just so that when 
> reverse-engineering it is easier to find out that on 
> hub.docker.com/apache/nutch:latest the REST API and WebApp are not present by 
> design.
>  
> When building with `BUILD_MODE=2` both the REST API and the WebApp startup as 
> expected. I configured a seed url and started a crawl, which very quickly ran 
> into an error condition. The UI does not offer any means of troubleshooting.
> *Hint 2:* Please either add drill-down capability to the WebApp, or point out 
> how the errors can be investigated.
>  
> I attached to the container and executed
> {{cat /var/log/supervisord/nutchserver_stdout.log}}
> In this file I could find an error log like
> {{2024-10-12 22:01:56,710 INFO o.a.n.p.ParseSegment [pool-2-thread-4] 
> ParseSegment: starting}}
> {{2024-10-12 22:01:56,710 INFO o.a.n.p.ParseSegment [pool-2-thread-4] 
> ParseSegment: segment: crawl-1/segments/20241012220155}}
> {{2024-10-12 22:01:56,725 ERROR o.a.n.p.ParseSegment [pool-2-thread-4] 
> org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does 
> not exist: file:/root/crawl-1/segments/20241012220155/content}}
> {{    at 
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:340)}}
> {{    at 
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:279)}}
> {{    at 
> org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:59)}}
> {{    at 
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:404)}}
> {{    at 
> org.apache.hadoop.mapreduce.JobSubmitter.writeNewSplits(JobSubmitter.java:310)}}
> {{    at 
> org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:327)}}
> {{    at 
> org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:200)}}
> {{    at org.apache.hadoop.mapreduce.Job$11.run(Job.java:1678)}}
> {{    at org.apache.hadoop.mapreduce.Job$11.run(Job.java:1675)}}
> {{    at java.base/java.security.AccessController.doPrivileged(Native 
> Method)}}
> {{    at java.base/javax.security.auth.Subject.doAs(Subject.java:423)}}
> {{    at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1899)}}
> {{    at org.apache.hadoop.mapreduce.Job.submit(Job.java:1675)}}
> {{    at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1696)}}
> {{    at org.apache.nutch.parse.ParseSegment.parse(ParseSegment.java:262)}}
> {{    at org.apache.nutch.parse.ParseSegment.run(ParseSegment.java:347)}}
> {{    at org.apache.nutch.service.impl.JobWorker.run(JobWorker.java:74)}}
> {{    at 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)}}
> {{    at 
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)}}
> {{    at java.base/java.lang.Thread.run(Thread.java:829)}}
> {{Caused by: java.io.IOException: Input path does not exist: 
> file:/root/crawl-1/segments/20241012220155/content}}
> {{    at 
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:313)}}
> {{    ... 19 more}}
>  
> This points to a problem at parsing the segment. But going back in time (to 
> find out why no content was retrieved) I saw this:
> {{2024-10-12 22:01:53,296 INFO o.a.n.c.FetchScheduleFactory [LocalJobRunner 
> Map Task Executor #0] Using FetchSchedule impl: 
> org.apache.nutch.crawl.DefaultFetchSchedule}}
> {{2024-10-12 22:01:53,297 INFO o.a.n.c.AbstractFetchSchedule [LocalJobRunner 
> Map Task Executor #0] defaultInterval=2592000}}
> {{2024-10-12 22:01:53,297 INFO o.a.n.c.AbstractFetchSchedule [LocalJobRunner 
> Map Task Executor #0] maxInterval=7776000}}
> {{2024-10-12 22:01:53,321 INFO o.a.n.n.u.r.RegexURLNormalizer 
> [pool-9-thread-1] can't find rules for scope 'generate_host_count', using 
> default}}
> {{2024-10-12 22:01:54,264 INFO o.a.n.c.Generator [pool-2-thread-2] Generator: 
> number of items rejected during selection:}}
> {{2024-10-12 22:01:54,274 INFO o.a.n.c.Generator [pool-2-thread-2] Generator: 
> Partitioning selected urls for politeness.}}
> {{2024-10-12 22:01:55,275 INFO o.a.n.c.Generator [pool-2-thread-2] Generator: 
> segment: crawl-1/segments/20241012220155}}
> {{2024-10-12 22:01:56,434 INFO o.a.n.c.Generator [pool-2-thread-2] Generator: 
> finished, elapsed: 3276 ms}}
> {{2024-10-12 22:01:56,705 ERROR o.a.n.f.Fetcher [pool-2-thread-3] Fetcher: No 
> agents listed in 'http.agent.name' property.}}
> {{2024-10-12 22:01:56,705 ERROR o.a.n.f.Fetcher [pool-2-thread-3] Fetcher: 
> java.lang.IllegalArgumentException: Fetcher: No agents listed in 
> 'http.agent.name' property.}}
> {{    at 
> org.apache.nutch.fetcher.Fetcher.checkConfiguration(Fetcher.java:604)}}
> {{    at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:471)}}
> {{    at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:645)}}
> {{    at org.apache.nutch.service.impl.JobWorker.run(JobWorker.java:74)}}
> {{    at 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)}}
> {{    at 
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)}}
> {{    at java.base/java.lang.Thread.run(Thread.java:829)}}
>  
> It means no content was fetched at all. I can only suspect that the exit code 
> of the fetcher was insufficient for the wrapper (the nutch or the crawl 
> script) to detect malfunction early enough.
>  
> *Hint 3:* Improve error handling for the different Nutch commands and the 
> wrapper scripts with good exit codes
>  
> Now that I know the `{{{}http.agent.name` property has to be set, I cannot do 
> so since the webapp does not offer it for editing.{}}}
> *Hint 4:* Please allow adding unknown properties through the configuration 
> page. More plugins will require more properties in future.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Comment Edited] (NUTCH-3077) Docker container with Nutch WebApp works but missing drill-down

Reply via email to