[jira] [Commented] (NUTCH-3077) Docker container with Nutch WebApp works but missing drill-down

Hiran Chaudhuri (Jira) Tue, 15 Oct 2024 08:57:04 -0700


    [ 
https://issues.apache.org/jira/browse/NUTCH-3077?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17889736#comment-17889736
 ]


Hiran Chaudhuri commented on NUTCH-3077:
----------------------------------------

That would explain why it is not mentioned in any documentation and not 
installed by default.

> Docker container with Nutch WebApp works but missing drill-down
> ---------------------------------------------------------------
>
>                 Key: NUTCH-3077
>                 URL: https://issues.apache.org/jira/browse/NUTCH-3077
>             Project: Nutch
>          Issue Type: Improvement
>          Components: docker, web gui
>    Affects Versions: 1.21
>         Environment: {*}{*}Docker container built under Ubuntu 22 LTS using 
> BUILD_MODE=2
>            Reporter: Hiran Chaudhuri
>            Priority: Major
>
> I used the current master branch of Nutch (commit 
> 4a61208f492613f2c5282741e64c036acabeb71e) to build the docker container 
> myself.
> *Hint 1:* Please make the chosen `BUILD_MODE` visible at runtime. This could 
> be a startup message, a tag or some environment variable. Just so that when 
> reverse-engineering it is easier to find out that on 
> hub.docker.com/apache/nutch:latest the REST API and WebApp are not present by 
> design.
>  
> When building with `BUILD_MODE=2` both the REST API and the WebApp startup as 
> expected. I configured a seed url and started a crawl, which very quickly ran 
> into an error condition. The UI does not offer any means of troubleshooting.
> *Hint 2:* Please either add drill-down capability to the WebApp, or point out 
> how the errors can be investigated.
>  
> I attached to the container and executed
> {{cat /var/log/supervisord/nutchserver_stdout.log}}
> In this file I could find an error log like
> {{2024-10-12 22:01:56,710 INFO o.a.n.p.ParseSegment [pool-2-thread-4] 
> ParseSegment: starting}}
> {{2024-10-12 22:01:56,710 INFO o.a.n.p.ParseSegment [pool-2-thread-4] 
> ParseSegment: segment: crawl-1/segments/20241012220155}}
> {{2024-10-12 22:01:56,725 ERROR o.a.n.p.ParseSegment [pool-2-thread-4] 
> org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does 
> not exist: file:/root/crawl-1/segments/20241012220155/content}}
> {{    at 
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:340)}}
> {{    at 
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:279)}}
> {{    at 
> org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:59)}}
> {{    at 
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:404)}}
> {{    at 
> org.apache.hadoop.mapreduce.JobSubmitter.writeNewSplits(JobSubmitter.java:310)}}
> {{    at 
> org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:327)}}
> {{    at 
> org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:200)}}
> {{    at org.apache.hadoop.mapreduce.Job$11.run(Job.java:1678)}}
> {{    at org.apache.hadoop.mapreduce.Job$11.run(Job.java:1675)}}
> {{    at java.base/java.security.AccessController.doPrivileged(Native 
> Method)}}
> {{    at java.base/javax.security.auth.Subject.doAs(Subject.java:423)}}
> {{    at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1899)}}
> {{    at org.apache.hadoop.mapreduce.Job.submit(Job.java:1675)}}
> {{    at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1696)}}
> {{    at org.apache.nutch.parse.ParseSegment.parse(ParseSegment.java:262)}}
> {{    at org.apache.nutch.parse.ParseSegment.run(ParseSegment.java:347)}}
> {{    at org.apache.nutch.service.impl.JobWorker.run(JobWorker.java:74)}}
> {{    at 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)}}
> {{    at 
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)}}
> {{    at java.base/java.lang.Thread.run(Thread.java:829)}}
> {{Caused by: java.io.IOException: Input path does not exist: 
> file:/root/crawl-1/segments/20241012220155/content}}
> {{    at 
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:313)}}
> {{    ... 19 more}}
>  
> This points to a problem at parsing the segment. But going back in time (to 
> find out why no content was retrieved) I saw this:
> {{2024-10-12 22:01:53,296 INFO o.a.n.c.FetchScheduleFactory [LocalJobRunner 
> Map Task Executor #0] Using FetchSchedule impl: 
> org.apache.nutch.crawl.DefaultFetchSchedule}}
> {{2024-10-12 22:01:53,297 INFO o.a.n.c.AbstractFetchSchedule [LocalJobRunner 
> Map Task Executor #0] defaultInterval=2592000}}
> {{2024-10-12 22:01:53,297 INFO o.a.n.c.AbstractFetchSchedule [LocalJobRunner 
> Map Task Executor #0] maxInterval=7776000}}
> {{2024-10-12 22:01:53,321 INFO o.a.n.n.u.r.RegexURLNormalizer 
> [pool-9-thread-1] can't find rules for scope 'generate_host_count', using 
> default}}
> {{2024-10-12 22:01:54,264 INFO o.a.n.c.Generator [pool-2-thread-2] Generator: 
> number of items rejected during selection:}}
> {{2024-10-12 22:01:54,274 INFO o.a.n.c.Generator [pool-2-thread-2] Generator: 
> Partitioning selected urls for politeness.}}
> {{2024-10-12 22:01:55,275 INFO o.a.n.c.Generator [pool-2-thread-2] Generator: 
> segment: crawl-1/segments/20241012220155}}
> {{2024-10-12 22:01:56,434 INFO o.a.n.c.Generator [pool-2-thread-2] Generator: 
> finished, elapsed: 3276 ms}}
> {{2024-10-12 22:01:56,705 ERROR o.a.n.f.Fetcher [pool-2-thread-3] Fetcher: No 
> agents listed in 'http.agent.name' property.}}
> {{2024-10-12 22:01:56,705 ERROR o.a.n.f.Fetcher [pool-2-thread-3] Fetcher: 
> java.lang.IllegalArgumentException: Fetcher: No agents listed in 
> 'http.agent.name' property.}}
> {{    at 
> org.apache.nutch.fetcher.Fetcher.checkConfiguration(Fetcher.java:604)}}
> {{    at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:471)}}
> {{    at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:645)}}
> {{    at org.apache.nutch.service.impl.JobWorker.run(JobWorker.java:74)}}
> {{    at 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)}}
> {{    at 
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)}}
> {{    at java.base/java.lang.Thread.run(Thread.java:829)}}
>  
> It means no content was fetched at all. I can only suspect that the exit code 
> of the fetcher was insufficient for the wrapper (the nutch or the crawl 
> script) to detect malfunction early enough.
>  
> *Hint 3:* Improve error handling for the different Nutch commands and the 
> wrapper scripts with good exit codes
>  
> Now that I know the `{{{}http.agent.name` property has to be set, I cannot do 
> so since the webapp does not offer it for editing.{}}}
> *Hint 4:* Please allow adding unknown properties through the configuration 
> page. More plugins will require more properties in future.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (NUTCH-3077) Docker container with Nutch WebApp works but missing drill-down

Reply via email to