Hi Shashanka, All,

Thank you for your reply!

I'm using Nutch 1.19. I did the injection and segment generation using the
following commands:

bin/nutch inject crawl/crawldb urls
bin/nutch generate crawl/crawldb crawl/segments

When I run the fetch command, Nutch stops with errors about hung threads.
I've attached the fetch command output and the nutch-site.xml.

s1=`ls -d crawl/segments/2* | tail -1`
bin/nutch fetch $s1

My questions are:

1) What do I need to do to get Nutch to continue working even if there are
hung threads?
2) Is there a way to avoid having these hanging threads in the first place?

Thank you
Sheham


On Fri, Apr 19, 2024 at 1:04 AM Shashanka Balakuntala <
shbalakunt...@gmail.com> wrote:

> Hi Shehamizat,
> Please feel free to drop questions on the email itself. One of us/community
> will be glad to help on the same.
>
> *Regards*
>   Shashanka Balakuntala Srinivasa
>
>
>
> On Fri, 19 Apr 2024 at 7:15 AM, Sheham Izat <shehami...@gmail.com> wrote:
>
> > Hi,
> >
> > I'm trying to get Nutch to work and I have issues, how can I post
> questions
> > on the group?
> >
> > Thank you,
> > Sheham
> >
>
[root@localhost apache-nutch-1.19]# bin/nutch fetch $s1
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in 
[jar:file:/opt/apache-nutch-1.19/lib/log4j-slf4j-impl-2.18.0.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in 
[jar:file:/opt/apache-nutch-1.19/lib/slf4j-reload4j-1.7.36.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
2024-04-07 22:46:27,222 INFO o.a.n.p.PluginManifestParser [main] Plugins: 
looking in: /opt/apache-nutch-1.19/plugins
2024-04-07 22:46:27,353 INFO o.a.n.p.PluginRepository [main] Plugin 
Auto-activation mode: [true]
2024-04-07 22:46:27,354 INFO o.a.n.p.PluginRepository [main] Registered Plugins:
2024-04-07 22:46:27,354 INFO o.a.n.p.PluginRepository [main]    Regex URL 
Filter (urlfilter-regex)
2024-04-07 22:46:27,354 INFO o.a.n.p.PluginRepository [main]    Html Parse 
Plug-in (parse-html)
2024-04-07 22:46:27,354 INFO o.a.n.p.PluginRepository [main]    HTTP Framework 
(lib-http)
2024-04-07 22:46:27,355 INFO o.a.n.p.PluginRepository [main]    the nutch core 
extension points (nutch-extensionpoints)
2024-04-07 22:46:27,355 INFO o.a.n.p.PluginRepository [main]    Basic Indexing 
Filter (index-basic)
2024-04-07 22:46:27,355 INFO o.a.n.p.PluginRepository [main]    Anchor Indexing 
Filter (index-anchor)
2024-04-07 22:46:27,355 INFO o.a.n.p.PluginRepository [main]    Tika Parser 
Plug-in (parse-tika)
2024-04-07 22:46:27,355 INFO o.a.n.p.PluginRepository [main]    Basic URL 
Normalizer (urlnormalizer-basic)
2024-04-07 22:46:27,355 INFO o.a.n.p.PluginRepository [main]    Regex URL 
Filter Framework (lib-regex-filter)
2024-04-07 22:46:27,355 INFO o.a.n.p.PluginRepository [main]    Regex URL 
Normalizer (urlnormalizer-regex)
2024-04-07 22:46:27,355 INFO o.a.n.p.PluginRepository [main]    URL Validator 
(urlfilter-validator)
2024-04-07 22:46:27,355 INFO o.a.n.p.PluginRepository [main]    CyberNeko HTML 
Parser (lib-nekohtml)
2024-04-07 22:46:27,355 INFO o.a.n.p.PluginRepository [main]    OPIC Scoring 
Plug-in (scoring-opic)
2024-04-07 22:46:27,355 INFO o.a.n.p.PluginRepository [main]    Pass-through 
URL Normalizer (urlnormalizer-pass)
2024-04-07 22:46:27,355 INFO o.a.n.p.PluginRepository [main]    Http Protocol 
Plug-in (protocol-http)
2024-04-07 22:46:27,355 INFO o.a.n.p.PluginRepository [main]    SolrIndexWriter 
(indexer-solr)
2024-04-07 22:46:27,355 INFO o.a.n.p.PluginRepository [main] Registered 
Extension-Points:
2024-04-07 22:46:27,356 INFO o.a.n.p.PluginRepository [main]     (Nutch Content 
Parser)
2024-04-07 22:46:27,356 INFO o.a.n.p.PluginRepository [main]     (Nutch URL 
Filter)
2024-04-07 22:46:27,356 INFO o.a.n.p.PluginRepository [main]     (HTML Parse 
Filter)
2024-04-07 22:46:27,356 INFO o.a.n.p.PluginRepository [main]     (Nutch Scoring)
2024-04-07 22:46:27,356 INFO o.a.n.p.PluginRepository [main]     (Nutch URL 
Normalizer)
2024-04-07 22:46:27,356 INFO o.a.n.p.PluginRepository [main]     (Nutch 
Publisher)
2024-04-07 22:46:27,356 INFO o.a.n.p.PluginRepository [main]     (Nutch 
Exchange)
2024-04-07 22:46:27,356 INFO o.a.n.p.PluginRepository [main]     (Nutch 
Protocol)
2024-04-07 22:46:27,356 INFO o.a.n.p.PluginRepository [main]     (Nutch URL 
Ignore Exemption Filter)
2024-04-07 22:46:27,356 INFO o.a.n.p.PluginRepository [main]     (Nutch Index 
Writer)
2024-04-07 22:46:27,356 INFO o.a.n.p.PluginRepository [main]     (Nutch Segment 
Merge Filter)
2024-04-07 22:46:27,356 INFO o.a.n.p.PluginRepository [main]     (Nutch 
Indexing Filter)
2024-04-07 22:46:27,367 INFO o.a.n.f.Fetcher [main] Fetcher: starting at 
2024-04-07 22:46:27
2024-04-07 22:46:27,367 INFO o.a.n.f.Fetcher [main] Fetcher: segment: 
crawl/segments/20240407224534
2024-04-07 22:46:28,109 INFO o.a.n.f.FetchItemQueues [LocalJobRunner Map Task 
Executor #0] Using queue mode : byHost
2024-04-07 22:46:28,110 INFO o.a.n.f.Fetcher [LocalJobRunner Map Task Executor 
#0] Fetcher: threads: 10
2024-04-07 22:46:28,130 INFO o.a.n.f.Fetcher [LocalJobRunner Map Task Executor 
#0] Fetcher: time-out divisor: 2
2024-04-07 22:46:28,147 INFO o.a.n.f.QueueFeeder [QueueFeeder] QueueFeeder 
finished: total 60 records
2024-04-07 22:46:28,147 INFO o.a.n.f.QueueFeeder [QueueFeeder] QueueFeeder 
queuing status:
2024-04-07 22:46:28,147 INFO o.a.n.f.QueueFeeder [QueueFeeder]  60      
SUCCESSFULLY_QUEUED
2024-04-07 22:46:28,147 INFO o.a.n.f.QueueFeeder [QueueFeeder]  0       
ERROR_CREATE_FETCH_ITEM
2024-04-07 22:46:28,147 INFO o.a.n.f.QueueFeeder [QueueFeeder]  0       
ABOVE_EXCEPTION_THRESHOLD
2024-04-07 22:46:28,147 INFO o.a.n.f.QueueFeeder [QueueFeeder]  0       
HIT_BY_TIMELIMIT
2024-04-07 22:46:28,160 INFO o.a.n.n.URLExemptionFilters [LocalJobRunner Map 
Task Executor #0] Found 0 extensions at 
point:'org.apache.nutch.net.URLExemptionFilter'
2024-04-07 22:46:28,177 INFO o.a.n.f.FetcherThread [LocalJobRunner Map Task 
Executor #0] FetcherThread 54 Using queue mode : byHost
2024-04-07 22:46:28,178 INFO o.a.n.f.FetcherThread [FetcherThread] 
FetcherThread 60 fetching https://ladot.lacity.org/ (queue crawl delay=5000ms)
2024-04-07 22:46:28,188 INFO o.a.n.n.URLExemptionFilters [LocalJobRunner Map 
Task Executor #0] Found 0 extensions at 
point:'org.apache.nutch.net.URLExemptionFilter'
2024-04-07 22:46:28,287 INFO o.a.n.p.h.Http [FetcherThread] http.proxy.host = 
null
2024-04-07 22:46:28,287 INFO o.a.n.p.h.Http [FetcherThread] http.proxy.port = 
8080
2024-04-07 22:46:28,287 INFO o.a.n.p.h.Http [FetcherThread] 
http.proxy.exception.list = false
2024-04-07 22:46:28,287 INFO o.a.n.p.h.Http [FetcherThread] http.timeout = 10000
2024-04-07 22:46:28,287 INFO o.a.n.p.h.Http [FetcherThread] http.content.limit 
= 1048576
2024-04-07 22:46:28,288 INFO o.a.n.p.h.Http [FetcherThread] http.agent = 
Spirawndex Nutch Spider/Nutch-1.19
2024-04-07 22:46:28,288 INFO o.a.n.p.h.Http [FetcherThread] 
http.accept.language = en-us,en-gb,en;q=0.7,*;q=0.3
2024-04-07 22:46:28,288 INFO o.a.n.p.h.Http [FetcherThread] http.accept = 
text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
2024-04-07 22:46:28,289 INFO o.a.n.p.h.Http [FetcherThread] 
http.enable.cookie.header = true
2024-04-07 22:46:28,291 INFO o.a.n.f.FetcherThread [LocalJobRunner Map Task 
Executor #0] FetcherThread 54 Using queue mode : byHost
2024-04-07 22:46:28,293 INFO o.a.n.f.FetcherThread [FetcherThread] 
FetcherThread 61 fetching https://disneyland.disney.go.com/ (queue crawl 
delay=5000ms)
2024-04-07 22:46:28,302 INFO o.a.n.n.URLExemptionFilters [LocalJobRunner Map 
Task Executor #0] Found 0 extensions at 
point:'org.apache.nutch.net.URLExemptionFilter'
2024-04-07 22:46:28,303 INFO o.a.n.f.FetcherThread [LocalJobRunner Map Task 
Executor #0] FetcherThread 54 Using queue mode : byHost
2024-04-07 22:46:28,304 INFO o.a.n.f.FetcherThread [FetcherThread] 
FetcherThread 62 fetching https://www.getapp.com/ (queue crawl delay=5000ms)
2024-04-07 22:46:28,313 INFO o.a.n.n.URLExemptionFilters [LocalJobRunner Map 
Task Executor #0] Found 0 extensions at 
point:'org.apache.nutch.net.URLExemptionFilter'
2024-04-07 22:46:28,314 INFO o.a.n.f.FetcherThread [LocalJobRunner Map Task 
Executor #0] FetcherThread 54 Using queue mode : byHost
2024-04-07 22:46:28,315 INFO o.a.n.f.FetcherThread [FetcherThread] 
FetcherThread 63 fetching https://www.kayemfoodservice.com/ (queue crawl 
delay=5000ms)
2024-04-07 22:46:28,325 INFO o.a.n.n.URLExemptionFilters [LocalJobRunner Map 
Task Executor #0] Found 0 extensions at 
point:'org.apache.nutch.net.URLExemptionFilter'
2024-04-07 22:46:28,326 INFO o.a.n.f.FetcherThread [LocalJobRunner Map Task 
Executor #0] FetcherThread 54 Using queue mode : byHost
2024-04-07 22:46:28,327 INFO o.a.n.f.FetcherThread [FetcherThread] 
FetcherThread 64 fetching https://theculturetrip.com/ (queue crawl delay=5000ms)
2024-04-07 22:46:28,337 INFO o.a.n.n.URLExemptionFilters [LocalJobRunner Map 
Task Executor #0] Found 0 extensions at 
point:'org.apache.nutch.net.URLExemptionFilter'
2024-04-07 22:46:28,339 INFO o.a.n.f.FetcherThread [LocalJobRunner Map Task 
Executor #0] FetcherThread 54 Using queue mode : byHost
2024-04-07 22:46:28,340 INFO o.a.n.f.FetcherThread [FetcherThread] 
FetcherThread 65 fetching https://www.slideshare.net/ (queue crawl delay=5000ms)
2024-04-07 22:46:28,350 INFO o.a.n.n.URLExemptionFilters [LocalJobRunner Map 
Task Executor #0] Found 0 extensions at 
point:'org.apache.nutch.net.URLExemptionFilter'
2024-04-07 22:46:28,352 INFO o.a.n.f.FetcherThread [LocalJobRunner Map Task 
Executor #0] FetcherThread 54 Using queue mode : byHost
2024-04-07 22:46:28,353 INFO o.a.n.f.FetcherThread [FetcherThread] 
FetcherThread 66 fetching https://appexchange.salesforce.com/ (queue crawl 
delay=5000ms)
2024-04-07 22:46:28,363 INFO o.a.n.n.URLExemptionFilters [LocalJobRunner Map 
Task Executor #0] Found 0 extensions at 
point:'org.apache.nutch.net.URLExemptionFilter'
2024-04-07 22:46:28,364 INFO o.a.n.f.FetcherThread [LocalJobRunner Map Task 
Executor #0] FetcherThread 54 Using queue mode : byHost
2024-04-07 22:46:28,366 INFO o.a.n.f.FetcherThread [FetcherThread] 
FetcherThread 67 fetching https://www.lewisginter.org/ (queue crawl 
delay=5000ms)
2024-04-07 22:46:28,376 INFO o.a.n.n.URLExemptionFilters [LocalJobRunner Map 
Task Executor #0] Found 0 extensions at 
point:'org.apache.nutch.net.URLExemptionFilter'
2024-04-07 22:46:28,377 INFO o.a.n.f.FetcherThread [LocalJobRunner Map Task 
Executor #0] FetcherThread 54 Using queue mode : byHost
2024-04-07 22:46:28,378 INFO o.a.n.f.FetcherThread [FetcherThread] 
FetcherThread 68 fetching https://maps.google.com/ (queue crawl delay=5000ms)
2024-04-07 22:46:28,388 INFO o.a.n.n.URLExemptionFilters [LocalJobRunner Map 
Task Executor #0] Found 0 extensions at 
point:'org.apache.nutch.net.URLExemptionFilter'
2024-04-07 22:46:28,388 INFO o.a.n.f.FetcherThread [LocalJobRunner Map Task 
Executor #0] FetcherThread 54 Using queue mode : byHost
2024-04-07 22:46:28,389 INFO o.a.n.f.Fetcher [LocalJobRunner Map Task Executor 
#0] Fetcher: throughput threshold: -1
2024-04-07 22:46:28,390 INFO o.a.n.f.Fetcher [LocalJobRunner Map Task Executor 
#0] Fetcher: throughput threshold retries: 5
2024-04-07 22:46:28,390 INFO o.a.n.f.FetcherThread [FetcherThread] 
FetcherThread 69 fetching https://www.ballseed.com/ (queue crawl delay=5000ms)
2024-04-07 22:46:28,928 INFO o.a.n.n.u.r.RegexURLNormalizer [FetcherThread] 
can't find rules for scope 'fetcher', using default
2024-04-07 22:46:28,952 INFO o.a.n.f.FetcherThread [FetcherThread] 
FetcherThread 62 fetch of https://www.getapp.com/ failed with: Http code=403, 
url=https://www.getapp.com/
2024-04-07 22:46:28,952 INFO o.a.n.f.FetchItemQueues [FetcherThread] * queue: 
www.getapp.com >> delayed next fetch by 5000 ms after 1 exceptions in queue
2024-04-07 22:46:28,953 INFO o.a.n.f.FetcherThread [FetcherThread] 
FetcherThread 62 fetching https://www.youtube.com/ (queue crawl delay=5000ms)
2024-04-07 22:46:28,955 INFO o.a.n.f.FetcherThread [FetcherThread] 
FetcherThread 68 fetching https://www.thefreedictionary.com/ (queue crawl 
delay=5000ms)
2024-04-07 22:46:29,066 INFO o.a.n.f.FetcherThread [FetcherThread] 
FetcherThread 64 fetching https://mathiasconradt.medium.com/ (queue crawl 
delay=5000ms)
2024-04-07 22:46:29,171 INFO o.a.n.f.FetcherThread [FetcherThread] 
FetcherThread 65 fetching https://sourceforge.net/ (queue crawl delay=5000ms)
2024-04-07 22:46:29,269 INFO o.a.n.f.FetcherThread [FetcherThread] 
FetcherThread 66 fetching https://bitnami.com/ (queue crawl delay=5000ms)
2024-04-07 22:46:29,281 INFO o.a.n.f.FetcherThread [FetcherThread] 
FetcherThread 67 fetching https://www.hyland.com/ (queue crawl delay=5000ms)
2024-04-07 22:46:29,389 INFO o.a.n.f.FetcherThread [FetcherThread] 
FetcherThread 69 fetching https://github.com/ (queue crawl delay=5000ms)
2024-04-07 22:46:29,393 INFO o.a.n.f.Fetcher [LocalJobRunner Map Task Executor 
#0] -activeThreads=10, spinWaiting=0, fetchQueues.totalSize=43, 
fetchQueues.getQueueCount=60
2024-04-07 22:46:29,621 INFO o.a.n.f.FetcherThread [FetcherThread] 
FetcherThread 62 fetching https://microstrat.com/ (queue crawl delay=5000ms)
2024-04-07 22:46:29,632 INFO o.a.n.f.FetcherThread [FetcherThread] 
FetcherThread 63 fetching https://www.caryillinois.com/ (queue crawl 
delay=5000ms)
2024-04-07 22:46:29,650 INFO o.a.n.n.u.r.RegexURLNormalizer [FetcherThread] 
can't find rules for scope 'fetcher', using default
2024-04-07 22:46:29,652 INFO o.a.n.f.FetcherThread [FetcherThread] 
FetcherThread 60 fetching https://www.wilsonappliance.com/ (queue crawl 
delay=5000ms)
2024-04-07 22:46:29,688 INFO o.a.n.p.h.a.HttpRobotRulesParser [FetcherThread] 
Couldn't get robots.txt for https://bitnami.com/: java.net.SocketException: 
Socket is closed
2024-04-07 22:46:29,708 ERROR o.a.n.p.h.Http [FetcherThread] Failed to get 
protocol output
java.net.SocketException: Socket is closed
        at 
sun.security.ssl.SSLSocketImpl.getOutputStream(SSLSocketImpl.java:1129) ~[?:?]
        at 
org.apache.nutch.protocol.http.HttpResponse.<init>(HttpResponse.java:163) ~[?:?]
        at org.apache.nutch.protocol.http.Http.getResponse(Http.java:65) ~[?:?]
        at 
org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.java:393)
 ~[?:?]
        at org.apache.nutch.fetcher.FetcherThread.run(FetcherThread.java:381) 
~[apache-nutch-1.19.jar:?]
2024-04-07 22:46:29,713 INFO o.a.n.f.FetcherThread [FetcherThread] 
FetcherThread 66 fetch of https://bitnami.com/ failed with: 
java.net.SocketException: Socket is closed
2024-04-07 22:46:29,714 INFO o.a.n.f.FetchItemQueues [FetcherThread] * queue: 
bitnami.com >> delayed next fetch by 5000 ms after 1 exceptions in queue
2024-04-07 22:46:29,714 INFO o.a.n.f.FetcherThread [FetcherThread] 
FetcherThread 66 fetching https://www.inwoodartworks.nyc/ (queue crawl 
delay=5000ms)
2024-04-07 22:46:29,816 INFO o.a.n.f.FetcherThread [FetcherThread] Denied by 
robots.txt: https://sourceforge.net/
2024-04-07 22:46:29,816 INFO o.a.n.f.FetcherThread [FetcherThread] 
FetcherThread 65 fetching https://www.carahsoft.com/ (queue crawl delay=5000ms)
2024-04-07 22:46:29,990 INFO o.a.n.f.FetcherThread [FetcherThread] 
FetcherThread 69 fetching https://www.pirch.com/ (queue crawl delay=5000ms)
2024-04-07 22:46:30,139 INFO o.a.n.n.u.r.RegexURLNormalizer [FetcherThread] 
can't find rules for scope 'fetcher', using default
2024-04-07 22:46:30,140 INFO o.a.n.f.FetcherThread [FetcherThread] 
FetcherThread 67 fetching https://www.lowes.com/ (queue crawl delay=5000ms)
2024-04-07 22:46:30,166 INFO o.a.n.f.FetcherThread [FetcherThread] 
FetcherThread 62 fetching https://www.benjaminmoore.com/ (queue crawl 
delay=5000ms)
2024-04-07 22:46:30,394 INFO o.a.n.f.Fetcher [LocalJobRunner Map Task Executor 
#0] -activeThreads=10, spinWaiting=0, fetchQueues.totalSize=35, 
fetchQueues.getQueueCount=60
2024-04-07 22:46:30,533 INFO o.a.n.f.FetcherThread [FetcherThread] 
FetcherThread 66 fetching https://www.jamieoliver.com/ (queue crawl 
delay=5000ms)
2024-04-07 22:46:30,540 INFO o.a.n.f.FetcherThread [FetcherThread] 
FetcherThread 69 fetching https://hub.docker.com/ (queue crawl delay=5000ms)
2024-04-07 22:46:30,607 INFO o.a.n.f.FetcherThread [FetcherThread] 
FetcherThread 64 fetching https://onsclothing.com/ (queue crawl delay=5000ms)
2024-04-07 22:46:30,608 INFO o.a.n.f.FetcherThread [FetcherThread] 
FetcherThread 68 fetching https://www.burpeehomegardens.com/ (queue crawl 
delay=5000ms)
2024-04-07 22:46:30,772 INFO o.a.n.f.FetcherThread [FetcherThread] 
FetcherThread 69 fetching https://www.crateandbarrel.com/ (queue crawl 
delay=5000ms)
2024-04-07 22:46:30,786 INFO o.a.n.n.u.r.RegexURLNormalizer [FetcherThread] 
can't find rules for scope 'fetcher', using default
2024-04-07 22:46:30,787 INFO o.a.n.f.FetcherThread [FetcherThread] 
FetcherThread 62 fetching https://kinto-usa.com/ (queue crawl delay=5000ms)
2024-04-07 22:46:30,951 INFO o.a.n.f.FetcherThread [FetcherThread] 
FetcherThread 63 fetching https://dictionary.cambridge.org/ (queue crawl 
delay=5000ms)
2024-04-07 22:46:30,988 INFO o.a.n.f.FetcherThread [FetcherThread] Denied by 
robots.txt: https://onsclothing.com/
2024-04-07 22:46:30,989 INFO o.a.n.f.FetcherThread [FetcherThread] 
FetcherThread 64 fetching https://www.stylemepretty.com/ (queue crawl 
delay=5000ms)
2024-04-07 22:46:31,175 INFO o.a.n.f.FetcherThread [FetcherThread] 
FetcherThread 69 fetch of https://www.crateandbarrel.com/ failed with: Http 
code=403, url=https://www.crateandbarrel.com/
2024-04-07 22:46:31,176 INFO o.a.n.f.FetchItemQueues [FetcherThread] * queue: 
www.crateandbarrel.com >> delayed next fetch by 5000 ms after 1 exceptions in 
queue
2024-04-07 22:46:31,177 INFO o.a.n.f.FetcherThread [FetcherThread] 
FetcherThread 69 fetching https://www.seattlespheres.com/ (queue crawl 
delay=5000ms)
2024-04-07 22:46:31,278 INFO o.a.n.f.FetcherThread [FetcherThread] 
FetcherThread 66 fetching https://www.efcontractflooring.com/ (queue crawl 
delay=5000ms)
2024-04-07 22:46:31,394 INFO o.a.n.f.Fetcher [LocalJobRunner Map Task Executor 
#0] -activeThreads=10, spinWaiting=0, fetchQueues.totalSize=25, 
fetchQueues.getQueueCount=60
2024-04-07 22:46:31,436 INFO o.a.n.f.FetcherThread [FetcherThread] 
FetcherThread 65 fetching https://www.adu.com/ (queue crawl delay=5000ms)
2024-04-07 22:46:31,480 INFO o.a.n.f.FetcherThread [FetcherThread] 
FetcherThread 68 fetching https://en.wiktionary.org/ (queue crawl delay=5000ms)
2024-04-07 22:46:31,502 WARN o.a.n.p.h.Http [FetcherThread] Missing or invalid 
HTTP status line
org.apache.nutch.protocol.http.api.HttpException: Bad status line, no HTTP 
response code: 
        at 
org.apache.nutch.protocol.http.HttpResponse.parseStatusLine(HttpResponse.java:571)
 ~[?:?]
        at 
org.apache.nutch.protocol.http.HttpResponse.<init>(HttpResponse.java:275) ~[?:?]
        at org.apache.nutch.protocol.http.Http.getResponse(Http.java:65) ~[?:?]
        at 
org.apache.nutch.protocol.http.api.HttpRobotRulesParser.getRobotRulesSet(HttpRobotRulesParser.java:133)
 ~[?:?]
        at 
org.apache.nutch.protocol.RobotRulesParser.getRobotRulesSet(RobotRulesParser.java:235)
 ~[apache-nutch-1.19.jar:?]
        at 
org.apache.nutch.protocol.http.api.HttpBase.getRobotRules(HttpBase.java:766) 
~[?:?]
        at org.apache.nutch.fetcher.FetcherThread.run(FetcherThread.java:319) 
~[apache-nutch-1.19.jar:?]
Caused by: java.lang.NumberFormatException: For input string: ""
        at 
java.lang.NumberFormatException.forInputString(NumberFormatException.java:65) 
~[?:?]
        at java.lang.Integer.parseInt(Integer.java:662) ~[?:?]
        at java.lang.Integer.parseInt(Integer.java:770) ~[?:?]
        at 
org.apache.nutch.protocol.http.HttpResponse.parseStatusLine(HttpResponse.java:569)
 ~[?:?]
        ... 6 more
2024-04-07 22:46:31,504 WARN o.a.n.p.h.Http [FetcherThread] No HTTP header, 
assuming HTTP/0.9 for https://dictionary.cambridge.org/robots.txt
2024-04-07 22:46:31,552 INFO o.a.n.f.FetcherThread [FetcherThread] 
FetcherThread 69 fetching https://en.wikipedia.org/ (queue crawl delay=5000ms)
2024-04-07 22:46:31,561 INFO o.a.n.f.FetcherThread [FetcherThread] 
FetcherThread 60 fetching https://pitchbook.com/ (queue crawl delay=5000ms)
2024-04-07 22:46:31,596 WARN o.a.n.p.h.Http [FetcherThread] Missing or invalid 
HTTP status line
org.apache.nutch.protocol.http.api.HttpException: Bad status line, no HTTP 
response code: 
        at 
org.apache.nutch.protocol.http.HttpResponse.parseStatusLine(HttpResponse.java:571)
 ~[?:?]
        at 
org.apache.nutch.protocol.http.HttpResponse.<init>(HttpResponse.java:275) ~[?:?]
        at org.apache.nutch.protocol.http.Http.getResponse(Http.java:65) ~[?:?]
        at 
org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.java:393)
 ~[?:?]
        at org.apache.nutch.fetcher.FetcherThread.run(FetcherThread.java:381) 
~[apache-nutch-1.19.jar:?]
Caused by: java.lang.NumberFormatException: For input string: ""
        at 
java.lang.NumberFormatException.forInputString(NumberFormatException.java:65) 
~[?:?]
        at java.lang.Integer.parseInt(Integer.java:662) ~[?:?]
        at java.lang.Integer.parseInt(Integer.java:770) ~[?:?]
        at 
org.apache.nutch.protocol.http.HttpResponse.parseStatusLine(HttpResponse.java:569)
 ~[?:?]
        ... 4 more
2024-04-07 22:46:31,600 WARN o.a.n.p.h.Http [FetcherThread] No HTTP header, 
assuming HTTP/0.9 for https://dictionary.cambridge.org/
2024-04-07 22:46:31,602 INFO o.a.n.f.FetcherThread [FetcherThread] 
FetcherThread 63 fetching https://www.gartner.com/ (queue crawl delay=5000ms)
2024-04-07 22:46:31,748 INFO o.a.n.f.FetcherThread [FetcherThread] 
FetcherThread 68 fetching https://www.crunchbase.com/ (queue crawl delay=5000ms)
2024-04-07 22:46:31,767 INFO o.a.n.n.u.r.RegexURLNormalizer [FetcherThread] 
can't find rules for scope 'fetcher', using default
2024-04-07 22:46:31,767 INFO o.a.n.f.FetcherThread [FetcherThread] 
FetcherThread 69 fetching https://access.redhat.com/ (queue crawl delay=5000ms)
2024-04-07 22:46:31,853 INFO o.a.n.f.FetcherThread [FetcherThread] Denied by 
robots.txt: https://kinto-usa.com/
2024-04-07 22:46:31,853 INFO o.a.n.f.FetcherThread [FetcherThread] 
FetcherThread 62 fetching https://www.cityofsacramento.org/ (queue crawl 
delay=5000ms)
2024-04-07 22:46:31,901 INFO o.a.n.f.FetcherThread [FetcherThread] 
FetcherThread 60 fetch of https://pitchbook.com/ failed with: Http code=403, 
url=https://pitchbook.com/
2024-04-07 22:46:31,901 INFO o.a.n.f.FetchItemQueues [FetcherThread] * queue: 
pitchbook.com >> delayed next fetch by 5000 ms after 1 exceptions in queue
2024-04-07 22:46:31,901 INFO o.a.n.f.FetcherThread [FetcherThread] 
FetcherThread 60 fetching https://twitter.com/ (queue crawl delay=5000ms)
2024-04-07 22:46:32,029 INFO o.a.n.f.FetcherThread [FetcherThread] 
FetcherThread 63 fetch of https://www.gartner.com/ failed with: Http code=403, 
url=https://www.gartner.com/
2024-04-07 22:46:32,029 INFO o.a.n.f.FetchItemQueues [FetcherThread] * queue: 
www.gartner.com >> delayed next fetch by 5000 ms after 1 exceptions in queue
2024-04-07 22:46:32,030 INFO o.a.n.f.FetcherThread [FetcherThread] 
FetcherThread 63 fetching https://www.aggressiveappliances.com/ (queue crawl 
delay=5000ms)
2024-04-07 22:46:32,067 WARN c.r.SimpleRobotRulesParser [FetcherThread] Problem 
processing robots.txt for https://twitter.com/
2024-04-07 22:46:32,067 WARN c.r.SimpleRobotRulesParser [FetcherThread]         
 Unknown line in robots.txt file (size 1350): Noindex: /i/u
2024-04-07 22:46:32,067 INFO o.a.n.f.FetcherThread [FetcherThread] Denied by 
robots.txt: https://twitter.com/
2024-04-07 22:46:32,067 INFO o.a.n.f.FetcherThread [FetcherThread] 
FetcherThread 60 fetching https://www.softwareadvice.com/ (queue crawl 
delay=5000ms)
2024-04-07 22:46:32,270 INFO o.a.n.f.FetcherThread [FetcherThread] 
FetcherThread 69 fetching https://www.facebook.com/ (queue crawl delay=5000ms)
2024-04-07 22:46:32,395 INFO o.a.n.f.Fetcher [LocalJobRunner Map Task Executor 
#0] -activeThreads=10, spinWaiting=0, fetchQueues.totalSize=13, 
fetchQueues.getQueueCount=60
2024-04-07 22:46:32,414 INFO o.a.n.f.FetcherThread [FetcherThread] 
FetcherThread 64 fetching https://www.g2.com/ (queue crawl delay=5000ms)
2024-04-07 22:46:32,496 INFO o.a.n.f.FetcherThread [FetcherThread] 
FetcherThread 60 fetching https://accounts.google.com/ (queue crawl 
delay=5000ms)
2024-04-07 22:46:32,620 INFO o.a.n.f.FetcherThread [FetcherThread] 
FetcherThread 62 fetching https://www.groveresortorlando.com/ (queue crawl 
delay=5000ms)
2024-04-07 22:46:32,641 INFO o.a.n.f.FetcherThread [FetcherThread] 
FetcherThread 69 fetching https://visitabingdonvirginia.com/ (queue crawl 
delay=5000ms)
2024-04-07 22:46:32,690 INFO o.a.n.f.FetcherThread [FetcherThread] 
FetcherThread 60 fetching https://alfresco-content-app.netlify.app/ (queue 
crawl delay=5000ms)
2024-04-07 22:46:32,801 INFO o.a.n.f.FetcherThread [FetcherThread] 
FetcherThread 64 fetch of https://www.g2.com/ failed with: Http code=403, 
url=https://www.g2.com/
2024-04-07 22:46:32,801 INFO o.a.n.f.FetchItemQueues [FetcherThread] * queue: 
www.g2.com >> delayed next fetch by 5000 ms after 1 exceptions in queue
2024-04-07 22:46:32,801 INFO o.a.n.f.FetcherThread [FetcherThread] 
FetcherThread 64 fetching https://www.linkedin.com/ (queue crawl delay=5000ms)
2024-04-07 22:46:32,942 INFO o.a.n.f.FetcherThread [FetcherThread] 
FetcherThread 62 fetching https://www.trustradius.com/ (queue crawl 
delay=5000ms)
2024-04-07 22:46:32,956 INFO o.a.n.f.FetcherThread [FetcherThread] 
FetcherThread 68 fetching https://www.foodnetwork.com/ (queue crawl 
delay=5000ms)
2024-04-07 22:46:32,986 INFO o.a.n.f.FetcherThread [FetcherThread] Denied by 
robots.txt: https://www.linkedin.com/
2024-04-07 22:46:32,987 INFO o.a.n.f.FetcherThread [FetcherThread] 
FetcherThread 64 fetching https://www.instagram.com/ (queue crawl delay=5000ms)
2024-04-07 22:46:33,145 INFO o.a.n.f.FetcherThread [FetcherThread] 
FetcherThread 60 fetching https://www.imdb.com/ (queue crawl delay=5000ms)
2024-04-07 22:46:33,148 INFO o.a.n.f.FetcherThread [FetcherThread] Denied by 
robots.txt: https://www.instagram.com/
2024-04-07 22:46:33,149 INFO o.a.n.f.FetcherThread [FetcherThread] 
FetcherThread 64 fetching https://lolldesigns.com/ (queue crawl delay=5000ms)
2024-04-07 22:46:33,159 WARN c.r.SimpleRobotRulesParser [FetcherThread] Problem 
processing robots.txt for https://www.trustradius.com/
2024-04-07 22:46:33,160 WARN c.r.SimpleRobotRulesParser [FetcherThread]         
 Unknown line in robots.txt file (size 1158): Noindex: /api/
2024-04-07 22:46:33,160 WARN c.r.SimpleRobotRulesParser [FetcherThread]         
 Unknown line in robots.txt file (size 1158): Noindex: /share/
2024-04-07 22:46:33,160 WARN c.r.SimpleRobotRulesParser [FetcherThread]         
 Unknown line in robots.txt file (size 1158): Noindex: /share
2024-04-07 22:46:33,160 WARN c.r.SimpleRobotRulesParser [FetcherThread]         
 Unknown line in robots.txt file (size 1158): Noindex: /search/
2024-04-07 22:46:33,271 INFO o.a.n.f.FetcherThread [FetcherThread] 
FetcherThread 62 fetch of https://www.trustradius.com/ failed with: Http 
code=403, url=https://www.trustradius.com/
2024-04-07 22:46:33,271 INFO o.a.n.f.FetchItemQueues [FetcherThread] * queue: 
www.trustradius.com >> delayed next fetch by 5000 ms after 1 exceptions in queue
2024-04-07 22:46:33,272 INFO o.a.n.f.FetcherThread [FetcherThread] 
FetcherThread 62 fetching https://api.onlyoffice.com/ (queue crawl delay=5000ms)
2024-04-07 22:46:33,348 INFO o.a.n.f.FetcherThread [FetcherThread] 
FetcherThread 69 fetching https://www.choosechicago.com/ (queue crawl 
delay=5000ms)
2024-04-07 22:46:33,364 INFO o.a.n.f.FetcherThread [FetcherThread] Denied by 
robots.txt: https://lolldesigns.com/
2024-04-07 22:46:33,365 INFO o.a.n.f.FetcherThread [FetcherThread] 
FetcherThread 64 has no more work available
2024-04-07 22:46:33,365 INFO o.a.n.f.FetcherThread [FetcherThread] 
FetcherThread 64 -finishing thread FetcherThread, activeThreads=9
2024-04-07 22:46:33,383 INFO o.a.n.f.FetcherThread [FetcherThread] 
FetcherThread 68 has no more work available
2024-04-07 22:46:33,383 INFO o.a.n.f.FetcherThread [FetcherThread] 
FetcherThread 68 -finishing thread FetcherThread, activeThreads=8
2024-04-07 22:46:33,396 INFO o.a.n.f.Fetcher [LocalJobRunner Map Task Executor 
#0] -activeThreads=8, spinWaiting=0, fetchQueues.totalSize=0, 
fetchQueues.getQueueCount=8
2024-04-07 22:46:33,401 INFO o.a.n.f.FetcherThread [FetcherThread] 
FetcherThread 66 has no more work available
2024-04-07 22:46:33,402 INFO o.a.n.f.FetcherThread [FetcherThread] 
FetcherThread 66 -finishing thread FetcherThread, activeThreads=7
2024-04-07 22:46:33,731 INFO o.a.n.f.FetcherThread [FetcherThread] 
FetcherThread 69 has no more work available
2024-04-07 22:46:33,731 INFO o.a.n.f.FetcherThread [FetcherThread] 
FetcherThread 69 -finishing thread FetcherThread, activeThreads=6
2024-04-07 22:46:33,997 INFO o.a.n.f.FetcherThread [FetcherThread] 
FetcherThread 63 has no more work available
2024-04-07 22:46:33,997 INFO o.a.n.f.FetcherThread [FetcherThread] 
FetcherThread 63 -finishing thread FetcherThread, activeThreads=5
2024-04-07 22:46:34,277 INFO o.a.n.f.FetcherThread [FetcherThread] 
FetcherThread 60 has no more work available
2024-04-07 22:46:34,277 INFO o.a.n.f.FetcherThread [FetcherThread] 
FetcherThread 60 -finishing thread FetcherThread, activeThreads=4
2024-04-07 22:46:34,396 INFO o.a.n.f.Fetcher [LocalJobRunner Map Task Executor 
#0] -activeThreads=4, spinWaiting=0, fetchQueues.totalSize=0, 
fetchQueues.getQueueCount=4
2024-04-07 22:46:34,397 WARN o.a.n.f.Fetcher [LocalJobRunner Map Task Executor 
#0] Aborting with 4 hung threads.
2024-04-07 22:46:34,397 WARN o.a.n.f.Fetcher [LocalJobRunner Map Task Executor 
#0] Thread #1 hung while processing https://disneyland.disney.go.com/
2024-04-07 22:46:34,397 WARN o.a.n.f.Fetcher [LocalJobRunner Map Task Executor 
#0] Thread #2 hung while processing https://api.onlyoffice.com/
2024-04-07 22:46:34,397 WARN o.a.n.f.Fetcher [LocalJobRunner Map Task Executor 
#0] Thread #5 hung while processing https://www.adu.com/
2024-04-07 22:46:34,397 WARN o.a.n.f.Fetcher [LocalJobRunner Map Task Executor 
#0] Thread #7 hung while processing https://www.lowes.com/
2024-04-07 22:46:35,032 INFO o.a.n.f.Fetcher [main] Fetcher: finished at 
2024-04-07 22:46:35, elapsed: 00:00:07

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>
  <property>
    <name>http.agent.name</name>
    <value>Nutch Spider</value>
  </property>
  <configuration>
  <property>
    <name>mapreduce.task.timeout</name>
    <value>1800</value>
  </property>
</configuration>

Reply via email to