(stormcrawler) branch main updated: Fix #1763 - Documentation fixes from #1714 (#1765)

rzo1 Fri, 26 Dec 2025 10:37:37 -0800

This is an automated email from the ASF dual-hosted git repository.

rzo1 pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/stormcrawler.git



The following commit(s) were added to refs/heads/main by this push:
     new 47a29da2 Fix #1763 - Documentation fixes from #1714 (#1765)
47a29da2 is described below

commit 47a29da251d4d89f7d6c822c71da0270fdc706d7
Author: Richard Zowalla <[email protected]>
AuthorDate: Fri Dec 26 19:37:07 2025 +0100

    Fix #1763 - Documentation fixes from #1714 (#1765)
---
 docs/src/main/asciidoc/architecture.adoc  |  2 +-
 docs/src/main/asciidoc/configuration.adoc |  2 +-
 docs/src/main/asciidoc/internals.adoc     | 18 +++++++++---------
 docs/src/main/asciidoc/quick-start.adoc   |  2 +-
 4 files changed, 12 insertions(+), 12 deletions(-)

diff --git a/docs/src/main/asciidoc/architecture.adoc 
b/docs/src/main/asciidoc/architecture.adoc
index b4ba45b0..bbacfae6 100644
--- a/docs/src/main/asciidoc/architecture.adoc
+++ b/docs/src/main/asciidoc/architecture.adoc
@@ -4,7 +4,7 @@ You may not use this file except in compliance with the License.
 You may obtain a copy of the License at:
 https://www.apache.org/licenses/LICENSE-2.0
 ////
-
+[[architecture]]
 == Understanding StormCrawler's Architecture
 
 === Architecture Overview
diff --git a/docs/src/main/asciidoc/configuration.adoc 
b/docs/src/main/asciidoc/configuration.adoc
index 02da291c..1c3a50c0 100644
--- a/docs/src/main/asciidoc/configuration.adoc
+++ b/docs/src/main/asciidoc/configuration.adoc
@@ -53,7 +53,7 @@ This is what the configuration `http.robots.agents` allows 
you to do. It is a co
 
 === Proxy 
 
-StormCrawler's proxy system is built on top of the 
link:https://github.com/apache/stormcrawler/blob/main/core/src/main/java/org/apache/stormcrawler/proxy/SCProxy.java[SCProxy]
 class and the 
link:https://github.com/apache/stormcrawler/blob/main/core/src/main/java/org/apache/stormcrawler/proxy/ProxyManager.java[ProxyManager]
 interface. Every proxy used in the system is formatted as a **SCProxy**. The 
**ProxyManager** implementations handle the management and delegation of their 
internal pr [...]
+StormCrawler's proxy system is built on top of the 
link:https://github.com/apache/stormcrawler/blob/main/core/src/main/java/org/apache/stormcrawler/proxy/SCProxy.java[SCProxy]
 class and the 
link:https://github.com/apache/stormcrawler/blob/main/core/src/main/java/org/apache/stormcrawler/proxy/ProxyManager.java[ProxyManager]
 interface. Every proxy used in the system is formatted as a **SCProxy**. The 
**ProxyManager** implementations handle the management and delegation of their 
internal pr [...]
 
 The **ProxyManager** interface can be implemented in a custom class to create 
custom logic for proxy management and load balancing. The default 
**ProxyManager** implementation is **SingleProxyManager**. This ensures 
backwards compatibility for prior StormCrawler releases. To use 
**MultiProxyManager** or custom implementations, pass the class path and name 
via the config parameter `http.proxy.manager`:
 
diff --git a/docs/src/main/asciidoc/internals.adoc 
b/docs/src/main/asciidoc/internals.adoc
index 7f552ded..6f517adc 100644
--- a/docs/src/main/asciidoc/internals.adoc
+++ b/docs/src/main/asciidoc/internals.adoc
@@ -10,9 +10,9 @@ https://www.apache.org/licenses/LICENSE-2.0
 
 The Apache StormCrawler components rely on two Apache Storm streams: the 
_default_ one and another one called _status_.
 
-The aim of the _status_ stream is to pass information about URLs to a 
persistence layer. Typically, a bespoke bolt will take the tuples coming from 
the _status_ stream and update the information about URLs in some sort of 
storage (e.g., ElasticSearch, HBase, etc...), which is then used by a Spout to 
send new URLs down the topology.
+The aim of the _status_ stream is to pass information about URLs to a 
persistence layer. Typically, a bespoke bolt will take the tuples coming from 
the _status_ stream and update the information about URLs in some sort of 
storage (e.g., OpenSearch, HBase, etc...), which is then used by a Spout to 
send new URLs down the topology.
 
-This is critical for building recursive crawls (i.e., you discover new URLs 
and not just process known ones). The _default_ stream is used for the URL 
being processed and is generally used at the end of the pipeline by an indexing 
bolt (which could also be ElasticSearch, HBase, etc...), regardless of whether 
the crawler is recursive or not.
+This is critical for building recursive crawls (i.e., you discover new URLs 
and not just process known ones). The _default_ stream is used for the URL 
being processed and is generally used at the end of the pipeline by an indexing 
bolt (which could also be OpenSearch, HBase, etc...), regardless of whether the 
crawler is recursive or not.
 
 Tuples are emitted on the _status_ stream by the parsing bolts for handling 
outlinks but also to notify that there has been a problem with a URL (e.g., 
unparsable content). It is also used by the fetching bolts to handle 
redirections, exceptions, and unsuccessful fetch status (e.g., HTTP code 400).
 
@@ -29,7 +29,7 @@ As you can see for instance in 
link:https://github.com/apache/stormcrawler/blob/
 
 The 
link:https://github.com/apache/stormcrawler/blob/main/core/src/main/java/org/apache/stormcrawler/persistence/Status.java[Status]
 enum has the following values:
 
-* DISCOVERED:: outlinks found by the parsers or "seed" URLs emitted into the 
topology by one of the 
link:https://stormcrawler.net/docs/api/com/digitalpebble/stormcrawler/spout/package-summary.html[spouts]
 or "injected" into the storage. The URLs can be already known in the storage.
+* DISCOVERED:: outlinks found by the parsers or "seed" URLs emitted into the 
topology by one of the spouts or "injected" into the storage. The URLs can be 
already known in the storage.
 * REDIRECTION:: set by the fetcher bolts.
 * FETCH_ERROR:: set by the fetcher bolts.
 * ERROR:: used by either the fetcher, parser, or indexer bolts.
@@ -41,7 +41,7 @@ The 
link:https://github.com/apache/stormcrawler/blob/main/core/src/main/java/org
 
 The class 
link:https://github.com/apache/stormcrawler/blob/main/core/src/main/java/org/apache/stormcrawler/persistence/AbstractStatusUpdaterBolt.java[AbstractStatusUpdaterBolt]
 can be extended to handle status updates for a specific backend. It has an 
internal cache of URLs with a `discovered` status so that they don't get added 
to the backend if they already exist, which is a simple but efficient 
optimisation. It also uses 
link:https://github.com/apache/stormcrawler/blob/main/core/src/m [...]
 
-In most cases, the extending classes will just need to implement the method 
`store(String URL, Status status, Metadata metadata, Date nextFetch)` and 
handle their own initialisation in `prepare()`. You can find an example of a 
class which extends it in the 
link:https://github.com/apache/stormcrawler/blob/main/external/opensearch/src/main/java/org/apache/stormcrawler/opensearch/persistence/StatusUpdaterBolt.java[StatusUpdaterBolt]
 for Elasticsearch.
+In most cases, the extending classes will just need to implement the method 
`store(String URL, Status status, Metadata metadata, Date nextFetch)` and 
handle their own initialisation in `prepare()`. You can find an example of a 
class which extends it in the 
link:https://github.com/apache/stormcrawler/blob/main/external/opensearch/src/main/java/org/apache/stormcrawler/opensearch/persistence/StatusUpdaterBolt.java[StatusUpdaterBolt]
 for OpenSearch.
 
 
 === Bolts
@@ -71,14 +71,14 @@ The **FetcherBolt** has an internal set of queues where the 
incoming URLs are pl
 
 Incoming tuples spend very little time in the 
link:https://github.com/apache/stormcrawler/blob/main/core/src/main/java/org/apache/stormcrawler/bolt/FetcherBolt.java#L768[execute]
 method of the **FetcherBolt** as they are put in the FetchQueues, which is why 
you'll find that the value of **Execute latency** in the Storm UI is pretty 
low. They get acked later on, after they've been fetched. The metric to watch 
for in the Storm UI is **Process latency**.
 
-The **SimpleFetcherBolt** does not do any of this, hence its name. It just 
fetches incoming tuples in its `execute` method and does not do 
multi-threading. It does enforce politeness by checking when a URL can be 
fetched and will wait until it is the case. It is up to the user to declare 
multiple instances of the bolt in the Topology class and to manage how the URLs 
get distributed across the instances of **SimpleFetcherBolt**, often with the 
help of the link:https:/
+The **SimpleFetcherBolt** does not do any of this, hence its name. It just 
fetches incoming tuples in its `execute` method and does not do 
multi-threading. It does enforce politeness by checking when a URL can be 
fetched and will wait until it is the case. It is up to the user to declare 
multiple instances of the bolt in the Topology class and to manage how the URLs 
get distributed across the instances of **SimpleFetcherBolt**, often with the 
help of the link:https://github.com/apache/st [...]
 
 === Indexer Bolts
 The purpose of crawlers is often to index web pages to make them searchable. 
The project contains resources for indexing with popular search solutions such 
as:
 
-* 
link:https://github.com/apache/stormcrawler/blob/main/external/solr/src/main/java/com/digitalpebble/stormcrawler/solr/bolt/IndexerBolt.java[Apache
 SOLR]
-* 
link:https://github.com/apache/stormcrawler/blob/main/external/opensearch/src/main/java/com/digitalpebble/stormcrawler/elasticsearch/bolt/IndexerBolt.java[Elasticsearch]
-* 
link:https://github.com/apache/stormcrawler/blob/main/external/aws/src/main/java/com/digitalpebble/stormcrawler/aws/bolt/CloudSearchIndexerBolt.java[AWS
 CloudSearch]
+* 
link:https://github.com/apache/stormcrawler/blob/main/external/solr/src/main/java/org/apache/stormcrawler/solr/bolt/IndexerBolt.java[Apache
 SOLR]
+* 
link:https://github.com/apache/stormcrawler/blob/main/external/opensearch/src/main/java/org/apache/stormcrawler/opensearch/bolt/IndexerBolt.java[OpenSearch]
+* 
link:https://github.com/apache/stormcrawler/blob/main/external/aws/src/main/java/org/apache/stormcrawler/aws/bolt/CloudSearchIndexerBolt.java[AWS
 CloudSearch]
 
 All of these extend the class 
link:https://github.com/apache/stormcrawler/blob/main/core/src/main/java/org/apache/stormcrawler/indexing/AbstractIndexerBolt.java[AbstractIndexerBolt].
 
@@ -104,7 +104,7 @@ You can easily build your own custom indexer to integrate 
with other storage sys
 === Parser Bolts
 ==== JSoupParserBolt
 
-The 
link:https://github.com/apache/stormcrawler/blob/main/core/src/main/java/org/apache/stormcrawler/bolt/JSoupParserBolt.java[JSoupParserBolt]
 can be used to parse HTML documents and extract the outlinks, text, and 
metadata it contains. If you want to parse non-HTML documents, use the 
link:https://github.com/apache/stormcrawler/tree/main/external/src/main/java/com/digitalpebble/storm/crawler/tika[Tika-based
 ParserBolt] from the external modules.
+The 
link:https://github.com/apache/stormcrawler/blob/main/core/src/main/java/org/apache/stormcrawler/bolt/JSoupParserBolt.java[JSoupParserBolt]
 can be used to parse HTML documents and extract the outlinks, text, and 
metadata it contains. If you want to parse non-HTML documents, use the 
link:https://github.com/apache/stormcrawler/tree/main/external/tika/src/main/java/org/apache/stormcrawler/tika[Tika-based
 ParserBolt] from the external modules.
 
 This parser calls the xref:urlfilters[URLFilters] and 
xref:parsefilters[ParseFilters] defined in the configuration. Please note that 
it calls xref:metadatatransfer[MetadataTransfer] prior to calling the 
xref:parsefilters[ParseFilters]. If you create new Outlinks in your 
[[ParseFilters]], you'll need to make sure that you use MetadataTransfer there 
to inherit the Metadata from the parent document.
 
diff --git a/docs/src/main/asciidoc/quick-start.adoc 
b/docs/src/main/asciidoc/quick-start.adoc
index b3f6f89b..faaef3a8 100644
--- a/docs/src/main/asciidoc/quick-start.adoc
+++ b/docs/src/main/asciidoc/quick-start.adoc
@@ -63,7 +63,7 @@ The archetype will generate a fully-structured project 
including:
 
 After generation, navigate into the newly created directory (named after the 
`artifactId` you specified).
 
-TIP: You can learn more about the architecture and how each component works 
together if you look into link:architecture.adoc[the architecture 
documentation].
+TIP: You can learn more about the architecture and how each component works 
together if you look into xref:architecture[the architecture documentation].
 By exploring that part of the documentation, you can gain a better 
understanding of how StormCrawler performs crawling and how bolts, spouts, as 
well as parse and URL filters, collaborate in the process.
 
 ==== Docker Compose Setup

(stormcrawler) branch main updated: Fix #1763 - Documentation fixes from #1714 (#1765)

Reply via email to