This is an automated email from the ASF dual-hosted git repository.
rzo1 pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/stormcrawler.git
The following commit(s) were added to refs/heads/main by this push:
new 47a29da2 Fix #1763 - Documentation fixes from #1714 (#1765)
47a29da2 is described below
commit 47a29da251d4d89f7d6c822c71da0270fdc706d7
Author: Richard Zowalla <[email protected]>
AuthorDate: Fri Dec 26 19:37:07 2025 +0100
Fix #1763 - Documentation fixes from #1714 (#1765)
---
docs/src/main/asciidoc/architecture.adoc | 2 +-
docs/src/main/asciidoc/configuration.adoc | 2 +-
docs/src/main/asciidoc/internals.adoc | 18 +++++++++---------
docs/src/main/asciidoc/quick-start.adoc | 2 +-
4 files changed, 12 insertions(+), 12 deletions(-)
diff --git a/docs/src/main/asciidoc/architecture.adoc
b/docs/src/main/asciidoc/architecture.adoc
index b4ba45b0..bbacfae6 100644
--- a/docs/src/main/asciidoc/architecture.adoc
+++ b/docs/src/main/asciidoc/architecture.adoc
@@ -4,7 +4,7 @@ You may not use this file except in compliance with the License.
You may obtain a copy of the License at:
https://www.apache.org/licenses/LICENSE-2.0
////
-
+[[architecture]]
== Understanding StormCrawler's Architecture
=== Architecture Overview
diff --git a/docs/src/main/asciidoc/configuration.adoc
b/docs/src/main/asciidoc/configuration.adoc
index 02da291c..1c3a50c0 100644
--- a/docs/src/main/asciidoc/configuration.adoc
+++ b/docs/src/main/asciidoc/configuration.adoc
@@ -53,7 +53,7 @@ This is what the configuration `http.robots.agents` allows
you to do. It is a co
=== Proxy
-StormCrawler's proxy system is built on top of the
link:https://github.com/apache/stormcrawler/blob/main/core/src/main/java/org/apache/stormcrawler/proxy/SCProxy.java[SCProxy]
class and the
link:https://github.com/apache/stormcrawler/blob/main/core/src/main/java/org/apache/stormcrawler/proxy/ProxyManager.java[ProxyManager]
interface. Every proxy used in the system is formatted as a **SCProxy**. The
**ProxyManager** implementations handle the management and delegation of their
internal pr [...]
+StormCrawler's proxy system is built on top of the
link:https://github.com/apache/stormcrawler/blob/main/core/src/main/java/org/apache/stormcrawler/proxy/SCProxy.java[SCProxy]
class and the
link:https://github.com/apache/stormcrawler/blob/main/core/src/main/java/org/apache/stormcrawler/proxy/ProxyManager.java[ProxyManager]
interface. Every proxy used in the system is formatted as a **SCProxy**. The
**ProxyManager** implementations handle the management and delegation of their
internal pr [...]
The **ProxyManager** interface can be implemented in a custom class to create
custom logic for proxy management and load balancing. The default
**ProxyManager** implementation is **SingleProxyManager**. This ensures
backwards compatibility for prior StormCrawler releases. To use
**MultiProxyManager** or custom implementations, pass the class path and name
via the config parameter `http.proxy.manager`:
diff --git a/docs/src/main/asciidoc/internals.adoc
b/docs/src/main/asciidoc/internals.adoc
index 7f552ded..6f517adc 100644
--- a/docs/src/main/asciidoc/internals.adoc
+++ b/docs/src/main/asciidoc/internals.adoc
@@ -10,9 +10,9 @@ https://www.apache.org/licenses/LICENSE-2.0
The Apache StormCrawler components rely on two Apache Storm streams: the
_default_ one and another one called _status_.
-The aim of the _status_ stream is to pass information about URLs to a
persistence layer. Typically, a bespoke bolt will take the tuples coming from
the _status_ stream and update the information about URLs in some sort of
storage (e.g., ElasticSearch, HBase, etc...), which is then used by a Spout to
send new URLs down the topology.
+The aim of the _status_ stream is to pass information about URLs to a
persistence layer. Typically, a bespoke bolt will take the tuples coming from
the _status_ stream and update the information about URLs in some sort of
storage (e.g., OpenSearch, HBase, etc...), which is then used by a Spout to
send new URLs down the topology.
-This is critical for building recursive crawls (i.e., you discover new URLs
and not just process known ones). The _default_ stream is used for the URL
being processed and is generally used at the end of the pipeline by an indexing
bolt (which could also be ElasticSearch, HBase, etc...), regardless of whether
the crawler is recursive or not.
+This is critical for building recursive crawls (i.e., you discover new URLs
and not just process known ones). The _default_ stream is used for the URL
being processed and is generally used at the end of the pipeline by an indexing
bolt (which could also be OpenSearch, HBase, etc...), regardless of whether the
crawler is recursive or not.
Tuples are emitted on the _status_ stream by the parsing bolts for handling
outlinks but also to notify that there has been a problem with a URL (e.g.,
unparsable content). It is also used by the fetching bolts to handle
redirections, exceptions, and unsuccessful fetch status (e.g., HTTP code 400).
@@ -29,7 +29,7 @@ As you can see for instance in
link:https://github.com/apache/stormcrawler/blob/
The
link:https://github.com/apache/stormcrawler/blob/main/core/src/main/java/org/apache/stormcrawler/persistence/Status.java[Status]
enum has the following values:
-* DISCOVERED:: outlinks found by the parsers or "seed" URLs emitted into the
topology by one of the
link:https://stormcrawler.net/docs/api/com/digitalpebble/stormcrawler/spout/package-summary.html[spouts]
or "injected" into the storage. The URLs can be already known in the storage.
+* DISCOVERED:: outlinks found by the parsers or "seed" URLs emitted into the
topology by one of the spouts or "injected" into the storage. The URLs can be
already known in the storage.
* REDIRECTION:: set by the fetcher bolts.
* FETCH_ERROR:: set by the fetcher bolts.
* ERROR:: used by either the fetcher, parser, or indexer bolts.
@@ -41,7 +41,7 @@ The
link:https://github.com/apache/stormcrawler/blob/main/core/src/main/java/org
The class
link:https://github.com/apache/stormcrawler/blob/main/core/src/main/java/org/apache/stormcrawler/persistence/AbstractStatusUpdaterBolt.java[AbstractStatusUpdaterBolt]
can be extended to handle status updates for a specific backend. It has an
internal cache of URLs with a `discovered` status so that they don't get added
to the backend if they already exist, which is a simple but efficient
optimisation. It also uses
link:https://github.com/apache/stormcrawler/blob/main/core/src/m [...]
-In most cases, the extending classes will just need to implement the method
`store(String URL, Status status, Metadata metadata, Date nextFetch)` and
handle their own initialisation in `prepare()`. You can find an example of a
class which extends it in the
link:https://github.com/apache/stormcrawler/blob/main/external/opensearch/src/main/java/org/apache/stormcrawler/opensearch/persistence/StatusUpdaterBolt.java[StatusUpdaterBolt]
for Elasticsearch.
+In most cases, the extending classes will just need to implement the method
`store(String URL, Status status, Metadata metadata, Date nextFetch)` and
handle their own initialisation in `prepare()`. You can find an example of a
class which extends it in the
link:https://github.com/apache/stormcrawler/blob/main/external/opensearch/src/main/java/org/apache/stormcrawler/opensearch/persistence/StatusUpdaterBolt.java[StatusUpdaterBolt]
for OpenSearch.
=== Bolts
@@ -71,14 +71,14 @@ The **FetcherBolt** has an internal set of queues where the
incoming URLs are pl
Incoming tuples spend very little time in the
link:https://github.com/apache/stormcrawler/blob/main/core/src/main/java/org/apache/stormcrawler/bolt/FetcherBolt.java#L768[execute]
method of the **FetcherBolt** as they are put in the FetchQueues, which is why
you'll find that the value of **Execute latency** in the Storm UI is pretty
low. They get acked later on, after they've been fetched. The metric to watch
for in the Storm UI is **Process latency**.
-The **SimpleFetcherBolt** does not do any of this, hence its name. It just
fetches incoming tuples in its `execute` method and does not do
multi-threading. It does enforce politeness by checking when a URL can be
fetched and will wait until it is the case. It is up to the user to declare
multiple instances of the bolt in the Topology class and to manage how the URLs
get distributed across the instances of **SimpleFetcherBolt**, often with the
help of the link:https:/
+The **SimpleFetcherBolt** does not do any of this, hence its name. It just
fetches incoming tuples in its `execute` method and does not do
multi-threading. It does enforce politeness by checking when a URL can be
fetched and will wait until it is the case. It is up to the user to declare
multiple instances of the bolt in the Topology class and to manage how the URLs
get distributed across the instances of **SimpleFetcherBolt**, often with the
help of the link:https://github.com/apache/st [...]
=== Indexer Bolts
The purpose of crawlers is often to index web pages to make them searchable.
The project contains resources for indexing with popular search solutions such
as:
-*
link:https://github.com/apache/stormcrawler/blob/main/external/solr/src/main/java/com/digitalpebble/stormcrawler/solr/bolt/IndexerBolt.java[Apache
SOLR]
-*
link:https://github.com/apache/stormcrawler/blob/main/external/opensearch/src/main/java/com/digitalpebble/stormcrawler/elasticsearch/bolt/IndexerBolt.java[Elasticsearch]
-*
link:https://github.com/apache/stormcrawler/blob/main/external/aws/src/main/java/com/digitalpebble/stormcrawler/aws/bolt/CloudSearchIndexerBolt.java[AWS
CloudSearch]
+*
link:https://github.com/apache/stormcrawler/blob/main/external/solr/src/main/java/org/apache/stormcrawler/solr/bolt/IndexerBolt.java[Apache
SOLR]
+*
link:https://github.com/apache/stormcrawler/blob/main/external/opensearch/src/main/java/org/apache/stormcrawler/opensearch/bolt/IndexerBolt.java[OpenSearch]
+*
link:https://github.com/apache/stormcrawler/blob/main/external/aws/src/main/java/org/apache/stormcrawler/aws/bolt/CloudSearchIndexerBolt.java[AWS
CloudSearch]
All of these extend the class
link:https://github.com/apache/stormcrawler/blob/main/core/src/main/java/org/apache/stormcrawler/indexing/AbstractIndexerBolt.java[AbstractIndexerBolt].
@@ -104,7 +104,7 @@ You can easily build your own custom indexer to integrate
with other storage sys
=== Parser Bolts
==== JSoupParserBolt
-The
link:https://github.com/apache/stormcrawler/blob/main/core/src/main/java/org/apache/stormcrawler/bolt/JSoupParserBolt.java[JSoupParserBolt]
can be used to parse HTML documents and extract the outlinks, text, and
metadata it contains. If you want to parse non-HTML documents, use the
link:https://github.com/apache/stormcrawler/tree/main/external/src/main/java/com/digitalpebble/storm/crawler/tika[Tika-based
ParserBolt] from the external modules.
+The
link:https://github.com/apache/stormcrawler/blob/main/core/src/main/java/org/apache/stormcrawler/bolt/JSoupParserBolt.java[JSoupParserBolt]
can be used to parse HTML documents and extract the outlinks, text, and
metadata it contains. If you want to parse non-HTML documents, use the
link:https://github.com/apache/stormcrawler/tree/main/external/tika/src/main/java/org/apache/stormcrawler/tika[Tika-based
ParserBolt] from the external modules.
This parser calls the xref:urlfilters[URLFilters] and
xref:parsefilters[ParseFilters] defined in the configuration. Please note that
it calls xref:metadatatransfer[MetadataTransfer] prior to calling the
xref:parsefilters[ParseFilters]. If you create new Outlinks in your
[[ParseFilters]], you'll need to make sure that you use MetadataTransfer there
to inherit the Metadata from the parent document.
diff --git a/docs/src/main/asciidoc/quick-start.adoc
b/docs/src/main/asciidoc/quick-start.adoc
index b3f6f89b..faaef3a8 100644
--- a/docs/src/main/asciidoc/quick-start.adoc
+++ b/docs/src/main/asciidoc/quick-start.adoc
@@ -63,7 +63,7 @@ The archetype will generate a fully-structured project
including:
After generation, navigate into the newly created directory (named after the
`artifactId` you specified).
-TIP: You can learn more about the architecture and how each component works
together if you look into link:architecture.adoc[the architecture
documentation].
+TIP: You can learn more about the architecture and how each component works
together if you look into xref:architecture[the architecture documentation].
By exploring that part of the documentation, you can gain a better
understanding of how StormCrawler performs crawling and how bolts, spouts, as
well as parse and URL filters, collaborate in the process.
==== Docker Compose Setup