This is an automated email from the ASF dual-hosted git repository. rzo1 pushed a commit to branch 1763 in repository https://gitbox.apache.org/repos/asf/stormcrawler.git
commit b70d804e5e3139823c6e95840c2f1be2e96f9cb2 Author: Richard Zowalla <[email protected]> AuthorDate: Fri Dec 26 10:38:14 2025 +0100 Fix #1763 - Documentation fixes from #1714 --- docs/src/main/asciidoc/architecture.adoc | 2 +- docs/src/main/asciidoc/configuration.adoc | 2 +- docs/src/main/asciidoc/internals.adoc | 18 +++++++++--------- docs/src/main/asciidoc/quick-start.adoc | 2 +- 4 files changed, 12 insertions(+), 12 deletions(-) diff --git a/docs/src/main/asciidoc/architecture.adoc b/docs/src/main/asciidoc/architecture.adoc index b4ba45b0..bbacfae6 100644 --- a/docs/src/main/asciidoc/architecture.adoc +++ b/docs/src/main/asciidoc/architecture.adoc @@ -4,7 +4,7 @@ You may not use this file except in compliance with the License. You may obtain a copy of the License at: https://www.apache.org/licenses/LICENSE-2.0 //// - +[[architecture]] == Understanding StormCrawler's Architecture === Architecture Overview diff --git a/docs/src/main/asciidoc/configuration.adoc b/docs/src/main/asciidoc/configuration.adoc index 02da291c..1c3a50c0 100644 --- a/docs/src/main/asciidoc/configuration.adoc +++ b/docs/src/main/asciidoc/configuration.adoc @@ -53,7 +53,7 @@ This is what the configuration `http.robots.agents` allows you to do. It is a co === Proxy -StormCrawler's proxy system is built on top of the link:https://github.com/apache/stormcrawler/blob/main/core/src/main/java/org/apache/stormcrawler/proxy/SCProxy.java[SCProxy] class and the link:https://github.com/apache/stormcrawler/blob/main/core/src/main/java/org/apache/stormcrawler/proxy/ProxyManager.java[ProxyManager] interface. Every proxy used in the system is formatted as a **SCProxy**. The **ProxyManager** implementations handle the management and delegation of their internal pr [...] +StormCrawler's proxy system is built on top of the link:https://github.com/apache/stormcrawler/blob/main/core/src/main/java/org/apache/stormcrawler/proxy/SCProxy.java[SCProxy] class and the link:https://github.com/apache/stormcrawler/blob/main/core/src/main/java/org/apache/stormcrawler/proxy/ProxyManager.java[ProxyManager] interface. Every proxy used in the system is formatted as a **SCProxy**. The **ProxyManager** implementations handle the management and delegation of their internal pr [...] The **ProxyManager** interface can be implemented in a custom class to create custom logic for proxy management and load balancing. The default **ProxyManager** implementation is **SingleProxyManager**. This ensures backwards compatibility for prior StormCrawler releases. To use **MultiProxyManager** or custom implementations, pass the class path and name via the config parameter `http.proxy.manager`: diff --git a/docs/src/main/asciidoc/internals.adoc b/docs/src/main/asciidoc/internals.adoc index 7f552ded..6f517adc 100644 --- a/docs/src/main/asciidoc/internals.adoc +++ b/docs/src/main/asciidoc/internals.adoc @@ -10,9 +10,9 @@ https://www.apache.org/licenses/LICENSE-2.0 The Apache StormCrawler components rely on two Apache Storm streams: the _default_ one and another one called _status_. -The aim of the _status_ stream is to pass information about URLs to a persistence layer. Typically, a bespoke bolt will take the tuples coming from the _status_ stream and update the information about URLs in some sort of storage (e.g., ElasticSearch, HBase, etc...), which is then used by a Spout to send new URLs down the topology. +The aim of the _status_ stream is to pass information about URLs to a persistence layer. Typically, a bespoke bolt will take the tuples coming from the _status_ stream and update the information about URLs in some sort of storage (e.g., OpenSearch, HBase, etc...), which is then used by a Spout to send new URLs down the topology. -This is critical for building recursive crawls (i.e., you discover new URLs and not just process known ones). The _default_ stream is used for the URL being processed and is generally used at the end of the pipeline by an indexing bolt (which could also be ElasticSearch, HBase, etc...), regardless of whether the crawler is recursive or not. +This is critical for building recursive crawls (i.e., you discover new URLs and not just process known ones). The _default_ stream is used for the URL being processed and is generally used at the end of the pipeline by an indexing bolt (which could also be OpenSearch, HBase, etc...), regardless of whether the crawler is recursive or not. Tuples are emitted on the _status_ stream by the parsing bolts for handling outlinks but also to notify that there has been a problem with a URL (e.g., unparsable content). It is also used by the fetching bolts to handle redirections, exceptions, and unsuccessful fetch status (e.g., HTTP code 400). @@ -29,7 +29,7 @@ As you can see for instance in link:https://github.com/apache/stormcrawler/blob/ The link:https://github.com/apache/stormcrawler/blob/main/core/src/main/java/org/apache/stormcrawler/persistence/Status.java[Status] enum has the following values: -* DISCOVERED:: outlinks found by the parsers or "seed" URLs emitted into the topology by one of the link:https://stormcrawler.net/docs/api/com/digitalpebble/stormcrawler/spout/package-summary.html[spouts] or "injected" into the storage. The URLs can be already known in the storage. +* DISCOVERED:: outlinks found by the parsers or "seed" URLs emitted into the topology by one of the spouts or "injected" into the storage. The URLs can be already known in the storage. * REDIRECTION:: set by the fetcher bolts. * FETCH_ERROR:: set by the fetcher bolts. * ERROR:: used by either the fetcher, parser, or indexer bolts. @@ -41,7 +41,7 @@ The link:https://github.com/apache/stormcrawler/blob/main/core/src/main/java/org The class link:https://github.com/apache/stormcrawler/blob/main/core/src/main/java/org/apache/stormcrawler/persistence/AbstractStatusUpdaterBolt.java[AbstractStatusUpdaterBolt] can be extended to handle status updates for a specific backend. It has an internal cache of URLs with a `discovered` status so that they don't get added to the backend if they already exist, which is a simple but efficient optimisation. It also uses link:https://github.com/apache/stormcrawler/blob/main/core/src/m [...] -In most cases, the extending classes will just need to implement the method `store(String URL, Status status, Metadata metadata, Date nextFetch)` and handle their own initialisation in `prepare()`. You can find an example of a class which extends it in the link:https://github.com/apache/stormcrawler/blob/main/external/opensearch/src/main/java/org/apache/stormcrawler/opensearch/persistence/StatusUpdaterBolt.java[StatusUpdaterBolt] for Elasticsearch. +In most cases, the extending classes will just need to implement the method `store(String URL, Status status, Metadata metadata, Date nextFetch)` and handle their own initialisation in `prepare()`. You can find an example of a class which extends it in the link:https://github.com/apache/stormcrawler/blob/main/external/opensearch/src/main/java/org/apache/stormcrawler/opensearch/persistence/StatusUpdaterBolt.java[StatusUpdaterBolt] for OpenSearch. === Bolts @@ -71,14 +71,14 @@ The **FetcherBolt** has an internal set of queues where the incoming URLs are pl Incoming tuples spend very little time in the link:https://github.com/apache/stormcrawler/blob/main/core/src/main/java/org/apache/stormcrawler/bolt/FetcherBolt.java#L768[execute] method of the **FetcherBolt** as they are put in the FetchQueues, which is why you'll find that the value of **Execute latency** in the Storm UI is pretty low. They get acked later on, after they've been fetched. The metric to watch for in the Storm UI is **Process latency**. -The **SimpleFetcherBolt** does not do any of this, hence its name. It just fetches incoming tuples in its `execute` method and does not do multi-threading. It does enforce politeness by checking when a URL can be fetched and will wait until it is the case. It is up to the user to declare multiple instances of the bolt in the Topology class and to manage how the URLs get distributed across the instances of **SimpleFetcherBolt**, often with the help of the link:https:/ +The **SimpleFetcherBolt** does not do any of this, hence its name. It just fetches incoming tuples in its `execute` method and does not do multi-threading. It does enforce politeness by checking when a URL can be fetched and will wait until it is the case. It is up to the user to declare multiple instances of the bolt in the Topology class and to manage how the URLs get distributed across the instances of **SimpleFetcherBolt**, often with the help of the link:https://github.com/apache/st [...] === Indexer Bolts The purpose of crawlers is often to index web pages to make them searchable. The project contains resources for indexing with popular search solutions such as: -* link:https://github.com/apache/stormcrawler/blob/main/external/solr/src/main/java/com/digitalpebble/stormcrawler/solr/bolt/IndexerBolt.java[Apache SOLR] -* link:https://github.com/apache/stormcrawler/blob/main/external/opensearch/src/main/java/com/digitalpebble/stormcrawler/elasticsearch/bolt/IndexerBolt.java[Elasticsearch] -* link:https://github.com/apache/stormcrawler/blob/main/external/aws/src/main/java/com/digitalpebble/stormcrawler/aws/bolt/CloudSearchIndexerBolt.java[AWS CloudSearch] +* link:https://github.com/apache/stormcrawler/blob/main/external/solr/src/main/java/org/apache/stormcrawler/solr/bolt/IndexerBolt.java[Apache SOLR] +* link:https://github.com/apache/stormcrawler/blob/main/external/opensearch/src/main/java/org/apache/stormcrawler/opensearch/bolt/IndexerBolt.java[OpenSearch] +* link:https://github.com/apache/stormcrawler/blob/main/external/aws/src/main/java/org/apache/stormcrawler/aws/bolt/CloudSearchIndexerBolt.java[AWS CloudSearch] All of these extend the class link:https://github.com/apache/stormcrawler/blob/main/core/src/main/java/org/apache/stormcrawler/indexing/AbstractIndexerBolt.java[AbstractIndexerBolt]. @@ -104,7 +104,7 @@ You can easily build your own custom indexer to integrate with other storage sys === Parser Bolts ==== JSoupParserBolt -The link:https://github.com/apache/stormcrawler/blob/main/core/src/main/java/org/apache/stormcrawler/bolt/JSoupParserBolt.java[JSoupParserBolt] can be used to parse HTML documents and extract the outlinks, text, and metadata it contains. If you want to parse non-HTML documents, use the link:https://github.com/apache/stormcrawler/tree/main/external/src/main/java/com/digitalpebble/storm/crawler/tika[Tika-based ParserBolt] from the external modules. +The link:https://github.com/apache/stormcrawler/blob/main/core/src/main/java/org/apache/stormcrawler/bolt/JSoupParserBolt.java[JSoupParserBolt] can be used to parse HTML documents and extract the outlinks, text, and metadata it contains. If you want to parse non-HTML documents, use the link:https://github.com/apache/stormcrawler/tree/main/external/tika/src/main/java/org/apache/stormcrawler/tika[Tika-based ParserBolt] from the external modules. This parser calls the xref:urlfilters[URLFilters] and xref:parsefilters[ParseFilters] defined in the configuration. Please note that it calls xref:metadatatransfer[MetadataTransfer] prior to calling the xref:parsefilters[ParseFilters]. If you create new Outlinks in your [[ParseFilters]], you'll need to make sure that you use MetadataTransfer there to inherit the Metadata from the parent document. diff --git a/docs/src/main/asciidoc/quick-start.adoc b/docs/src/main/asciidoc/quick-start.adoc index b3f6f89b..faaef3a8 100644 --- a/docs/src/main/asciidoc/quick-start.adoc +++ b/docs/src/main/asciidoc/quick-start.adoc @@ -63,7 +63,7 @@ The archetype will generate a fully-structured project including: After generation, navigate into the newly created directory (named after the `artifactId` you specified). -TIP: You can learn more about the architecture and how each component works together if you look into link:architecture.adoc[the architecture documentation]. +TIP: You can learn more about the architecture and how each component works together if you look into xref:architecture[the architecture documentation]. By exploring that part of the documentation, you can gain a better understanding of how StormCrawler performs crawling and how bolts, spouts, as well as parse and URL filters, collaborate in the process. ==== Docker Compose Setup
