(incubator-stormcrawler-site) branch main updated: Remove incubator references

rzo1 Thu, 22 May 2025 12:13:51 -0700

This is an automated email from the ASF dual-hosted git repository.

rzo1 pushed a commit to branch main
in repository 
https://gitbox.apache.org/repos/asf/incubator-stormcrawler-site.git



The following commit(s) were added to refs/heads/main by this push:
     new f3f10cf  Remove incubator references
f3f10cf is described below

commit f3f10cf4c38bd3ea2308c9f0a524a00033742f50
Author: Richard Zowalla <[email protected]>
AuthorDate: Thu May 22 20:53:46 2025 +0200

    Remove incubator references
---
 .asf.yaml                         |  2 +-
 NOTICE                            |  4 ++--
 README.md                         |  2 +-
 _config-local.yml                 |  2 +-
 _config.yml                       |  4 ++--
 _includes/footer.html             |  6 +-----
 _includes/header.html             |  2 +-
 contribute/index.html             |  4 ++--
 download/3.0/migration-guide.html |  2 +-
 download/index.html               |  2 +-
 faq/index.html                    | 14 +++++++-------
 getting-started/index.html        |  6 +++---
 index.html                        |  8 ++++----
 support/index.html                |  6 +++---
 14 files changed, 30 insertions(+), 34 deletions(-)

diff --git a/.asf.yaml b/.asf.yaml
index bc023f2..5f0b48f 100644
--- a/.asf.yaml
+++ b/.asf.yaml
@@ -9,7 +9,7 @@ publish:
   whoami: asf-site
 
 github:
-  description: "Source for the Apache StormCrawler (Incubating) web site"
+  description: "Source for the Apache StormCrawler web site"
   homepage: https://stormcrawler.apache.org/
   features:
     # Enable wiki for documentation
diff --git a/NOTICE b/NOTICE
index 5735e11..63f8332 100644
--- a/NOTICE
+++ b/NOTICE
@@ -1,5 +1,5 @@
-Apache StormCrawler (Incubating)
-Copyright 2024 The Apache Software Foundation
+Apache StormCrawler
+Copyright 2025 The Apache Software Foundation
 
 This product includes software developed by The Apache Software
 Foundation (http://www.apache.org/).
diff --git a/README.md b/README.md
index 36476cd..a7c4d6b 100644
--- a/README.md
+++ b/README.md
@@ -1,4 +1,4 @@
-# Apache StormCrawler (Incubating) Website
+# Apache StormCrawler Website
 
 ## How to build?
 
diff --git a/_config-local.yml b/_config-local.yml
index 0a011ee..4b53b56 100644
--- a/_config-local.yml
+++ b/_config-local.yml
@@ -9,7 +9,7 @@
 title: Apache StormCrawler
 email: [email protected]
 description: > # this means to ignore newlines until "baseurl:"
- Apache StormCrawler (Incubating) is collection of resources for building 
low-latency, scalable web crawlers on Apache Storm
+ Apache StormCrawler is collection of resources for building low-latency, 
scalable web crawlers on Apache Storm
 baseurl: "" # the subpath of your site, e.g. /blog
 url: "http://localhost:4000"; # the base hostname & protocol for your site
 
diff --git a/_config.yml b/_config.yml
index 39c06e6..cc84c9b 100644
--- a/_config.yml
+++ b/_config.yml
@@ -6,10 +6,10 @@
 # 'jekyll serve'. If you change this file, please restart the server process.
 
 # Site settings
-title: Apache StormCrawler (Incubating)
+title: Apache StormCrawler
 email: [email protected]
 description: > # this means to ignore newlines until "baseurl:"
- Apache StormCrawler (Incubating) is collection of resources for building 
low-latency, scalable web crawlers on Apache Storm
+ Apache StormCrawler is collection of resources for building low-latency, 
scalable web crawlers on Apache Storm
 baseurl: "" # the subpath of your site, e.g. /blog
 url: "https://stormcrawler.apache.org"; # the base hostname & protocol for your 
site
 
diff --git a/_includes/footer.html b/_includes/footer.html
index 6218a1e..8bd1a36 100644
--- a/_includes/footer.html
+++ b/_includes/footer.html
@@ -1,9 +1,5 @@
 <footer class="site-footer">
-       <img src="/img/incubator_feather_egg_logo_bw_crop.png" alt="Apache 
Incubator Logo" width="500"><br/>
-
-       Apache StormCrawler is an effort undergoing incubation at The Apache 
Software Foundation (ASF), sponsored by the Apache Incubator. Incubation is 
required of all newly accepted projects until a further review indicates that 
the infrastructure, communications, and decision making process have stabilized 
in a manner consistent with other successful ASF projects. While incubation 
status is not necessarily a reflection of the completeness or stability of the 
code, it does indicate that the p [...]
-<br/> <br/>
-       &copy; 2024 <a href="https://www.apache.org/";>The Apache Software 
Foundation</a><br/><br/>
+&copy; 2024 <a href="https://www.apache.org/";>The Apache Software 
Foundation</a><br/><br/>
 Licensed under the <a 
href="https://www.apache.org/licenses/LICENSE-2.0";>Apache License, Version 
2.0</a>. <br/> Apache StormCrawler, StormCrawler, the Apache feather logo are 
trademarks of The Apache Software Foundation. <br/> All other marks mentioned 
may be trademarks or registered trademarks of their respective owners. 
<br/><br/>
        <a 
href="https://privacy.apache.org/policies/privacy-policy-public.html";>Privacy 
Policy</a> | <a href="https://www.apache.org/security/";>Security</a> | <a 
href="https://www.apache.org/foundation/sponsorship";>Sponsorship</a> | <a 
href="https://www.apache.org/foundation/sponsors";>Sponsors</a><br/><br/>
        <div class="footer-widget">
diff --git a/_includes/header.html b/_includes/header.html
index 1e9e0f9..8c683c0 100644
--- a/_includes/header.html
+++ b/_includes/header.html
@@ -1,7 +1,7 @@
 <header class="site-header">
   <div class="site-header__wrap">
     <div class="site-header__logo">
-      <a href="{{ site.baseurl }}/"><img src="{{ site.baseurl 
}}/img/incubator_logo.png" alt="Apache StormCrawler (Incubating)"></a>
+      <a href="{{ site.baseurl }}/"><img src="{{ site.baseurl }}/img/logo.png" 
alt="Apache StormCrawler"></a>
     </div>
   </div>
 </header>
diff --git a/contribute/index.html b/contribute/index.html
index 130c767..4259b8d 100644
--- a/contribute/index.html
+++ b/contribute/index.html
@@ -1,13 +1,13 @@
 ---
 layout: default
 slug: contribute
-title: How to contribute to Apache StormCrawler (Incubating)
+title: How to contribute to Apache StormCrawler
 ---
 
 <div class="row row-col">
     <h1>How to Contribute</h1>
 
-    <h2 id="the-apache-stormcrawler-community">The Apache StormCrawler 
(Incubating) Community</h2>
+    <h2 id="the-apache-stormcrawler-community">The Apache StormCrawler 
Community</h2>
     <p>If you have questions about the contribution process or want to discuss 
specific issues, please interact with the community using the following 
resources.</p>
 
     <ul>
diff --git a/download/3.0/migration-guide.html 
b/download/3.0/migration-guide.html
index ea83134..45f9ff7 100644
--- a/download/3.0/migration-guide.html
+++ b/download/3.0/migration-guide.html
@@ -4,7 +4,7 @@ slug: migration-guide
 title: Migration Guide
 ---
 <div class="row row-col">
-<h1>Apache StormCrawler (Incubating) Migration Guide</h1>
+<h1>Apache StormCrawler Migration Guide</h1>
 
 <h2>Introduction</h2>
 <p>This guide provides step-by-step instructions for migrating your project 
from older versions of StormCrawler to the new version under the Apache 
umbrella. Key changes include updates to the group and artifact IDs, as well as 
the removal of the Elasticsearch module.</p>
diff --git a/download/index.html b/download/index.html
index d7af197..9b519b6 100644
--- a/download/index.html
+++ b/download/index.html
@@ -25,7 +25,7 @@ title: Download
         stability of the code, it does indicate that the project has yet to be 
fully endorsed by the ASF.</p>
     <h2>Downloads</h2>
 
-    <h3>Apache StormCrawler (Incubating) 3.3.0</h3>
+    <h3>Apache StormCrawler 3.3.0</h3>
     <br/>
     <ul>
         <li><a 
href="https://github.com/apache/incubator-stormcrawler/releases/tag/stormcrawler-3.3.0";>Release
 Notes</a></li>
diff --git a/faq/index.html b/faq/index.html
index 5e5c9bd..8d08919 100644
--- a/faq/index.html
+++ b/faq/index.html
@@ -10,7 +10,7 @@ slug: faq
 
   <p>A: Probably worth having a look at <a 
href="http://storm.apache.org/";>Apache Storm® first. The <a 
href="http://storm.apache.org/releases/current/Tutorial.html";>tutorial</a> and 
<a href="http://storm.apache.org/documentation/Concepts.html";>concept</a> pages 
are good starting points.</p>
 
-  <p><strong>Q: Do I need an Apache Storm® cluster to run Apache StormCrawler 
(Incubating)?</strong></p>
+  <p><strong>Q: Do I need an Apache Storm® cluster to run Apache 
StormCrawler?</strong></p>
 
   <p>A: No. It can run in local mode and will just use the Storm libraries as 
dependencies. It makes sense to install Storm in pseudo-distributed mode though 
so that you can use its UI to monitor the topologies.</p>
 
@@ -18,7 +18,7 @@ slug: faq
 
   <p>A: Apache Storm® is an elegant framework, with simple concepts, which 
provides a solid platform for distributed stream processing. It gives us fault 
tolerance and guaranteed data processing out of the box. The project is also 
very dynamic and backed by a thriving community. Last but not least it is under 
ASF 2.0 license.</p>
 
-  <p id="howfast"><strong>Q: How fast is Apache StormCrawler 
(Incubating)?</strong></p>
+  <p id="howfast"><strong>Q: How fast is Apache StormCrawler?</strong></p>
 
   <p>A: This depends mainly on the diversity of hostnames as well as your 
politeness settings. For instance, if you have 1M URLs from the same host and 
have set a delay of 1 sec between request then the best you'll be able to do is 
86400 pages per day. In practice this would be less than that as the time 
needed for fetching the content (which itself depends on your network and how 
large the documents are), parsing and indexing it etc...  This is true of any 
crawler, not just StormCrawler.</p>
 
@@ -27,16 +27,16 @@ slug: faq
   <p>A: This <a 
href="http://digitalpebble.blogspot.co.uk/2015/09/index-web-with-aws-cloudsearch.html";>tutorial</a>
 on using Apache Nutch® and SC for indexing with Cloudsearch give you some idea 
of how they compare in their methodology and performance. 
   We also ran a comparative <a 
href="http://digitalpebble.blogspot.co.uk/2017/01/the-battle-of-crawlers-apache-nutch-vs.html";>benchmark</a>
 on a larger crawl.</p>
   <p>In a nutshell (pardon the pun), Nutch proceeds by batch steps where it 
selects the URLs to fetch, fetches them, parses them then update it database 
with the new info about the URLs it just processed and adds the newly 
discovered URLs. The generate and update steps take longer and longer as the 
crawl grows and the resources are used unevenly : when fetching there is little 
CPU or disk used whereas when doing all the other activities, you are not 
fetching anything at all, which is a w [...]
-  <p>Apache StormCrawler (Incubating) proceeds differently and does everything 
at the same time, hence optimising the physical resources of the cluster, but 
can potentially accomodate more use cases, e.g. when URLs naturally come as 
streams or when low latency is a must. URLs also get indexed as they are 
fetched and not as a batch. On a more subjective note and apart from being 
potentially more efficient, Apache StormCrawler (Incubating) is more modern, 
easier to understand and build, ni [...]
-  <p>Apache Nutch® is a great tool though, which we used for years with many 
of our customers at DigitalPebble, and it can also do things that Apache 
StormCrawler (Incubating) cannot currently do out of the box like deduplicating 
or advanced scoring like PageRank.</p>
+  <p>Apache StormCrawler proceeds differently and does everything at the same 
time, hence optimising the physical resources of the cluster, but can 
potentially accomodate more use cases, e.g. when URLs naturally come as streams 
or when low latency is a must. URLs also get indexed as they are fetched and 
not as a batch. On a more subjective note and apart from being potentially more 
efficient, Apache StormCrawler is more modern, easier to understand and build, 
nicer to use, more versatile [...]
+  <p>Apache Nutch® is a great tool though, which we used for years with many 
of our customers at DigitalPebble, and it can also do things that Apache 
StormCrawler cannot currently do out of the box like deduplicating or advanced 
scoring like PageRank.</p>
 
   <p><strong>Q: Do I need some sort of external storage? And if so, then 
what?</strong></p>
 
   <p>A: Yes, you'll need to store the URLs to fetch somewhere. The type of the 
storage to use depends on the nature of your crawl. If your crawl is not 
recursive i.e. you just want to process specific pages and/or won't discover 
new pages through more than one path, then you could use messaging queues like 
<a href="https://www.rabbitmq.com/";>RabbitMQ</a>, <a 
href="https://aws.amazon.com/sqs/";>AWS SQS</a> or <a 
href="http://kafka.apache.org";>Apache Kafka®</a>. All you'll need is a Spout i 
[...]
-  <p>If your crawl is recursive and there is a possibility that URLs which are 
already known are discovered multiple times, then a queue won't help as it 
would add the same URL to the queue every time it is discovered. This would be 
very inefficient. Instead you should use a storage where the keys are unique, 
like for instance a relational database. Apache StormCrawler (Incubating) has 
several resources you can use in the <a 
href="https://github.com/apache/incubator-stormcrawler/tree/mas [...]
-  <p>The advantage of using Apache StormCrawler (Incubating) is that is it 
both modular and flexible. You can plug it to pretty much any storage you 
want.</p>
+  <p>If your crawl is recursive and there is a possibility that URLs which are 
already known are discovered multiple times, then a queue won't help as it 
would add the same URL to the queue every time it is discovered. This would be 
very inefficient. Instead you should use a storage where the keys are unique, 
like for instance a relational database. Apache StormCrawler has several 
resources you can use in the <a 
href="https://github.com/apache/incubator-stormcrawler/tree/master/external"; 
[...]
+  <p>The advantage of using Apache StormCrawler is that is it both modular and 
flexible. You can plug it to pretty much any storage you want.</p>
 
-  <p><strong>Q: Is Apache StormCrawler (Incubating) polite?</strong></p>
+  <p><strong>Q: Is Apache StormCrawler polite?</strong></p>
   <p>A: The <a href="http://www.robotstxt.org/";>robots.txt</a> protocol is 
supported and the fetchers are configured to have a <a 
href="https://github.com/apache/incubator-stormcrawler/blob/master/core/src/main/resources/crawler-default.yaml#L6";>delay</a>
 between calls to the same hostname or domain. However like with every tool, it 
is down to how people use it.</p>
 
   <p><strong>Q: When do I know when a crawl is finished?</strong></p>
diff --git a/getting-started/index.html b/getting-started/index.html
index 5d286dd..1bda4cf 100644
--- a/getting-started/index.html
+++ b/getting-started/index.html
@@ -1,7 +1,7 @@
 ---
 layout: default
 slug: getting-started
-title: Getting started with Apache StormCrawler (Incubating)
+title: Getting started with Apache StormCrawler
 ---
 
 <div class="row row-col">
@@ -10,7 +10,7 @@ title: Getting started with Apache StormCrawler (Incubating)
   <p>NOTE: These instructions assume that you have <a 
href="https://maven.apache.org/install.html";>Apache Maven®</a> installed.  
   You will also need to install <a href="https://storm.apache.org/";>Apache 
Storm® 2.8.0</a> to run the crawler.</p>
 
-  <p>Once Apache Storm® is installed, the easiest way to get started is to 
generate a brand new Apache StormCrawler (Incubating) project using:</p>
+  <p>Once Apache Storm® is installed, the easiest way to get started is to 
generate a brand new Apache StormCrawler project using:</p>
 
   <p><i>mvn archetype:generate -DarchetypeGroupId=org.apache.stormcrawler 
-DarchetypeArtifactId=stormcrawler-archetype -DarchetypeVersion=3.3.0</i></p>
   
@@ -24,7 +24,7 @@ title: Getting started with Apache StormCrawler (Incubating)
 
  <p>What this CrawlTopology does is very simple : it gets URLs to crawl from a 
<a href="https://urlfrontier.net";>URLFrontier</a> instance and emits them on 
the topology. These URLs are then partitioned by hostname to enfore the 
politeness and then fetched. The next bolt (SiteMapParserBolt) checks whether 
they are sitemap files and if not passes them on to a HTML parser. The parser 
extracts the text from the document and passes it to a dummy indexer which 
simply prints a representation of [...]
 
- <p>Of course this topology is very primitive and its purpose is merely to 
give you an idea of how Apache StormCrawler (Incubating) works. In reality, 
you'd use a different spout and index the documents to a proper backend. Look 
at the <a 
href="https://github.com/apache/incubator-stormcrawler/blob/master/external";>external
 modules</a> to see what's already available. Another limitation of this 
topology is that it will work in local mode only or on a single worker.</p>
+ <p>Of course this topology is very primitive and its purpose is merely to 
give you an idea of how Apache StormCrawler works. In reality, you'd use a 
different spout and index the documents to a proper backend. Look at the <a 
href="https://github.com/apache/incubator-stormcrawler/blob/master/external";>external
 modules</a> to see what's already available. Another limitation of this 
topology is that it will work in local mode only or on a single worker.</p>
 
  <p>You can run the topology in local mode with :</p>
 
diff --git a/index.html b/index.html
index ba52d24..9747555 100644
--- a/index.html
+++ b/index.html
@@ -8,8 +8,8 @@ slug: home
 </div>
 </div>
 <div class="row row-col">
-  <p><strong>Apache StormCrawler (Incubating)</strong> is an open source SDK 
for building distributed web crawlers based on <a 
href="http://storm.apache.org";>Apache Storm®</a>. The project is under Apache 
License v2 and consists of a collection of reusable resources and components, 
written mostly in Java.</p>
-  <p>The aim of Apache StormCrawler (Incubating) is to help build web crawlers 
that are :</p>
+  <p><strong>Apache StormCrawler</strong> is an open source SDK for building 
distributed web crawlers based on <a href="http://storm.apache.org";>Apache 
Storm®</a>. The project is under Apache License v2 and consists of a collection 
of reusable resources and components, written mostly in Java.</p>
+  <p>The aim of Apache StormCrawler is to help build web crawlers that are 
:</p>
   <ul>
     <li>scalable</li>
     <li>resilient</li>
@@ -17,9 +17,9 @@ slug: home
     <li>easy to extend</li>
     <li>polite yet efficient</li>
   </ul>
-  <p><strong>Apache StormCrawler (Incubating)</strong> is a library and 
collection of resources that developers can leverage to build their own 
crawlers. The good news is that doing so can be pretty straightforward! Have a 
look at the <a href="getting-started/">Getting Started</a> section for more 
details.</p>
+  <p><strong>Apache StormCrawler</strong> is a library and collection of 
resources that developers can leverage to build their own crawlers. The good 
news is that doing so can be pretty straightforward! Have a look at the <a 
href="getting-started/">Getting Started</a> section for more details.</p>
   <p>Apart from the core components, we provide some <a 
href="https://github.com/apache/incubator-stormcrawler/tree/main/external";>external
 resources</a> that you can reuse in your project, like for instance our spout 
and bolts for <a href="https://opensearch.org/";>OpenSearch®</a> or a ParserBolt 
which uses <a href="http://tika.apache.org";>Apache Tika®</a> to parse various 
document formats.</p>
-  <p><strong>Apache StormCrawler (Incubating)</strong> is perfectly suited to 
use cases where the URL to fetch and parse come as streams but is also an 
appropriate solution for large scale recursive crawls, particularly where low 
latency is required. The project is used in production by <a 
href="https://github.com/apache/incubator-stormcrawler/wiki/Powered-By";>many 
organisations</a> and is actively developed and maintained.</p>
+  <p><strong>Apache StormCrawler</strong> is perfectly suited to use cases 
where the URL to fetch and parse come as streams but is also an appropriate 
solution for large scale recursive crawls, particularly where low latency is 
required. The project is used in production by <a 
href="https://github.com/apache/incubator-stormcrawler/wiki/Powered-By";>many 
organisations</a> and is actively developed and maintained.</p>
   <p>The <a 
href="https://github.com/apache/incubator-stormcrawler/wiki/Presentations";>Presentations</a>
 page contains links to some recent presentations made about this project.</p>
 </div>
 
diff --git a/support/index.html b/support/index.html
index 8c5723d..c4fa8c8 100644
--- a/support/index.html
+++ b/support/index.html
@@ -7,16 +7,16 @@ title: Need assistance from web crawling experts?
 <div class="row row-col">
   <h1>Support</h1>
 <br>
-<p>You can ask questions related to Apache StormCrawler (Incubating) on Github 
in the <a 
href="https://github.com/apache/incubator-stormcrawler/discussions";>discussions 
section</a>, on <a 
href="http://stackoverflow.com/questions/tagged/stormcrawler";>stackoverflow</a> 
using the tag 'stormcrawler' or on <a 
href="https://discord.com/invite/C62MHusNnG";>Discord</a>.</p>
+<p>You can ask questions related to Apache StormCrawler on Github in the <a 
href="https://github.com/apache/incubator-stormcrawler/discussions";>discussions 
section</a>, on <a 
href="http://stackoverflow.com/questions/tagged/stormcrawler";>stackoverflow</a> 
using the tag 'stormcrawler' or on <a 
href="https://discord.com/invite/C62MHusNnG";>Discord</a>.</p>
 <p>If you think you've found a bug, please <a 
href="https://github.com/apache/incubator-stormcrawler/issues";>open an 
issue</a> on GitHub.</p>
 
   <h1>Commercial Support</h1>
   <br>
-  <p>The Apache StormCrawler (Incubating) PMC does not endorse or recommend 
any of the products or services on this page. We love all our supporters 
equally.</p>
+  <p>The Apache StormCrawler PMC does not endorse or recommend any of the 
products or services on this page. We love all our supporters equally.</p>
 
   <h2>Want to be added to this page? </h2>
   <p>All submitted information must be factual and informational in nature and 
not be a marketing statement. Statements that promote your products and 
services over other offerings on the page will not be tolerated and will be 
removed. Such marketing statements can be added to your own pages on your own 
site.</p>
-  <p>When in doubt, email the Apache StormCrawler (Incubating) PMC and ask. We 
are be happy to help.</p>
+  <p>When in doubt, email the Apache StormCrawler PMC and ask. We are be happy 
to help.</p>
 
   <h2>Companies</h2>
   <ul>

(incubator-stormcrawler-site) branch main updated: Remove incubator references

Reply via email to