This is an automated email from the ASF dual-hosted git repository.
rzo1 pushed a commit to branch main
in repository
https://gitbox.apache.org/repos/asf/incubator-stormcrawler-site.git
The following commit(s) were added to refs/heads/main by this push:
new f3f10cf Remove incubator references
f3f10cf is described below
commit f3f10cf4c38bd3ea2308c9f0a524a00033742f50
Author: Richard Zowalla <[email protected]>
AuthorDate: Thu May 22 20:53:46 2025 +0200
Remove incubator references
---
.asf.yaml | 2 +-
NOTICE | 4 ++--
README.md | 2 +-
_config-local.yml | 2 +-
_config.yml | 4 ++--
_includes/footer.html | 6 +-----
_includes/header.html | 2 +-
contribute/index.html | 4 ++--
download/3.0/migration-guide.html | 2 +-
download/index.html | 2 +-
faq/index.html | 14 +++++++-------
getting-started/index.html | 6 +++---
index.html | 8 ++++----
support/index.html | 6 +++---
14 files changed, 30 insertions(+), 34 deletions(-)
diff --git a/.asf.yaml b/.asf.yaml
index bc023f2..5f0b48f 100644
--- a/.asf.yaml
+++ b/.asf.yaml
@@ -9,7 +9,7 @@ publish:
whoami: asf-site
github:
- description: "Source for the Apache StormCrawler (Incubating) web site"
+ description: "Source for the Apache StormCrawler web site"
homepage: https://stormcrawler.apache.org/
features:
# Enable wiki for documentation
diff --git a/NOTICE b/NOTICE
index 5735e11..63f8332 100644
--- a/NOTICE
+++ b/NOTICE
@@ -1,5 +1,5 @@
-Apache StormCrawler (Incubating)
-Copyright 2024 The Apache Software Foundation
+Apache StormCrawler
+Copyright 2025 The Apache Software Foundation
This product includes software developed by The Apache Software
Foundation (http://www.apache.org/).
diff --git a/README.md b/README.md
index 36476cd..a7c4d6b 100644
--- a/README.md
+++ b/README.md
@@ -1,4 +1,4 @@
-# Apache StormCrawler (Incubating) Website
+# Apache StormCrawler Website
## How to build?
diff --git a/_config-local.yml b/_config-local.yml
index 0a011ee..4b53b56 100644
--- a/_config-local.yml
+++ b/_config-local.yml
@@ -9,7 +9,7 @@
title: Apache StormCrawler
email: [email protected]
description: > # this means to ignore newlines until "baseurl:"
- Apache StormCrawler (Incubating) is collection of resources for building
low-latency, scalable web crawlers on Apache Storm
+ Apache StormCrawler is collection of resources for building low-latency,
scalable web crawlers on Apache Storm
baseurl: "" # the subpath of your site, e.g. /blog
url: "http://localhost:4000" # the base hostname & protocol for your site
diff --git a/_config.yml b/_config.yml
index 39c06e6..cc84c9b 100644
--- a/_config.yml
+++ b/_config.yml
@@ -6,10 +6,10 @@
# 'jekyll serve'. If you change this file, please restart the server process.
# Site settings
-title: Apache StormCrawler (Incubating)
+title: Apache StormCrawler
email: [email protected]
description: > # this means to ignore newlines until "baseurl:"
- Apache StormCrawler (Incubating) is collection of resources for building
low-latency, scalable web crawlers on Apache Storm
+ Apache StormCrawler is collection of resources for building low-latency,
scalable web crawlers on Apache Storm
baseurl: "" # the subpath of your site, e.g. /blog
url: "https://stormcrawler.apache.org" # the base hostname & protocol for your
site
diff --git a/_includes/footer.html b/_includes/footer.html
index 6218a1e..8bd1a36 100644
--- a/_includes/footer.html
+++ b/_includes/footer.html
@@ -1,9 +1,5 @@
<footer class="site-footer">
- <img src="/img/incubator_feather_egg_logo_bw_crop.png" alt="Apache
Incubator Logo" width="500"><br/>
-
- Apache StormCrawler is an effort undergoing incubation at The Apache
Software Foundation (ASF), sponsored by the Apache Incubator. Incubation is
required of all newly accepted projects until a further review indicates that
the infrastructure, communications, and decision making process have stabilized
in a manner consistent with other successful ASF projects. While incubation
status is not necessarily a reflection of the completeness or stability of the
code, it does indicate that the p [...]
-<br/> <br/>
- © 2024 <a href="https://www.apache.org/">The Apache Software
Foundation</a><br/><br/>
+© 2024 <a href="https://www.apache.org/">The Apache Software
Foundation</a><br/><br/>
Licensed under the <a
href="https://www.apache.org/licenses/LICENSE-2.0">Apache License, Version
2.0</a>. <br/> Apache StormCrawler, StormCrawler, the Apache feather logo are
trademarks of The Apache Software Foundation. <br/> All other marks mentioned
may be trademarks or registered trademarks of their respective owners.
<br/><br/>
<a
href="https://privacy.apache.org/policies/privacy-policy-public.html">Privacy
Policy</a> | <a href="https://www.apache.org/security/">Security</a> | <a
href="https://www.apache.org/foundation/sponsorship">Sponsorship</a> | <a
href="https://www.apache.org/foundation/sponsors">Sponsors</a><br/><br/>
<div class="footer-widget">
diff --git a/_includes/header.html b/_includes/header.html
index 1e9e0f9..8c683c0 100644
--- a/_includes/header.html
+++ b/_includes/header.html
@@ -1,7 +1,7 @@
<header class="site-header">
<div class="site-header__wrap">
<div class="site-header__logo">
- <a href="{{ site.baseurl }}/"><img src="{{ site.baseurl
}}/img/incubator_logo.png" alt="Apache StormCrawler (Incubating)"></a>
+ <a href="{{ site.baseurl }}/"><img src="{{ site.baseurl }}/img/logo.png"
alt="Apache StormCrawler"></a>
</div>
</div>
</header>
diff --git a/contribute/index.html b/contribute/index.html
index 130c767..4259b8d 100644
--- a/contribute/index.html
+++ b/contribute/index.html
@@ -1,13 +1,13 @@
---
layout: default
slug: contribute
-title: How to contribute to Apache StormCrawler (Incubating)
+title: How to contribute to Apache StormCrawler
---
<div class="row row-col">
<h1>How to Contribute</h1>
- <h2 id="the-apache-stormcrawler-community">The Apache StormCrawler
(Incubating) Community</h2>
+ <h2 id="the-apache-stormcrawler-community">The Apache StormCrawler
Community</h2>
<p>If you have questions about the contribution process or want to discuss
specific issues, please interact with the community using the following
resources.</p>
<ul>
diff --git a/download/3.0/migration-guide.html
b/download/3.0/migration-guide.html
index ea83134..45f9ff7 100644
--- a/download/3.0/migration-guide.html
+++ b/download/3.0/migration-guide.html
@@ -4,7 +4,7 @@ slug: migration-guide
title: Migration Guide
---
<div class="row row-col">
-<h1>Apache StormCrawler (Incubating) Migration Guide</h1>
+<h1>Apache StormCrawler Migration Guide</h1>
<h2>Introduction</h2>
<p>This guide provides step-by-step instructions for migrating your project
from older versions of StormCrawler to the new version under the Apache
umbrella. Key changes include updates to the group and artifact IDs, as well as
the removal of the Elasticsearch module.</p>
diff --git a/download/index.html b/download/index.html
index d7af197..9b519b6 100644
--- a/download/index.html
+++ b/download/index.html
@@ -25,7 +25,7 @@ title: Download
stability of the code, it does indicate that the project has yet to be
fully endorsed by the ASF.</p>
<h2>Downloads</h2>
- <h3>Apache StormCrawler (Incubating) 3.3.0</h3>
+ <h3>Apache StormCrawler 3.3.0</h3>
<br/>
<ul>
<li><a
href="https://github.com/apache/incubator-stormcrawler/releases/tag/stormcrawler-3.3.0">Release
Notes</a></li>
diff --git a/faq/index.html b/faq/index.html
index 5e5c9bd..8d08919 100644
--- a/faq/index.html
+++ b/faq/index.html
@@ -10,7 +10,7 @@ slug: faq
<p>A: Probably worth having a look at <a
href="http://storm.apache.org/">Apache Storm® first. The <a
href="http://storm.apache.org/releases/current/Tutorial.html">tutorial</a> and
<a href="http://storm.apache.org/documentation/Concepts.html">concept</a> pages
are good starting points.</p>
- <p><strong>Q: Do I need an Apache Storm® cluster to run Apache StormCrawler
(Incubating)?</strong></p>
+ <p><strong>Q: Do I need an Apache Storm® cluster to run Apache
StormCrawler?</strong></p>
<p>A: No. It can run in local mode and will just use the Storm libraries as
dependencies. It makes sense to install Storm in pseudo-distributed mode though
so that you can use its UI to monitor the topologies.</p>
@@ -18,7 +18,7 @@ slug: faq
<p>A: Apache Storm® is an elegant framework, with simple concepts, which
provides a solid platform for distributed stream processing. It gives us fault
tolerance and guaranteed data processing out of the box. The project is also
very dynamic and backed by a thriving community. Last but not least it is under
ASF 2.0 license.</p>
- <p id="howfast"><strong>Q: How fast is Apache StormCrawler
(Incubating)?</strong></p>
+ <p id="howfast"><strong>Q: How fast is Apache StormCrawler?</strong></p>
<p>A: This depends mainly on the diversity of hostnames as well as your
politeness settings. For instance, if you have 1M URLs from the same host and
have set a delay of 1 sec between request then the best you'll be able to do is
86400 pages per day. In practice this would be less than that as the time
needed for fetching the content (which itself depends on your network and how
large the documents are), parsing and indexing it etc... This is true of any
crawler, not just StormCrawler.</p>
@@ -27,16 +27,16 @@ slug: faq
<p>A: This <a
href="http://digitalpebble.blogspot.co.uk/2015/09/index-web-with-aws-cloudsearch.html">tutorial</a>
on using Apache Nutch® and SC for indexing with Cloudsearch give you some idea
of how they compare in their methodology and performance.
We also ran a comparative <a
href="http://digitalpebble.blogspot.co.uk/2017/01/the-battle-of-crawlers-apache-nutch-vs.html">benchmark</a>
on a larger crawl.</p>
<p>In a nutshell (pardon the pun), Nutch proceeds by batch steps where it
selects the URLs to fetch, fetches them, parses them then update it database
with the new info about the URLs it just processed and adds the newly
discovered URLs. The generate and update steps take longer and longer as the
crawl grows and the resources are used unevenly : when fetching there is little
CPU or disk used whereas when doing all the other activities, you are not
fetching anything at all, which is a w [...]
- <p>Apache StormCrawler (Incubating) proceeds differently and does everything
at the same time, hence optimising the physical resources of the cluster, but
can potentially accomodate more use cases, e.g. when URLs naturally come as
streams or when low latency is a must. URLs also get indexed as they are
fetched and not as a batch. On a more subjective note and apart from being
potentially more efficient, Apache StormCrawler (Incubating) is more modern,
easier to understand and build, ni [...]
- <p>Apache Nutch® is a great tool though, which we used for years with many
of our customers at DigitalPebble, and it can also do things that Apache
StormCrawler (Incubating) cannot currently do out of the box like deduplicating
or advanced scoring like PageRank.</p>
+ <p>Apache StormCrawler proceeds differently and does everything at the same
time, hence optimising the physical resources of the cluster, but can
potentially accomodate more use cases, e.g. when URLs naturally come as streams
or when low latency is a must. URLs also get indexed as they are fetched and
not as a batch. On a more subjective note and apart from being potentially more
efficient, Apache StormCrawler is more modern, easier to understand and build,
nicer to use, more versatile [...]
+ <p>Apache Nutch® is a great tool though, which we used for years with many
of our customers at DigitalPebble, and it can also do things that Apache
StormCrawler cannot currently do out of the box like deduplicating or advanced
scoring like PageRank.</p>
<p><strong>Q: Do I need some sort of external storage? And if so, then
what?</strong></p>
<p>A: Yes, you'll need to store the URLs to fetch somewhere. The type of the
storage to use depends on the nature of your crawl. If your crawl is not
recursive i.e. you just want to process specific pages and/or won't discover
new pages through more than one path, then you could use messaging queues like
<a href="https://www.rabbitmq.com/">RabbitMQ</a>, <a
href="https://aws.amazon.com/sqs/">AWS SQS</a> or <a
href="http://kafka.apache.org">Apache Kafka®</a>. All you'll need is a Spout i
[...]
- <p>If your crawl is recursive and there is a possibility that URLs which are
already known are discovered multiple times, then a queue won't help as it
would add the same URL to the queue every time it is discovered. This would be
very inefficient. Instead you should use a storage where the keys are unique,
like for instance a relational database. Apache StormCrawler (Incubating) has
several resources you can use in the <a
href="https://github.com/apache/incubator-stormcrawler/tree/mas [...]
- <p>The advantage of using Apache StormCrawler (Incubating) is that is it
both modular and flexible. You can plug it to pretty much any storage you
want.</p>
+ <p>If your crawl is recursive and there is a possibility that URLs which are
already known are discovered multiple times, then a queue won't help as it
would add the same URL to the queue every time it is discovered. This would be
very inefficient. Instead you should use a storage where the keys are unique,
like for instance a relational database. Apache StormCrawler has several
resources you can use in the <a
href="https://github.com/apache/incubator-stormcrawler/tree/master/external"
[...]
+ <p>The advantage of using Apache StormCrawler is that is it both modular and
flexible. You can plug it to pretty much any storage you want.</p>
- <p><strong>Q: Is Apache StormCrawler (Incubating) polite?</strong></p>
+ <p><strong>Q: Is Apache StormCrawler polite?</strong></p>
<p>A: The <a href="http://www.robotstxt.org/">robots.txt</a> protocol is
supported and the fetchers are configured to have a <a
href="https://github.com/apache/incubator-stormcrawler/blob/master/core/src/main/resources/crawler-default.yaml#L6">delay</a>
between calls to the same hostname or domain. However like with every tool, it
is down to how people use it.</p>
<p><strong>Q: When do I know when a crawl is finished?</strong></p>
diff --git a/getting-started/index.html b/getting-started/index.html
index 5d286dd..1bda4cf 100644
--- a/getting-started/index.html
+++ b/getting-started/index.html
@@ -1,7 +1,7 @@
---
layout: default
slug: getting-started
-title: Getting started with Apache StormCrawler (Incubating)
+title: Getting started with Apache StormCrawler
---
<div class="row row-col">
@@ -10,7 +10,7 @@ title: Getting started with Apache StormCrawler (Incubating)
<p>NOTE: These instructions assume that you have <a
href="https://maven.apache.org/install.html">Apache Maven®</a> installed.
You will also need to install <a href="https://storm.apache.org/">Apache
Storm® 2.8.0</a> to run the crawler.</p>
- <p>Once Apache Storm® is installed, the easiest way to get started is to
generate a brand new Apache StormCrawler (Incubating) project using:</p>
+ <p>Once Apache Storm® is installed, the easiest way to get started is to
generate a brand new Apache StormCrawler project using:</p>
<p><i>mvn archetype:generate -DarchetypeGroupId=org.apache.stormcrawler
-DarchetypeArtifactId=stormcrawler-archetype -DarchetypeVersion=3.3.0</i></p>
@@ -24,7 +24,7 @@ title: Getting started with Apache StormCrawler (Incubating)
<p>What this CrawlTopology does is very simple : it gets URLs to crawl from a
<a href="https://urlfrontier.net">URLFrontier</a> instance and emits them on
the topology. These URLs are then partitioned by hostname to enfore the
politeness and then fetched. The next bolt (SiteMapParserBolt) checks whether
they are sitemap files and if not passes them on to a HTML parser. The parser
extracts the text from the document and passes it to a dummy indexer which
simply prints a representation of [...]
- <p>Of course this topology is very primitive and its purpose is merely to
give you an idea of how Apache StormCrawler (Incubating) works. In reality,
you'd use a different spout and index the documents to a proper backend. Look
at the <a
href="https://github.com/apache/incubator-stormcrawler/blob/master/external">external
modules</a> to see what's already available. Another limitation of this
topology is that it will work in local mode only or on a single worker.</p>
+ <p>Of course this topology is very primitive and its purpose is merely to
give you an idea of how Apache StormCrawler works. In reality, you'd use a
different spout and index the documents to a proper backend. Look at the <a
href="https://github.com/apache/incubator-stormcrawler/blob/master/external">external
modules</a> to see what's already available. Another limitation of this
topology is that it will work in local mode only or on a single worker.</p>
<p>You can run the topology in local mode with :</p>
diff --git a/index.html b/index.html
index ba52d24..9747555 100644
--- a/index.html
+++ b/index.html
@@ -8,8 +8,8 @@ slug: home
</div>
</div>
<div class="row row-col">
- <p><strong>Apache StormCrawler (Incubating)</strong> is an open source SDK
for building distributed web crawlers based on <a
href="http://storm.apache.org">Apache Storm®</a>. The project is under Apache
License v2 and consists of a collection of reusable resources and components,
written mostly in Java.</p>
- <p>The aim of Apache StormCrawler (Incubating) is to help build web crawlers
that are :</p>
+ <p><strong>Apache StormCrawler</strong> is an open source SDK for building
distributed web crawlers based on <a href="http://storm.apache.org">Apache
Storm®</a>. The project is under Apache License v2 and consists of a collection
of reusable resources and components, written mostly in Java.</p>
+ <p>The aim of Apache StormCrawler is to help build web crawlers that are
:</p>
<ul>
<li>scalable</li>
<li>resilient</li>
@@ -17,9 +17,9 @@ slug: home
<li>easy to extend</li>
<li>polite yet efficient</li>
</ul>
- <p><strong>Apache StormCrawler (Incubating)</strong> is a library and
collection of resources that developers can leverage to build their own
crawlers. The good news is that doing so can be pretty straightforward! Have a
look at the <a href="getting-started/">Getting Started</a> section for more
details.</p>
+ <p><strong>Apache StormCrawler</strong> is a library and collection of
resources that developers can leverage to build their own crawlers. The good
news is that doing so can be pretty straightforward! Have a look at the <a
href="getting-started/">Getting Started</a> section for more details.</p>
<p>Apart from the core components, we provide some <a
href="https://github.com/apache/incubator-stormcrawler/tree/main/external">external
resources</a> that you can reuse in your project, like for instance our spout
and bolts for <a href="https://opensearch.org/">OpenSearch®</a> or a ParserBolt
which uses <a href="http://tika.apache.org">Apache Tika®</a> to parse various
document formats.</p>
- <p><strong>Apache StormCrawler (Incubating)</strong> is perfectly suited to
use cases where the URL to fetch and parse come as streams but is also an
appropriate solution for large scale recursive crawls, particularly where low
latency is required. The project is used in production by <a
href="https://github.com/apache/incubator-stormcrawler/wiki/Powered-By">many
organisations</a> and is actively developed and maintained.</p>
+ <p><strong>Apache StormCrawler</strong> is perfectly suited to use cases
where the URL to fetch and parse come as streams but is also an appropriate
solution for large scale recursive crawls, particularly where low latency is
required. The project is used in production by <a
href="https://github.com/apache/incubator-stormcrawler/wiki/Powered-By">many
organisations</a> and is actively developed and maintained.</p>
<p>The <a
href="https://github.com/apache/incubator-stormcrawler/wiki/Presentations">Presentations</a>
page contains links to some recent presentations made about this project.</p>
</div>
diff --git a/support/index.html b/support/index.html
index 8c5723d..c4fa8c8 100644
--- a/support/index.html
+++ b/support/index.html
@@ -7,16 +7,16 @@ title: Need assistance from web crawling experts?
<div class="row row-col">
<h1>Support</h1>
<br>
-<p>You can ask questions related to Apache StormCrawler (Incubating) on Github
in the <a
href="https://github.com/apache/incubator-stormcrawler/discussions">discussions
section</a>, on <a
href="http://stackoverflow.com/questions/tagged/stormcrawler">stackoverflow</a>
using the tag 'stormcrawler' or on <a
href="https://discord.com/invite/C62MHusNnG">Discord</a>.</p>
+<p>You can ask questions related to Apache StormCrawler on Github in the <a
href="https://github.com/apache/incubator-stormcrawler/discussions">discussions
section</a>, on <a
href="http://stackoverflow.com/questions/tagged/stormcrawler">stackoverflow</a>
using the tag 'stormcrawler' or on <a
href="https://discord.com/invite/C62MHusNnG">Discord</a>.</p>
<p>If you think you've found a bug, please <a
href="https://github.com/apache/incubator-stormcrawler/issues">open an
issue</a> on GitHub.</p>
<h1>Commercial Support</h1>
<br>
- <p>The Apache StormCrawler (Incubating) PMC does not endorse or recommend
any of the products or services on this page. We love all our supporters
equally.</p>
+ <p>The Apache StormCrawler PMC does not endorse or recommend any of the
products or services on this page. We love all our supporters equally.</p>
<h2>Want to be added to this page? </h2>
<p>All submitted information must be factual and informational in nature and
not be a marketing statement. Statements that promote your products and
services over other offerings on the page will not be tolerated and will be
removed. Such marketing statements can be added to your own pages on your own
site.</p>
- <p>When in doubt, email the Apache StormCrawler (Incubating) PMC and ask. We
are be happy to help.</p>
+ <p>When in doubt, email the Apache StormCrawler PMC and ask. We are be happy
to help.</p>
<h2>Companies</h2>
<ul>