(incubator-stormcrawler-site) 01/01: Fixes #5 Removes Google Analytics Removes GH badges (blocked by ASF policy anyway) Adds (Incubating) disclaimer to mentions of SC

rzo1 Mon, 22 Apr 2024 02:17:46 -0700

This is an automated email from the ASF dual-hosted git repository.

rzo1 pushed a commit to branch 5
in repository 
https://gitbox.apache.org/repos/asf/incubator-stormcrawler-site.git


commit 707b5e3b7e302a0a1e3edf38ec0697f6486bbd76
Author: Richard Zowalla <[email protected]>
AuthorDate: Mon Apr 22 11:14:51 2024 +0200

    Fixes #5
    Removes Google Analytics
    Removes GH badges (blocked by ASF policy anyway)
    Adds (Incubating) disclaimer to mentions of SC
---
 .asf.yaml                                  |   2 +-
 README.md                                  |   2 +-
 _config-local.yml                          |   2 +-
 _config.yml                                |   4 ++--
 _includes/footer.html                      |  13 +++++--------
 _includes/header.html                      |   2 +-
 _layouts/default.html                      |  10 ----------
 faq/index.html                             |  14 +++++++-------
 getting-started/index.html                 |  10 +++++-----
 img/incubator_feather_egg_logo_bw_crop.png | Bin 0 -> 56218 bytes
 index.html                                 |   4 ++--
 support/index.html                         |   8 ++++----
 12 files changed, 29 insertions(+), 42 deletions(-)

diff --git a/.asf.yaml b/.asf.yaml
index 51d9104..470bba3 100644
--- a/.asf.yaml
+++ b/.asf.yaml
@@ -10,7 +10,7 @@ publish:
   whoami: asf-site
 
 github:
-  description: "Source for the Apache StormCrawler web site"
+  description: "Source for the Apache Apache StormCrawler (Incubating) web 
site"
   homepage: https://stormcrawler.apache.org/
   features:
     # Enable wiki for documentation
diff --git a/README.md b/README.md
index 708f97e..d91a94b 100644
--- a/README.md
+++ b/README.md
@@ -1,4 +1,4 @@
-# Apache StormCrawler Website
+# Apache StormCrawler (Incubating) Website
 
 ## How to build?
 
diff --git a/_config-local.yml b/_config-local.yml
index 4b53b56..0a011ee 100644
--- a/_config-local.yml
+++ b/_config-local.yml
@@ -9,7 +9,7 @@
 title: Apache StormCrawler
 email: [email protected]
 description: > # this means to ignore newlines until "baseurl:"
- Apache StormCrawler is collection of resources for building low-latency, 
scalable web crawlers on Apache Storm
+ Apache StormCrawler (Incubating) is collection of resources for building 
low-latency, scalable web crawlers on Apache Storm
 baseurl: "" # the subpath of your site, e.g. /blog
 url: "http://localhost:4000"; # the base hostname & protocol for your site
 
diff --git a/_config.yml b/_config.yml
index cc84c9b..39c06e6 100644
--- a/_config.yml
+++ b/_config.yml
@@ -6,10 +6,10 @@
 # 'jekyll serve'. If you change this file, please restart the server process.
 
 # Site settings
-title: Apache StormCrawler
+title: Apache StormCrawler (Incubating)
 email: [email protected]
 description: > # this means to ignore newlines until "baseurl:"
- Apache StormCrawler is collection of resources for building low-latency, 
scalable web crawlers on Apache Storm
+ Apache StormCrawler (Incubating) is collection of resources for building 
low-latency, scalable web crawlers on Apache Storm
 baseurl: "" # the subpath of your site, e.g. /blog
 url: "https://stormcrawler.apache.org"; # the base hostname & protocol for your 
site
 
diff --git a/_includes/footer.html b/_includes/footer.html
index 99befa0..41e74ae 100644
--- a/_includes/footer.html
+++ b/_includes/footer.html
@@ -1,11 +1,8 @@
-<div class="github-info">
-  <iframe 
src="https://ghbtns.com/github-btn.html?user=apache&repo=incubator-stormcrawler&type=star&count=true";
 frameborder="0" scrolling="0" width="105px" height="20px"></iframe>
-  <iframe 
src="https://ghbtns.com/github-btn.html?user=apache&repo=incubator-stormcrawler&type=watch&count=true&v=2";
 frameborder="0" scrolling="0" width="110px" height="20px"></iframe>
-  <iframe 
src="https://ghbtns.com/github-btn.html?user=apache&repo=incubator-stormcrawler&type=fork&count=true";
 frameborder="0" scrolling="0" width="101px" height="20px"></iframe>
-</div>
-
 <footer class="site-footer">
-       &copy; 2024 <a href="https://stormcrawler.apache.org/";>The Apache 
Software Foundation</a>
-<p>Licensed under the Apache License, Version 2.0. Apache StormCrawler, 
StormCrawler, the Apache feather logo are trademarks of The Apache Software 
Foundation. All other marks mentioned may be trademarks or registered 
trademarks of their respective owners.</p>
+       <img src="img/incubator_feather_egg_logo_bw_crop.png" alt="Apache 
Incubator Logo" width="500"><br/>
 
+       Apache StormCrawler is an effort undergoing incubation at The Apache 
Software Foundation (ASF), sponsored by the Apache Incubator. Incubation is 
required of all newly accepted projects until a further review indicates that 
the infrastructure, communications, and decision making process have stabilized 
in a manner consistent with other successful ASF projects. While incubation 
status is not necessarily a reflection of the completeness or stability of the 
code, it does indicate that the p [...]
+<br/> <br/>
+       &copy; 2024 <a href="https://stormcrawler.apache.org/";>The Apache 
Software Foundation</a><br/><br/>
+Licensed under the Apache License, Version 2.0. <br/> Apache StormCrawler, 
StormCrawler, the Apache feather logo are trademarks of The Apache Software 
Foundation. <br/> All other marks mentioned may be trademarks or registered 
trademarks of their respective owners.
 </footer>
diff --git a/_includes/header.html b/_includes/header.html
index 1878c54..e538166 100644
--- a/_includes/header.html
+++ b/_includes/header.html
@@ -1,7 +1,7 @@
 <header class="site-header">
   <div class="site-header__wrap">
     <div class="site-header__logo">
-      <a href="{{ site.baseurl }}/"><img src="{{ site.baseurl }}/img/logo.png" 
alt="Apache StormCrawler"></a>
+      <a href="{{ site.baseurl }}/"><img src="{{ site.baseurl }}/img/logo.png" 
alt="Apache StormCrawler (Incubating)"></a>
     </div>
   </div>
 </header>
diff --git a/_layouts/default.html b/_layouts/default.html
index 1d5fcdb..82b0e91 100644
--- a/_layouts/default.html
+++ b/_layouts/default.html
@@ -13,16 +13,6 @@
 
     {% include footer.html %}
 
-       <script>
-         
(function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){
-         (i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new 
Date();a=s.createElement(o),
-         
m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m)
-         
})(window,document,'script','//www.google-analytics.com/analytics.js','ga');
-
-         ga('create', 'UA-71137732-1', 'auto');
-         ga('send', 'pageview');
-       </script>
-
   </body>
 
 </html>
diff --git a/faq/index.html b/faq/index.html
index fd9d072..77fd414 100644
--- a/faq/index.html
+++ b/faq/index.html
@@ -10,7 +10,7 @@ slug: faq
 
   <p>A: Probably worth having a look at <a 
href="http://storm.apache.org/";>Apache Storm® first. The <a 
href="http://storm.apache.org/releases/current/Tutorial.html";>tutorial</a> and 
<a href="http://storm.apache.org/documentation/Concepts.html";>concept</a> pages 
are good starting points.</p>
 
-  <p><strong>Q: Do I need an Apache Storm® cluster to run 
StormCrawler?</strong></p>
+  <p><strong>Q: Do I need an Apache Storm® cluster to run Apache StormCrawler 
(Incubating)?</strong></p>
 
   <p>A: No. It can run in local mode and will just use the Storm libraries as 
dependencies. It makes sense to install Storm in pseudo-distributed mode though 
so that you can use its UI to monitor the topologies.</p>
 
@@ -18,7 +18,7 @@ slug: faq
 
   <p>A: Apache Storm® is an elegant framework, with simple concepts, which 
provides a solid platform for distributed stream processing. It gives us fault 
tolerance and guaranteed data processing out of the box. The project is also 
very dynamic and backed by a thriving community. Last but not least it is under 
ASF 2.0 license.</p>
 
-  <p id="howfast"><strong>Q: How fast is StormCrawler?</strong></p>
+  <p id="howfast"><strong>Q: How fast is Apache StormCrawler 
(Incubating)?</strong></p>
 
   <p>A: This depends mainly on the diversity of hostnames as well as your 
politeness settings. For instance, if you have 1M URLs from the same host and 
have set a delay of 1 sec between request then the best you'll be able to do is 
86400 pages per day. In practice this would be less than that as the time 
needed for fetching the content (which itself depends on your network and how 
large the documents are), parsing and indexing it etc...  This is true of any 
crawler, not just StormCrawler.</p>
 
@@ -27,16 +27,16 @@ slug: faq
   <p>A: This <a 
href="http://digitalpebble.blogspot.co.uk/2015/09/index-web-with-aws-cloudsearch.html";>tutorial</a>
 on using Apache Nutch® and SC for indexing with Cloudsearch give you some idea 
of how they compare in their methodology and performance. 
   We also ran a comparative <a 
href="http://digitalpebble.blogspot.co.uk/2017/01/the-battle-of-crawlers-apache-nutch-vs.html";>benchmark</a>
 on a larger crawl.</p>
   <p>In a nutshell (pardon the pun), Nutch proceeds by batch steps where it 
selects the URLs to fetch, fetches them, parses them then update it database 
with the new info about the URLs it just processed and adds the newly 
discovered URLs. The generate and update steps take longer and longer as the 
crawl grows and the resources are used unevenly : when fetching there is little 
CPU or disk used whereas when doing all the other activities, you are not 
fetching anything at all, which is a w [...]
-  <p>StormCrawler proceeds differently and does everything at the same time, 
hence optimising the physical resources of the cluster, but can potentially 
accomodate more use cases, e.g. when URLs naturally come as streams or when low 
latency is a must. URLs also get indexed as they are fetched and not as a 
batch. On a more subjective note and apart from being potentially more 
efficient, StormCrawler is more modern, easier to understand and build, nicer 
to use, more versatile and more acti [...]
-  <p>Apache Nutch® is a great tool though, which we used for years with many 
of our customers at DigitalPebble, and it can also do things that StormCrawler 
cannot currently do out of the box like deduplicating or advanced scoring like 
PageRank.</p>
+  <p>Apache StormCrawler (Incubating) proceeds differently and does everything 
at the same time, hence optimising the physical resources of the cluster, but 
can potentially accomodate more use cases, e.g. when URLs naturally come as 
streams or when low latency is a must. URLs also get indexed as they are 
fetched and not as a batch. On a more subjective note and apart from being 
potentially more efficient, Apache StormCrawler (Incubating) is more modern, 
easier to understand and build, ni [...]
+  <p>Apache Nutch® is a great tool though, which we used for years with many 
of our customers at DigitalPebble, and it can also do things that Apache 
StormCrawler (Incubating) cannot currently do out of the box like deduplicating 
or advanced scoring like PageRank.</p>
 
   <p><strong>Q: Do I need some sort of external storage? And if so, then 
what?</strong></p>
 
   <p>A: Yes, you'll need to store the URLs to fetch somewhere. The type of the 
storage to use depends on the nature of your crawl. If your crawl is not 
recursive i.e. you just want to process specific pages and/or won't discover 
new pages through more than one path, then you could use messaging queues like 
<a href="https://www.rabbitmq.com/";>RabbitMQ</a>, <a 
href="https://aws.amazon.com/sqs/";>AWS SQS</a> or <a 
href="http://kafka.apache.org";>Apache Kafka®</a>. All you'll need is a Spout i 
[...]
-  <p>If your crawl is recursive and there is a possibility that URLs which are 
already known are discovered multiple times, then a queue won't help as it 
would add the same URL to the queue every time it is discovered. This would be 
very inefficient. Instead you should use a storage where the keys are unique, 
like for instance a relational database. StormCrawler has several resources you 
can use in the <a 
href="https://github.com/DigitalPebble/storm-crawler/tree/master/external";>external
 [...]
-  <p>The advantage of using StormCrawler is that is it both modular and 
flexible. You can plug it to pretty much any storage you want.</p>
+  <p>If your crawl is recursive and there is a possibility that URLs which are 
already known are discovered multiple times, then a queue won't help as it 
would add the same URL to the queue every time it is discovered. This would be 
very inefficient. Instead you should use a storage where the keys are unique, 
like for instance a relational database. Apache StormCrawler (Incubating) has 
several resources you can use in the <a 
href="https://github.com/DigitalPebble/storm-crawler/tree/maste [...]
+  <p>The advantage of using Apache StormCrawler (Incubating) is that is it 
both modular and flexible. You can plug it to pretty much any storage you 
want.</p>
 
-  <p><strong>Q: Is StormCrawler polite?</strong></p>
+  <p><strong>Q: Is Apache StormCrawler (Incubating) polite?</strong></p>
   <p>A: The <a href="http://www.robotstxt.org/";>robots.txt</a> protocol is 
supported and the fetchers are configured to have a <a 
href="https://github.com/DigitalPebble/storm-crawler/blob/master/core/src/main/resources/crawler-default.yaml#L6";>delay</a>
 between calls to the same hostname or domain. However like with every tool, it 
is down to how people use it.</p>
 
   <p><strong>Q: When do I know when a crawl is finished?</strong></p>
diff --git a/getting-started/index.html b/getting-started/index.html
index 57be56d..fe328d4 100644
--- a/getting-started/index.html
+++ b/getting-started/index.html
@@ -1,16 +1,16 @@
 ---
 layout: default
 slug: getting-started
-title: Getting started with StormCrawler
+title: Getting started with Apache StormCrawler (Incubating)
 ---
 
 <div class="row row-col">
   <h1>Quickstart</h1>
   <br>
   <p>NOTE: These instructions assume that you have <a 
href="https://maven.apache.org/install.html";>Apache Maven®</a> installed.  
-  You will also need to install <a href="https://storm.apache.org/";>Apache 
Storm®</a> to run the crawler. The version of Storm to use must match the one 
defined in the pom.xml file of your topology. The major version of StormCrawler 
mirrors the one from Apache Storm®, i.e whereas StormCrawler 1.x used Storm 
1.2.3, the current version now requires Storm 2.6.0. Our <a 
href="https://github.com/DigitalPebble/ansible-storm";>Ansible-Storm</a> 
repository contains resources to install Apache Sto [...]
+  You will also need to install <a href="https://storm.apache.org/";>Apache 
Storm®</a> to run the crawler. The version of Storm to use must match the one 
defined in the pom.xml file of your topology. The major version of Apache 
StormCrawler (Incubating) mirrors the one from Apache Storm®, i.e whereas 
StormCrawler 1.x used Storm 1.2.3, the current version now requires Storm 
2.6.0. Our <a 
href="https://github.com/DigitalPebble/ansible-storm";>Ansible-Storm</a> 
repository contains resources t [...]
 
-  <p>Once Apache Storm® is installed, the easiest way to get started is to 
generate a brand new StormCrawler project using :</p>
+  <p>Once Apache Storm® is installed, the easiest way to get started is to 
generate a brand new Apache StormCrawler (Incubating) project using:</p>
 
   <p><i>mvn archetype:generate 
-DarchetypeGroupId=com.digitalpebble.stormcrawler 
-DarchetypeArtifactId=storm-crawler-archetype -DarchetypeVersion=2.11</i></p>
   
@@ -24,7 +24,7 @@ title: Getting started with StormCrawler
 
  <p>What this CrawlTopology does is very simple : it gets URLs to crawl from a 
<a href="https://urlfrontier.net";>URLFrontier</a> instance and emits them on 
the topology. These URLs are then partitioned by hostname to enfore the 
politeness and then fetched. The next bolt (SiteMapParserBolt) checks whether 
they are sitemap files and if not passes them on to a HTML parser. The parser 
extracts the text from the document and passes it to a dummy indexer which 
simply prints a representation of [...]
 
- <p>Of course this topology is very primitive and its purpose is merely to 
give you an idea of how StormCrawler works. In reality you'd use a different 
spout and index the documents to a proper backend. Look at the <a 
href="https://github.com/DigitalPebble/storm-crawler/tree/master/external";>external
 modules</a> to see what's already available. Another limitation of this 
topology is that it will work in local mode only or on a single worker.</p>
+ <p>Of course this topology is very primitive and its purpose is merely to 
give you an idea of how Apache StormCrawler (Incubating) works. In reality, 
you'd use a different spout and index the documents to a proper backend. Look 
at the <a 
href="https://github.com/DigitalPebble/storm-crawler/tree/master/external";>external
 modules</a> to see what's already available. Another limitation of this 
topology is that it will work in local mode only or on a single worker.</p>
 
  <p>You can run the topology in local mode with :</p>
 
@@ -36,7 +36,7 @@ title: Getting started with StormCrawler
 
  <br>
   
- <p>If you want to use StormCrawler with Elasticsearch, the tutorial below 
should be a good starting point.</p>
+ <p>If you want to use Apache StormCrawler (Incubating) with Elasticsearch, 
the tutorial below should be a good starting point.</p>
   <iframe  width="840" height="472" 
src="https://www.youtube.com/embed/8kpJLPdhvLw"; frameborder="0" 
allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" 
allowfullscreen></iframe>
 
  <br>
diff --git a/img/incubator_feather_egg_logo_bw_crop.png 
b/img/incubator_feather_egg_logo_bw_crop.png
new file mode 100644
index 0000000..377e4e3
Binary files /dev/null and b/img/incubator_feather_egg_logo_bw_crop.png differ
diff --git a/index.html b/index.html
index 8f86388..7166ffd 100644
--- a/index.html
+++ b/index.html
@@ -9,7 +9,7 @@ slug: home
 </div>
 <div class="row row-col">
   <p><strong>Apache StormCrawler (Incubating)</strong> is an open source SDK 
for building distributed web crawlers based on <a 
href="http://storm.apache.org";>Apache Storm®</a>. The project is under Apache 
license v2 and consists of a collection of reusable resources and components, 
written mostly in Java.</p>
-  <p>The aim of StormCrawler is to help build web crawlers that are :</p>
+  <p>The aim of Apache StormCrawler (Incubating) is to help build web crawlers 
that are :</p>
   <ul>
     <li>scalable</li>
     <li>resilient</li>
@@ -19,7 +19,7 @@ slug: home
   </ul>
   <p><strong>Apache StormCrawler (Incubating)</strong> is a library and 
collection of resources that developers can leverage to build their own 
crawlers. The good news is that doing so can be pretty straightforward! Have a 
look at the <a href="getting-started/">Getting Started</a> section for more 
details.</p>
   <p>Apart from the core components, we provide some <a 
href="https://github.com/apache/incubator-stormcrawler/tree/main/external";>external
 resources</a> that you can reuse in your project, like for instance our spout 
and bolts for <a href="https://opensearch.org/";>OpenSearch®</a> or a ParserBolt 
which uses <a href="http://tika.apache.org";>Apache Tika®</a> to parse various 
document formats.</p>
-  <p><strong>Apache StormCrawler</strong> is perfectly suited to use cases 
where the URL to fetch and parse come as streams but is also an appropriate 
solution for large scale recursive crawls, particularly where low latency is 
required. The project is used in production by <a 
href="https://github.com/apache/incubator-stormcrawler/wiki/Powered-By";>many 
organisations</a> and is actively developed and maintained.</p>
+  <p><strong>Apache StormCrawler (Incubating)</strong> is perfectly suited to 
use cases where the URL to fetch and parse come as streams but is also an 
appropriate solution for large scale recursive crawls, particularly where low 
latency is required. The project is used in production by <a 
href="https://github.com/apache/incubator-stormcrawler/wiki/Powered-By";>many 
organisations</a> and is actively developed and maintained.</p>
   <p>The <a 
href="https://github.com/apache/incubator-stormcrawler/wiki/Presentations";>Presentations</a>
 page contains links to some recent presentations made about this project.</p>
 </div>
 
diff --git a/support/index.html b/support/index.html
index 659c2b7..8c5723d 100644
--- a/support/index.html
+++ b/support/index.html
@@ -7,16 +7,16 @@ title: Need assistance from web crawling experts?
 <div class="row row-col">
   <h1>Support</h1>
 <br>
-<p>You can ask questions related to StormCrawler on Github in the <a 
href="https://github.com/apache/incubator-stormcrawlerdiscussions";>discussions 
section</a>, on <a 
href="http://stackoverflow.com/questions/tagged/stormcrawler";>stackoverflow</a> 
using the tag 'stormcrawler' or on <a 
href="https://discord.com/invite/C62MHusNnG";>Discord</a>.</p>  
-<p>If you think you've found a bug, please <a 
href="https://github.com/apache/incubator-stormcrawlerissues";>open an issue</a> 
on GitHub.</p>
+<p>You can ask questions related to Apache StormCrawler (Incubating) on Github 
in the <a 
href="https://github.com/apache/incubator-stormcrawler/discussions";>discussions 
section</a>, on <a 
href="http://stackoverflow.com/questions/tagged/stormcrawler";>stackoverflow</a> 
using the tag 'stormcrawler' or on <a 
href="https://discord.com/invite/C62MHusNnG";>Discord</a>.</p>
+<p>If you think you've found a bug, please <a 
href="https://github.com/apache/incubator-stormcrawler/issues";>open an 
issue</a> on GitHub.</p>
 
   <h1>Commercial Support</h1>
   <br>
-  <p>The Apache StormCrawler PMC does not endorse or recommend any of the 
products or services on this page. We love all our supporters equally.</p>
+  <p>The Apache StormCrawler (Incubating) PMC does not endorse or recommend 
any of the products or services on this page. We love all our supporters 
equally.</p>
 
   <h2>Want to be added to this page? </h2>
   <p>All submitted information must be factual and informational in nature and 
not be a marketing statement. Statements that promote your products and 
services over other offerings on the page will not be tolerated and will be 
removed. Such marketing statements can be added to your own pages on your own 
site.</p>
-  <p>When in doubt, email the Apache StormCrawler PMC and ask. We are be happy 
to help.</p>
+  <p>When in doubt, email the Apache StormCrawler (Incubating) PMC and ask. We 
are be happy to help.</p>
 
   <h2>Companies</h2>
   <ul>

(incubator-stormcrawler-site) 01/01: Fixes #5 Removes Google Analytics Removes GH badges (blocked by ASF policy anyway) Adds (Incubating) disclaimer to mentions of SC

Reply via email to