This is an automated email from the ASF dual-hosted git repository. rzo1 pushed a commit to branch 5 in repository https://gitbox.apache.org/repos/asf/incubator-stormcrawler-site.git
commit 707b5e3b7e302a0a1e3edf38ec0697f6486bbd76 Author: Richard Zowalla <[email protected]> AuthorDate: Mon Apr 22 11:14:51 2024 +0200 Fixes #5 Removes Google Analytics Removes GH badges (blocked by ASF policy anyway) Adds (Incubating) disclaimer to mentions of SC --- .asf.yaml | 2 +- README.md | 2 +- _config-local.yml | 2 +- _config.yml | 4 ++-- _includes/footer.html | 13 +++++-------- _includes/header.html | 2 +- _layouts/default.html | 10 ---------- faq/index.html | 14 +++++++------- getting-started/index.html | 10 +++++----- img/incubator_feather_egg_logo_bw_crop.png | Bin 0 -> 56218 bytes index.html | 4 ++-- support/index.html | 8 ++++---- 12 files changed, 29 insertions(+), 42 deletions(-) diff --git a/.asf.yaml b/.asf.yaml index 51d9104..470bba3 100644 --- a/.asf.yaml +++ b/.asf.yaml @@ -10,7 +10,7 @@ publish: whoami: asf-site github: - description: "Source for the Apache StormCrawler web site" + description: "Source for the Apache Apache StormCrawler (Incubating) web site" homepage: https://stormcrawler.apache.org/ features: # Enable wiki for documentation diff --git a/README.md b/README.md index 708f97e..d91a94b 100644 --- a/README.md +++ b/README.md @@ -1,4 +1,4 @@ -# Apache StormCrawler Website +# Apache StormCrawler (Incubating) Website ## How to build? diff --git a/_config-local.yml b/_config-local.yml index 4b53b56..0a011ee 100644 --- a/_config-local.yml +++ b/_config-local.yml @@ -9,7 +9,7 @@ title: Apache StormCrawler email: [email protected] description: > # this means to ignore newlines until "baseurl:" - Apache StormCrawler is collection of resources for building low-latency, scalable web crawlers on Apache Storm + Apache StormCrawler (Incubating) is collection of resources for building low-latency, scalable web crawlers on Apache Storm baseurl: "" # the subpath of your site, e.g. /blog url: "http://localhost:4000" # the base hostname & protocol for your site diff --git a/_config.yml b/_config.yml index cc84c9b..39c06e6 100644 --- a/_config.yml +++ b/_config.yml @@ -6,10 +6,10 @@ # 'jekyll serve'. If you change this file, please restart the server process. # Site settings -title: Apache StormCrawler +title: Apache StormCrawler (Incubating) email: [email protected] description: > # this means to ignore newlines until "baseurl:" - Apache StormCrawler is collection of resources for building low-latency, scalable web crawlers on Apache Storm + Apache StormCrawler (Incubating) is collection of resources for building low-latency, scalable web crawlers on Apache Storm baseurl: "" # the subpath of your site, e.g. /blog url: "https://stormcrawler.apache.org" # the base hostname & protocol for your site diff --git a/_includes/footer.html b/_includes/footer.html index 99befa0..41e74ae 100644 --- a/_includes/footer.html +++ b/_includes/footer.html @@ -1,11 +1,8 @@ -<div class="github-info"> - <iframe src="https://ghbtns.com/github-btn.html?user=apache&repo=incubator-stormcrawler&type=star&count=true" frameborder="0" scrolling="0" width="105px" height="20px"></iframe> - <iframe src="https://ghbtns.com/github-btn.html?user=apache&repo=incubator-stormcrawler&type=watch&count=true&v=2" frameborder="0" scrolling="0" width="110px" height="20px"></iframe> - <iframe src="https://ghbtns.com/github-btn.html?user=apache&repo=incubator-stormcrawler&type=fork&count=true" frameborder="0" scrolling="0" width="101px" height="20px"></iframe> -</div> - <footer class="site-footer"> - © 2024 <a href="https://stormcrawler.apache.org/">The Apache Software Foundation</a> -<p>Licensed under the Apache License, Version 2.0. Apache StormCrawler, StormCrawler, the Apache feather logo are trademarks of The Apache Software Foundation. All other marks mentioned may be trademarks or registered trademarks of their respective owners.</p> + <img src="img/incubator_feather_egg_logo_bw_crop.png" alt="Apache Incubator Logo" width="500"><br/> + Apache StormCrawler is an effort undergoing incubation at The Apache Software Foundation (ASF), sponsored by the Apache Incubator. Incubation is required of all newly accepted projects until a further review indicates that the infrastructure, communications, and decision making process have stabilized in a manner consistent with other successful ASF projects. While incubation status is not necessarily a reflection of the completeness or stability of the code, it does indicate that the p [...] +<br/> <br/> + © 2024 <a href="https://stormcrawler.apache.org/">The Apache Software Foundation</a><br/><br/> +Licensed under the Apache License, Version 2.0. <br/> Apache StormCrawler, StormCrawler, the Apache feather logo are trademarks of The Apache Software Foundation. <br/> All other marks mentioned may be trademarks or registered trademarks of their respective owners. </footer> diff --git a/_includes/header.html b/_includes/header.html index 1878c54..e538166 100644 --- a/_includes/header.html +++ b/_includes/header.html @@ -1,7 +1,7 @@ <header class="site-header"> <div class="site-header__wrap"> <div class="site-header__logo"> - <a href="{{ site.baseurl }}/"><img src="{{ site.baseurl }}/img/logo.png" alt="Apache StormCrawler"></a> + <a href="{{ site.baseurl }}/"><img src="{{ site.baseurl }}/img/logo.png" alt="Apache StormCrawler (Incubating)"></a> </div> </div> </header> diff --git a/_layouts/default.html b/_layouts/default.html index 1d5fcdb..82b0e91 100644 --- a/_layouts/default.html +++ b/_layouts/default.html @@ -13,16 +13,6 @@ {% include footer.html %} - <script> - (function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){ - (i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o), - m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m) - })(window,document,'script','//www.google-analytics.com/analytics.js','ga'); - - ga('create', 'UA-71137732-1', 'auto'); - ga('send', 'pageview'); - </script> - </body> </html> diff --git a/faq/index.html b/faq/index.html index fd9d072..77fd414 100644 --- a/faq/index.html +++ b/faq/index.html @@ -10,7 +10,7 @@ slug: faq <p>A: Probably worth having a look at <a href="http://storm.apache.org/">Apache Storm® first. The <a href="http://storm.apache.org/releases/current/Tutorial.html">tutorial</a> and <a href="http://storm.apache.org/documentation/Concepts.html">concept</a> pages are good starting points.</p> - <p><strong>Q: Do I need an Apache Storm® cluster to run StormCrawler?</strong></p> + <p><strong>Q: Do I need an Apache Storm® cluster to run Apache StormCrawler (Incubating)?</strong></p> <p>A: No. It can run in local mode and will just use the Storm libraries as dependencies. It makes sense to install Storm in pseudo-distributed mode though so that you can use its UI to monitor the topologies.</p> @@ -18,7 +18,7 @@ slug: faq <p>A: Apache Storm® is an elegant framework, with simple concepts, which provides a solid platform for distributed stream processing. It gives us fault tolerance and guaranteed data processing out of the box. The project is also very dynamic and backed by a thriving community. Last but not least it is under ASF 2.0 license.</p> - <p id="howfast"><strong>Q: How fast is StormCrawler?</strong></p> + <p id="howfast"><strong>Q: How fast is Apache StormCrawler (Incubating)?</strong></p> <p>A: This depends mainly on the diversity of hostnames as well as your politeness settings. For instance, if you have 1M URLs from the same host and have set a delay of 1 sec between request then the best you'll be able to do is 86400 pages per day. In practice this would be less than that as the time needed for fetching the content (which itself depends on your network and how large the documents are), parsing and indexing it etc... This is true of any crawler, not just StormCrawler.</p> @@ -27,16 +27,16 @@ slug: faq <p>A: This <a href="http://digitalpebble.blogspot.co.uk/2015/09/index-web-with-aws-cloudsearch.html">tutorial</a> on using Apache Nutch® and SC for indexing with Cloudsearch give you some idea of how they compare in their methodology and performance. We also ran a comparative <a href="http://digitalpebble.blogspot.co.uk/2017/01/the-battle-of-crawlers-apache-nutch-vs.html">benchmark</a> on a larger crawl.</p> <p>In a nutshell (pardon the pun), Nutch proceeds by batch steps where it selects the URLs to fetch, fetches them, parses them then update it database with the new info about the URLs it just processed and adds the newly discovered URLs. The generate and update steps take longer and longer as the crawl grows and the resources are used unevenly : when fetching there is little CPU or disk used whereas when doing all the other activities, you are not fetching anything at all, which is a w [...] - <p>StormCrawler proceeds differently and does everything at the same time, hence optimising the physical resources of the cluster, but can potentially accomodate more use cases, e.g. when URLs naturally come as streams or when low latency is a must. URLs also get indexed as they are fetched and not as a batch. On a more subjective note and apart from being potentially more efficient, StormCrawler is more modern, easier to understand and build, nicer to use, more versatile and more acti [...] - <p>Apache Nutch® is a great tool though, which we used for years with many of our customers at DigitalPebble, and it can also do things that StormCrawler cannot currently do out of the box like deduplicating or advanced scoring like PageRank.</p> + <p>Apache StormCrawler (Incubating) proceeds differently and does everything at the same time, hence optimising the physical resources of the cluster, but can potentially accomodate more use cases, e.g. when URLs naturally come as streams or when low latency is a must. URLs also get indexed as they are fetched and not as a batch. On a more subjective note and apart from being potentially more efficient, Apache StormCrawler (Incubating) is more modern, easier to understand and build, ni [...] + <p>Apache Nutch® is a great tool though, which we used for years with many of our customers at DigitalPebble, and it can also do things that Apache StormCrawler (Incubating) cannot currently do out of the box like deduplicating or advanced scoring like PageRank.</p> <p><strong>Q: Do I need some sort of external storage? And if so, then what?</strong></p> <p>A: Yes, you'll need to store the URLs to fetch somewhere. The type of the storage to use depends on the nature of your crawl. If your crawl is not recursive i.e. you just want to process specific pages and/or won't discover new pages through more than one path, then you could use messaging queues like <a href="https://www.rabbitmq.com/">RabbitMQ</a>, <a href="https://aws.amazon.com/sqs/">AWS SQS</a> or <a href="http://kafka.apache.org">Apache Kafka®</a>. All you'll need is a Spout i [...] - <p>If your crawl is recursive and there is a possibility that URLs which are already known are discovered multiple times, then a queue won't help as it would add the same URL to the queue every time it is discovered. This would be very inefficient. Instead you should use a storage where the keys are unique, like for instance a relational database. StormCrawler has several resources you can use in the <a href="https://github.com/DigitalPebble/storm-crawler/tree/master/external">external [...] - <p>The advantage of using StormCrawler is that is it both modular and flexible. You can plug it to pretty much any storage you want.</p> + <p>If your crawl is recursive and there is a possibility that URLs which are already known are discovered multiple times, then a queue won't help as it would add the same URL to the queue every time it is discovered. This would be very inefficient. Instead you should use a storage where the keys are unique, like for instance a relational database. Apache StormCrawler (Incubating) has several resources you can use in the <a href="https://github.com/DigitalPebble/storm-crawler/tree/maste [...] + <p>The advantage of using Apache StormCrawler (Incubating) is that is it both modular and flexible. You can plug it to pretty much any storage you want.</p> - <p><strong>Q: Is StormCrawler polite?</strong></p> + <p><strong>Q: Is Apache StormCrawler (Incubating) polite?</strong></p> <p>A: The <a href="http://www.robotstxt.org/">robots.txt</a> protocol is supported and the fetchers are configured to have a <a href="https://github.com/DigitalPebble/storm-crawler/blob/master/core/src/main/resources/crawler-default.yaml#L6">delay</a> between calls to the same hostname or domain. However like with every tool, it is down to how people use it.</p> <p><strong>Q: When do I know when a crawl is finished?</strong></p> diff --git a/getting-started/index.html b/getting-started/index.html index 57be56d..fe328d4 100644 --- a/getting-started/index.html +++ b/getting-started/index.html @@ -1,16 +1,16 @@ --- layout: default slug: getting-started -title: Getting started with StormCrawler +title: Getting started with Apache StormCrawler (Incubating) --- <div class="row row-col"> <h1>Quickstart</h1> <br> <p>NOTE: These instructions assume that you have <a href="https://maven.apache.org/install.html">Apache Maven®</a> installed. - You will also need to install <a href="https://storm.apache.org/">Apache Storm®</a> to run the crawler. The version of Storm to use must match the one defined in the pom.xml file of your topology. The major version of StormCrawler mirrors the one from Apache Storm®, i.e whereas StormCrawler 1.x used Storm 1.2.3, the current version now requires Storm 2.6.0. Our <a href="https://github.com/DigitalPebble/ansible-storm">Ansible-Storm</a> repository contains resources to install Apache Sto [...] + You will also need to install <a href="https://storm.apache.org/">Apache Storm®</a> to run the crawler. The version of Storm to use must match the one defined in the pom.xml file of your topology. The major version of Apache StormCrawler (Incubating) mirrors the one from Apache Storm®, i.e whereas StormCrawler 1.x used Storm 1.2.3, the current version now requires Storm 2.6.0. Our <a href="https://github.com/DigitalPebble/ansible-storm">Ansible-Storm</a> repository contains resources t [...] - <p>Once Apache Storm® is installed, the easiest way to get started is to generate a brand new StormCrawler project using :</p> + <p>Once Apache Storm® is installed, the easiest way to get started is to generate a brand new Apache StormCrawler (Incubating) project using:</p> <p><i>mvn archetype:generate -DarchetypeGroupId=com.digitalpebble.stormcrawler -DarchetypeArtifactId=storm-crawler-archetype -DarchetypeVersion=2.11</i></p> @@ -24,7 +24,7 @@ title: Getting started with StormCrawler <p>What this CrawlTopology does is very simple : it gets URLs to crawl from a <a href="https://urlfrontier.net">URLFrontier</a> instance and emits them on the topology. These URLs are then partitioned by hostname to enfore the politeness and then fetched. The next bolt (SiteMapParserBolt) checks whether they are sitemap files and if not passes them on to a HTML parser. The parser extracts the text from the document and passes it to a dummy indexer which simply prints a representation of [...] - <p>Of course this topology is very primitive and its purpose is merely to give you an idea of how StormCrawler works. In reality you'd use a different spout and index the documents to a proper backend. Look at the <a href="https://github.com/DigitalPebble/storm-crawler/tree/master/external">external modules</a> to see what's already available. Another limitation of this topology is that it will work in local mode only or on a single worker.</p> + <p>Of course this topology is very primitive and its purpose is merely to give you an idea of how Apache StormCrawler (Incubating) works. In reality, you'd use a different spout and index the documents to a proper backend. Look at the <a href="https://github.com/DigitalPebble/storm-crawler/tree/master/external">external modules</a> to see what's already available. Another limitation of this topology is that it will work in local mode only or on a single worker.</p> <p>You can run the topology in local mode with :</p> @@ -36,7 +36,7 @@ title: Getting started with StormCrawler <br> - <p>If you want to use StormCrawler with Elasticsearch, the tutorial below should be a good starting point.</p> + <p>If you want to use Apache StormCrawler (Incubating) with Elasticsearch, the tutorial below should be a good starting point.</p> <iframe width="840" height="472" src="https://www.youtube.com/embed/8kpJLPdhvLw" frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe> <br> diff --git a/img/incubator_feather_egg_logo_bw_crop.png b/img/incubator_feather_egg_logo_bw_crop.png new file mode 100644 index 0000000..377e4e3 Binary files /dev/null and b/img/incubator_feather_egg_logo_bw_crop.png differ diff --git a/index.html b/index.html index 8f86388..7166ffd 100644 --- a/index.html +++ b/index.html @@ -9,7 +9,7 @@ slug: home </div> <div class="row row-col"> <p><strong>Apache StormCrawler (Incubating)</strong> is an open source SDK for building distributed web crawlers based on <a href="http://storm.apache.org">Apache Storm®</a>. The project is under Apache license v2 and consists of a collection of reusable resources and components, written mostly in Java.</p> - <p>The aim of StormCrawler is to help build web crawlers that are :</p> + <p>The aim of Apache StormCrawler (Incubating) is to help build web crawlers that are :</p> <ul> <li>scalable</li> <li>resilient</li> @@ -19,7 +19,7 @@ slug: home </ul> <p><strong>Apache StormCrawler (Incubating)</strong> is a library and collection of resources that developers can leverage to build their own crawlers. The good news is that doing so can be pretty straightforward! Have a look at the <a href="getting-started/">Getting Started</a> section for more details.</p> <p>Apart from the core components, we provide some <a href="https://github.com/apache/incubator-stormcrawler/tree/main/external">external resources</a> that you can reuse in your project, like for instance our spout and bolts for <a href="https://opensearch.org/">OpenSearch®</a> or a ParserBolt which uses <a href="http://tika.apache.org">Apache Tika®</a> to parse various document formats.</p> - <p><strong>Apache StormCrawler</strong> is perfectly suited to use cases where the URL to fetch and parse come as streams but is also an appropriate solution for large scale recursive crawls, particularly where low latency is required. The project is used in production by <a href="https://github.com/apache/incubator-stormcrawler/wiki/Powered-By">many organisations</a> and is actively developed and maintained.</p> + <p><strong>Apache StormCrawler (Incubating)</strong> is perfectly suited to use cases where the URL to fetch and parse come as streams but is also an appropriate solution for large scale recursive crawls, particularly where low latency is required. The project is used in production by <a href="https://github.com/apache/incubator-stormcrawler/wiki/Powered-By">many organisations</a> and is actively developed and maintained.</p> <p>The <a href="https://github.com/apache/incubator-stormcrawler/wiki/Presentations">Presentations</a> page contains links to some recent presentations made about this project.</p> </div> diff --git a/support/index.html b/support/index.html index 659c2b7..8c5723d 100644 --- a/support/index.html +++ b/support/index.html @@ -7,16 +7,16 @@ title: Need assistance from web crawling experts? <div class="row row-col"> <h1>Support</h1> <br> -<p>You can ask questions related to StormCrawler on Github in the <a href="https://github.com/apache/incubator-stormcrawlerdiscussions">discussions section</a>, on <a href="http://stackoverflow.com/questions/tagged/stormcrawler">stackoverflow</a> using the tag 'stormcrawler' or on <a href="https://discord.com/invite/C62MHusNnG">Discord</a>.</p> -<p>If you think you've found a bug, please <a href="https://github.com/apache/incubator-stormcrawlerissues">open an issue</a> on GitHub.</p> +<p>You can ask questions related to Apache StormCrawler (Incubating) on Github in the <a href="https://github.com/apache/incubator-stormcrawler/discussions">discussions section</a>, on <a href="http://stackoverflow.com/questions/tagged/stormcrawler">stackoverflow</a> using the tag 'stormcrawler' or on <a href="https://discord.com/invite/C62MHusNnG">Discord</a>.</p> +<p>If you think you've found a bug, please <a href="https://github.com/apache/incubator-stormcrawler/issues">open an issue</a> on GitHub.</p> <h1>Commercial Support</h1> <br> - <p>The Apache StormCrawler PMC does not endorse or recommend any of the products or services on this page. We love all our supporters equally.</p> + <p>The Apache StormCrawler (Incubating) PMC does not endorse or recommend any of the products or services on this page. We love all our supporters equally.</p> <h2>Want to be added to this page? </h2> <p>All submitted information must be factual and informational in nature and not be a marketing statement. Statements that promote your products and services over other offerings on the page will not be tolerated and will be removed. Such marketing statements can be added to your own pages on your own site.</p> - <p>When in doubt, email the Apache StormCrawler PMC and ask. We are be happy to help.</p> + <p>When in doubt, email the Apache StormCrawler (Incubating) PMC and ask. We are be happy to help.</p> <h2>Companies</h2> <ul>
