This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch asf-site in repository https://gitbox.apache.org/repos/asf/spark-website.git
The following commit(s) were added to refs/heads/asf-site by this push: new a6ce63fb9c docs: udpate third party projects (#497) a6ce63fb9c is described below commit a6ce63fb9c82dc8f25f42f377b487c0de2aff826 Author: Matthew Powers <matthewkevinpow...@gmail.com> AuthorDate: Thu Jan 25 11:18:05 2024 -0500 docs: udpate third party projects (#497) --- site/third-party-projects.html | 79 ++++++++++++++++++++++-------------------- third-party-projects.md | 77 ++++++++++++++++++++-------------------- 2 files changed, 81 insertions(+), 75 deletions(-) diff --git a/site/third-party-projects.html b/site/third-party-projects.html index ba0911b733..a0f7a953f8 100644 --- a/site/third-party-projects.html +++ b/site/third-party-projects.html @@ -141,40 +141,57 @@ <div class="col-12 col-md-9"> <p>This page tracks external software projects that supplement Apache Spark and add to its ecosystem.</p> -<p>To add a project, open a pull request against the <a href="https://github.com/apache/spark-website">spark-website</a> -repository. Add an entry to -<a href="https://github.com/apache/spark-website/blob/asf-site/third-party-projects.md">this markdown file</a>, -then run <code class="language-plaintext highlighter-rouge">jekyll build</code> to generate the HTML too. Include -both in your pull request. See the README in this repo for more information.</p> +<h2 id="popular-libraries-with-pyspark-integrations">Popular libraries with PySpark integrations</h2> -<p>Note that all project and product names should follow <a href="/trademarks.html">trademark guidelines</a>.</p> +<ul> + <li><a href="https://github.com/great-expectations/great_expectations">great-expectations</a> - Always know what to expect from your data</li> + <li><a href="https://github.com/apache/airflow">Apache Airflow</a> - A platform to programmatically author, schedule, and monitor workflows</li> + <li><a href="https://github.com/dmlc/xgboost">xgboost</a> - Scalable, portable and distributed gradient boosting</li> + <li><a href="https://github.com/shap/shap">shap</a> - A game theoretic approach to explain the output of any machine learning model</li> + <li><a href="https://github.com/awslabs/python-deequ">python-deequ</a> - Measures data quality in large datasets</li> + <li><a href="https://github.com/datahub-project/datahub">datahub</a> - Metadata platform for the modern data stack</li> + <li><a href="https://github.com/dbt-labs/dbt-spark">dbt-spark</a> - Enables dbt to work with Apache Spark</li> +</ul> -<h2>spark-packages.org</h2> +<h2 id="connectors">Connectors</h2> -<p><a href="https://spark-packages.org/">spark-packages.org</a> is an external, -community-managed list of third-party libraries, add-ons, and applications that work with -Apache Spark. You can add a package as long as you have a GitHub repository.</p> +<ul> + <li><a href="https://github.com/spark-redshift-community/spark-redshift">spark-redshift</a> - Performant Redshift data source for Apache Spark</li> + <li><a href="https://github.com/microsoft/sql-spark-connector">spark-sql-connector</a> - Apache Spark Connector for SQL Server and Azure SQL</li> + <li><a href="https://github.com/Azure/azure-cosmosdb-spark">azure-cosmos-spark</a> - Apache Spark Connector for Azure Cosmos DB</li> + <li><a href="https://github.com/Azure/azure-event-hubs-spark">azure-event-hubs-spark</a> - Enables continuous data processing with Apache Spark and Azure Event Hubs</li> + <li><a href="https://github.com/Azure/azure-kusto-spark">azure-kusto-spark</a> - Apache Spark connector for Azure Kusto</li> + <li><a href="https://github.com/mongodb/mongo-spark">mongo-spark</a> - The MongoDB Spark connector</li> + <li><a href="https://github.com/couchbase/couchbase-spark-connector">couchbase-spark-connector</a> - The Official Couchbase Spark connector</li> + <li><a href="https://github.com/datastax/spark-cassandra-connector">spark-cassandra-connector</a> - DataStax connector for Apache Spark to Apache Cassandra</li> + <li><a href="https://github.com/elastic/elasticsearch-hadoop">elasticsearch-hadoop</a> - Elasticsearch real-time search and analytics natively integrated with Spark</li> + <li><a href="https://github.com/neo4j-contrib/neo4j-spark-connector">neo4j-spark-connector</a> - Neo4j Connector for Apache Spark</li> + <li><a href="https://github.com/StarRocks/starrocks-connector-for-apache-spark">starrocks-connector-for-apache-spark</a> - StarRocks Apache Spark connector</li> + <li><a href="https://github.com/pingcap/tispark">tispark</a> - TiSpark is built for running Apache Spark on top of TiDB/TiKV</li> +</ul> + +<h2 id="open-table-formats">Open table formats</h2> + +<ul> + <li><a href="https://delta.io">Delta Lake</a> - Storage layer that provides ACID transactions and scalable metadata handling for Apache Spark workloads</li> + <li><a href="https://github.com/apache/hudi">Hudi</a>: Upserts, Deletes And Incremental Processing on Big Data</li> + <li><a href="https://github.com/apache/iceberg">Iceberg</a> - Open table format for analytic datasets</li> +</ul> <h2>Infrastructure projects</h2> <ul> - <li><a href="https://github.com/spark-jobserver/spark-jobserver">REST Job Server for Apache Spark</a> - -REST interface for managing and submitting Spark jobs on the same cluster.</li> - <li><a href="http://mlbase.org/">MLbase</a> - Machine Learning research project on top of Spark</li> + <li><a href="https://github.com/apache/kyuubi">Kyuubi</a> - Apache Kyuubi is a distributed and multi-tenant gateway to provide serverless SQL on data warehouses and lakehouses</li> + <li><a href="https://github.com/spark-jobserver/spark-jobserver">REST Job Server for Apache Spark</a> - REST interface for managing and submitting Spark jobs on the same cluster.</li> <li><a href="https://mesos.apache.org/">Apache Mesos</a> - Cluster management system that supports running Spark</li> <li><a href="https://www.alluxio.org/">Alluxio</a> (née Tachyon) - Memory speed virtual distributed storage system that supports running Spark</li> <li><a href="https://github.com/filodb/FiloDB">FiloDB</a> - a Spark integrated analytical/columnar database, with in-memory option capable of sub-second concurrent queries</li> - <li><a href="http://zeppelin-project.org/">Zeppelin</a> - Multi-purpose notebook which supports 20+ language backends, -including Apache Spark</li> - <li><a href="https://github.com/EclairJS/eclairjs-node">EclairJS</a> - enables Node.js developers to code -against Spark, and data scientists to use Javascript in Jupyter notebooks.</li> - <li><a href="https://github.com/Hydrospheredata/mist">Mist</a> - Serverless proxy for Spark cluster (spark middleware)</li> + <li><a href="http://zeppelin-project.org/">Zeppelin</a> - Multi-purpose notebook which supports 20+ language backends, including Apache Spark</li> <li><a href="https://github.com/GoogleCloudPlatform/spark-on-k8s-operator">K8S Operator for Apache Spark</a> - Kubernetes operator for specifying and managing the lifecycle of Apache Spark applications on Kubernetes.</li> <li><a href="https://developer.ibm.com/storage/products/ibm-spectrum-conductor-spark/">IBM Spectrum Conductor</a> - Cluster management software that integrates with Spark and modern computing frameworks.</li> - <li><a href="https://delta.io">Delta Lake</a> - Storage layer that provides ACID transactions and scalable metadata handling for Apache Spark workloads.</li> <li><a href="https://mlflow.org">MLflow</a> - Open source platform to manage the machine learning lifecycle, including deploying models from diverse machine learning libraries on Apache Spark.</li> <li><a href="https://datafu.apache.org/docs/spark/getting-started.html">Apache DataFu</a> - A collection of utils and user-defined-functions for working with large scale data in Apache Spark, as well as making Scala-Python interoperability easier.</li> </ul> @@ -184,16 +201,6 @@ against Spark, and data scientists to use Javascript in Jupyter notebooks.</li> <ul> <li><a href="https://mahout.apache.org/">Apache Mahout</a> - Previously on Hadoop MapReduce, Mahout has switched to using Spark as the backend</li> - <li><a href="https://wiki.apache.org/mrql/">Apache MRQL</a> - A query processing and optimization -system for large-scale, distributed data analysis, built on top of Apache Hadoop, Hama, and Spark</li> - <li><a href="https://github.com/sameeragarwal/blinkdb">BlinkDB</a> - a massively parallel, approximate query engine built -on top of Shark and Spark</li> - <li><a href="https://github.com/adobe-research/spindle">Spindle</a> - Spark/Parquet-based web -analytics query engine</li> - <li><a href="https://github.com/thunderain-project/thunderain">Thunderain</a> - a framework -for combining stream processing with historical data, think Lambda architecture</li> - <li><a href="https://github.com/OryxProject/oryx">Oryx</a> - Lambda architecture on Apache Spark, -Apache Kafka for real-time large scale machine learning</li> <li><a href="https://github.com/bigdatagenomics/adam">ADAM</a> - A framework and CLI for loading, transforming, and analyzing genomic data using Apache Spark</li> <li><a href="https://github.com/salesforce/TransmogrifAI">TransmogrifAI</a> - AutoML library for building modular, reusable, strongly typed machine learning workflows on Spark with minimal hand tuning</li> @@ -204,7 +211,6 @@ transforming, and analyzing genomic data using Apache Spark</li> <h2>Performance, monitoring, and debugging tools for Spark</h2> <ul> - <li><a href="https://github.com/g1thubhub/phil_stopwatch">Performance and debugging library</a> - A library to analyze Spark and PySpark applications for improving performance and finding the cause of failures</li> <li><a href="https://www.datamechanics.co/delight">Data Mechanics Delight</a> - Delight is a free, hosted, cross-platform Spark UI alternative backed by an open-source Spark agent. It features new metrics and visualizations to simplify Spark monitoring and performance tuning.</li> </ul> @@ -219,16 +225,9 @@ transforming, and analyzing genomic data using Apache Spark</li> <h3>Clojure</h3> <ul> - <li><a href="https://github.com/TheClimateCorporation/clj-spark">clj-spark</a></li> <li><a href="https://github.com/zero-one-group/geni">Geni</a> - A Clojure dataframe library that runs on Apache Spark with a focus on optimizing the REPL experience.</li> </ul> -<h3>Groovy</h3> - -<ul> - <li><a href="https://github.com/bunions1/groovy-spark-example">groovy-spark-example</a></li> -</ul> - <h3>Julia</h3> <ul> @@ -241,6 +240,12 @@ transforming, and analyzing genomic data using Apache Spark</li> <li><a href="https://github.com/JetBrains/kotlin-spark-api">Kotlin for Apache Spark</a></li> </ul> +<h2 id="adding-new-projects">Adding new projects</h2> + +<p>To add a project, open a pull request against the <a href="https://github.com/apache/spark-website">spark-website</a> repository. Add an entry to <a href="https://github.com/apache/spark-website/blob/asf-site/third-party-projects.md">this markdown file</a>, then run <code class="language-plaintext highlighter-rouge">jekyll build</code> to generate the HTML too. Include both in your pull request. See the README in this repo for more information.</p> + +<p>Note that all project and product names should follow <a href="/trademarks.html">trademark guidelines</a>.</p> + </div> <div class="col-12 col-md-3"> <div class="news" style="margin-bottom: 20px;"> diff --git a/third-party-projects.md b/third-party-projects.md index cf6f3c8102..e8b4b16c85 100644 --- a/third-party-projects.md +++ b/third-party-projects.md @@ -9,39 +9,50 @@ navigation: This page tracks external software projects that supplement Apache Spark and add to its ecosystem. -To add a project, open a pull request against the [spark-website](https://github.com/apache/spark-website) -repository. Add an entry to -[this markdown file](https://github.com/apache/spark-website/blob/asf-site/third-party-projects.md), -then run `jekyll build` to generate the HTML too. Include -both in your pull request. See the README in this repo for more information. - -Note that all project and product names should follow [trademark guidelines](/trademarks.html). - -<h2>spark-packages.org</h2> - -<a href="https://spark-packages.org/">spark-packages.org</a> is an external, -community-managed list of third-party libraries, add-ons, and applications that work with -Apache Spark. You can add a package as long as you have a GitHub repository. +## Popular libraries with PySpark integrations + +- [great-expectations](https://github.com/great-expectations/great_expectations) - Always know what to expect from your data +- [Apache Airflow](https://github.com/apache/airflow) - A platform to programmatically author, schedule, and monitor workflows +- [xgboost](https://github.com/dmlc/xgboost) - Scalable, portable and distributed gradient boosting +- [shap](https://github.com/shap/shap) - A game theoretic approach to explain the output of any machine learning model +- [python-deequ](https://github.com/awslabs/python-deequ) - Measures data quality in large datasets +- [datahub](https://github.com/datahub-project/datahub) - Metadata platform for the modern data stack +- [dbt-spark](https://github.com/dbt-labs/dbt-spark) - Enables dbt to work with Apache Spark + +## Connectors + +- [spark-redshift](https://github.com/spark-redshift-community/spark-redshift) - Performant Redshift data source for Apache Spark +- [spark-sql-connector](https://github.com/microsoft/sql-spark-connector) - Apache Spark Connector for SQL Server and Azure SQL +- [azure-cosmos-spark](https://github.com/Azure/azure-cosmosdb-spark) - Apache Spark Connector for Azure Cosmos DB +- [azure-event-hubs-spark](https://github.com/Azure/azure-event-hubs-spark) - Enables continuous data processing with Apache Spark and Azure Event Hubs +- [azure-kusto-spark](https://github.com/Azure/azure-kusto-spark) - Apache Spark connector for Azure Kusto +- [mongo-spark](https://github.com/mongodb/mongo-spark) - The MongoDB Spark connector +- [couchbase-spark-connector](https://github.com/couchbase/couchbase-spark-connector) - The Official Couchbase Spark connector +- [spark-cassandra-connector](https://github.com/datastax/spark-cassandra-connector) - DataStax connector for Apache Spark to Apache Cassandra +- [elasticsearch-hadoop](https://github.com/elastic/elasticsearch-hadoop) - Elasticsearch real-time search and analytics natively integrated with Spark +- [neo4j-spark-connector](https://github.com/neo4j-contrib/neo4j-spark-connector) - Neo4j Connector for Apache Spark +- [starrocks-connector-for-apache-spark](https://github.com/StarRocks/starrocks-connector-for-apache-spark) - StarRocks Apache Spark connector +- [tispark](https://github.com/pingcap/tispark) - TiSpark is built for running Apache Spark on top of TiDB/TiKV + +## Open table formats + +- <a href="https://delta.io">Delta Lake</a> - Storage layer that provides ACID transactions and scalable metadata handling for Apache Spark workloads +- [Hudi](https://github.com/apache/hudi): Upserts, Deletes And Incremental Processing on Big Data +- [Iceberg](https://github.com/apache/iceberg) - Open table format for analytic datasets <h2>Infrastructure projects</h2> -- <a href="https://github.com/spark-jobserver/spark-jobserver">REST Job Server for Apache Spark</a> - -REST interface for managing and submitting Spark jobs on the same cluster. -- <a href="http://mlbase.org/">MLbase</a> - Machine Learning research project on top of Spark +- [Kyuubi](https://github.com/apache/kyuubi) - Apache Kyuubi is a distributed and multi-tenant gateway to provide serverless SQL on data warehouses and lakehouses +- <a href="https://github.com/spark-jobserver/spark-jobserver">REST Job Server for Apache Spark</a> - REST interface for managing and submitting Spark jobs on the same cluster. - <a href="https://mesos.apache.org/">Apache Mesos</a> - Cluster management system that supports running Spark - <a href="https://www.alluxio.org/">Alluxio</a> (née Tachyon) - Memory speed virtual distributed storage system that supports running Spark - <a href="https://github.com/filodb/FiloDB">FiloDB</a> - a Spark integrated analytical/columnar database, with in-memory option capable of sub-second concurrent queries -- <a href="http://zeppelin-project.org/">Zeppelin</a> - Multi-purpose notebook which supports 20+ language backends, -including Apache Spark -- <a href="https://github.com/EclairJS/eclairjs-node">EclairJS</a> - enables Node.js developers to code -against Spark, and data scientists to use Javascript in Jupyter notebooks. -- <a href="https://github.com/Hydrospheredata/mist">Mist</a> - Serverless proxy for Spark cluster (spark middleware) +- <a href="http://zeppelin-project.org/">Zeppelin</a> - Multi-purpose notebook which supports 20+ language backends, including Apache Spark - <a href="https://github.com/GoogleCloudPlatform/spark-on-k8s-operator">K8S Operator for Apache Spark</a> - Kubernetes operator for specifying and managing the lifecycle of Apache Spark applications on Kubernetes. - <a href="https://developer.ibm.com/storage/products/ibm-spectrum-conductor-spark/">IBM Spectrum Conductor</a> - Cluster management software that integrates with Spark and modern computing frameworks. -- <a href="https://delta.io">Delta Lake</a> - Storage layer that provides ACID transactions and scalable metadata handling for Apache Spark workloads. - <a href="https://mlflow.org">MLflow</a> - Open source platform to manage the machine learning lifecycle, including deploying models from diverse machine learning libraries on Apache Spark. - <a href="https://datafu.apache.org/docs/spark/getting-started.html">Apache DataFu</a> - A collection of utils and user-defined-functions for working with large scale data in Apache Spark, as well as making Scala-Python interoperability easier. @@ -49,16 +60,6 @@ against Spark, and data scientists to use Javascript in Jupyter notebooks. - <a href="https://mahout.apache.org/">Apache Mahout</a> - Previously on Hadoop MapReduce, Mahout has switched to using Spark as the backend -- <a href="https://wiki.apache.org/mrql/">Apache MRQL</a> - A query processing and optimization -system for large-scale, distributed data analysis, built on top of Apache Hadoop, Hama, and Spark -- <a href="https://github.com/sameeragarwal/blinkdb">BlinkDB</a> - a massively parallel, approximate query engine built -on top of Shark and Spark -- <a href="https://github.com/adobe-research/spindle">Spindle</a> - Spark/Parquet-based web -analytics query engine -- <a href="https://github.com/thunderain-project/thunderain">Thunderain</a> - a framework -for combining stream processing with historical data, think Lambda architecture -- <a href="https://github.com/OryxProject/oryx">Oryx</a> - Lambda architecture on Apache Spark, -Apache Kafka for real-time large scale machine learning - <a href="https://github.com/bigdatagenomics/adam">ADAM</a> - A framework and CLI for loading, transforming, and analyzing genomic data using Apache Spark - <a href="https://github.com/salesforce/TransmogrifAI">TransmogrifAI</a> - AutoML library for building modular, reusable, strongly typed machine learning workflows on Spark with minimal hand tuning @@ -67,7 +68,6 @@ transforming, and analyzing genomic data using Apache Spark <h2>Performance, monitoring, and debugging tools for Spark</h2> -- <a href="https://github.com/g1thubhub/phil_stopwatch">Performance and debugging library</a> - A library to analyze Spark and PySpark applications for improving performance and finding the cause of failures - <a href="https://www.datamechanics.co/delight">Data Mechanics Delight</a> - Delight is a free, hosted, cross-platform Spark UI alternative backed by an open-source Spark agent. It features new metrics and visualizations to simplify Spark monitoring and performance tuning. <h2>Additional language bindings</h2> @@ -78,13 +78,8 @@ transforming, and analyzing genomic data using Apache Spark <h3>Clojure</h3> -- <a href="https://github.com/TheClimateCorporation/clj-spark">clj-spark</a> - <a href="https://github.com/zero-one-group/geni">Geni</a> - A Clojure dataframe library that runs on Apache Spark with a focus on optimizing the REPL experience. -<h3>Groovy</h3> - -- <a href="https://github.com/bunions1/groovy-spark-example">groovy-spark-example</a> - <h3>Julia</h3> - <a href="https://github.com/dfdx/Spark.jl">Spark.jl</a> @@ -92,3 +87,9 @@ transforming, and analyzing genomic data using Apache Spark <h3>Kotlin</h3> - <a href="https://github.com/JetBrains/kotlin-spark-api">Kotlin for Apache Spark</a> + +## Adding new projects + +To add a project, open a pull request against the [spark-website](https://github.com/apache/spark-website) repository. Add an entry to [this markdown file](https://github.com/apache/spark-website/blob/asf-site/third-party-projects.md), then run `jekyll build` to generate the HTML too. Include both in your pull request. See the README in this repo for more information. + +Note that all project and product names should follow [trademark guidelines](/trademarks.html). --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org