This is an automated email from the ASF dual-hosted git repository. ckj pushed a commit to branch branch-0.6 in repository https://gitbox.apache.org/repos/asf/incubator-uniffle.git
commit c4259be963a019559a4908bfbcf166cdb89699e7 Author: Kaijie Chen <[email protected]> AuthorDate: Thu Dec 15 20:23:21 2022 +0800 Improve README (#427) 1. Update the introduction section in README: * "compute engines" is changed to "distributed compute engines". * "MapReduce" is changed to "Apache Hadoop MapReduce". * Remove the "Total lines" badge, see [OSSInsight][1] for better insight. 2. Fix typos. [1]: https://ossinsight.io/analyze/apache/incubator-uniffle#lines-of-code-changed "apache/incubator-uniffle | OSSInsight" Improve the README. No. Preview: https://github.com/kaijchen/incubator-uniffle/tree/readme#apache-uniffle-incubating --- README.md | 48 +++++++++++++++++++++++++----------------------- 1 file changed, 25 insertions(+), 23 deletions(-) diff --git a/README.md b/README.md index 84fd8099..c6f2a77b 100644 --- a/README.md +++ b/README.md @@ -17,23 +17,25 @@ # Apache Uniffle (Incubating) -Uniffle is a unified remote shuffle service for compute engines. +Uniffle is a unified remote shuffle service for distributed compute engines. It provides the ability to aggregate and store shuffle data on remote servers, thus improving the performance and reliability of large jobs. -Currently it supports [Apache Spark](https://spark.apache.org) and [MapReduce](https://hadoop.apache.org). +Currently it supports [Apache Spark](https://spark.apache.org) and [Apache Hadoop MapReduce](https://hadoop.apache.org). [](https://github.com/apache/incubator-uniffle/actions/workflows/build.yml) [](https://codecov.io/gh/apache/incubator-uniffle) +[](https://github.com/apache/incubator-uniffle/blob/master/LICENSE) +[](https://github.com/apache/incubator-uniffle/releases) ## Architecture  -Uniffle contains coordinator cluster, shuffle server cluster and remote storage(eg, HDFS) if necessary. +Uniffle cluster consists of three components, a coordinator cluster, a shuffle server cluster and an optional remote storage (e.g., HDFS). -Coordinator will collect status of shuffle server and do the assignment for the job. +Coordinator will collect the status of shuffle servers and assign jobs based on some strategy. Shuffle server will receive the shuffle data, merge them and write to storage. -Depend on different situation, Uniffle supports Memory & Local, Memory & Remote Storage(eg, HDFS), Memory & Local & Remote Storage(recommendation for production environment). +Depending on different situations, Uniffle supports Memory & Local, Memory & Remote Storage(e.g., HDFS), Memory & Local & Remote Storage(recommendation for production environment). ## Shuffle Process with Uniffle @@ -50,20 +52,20 @@ Depend on different situation, Uniffle supports Memory & Local, Memory & Remote 8. After write data, task report all blockId to shuffle server, this step is used for data validation later 9. Store taskAttemptId in MapStatus to support Spark speculation -* Depend on different storage type, spark task read shuffle data from shuffle server or remote storage or both of them. +* Depending on different storage types, the spark task will read shuffle data from shuffle server or remote storage or both of them. ## Shuffle file format -The shuffle data is stored with index file and data file. Data file has all blocks for specific partition and index file has metadata for every block. +The shuffle data is stored with index file and data file. Data file has all blocks for a specific partition and the index file has metadata for every block.  ## Supported Spark Version -Current support Spark 2.3.x, Spark 2.4.x, Spark3.0.x, Spark 3.1.x, Spark 3.2.x +Currently supports Spark 2.3.x, Spark 2.4.x, Spark 3.0.x, Spark 3.1.x, Spark 3.2.x Note: To support dynamic allocation, the patch(which is included in client-spark/patch folder) should be applied to Spark ## Supported MapReduce Version -Current support Hadoop 2.8.5's MapReduce framework. +Currently supports the MapReduce framework of Hadoop 2.8.5 ## Building Uniffle > note: currently Uniffle requires JDK 1.8 to build, adding later JDK support > is on our roadmap. @@ -73,11 +75,11 @@ To build it, run: mvn -DskipTests clean package -Build against profile Spark2(2.4.6) +Build against profile Spark 2 (2.4.6) mvn -DskipTests clean package -Pspark2 -Build against profile Spark3(3.1.2) +Build against profile Spark 3 (3.1.2) mvn -DskipTests clean package -Pspark3 @@ -108,13 +110,13 @@ rss-xxx.tgz will be generated for deployment ### Deploy Coordinator 1. unzip package to RSS_HOME -2. update RSS_HOME/bin/rss-env.sh, eg, +2. update RSS_HOME/bin/rss-env.sh, e.g., ``` JAVA_HOME=<java_home> HADOOP_HOME=<hadoop home> XMX_SIZE="16g" ``` -3. update RSS_HOME/conf/coordinator.conf, eg, +3. update RSS_HOME/conf/coordinator.conf, e.g., ``` rss.rpc.server.port 19999 rss.jetty.http.port 19998 @@ -128,9 +130,9 @@ rss-xxx.tgz will be generated for deployment # config the path of excluded shuffle server rss.coordinator.exclude.nodes.file.path <RSS_HOME>/conf/exclude_nodes ``` -4. update <RSS_HOME>/conf/dynamic_client.conf, rss client will get default conf from coordinator eg, +4. update <RSS_HOME>/conf/dynamic_client.conf, rss client will get default conf from coordinator e.g., ``` - # MEMORY_LOCALFILE_HDFS is recommandation for production environment + # MEMORY_LOCALFILE_HDFS is recommended for production environment rss.storage.type MEMORY_LOCALFILE_HDFS # multiple remote storages are supported, and client will get assignment from coordinator rss.coordinator.remote.storage.path hdfs://cluster1/path,hdfs://cluster2/path @@ -147,18 +149,18 @@ rss-xxx.tgz will be generated for deployment ### Deploy Shuffle Server 1. unzip package to RSS_HOME -2. update RSS_HOME/bin/rss-env.sh, eg, +2. update RSS_HOME/bin/rss-env.sh, e.g., ``` JAVA_HOME=<java_home> HADOOP_HOME=<hadoop home> XMX_SIZE="80g" ``` -3. update RSS_HOME/conf/server.conf, eg, +3. update RSS_HOME/conf/server.conf, e.g., ``` rss.rpc.server.port 19999 rss.jetty.http.port 19998 rss.rpc.executor.size 2000 - # it should be configed the same as in coordinator + # it should be configured the same as in coordinator rss.storage.type MEMORY_LOCALFILE_HDFS rss.coordinator.quorum <coordinatorIp1>:19999,<coordinatorIp2>:19999 # local storage path for shuffle server @@ -176,7 +178,7 @@ rss-xxx.tgz will be generated for deployment rss.server.app.expired.withoutHeartbeat 120000 # note: the default value of rss.server.flush.cold.storage.threshold.size is 64m # there will be no data written to DFS if set it as 100g even rss.storage.type=MEMORY_LOCALFILE_HDFS - # please set proper value if DFS is used, eg, 64m, 128m. + # please set a proper value if DFS is used, e.g., 64m, 128m. rss.server.flush.cold.storage.threshold.size 100g ``` 4. start Shuffle Server @@ -185,13 +187,13 @@ rss-xxx.tgz will be generated for deployment ``` ### Deploy Spark Client -1. Add client jar to Spark classpath, eg, SPARK_HOME/jars/ +1. Add client jar to Spark classpath, e.g., SPARK_HOME/jars/ The jar for Spark2 is located in <RSS_HOME>/jars/client/spark2/rss-client-XXXXX-shaded.jar The jar for Spark3 is located in <RSS_HOME>/jars/client/spark3/rss-client-XXXXX-shaded.jar -2. Update Spark conf to enable Uniffle, eg, +2. Update Spark conf to enable Uniffle, e.g., ``` spark.shuffle.manager org.apache.spark.shuffle.RssShuffleManager @@ -216,7 +218,7 @@ After apply the patch and rebuild spark, add following configuration in spark co The jar for MapReduce is located in <RSS_HOME>/jars/client/mr/rss-client-mr-XXXXX-shaded.jar -2. Update MapReduce conf to enable Uniffle, eg, +2. Update MapReduce conf to enable Uniffle, e.g., ``` -Dmapreduce.rss.coordinator.quorum=<coordinatorIp1>:19999,<coordinatorIp2>:19999 @@ -230,7 +232,7 @@ and job recovery (i.e., `yarn.app.mapreduce.am.job.recovery.enable=false`) ## Configuration -The important configuration is listed as following. +The important configuration is listed as follows. ### Coordinator
