This is an automated email from the ASF dual-hosted git repository. shaofengshi pushed a commit to branch document in repository https://gitbox.apache.org/repos/asf/kylin.git
commit 8087b4b691257b5e859ddf80d1987943fa354f9b Author: shaofengshi <shaofeng...@apache.org> AuthorDate: Thu Jan 10 08:52:15 2019 +0800 Update cube_spark document with KYLIN-3607 --- website/_docs/tutorial/cube_spark.cn.md | 18 ++++++++++++++++++ website/_docs/tutorial/cube_spark.md | 20 +++++++++++++++++++- 2 files changed, 37 insertions(+), 1 deletion(-) diff --git a/website/_docs/tutorial/cube_spark.cn.md b/website/_docs/tutorial/cube_spark.cn.md index 913be9f..0bc7dee 100644 --- a/website/_docs/tutorial/cube_spark.cn.md +++ b/website/_docs/tutorial/cube_spark.cn.md @@ -158,6 +158,24 @@ $KYLIN_HOME/spark/sbin/start-history-server.sh hdfs://sandbox.hortonworks.com:80 点击一个具体的 job,运行时的具体信息将会展示,该信息对疑难解答和性能调整有极大的帮助。 +在某些 Hadoop 版本上, 在 "Convert Cuboid Data to HFile" 这一步可能会遇到下面这个错误: + +{% highlight Groff markup %} +Caused by: java.lang.RuntimeException: Could not create interface org.apache.hadoop.hbase.regionserver.MetricsRegionServerSourceFactory Is the hadoop compatibility jar on the classpath? + at org.apache.hadoop.hbase.CompatibilitySingletonFactory.getInstance(CompatibilitySingletonFactory.java:73) + at org.apache.hadoop.hbase.io.MetricsIO.<init>(MetricsIO.java:31) + at org.apache.hadoop.hbase.io.hfile.HFile.<clinit>(HFile.java:192) + ... 15 more +Caused by: java.util.NoSuchElementException + at java.util.ServiceLoader$LazyIterator.nextService(ServiceLoader.java:365) + at java.util.ServiceLoader$LazyIterator.next(ServiceLoader.java:404) + at java.util.ServiceLoader$1.next(ServiceLoader.java:480) + at org.apache.hadoop.hbase.CompatibilitySingletonFactory.getInstance(CompatibilitySingletonFactory.java:59) + ... 17 more +{% endhighlight %} + +解决办法是: 将 `hbase-hadoop2-compat-*.jar` 和 `hbase-hadoop-compat-*.jar` 拷贝到 `$KYLIN_HOME/spark/jars` 目录下 (这两个 jar 文件可以从 HBase 的 lib 目录找到); 如果你已经生成了 Spark assembly jar 并上传到了 HDFS, 那么你需要重新打包上传。在这之后,重试失败的 cube 任务,应该就可以成功了。相关的 JIRA issue 是 KYLIN-3607,会在未来版本修复. + ## 进一步 如果您是 Kylin 的管理员但是对于 Spark 是新手,建议您浏览 [Spark 文档](https://spark.apache.org/docs/2.1.2/),别忘记相应地去更新配置。您可以开启 Spark 的 [Dynamic Resource Allocation](https://spark.apache.org/docs/2.1.2/job-scheduling.html#dynamic-resource-allocation) ,以便其对于不同的工作负载能自动伸缩。Spark 性能依赖于集群的内存和 CPU 资源,当有复杂数据模型和巨大的数据集一次构建时 Kylin 的 Cube 构建将会是一项繁重的任务。如果您的集群资源不能够执行,Spark executors 就会抛出如 "OutOfMemorry" 这样的错误,因此请合理的使用。对于有 UHC dimension,过多组合 (例如,一个 cube 超过 12 dimensions),或耗尽内存的度量 (Count Distinct,Top-N) 的 Cube,建议您使用 MapReduce e [...] diff --git a/website/_docs/tutorial/cube_spark.md b/website/_docs/tutorial/cube_spark.md index 9cfe366..2ac27d7 100644 --- a/website/_docs/tutorial/cube_spark.md +++ b/website/_docs/tutorial/cube_spark.md @@ -29,7 +29,7 @@ To run Spark on Yarn, need specify **HADOOP_CONF_DIR** environment variable, whi ## Check Spark configuration -Kylin embeds a Spark binary (v2.1.0) in $KYLIN_HOME/spark, all the Spark configurations can be managed in $KYLIN_HOME/conf/kylin.properties with prefix *"kylin.engine.spark-conf."*. These properties will be extracted and applied when runs submit Spark job; E.g, if you configure "kylin.engine.spark-conf.spark.executor.memory=4G", Kylin will use "--conf spark.executor.memory=4G" as parameter when execute "spark-submit". +Kylin embeds a Spark binary (Spark v2.1 for Kylin 2.4 and 2.5) in $KYLIN_HOME/spark, all the Spark configurations can be managed in $KYLIN_HOME/conf/kylin.properties with prefix *"kylin.engine.spark-conf."*. These properties will be extracted and applied when runs submit Spark job; E.g, if you configure "kylin.engine.spark-conf.spark.executor.memory=4G", Kylin will use "--conf spark.executor.memory=4G" as parameter when execute "spark-submit". Before you run Spark cubing, suggest take a look on these configurations and do customization according to your cluster. Below is the recommended configurations: @@ -152,6 +152,24 @@ In web browser, access "http://sandbox:18080" it shows the job history: Click a specific job, there you will see the detail runtime information, that is very helpful for trouble shooting and performance tuning. +On some Hadoop release, you may encounter the following error in the "Convert Cuboid Data to HFile" step: + +{% highlight Groff markup %} +Caused by: java.lang.RuntimeException: Could not create interface org.apache.hadoop.hbase.regionserver.MetricsRegionServerSourceFactory Is the hadoop compatibility jar on the classpath? + at org.apache.hadoop.hbase.CompatibilitySingletonFactory.getInstance(CompatibilitySingletonFactory.java:73) + at org.apache.hadoop.hbase.io.MetricsIO.<init>(MetricsIO.java:31) + at org.apache.hadoop.hbase.io.hfile.HFile.<clinit>(HFile.java:192) + ... 15 more +Caused by: java.util.NoSuchElementException + at java.util.ServiceLoader$LazyIterator.nextService(ServiceLoader.java:365) + at java.util.ServiceLoader$LazyIterator.next(ServiceLoader.java:404) + at java.util.ServiceLoader$1.next(ServiceLoader.java:480) + at org.apache.hadoop.hbase.CompatibilitySingletonFactory.getInstance(CompatibilitySingletonFactory.java:59) + ... 17 more +{% endhighlight %} + +The workaround is: add `hbase-hadoop2-compat-*.jar` and `hbase-hadoop-compat-*.jar` into `$KYLIN_HOME/spark/jars` (the two jar files can be found in HBase's lib folder); If you already make the Spark assembly jar and uploaded to HDFS, you may need to re-package that and re-upload to HDFS. After that, resume the failed job, the job should be succesful. The related issue is KYLIN-3607 which will be fixed in later version. + ## Go further If you're a Kylin administrator but new to Spark, suggest you go through [Spark documents](https://spark.apache.org/docs/2.1.0/), and don't forget to update the configurations accordingly. You can enable Spark [Dynamic Resource Allocation](https://spark.apache.org/docs/2.1.0/job-scheduling.html#dynamic-resource-allocation) so that it can auto scale/shrink for different work load. Spark's performance relies on Cluster's memory and CPU resource, while Kylin's Cube build is a heavy task whe [...]