Author: lidong
Date: Fri May 18 03:17:10 2018
New Revision: 1831821
URL: http://svn.apache.org/viewvc?rev=1831821&view=rev
Log:
update spark doc
Modified:
kylin/site/docs23/tutorial/spark.html
kylin/site/feed.xml
Modified: kylin/site/docs23/tutorial/spark.html
URL:
http://svn.apache.org/viewvc/kylin/site/docs23/tutorial/spark.html?rev=1831821&r1=1831820&r2=1831821&view=diff
==============================================================================
--- kylin/site/docs23/tutorial/spark.html (original)
+++ kylin/site/docs23/tutorial/spark.html Fri May 18 03:17:10 2018
@@ -4372,172 +4372,64 @@
<article
class="post-content" >
<h3
id="introduction">Introduction</h3>
-<p>Kylin provides JDBC driver to query the Cube data. Spark can query SQL
databases using JDBC driver. With this, you can query Kylinâs Cube from Spark
and then do the analysis.</p>
+<p>Kylin provides JDBC driver to query the Cube data. Spark can query SQL
databases using JDBC driver. With this, you can query Kylinâs Cube from Spark
and then do the analysis over a very huge data set.</p>
-<p>But, Kylin is an OLAP system, it is not a real database: Kylin only has
aggregated data, no raw data. If you simply load the source table into Spark as
a data frame, some operations like âcountâ might be wrong if you expect to
count the raw data.</p>
+<p>But, Kylin is an OLAP system, it is not a real database: Kylin only has
aggregated data, no raw data. If you simply load the source table into Spark as
a data frame, it may not work as the Cube data can be very huge, and some
operations like âcountâ might be wrong.</p>
-<p>Besides, the Cube data can be very huge which is different with normal
database.</p>
+<p>This document describes how to use Kylin as a data source in Apache Spark.
You need to install Kylin, build a Cube, and then put Kylinâs JDBC driver
onto your Spark applicationâs classpath.</p>
-<p>This document describes how to use Kylin as a data source in Apache Spark.
You need install Kylin and build a Cube as the prerequisite.</p>
+<h3 id="the-wrong-way">The wrong way</h3>
-<h3 id="the-wrong-application">The wrong application</h3>
+<p>The below Python application tries to directly load Kylinâs table as a
data frame, and then expect to get the total row count with âdf.count()â,
but the result is incorrect.</p>
-<p>The below Python application tries to load Kylinâs table as a data frame,
and then expect to get the total row count with âdf.count()â, but the
result is incorrect.</p>
-
-<div class="highlight"><pre><code class="language-groff"
data-lang="groff">#!/usr/bin/env python
-
-import os
-import sys
-import traceback
-import time
-import subprocess
-import json
-import re
-
-os.environ["SPARK_HOME"] = "/usr/local/spark/"
-sys.path.append(os.environ["SPARK_HOME"]+"/python")
-
-from pyspark import SparkConf, SparkContext
-from pyspark.sql import SQLContext
-
-from pyspark.sql.functions import *
-from pyspark.sql.types import *
-
-jars = ["kylin-jdbc-2.3.1.jar", "jersey-client-1.9.jar", "jersey-core-1.9.jar"]
-
-class Kap(object):
- def __init__(self):
- print 'initializing Spark context ...'
- sys.stdout.flush()
-
- conf = SparkConf()
- conf.setMaster('yarn')
- conf.setAppName('kap test')
-
- wdir = os.path.dirname(os.path.realpath(__file__))
- jars_with_path = ','.join([wdir + '/' + x for x in jars])
-
- conf.set("spark.jars", jars_with_path)
- conf.set("spark.yarn.archive",
"hdfs://sandbox.hortonworks.com:8020/kylin/spark/spark-libs.jar")
- conf.set("spark.driver.extraClassPath",
jars_with_path.replace(",",":"))
-
- self.sc = SparkContext(conf=conf)
- self.sqlContext = SQLContext(self.sc)
- print 'Spark context is initialized'
-
- self.df = self.sqlContext.read.format('jdbc').options(
- url='jdbc:kylin://sandbox:7070/default',
- user='ADMIN', password='KYLIN',
- dbtable='test_kylin_fact',
driver='org.apache.kylin.jdbc.Driver').load()
-
- self.df.registerTempTable("loltab")
- print self.df.count()
+<div class="highlight"><pre><code class="language-groff"
data-lang="groff">conf = SparkConf()
+ conf.setMaster('yarn')
+ conf.setAppName('Kylin jdbc example')
- def sql(self, cmd, result_tab_name='tmptable'):
- df = self.sqlContext.sql(cmd)
- if df is not None:
- df.registerTempTable(result_tab_name)
- return df
+ self.sc = SparkContext(conf=conf)
+ self.sqlContext = SQLContext(self.sc)
- def stop(self):
- self.sc.stop()
+ self.df = self.sqlContext.read.format('jdbc').options(
+ url='jdbc:kylin://sandbox:7070/default',
+ user='ADMIN', password='KYLIN',
+ dbtable='kylin_sales', driver='org.apache.kylin.jdbc.Driver').load()
-kap = Kap()
-try:
- df = kap.sql(r"select count(*) from loltab")
- df.show(truncate=False)
-except:
- pass
-finally:
- kap.stop()</code></pre></div>
+ print self.df.count()</code></pre></div>
<p>The output is:</p>
-<div class="highlight"><pre><code class="language-groff"
data-lang="groff">Spark context is initialized
-132
-+--------+
-|count(1)|
-+--------+
-|132 |
-+--------+</code></pre></div>
-
-<p>The result â132â here is not the total count of the origin table. The
reason is that, Spark sends âselect * from â query to Kylin, Kylin
doesnât have the raw data, but will answer the query with aggregated data in
the base Cuboid. The â132â is the row number of the base Cuboid, not source
data.</p>
-
-<h3 id="the-right-code">The right code</h3>
+<div class="highlight"><pre><code class="language-groff"
data-lang="groff">132</code></pre></div>
-<p>The right behavior is to push down the aggregation to Kylin, so that the
Cube can be leveraged. Below is the correct code:</p>
+<p>The result â132â here is not the total count of the origin table. The
reason is that, Spark sends âselect * â or âselect 1 â query to Kylin,
Kylin doesnât have the raw data, but will answer the query with aggregated
data in the base Cuboid. The â132â is the row number of the base Cuboid,
not original data.</p>
-<div class="highlight"><pre><code class="language-groff"
data-lang="groff">#!/usr/bin/env python
+<h3 id="the-right-way">The right way</h3>
-import os
-import sys
-import json
+<p>The right behavior is to push down all possible aggregations to Kylin, so
that the Cube can be leveraged, the performance would be much better than from
source data. Below is the correct code:</p>
-os.environ["SPARK_HOME"] = "/usr/local/spark/"
-sys.path.append(os.environ["SPARK_HOME"]+"/python")
-
-from pyspark import SparkConf, SparkContext
-from pyspark.sql import SQLContext
-
-from pyspark.sql.functions import *
-from pyspark.sql.types import *
-
-jars = ["kylin-jdbc-2.3.1.jar", "jersey-client-1.9.jar", "jersey-core-1.9.jar"]
-
-
-def demo():
- # step 1: init
- print 'initializing ...',
- conf = SparkConf()
+<div class="highlight"><pre><code class="language-groff"
data-lang="groff">conf = SparkConf()
conf.setMaster('yarn')
- conf.setAppName('jdbc example')
-
- wdir = os.path.dirname(os.path.realpath(__file__))
- jars_with_path = ','.join([wdir + '/' + x for x in jars])
-
- conf.set("spark.jars", jars_with_path)
- conf.set("spark.yarn.archive",
"hdfs://sandbox.hortonworks.com:8020/kylin/spark/spark-libs.jar")
-
- conf.set("spark.driver.extraClassPath", jars_with_path.replace(",",":"))
+ conf.setAppName('Kylin jdbc example')
sc = SparkContext(conf=conf)
sql_ctx = SQLContext(sc)
- print 'done'
-
+
url='jdbc:kylin://sandbox:7070/default'
- tab_name = '(select count(*) as total from test_kylin_fact) the_alias'
+ tab_name = '(select count(*) as total from kylin_sales) the_alias'
- # step 2: initiate the sql
df = sql_ctx.read.format('jdbc').options(
url=url, user='ADMIN', password='KYLIN',
driver='org.apache.kylin.jdbc.Driver',
dbtable=tab_name).load()
- # many ways to obtain the results
- df.show()
+ df.show()</code></pre></div>
- print "df.count()", df.count() # must be 1, as there is only one row
+<p>Here is the output, the result is correct as Spark push down the
aggregation to Kylin:</p>
- for record in df.toJSON().collect():
- # this loop has only one iteration
- # reach record is a string; need to be decoded to JSON
- print 'the total column: ', json.loads(record)['TOTAL']
-
- sc.stop()
-
-demo()</code></pre></div>
-
-<p>Here is the output, which is expected:</p>
-
-<div class="highlight"><pre><code class="language-groff"
data-lang="groff">initializing ... done
-+-----+
+<div class="highlight"><pre><code class="language-groff"
data-lang="groff">+-----+
|TOTAL|
+-----+
| 2000|
-+-----+
-
-df.count() 1
-the total column: 2000</code></pre></div>
++-----+</code></pre></div>
<p>Thanks for the input and sample code from Shuxin Yang
([email protected]).</p>
Modified: kylin/site/feed.xml
URL:
http://svn.apache.org/viewvc/kylin/site/feed.xml?rev=1831821&r1=1831820&r2=1831821&view=diff
==============================================================================
--- kylin/site/feed.xml (original)
+++ kylin/site/feed.xml Fri May 18 03:17:10 2018
@@ -19,8 +19,8 @@
<description>Apache Kylin Home</description>
<link>http://kylin.apache.org/</link>
<atom:link href="http://kylin.apache.org/feed.xml" rel="self"
type="application/rss+xml"/>
- <pubDate>Thu, 17 May 2018 06:59:24 -0700</pubDate>
- <lastBuildDate>Thu, 17 May 2018 06:59:24 -0700</lastBuildDate>
+ <pubDate>Thu, 17 May 2018 20:11:39 -0700</pubDate>
+ <lastBuildDate>Thu, 17 May 2018 20:11:39 -0700</lastBuildDate>
<generator>Jekyll v2.5.3</generator>
<item>