Modified: kylin/site/feed.xml
URL: 
http://svn.apache.org/viewvc/kylin/site/feed.xml?rev=1851725&r1=1851724&r2=1851725&view=diff
==============================================================================
--- kylin/site/feed.xml (original)
+++ kylin/site/feed.xml Mon Jan 21 04:20:52 2019
@@ -19,11 +19,168 @@
     <description>Apache Kylin Home</description>
     <link>http://kylin.apache.org/</link>
     <atom:link href="http://kylin.apache.org/feed.xml"; rel="self" 
type="application/rss+xml"/>
-    <pubDate>Thu, 17 Jan 2019 23:11:23 -0800</pubDate>
-    <lastBuildDate>Thu, 17 Jan 2019 23:11:23 -0800</lastBuildDate>
+    <pubDate>Sun, 20 Jan 2019 20:11:59 -0800</pubDate>
+    <lastBuildDate>Sun, 20 Jan 2019 20:11:59 -0800</lastBuildDate>
     <generator>Jekyll v2.5.3</generator>
     
       <item>
+        <title>Apache Kylin v2.6.0 Release Announcement</title>
+        <description>&lt;p&gt;The Apache Kylin community is pleased to 
announce the release of Apache Kylin v2.6.0.&lt;/p&gt;
+
+&lt;p&gt;Apache Kylin is an open source Distributed Analytics Engine designed 
to provide SQL interface and multi-dimensional analysis (OLAP) on Big Data 
supporting extremely large datasets.&lt;/p&gt;
+
+&lt;p&gt;This is a major release after 2.5.0, including many enhancements. All 
of the changes can be found in the &lt;a 
href=&quot;https://kylin.apache.org/docs/release_notes.html&quot;&gt;release 
notes&lt;/a&gt;. Here just highlight the major ones:&lt;/p&gt;
+
+&lt;h3 id=&quot;sdk-for-jdbc-sources&quot;&gt;SDK for JDBC sources&lt;/h3&gt;
+&lt;p&gt;Apache Kylin has already supported several data sources like Amazon 
Redshift, SQL Server through JDBC. &lt;br /&gt;
+To help developers handle SQL dialect differences and easily implement a new 
data source through JDBC, Kylin provides a new data source SDK with APIs 
for:&lt;br /&gt;
+* Synchronize metadata and data from JDBC source&lt;br /&gt;
+* Build cube from JDBC source&lt;br /&gt;
+* Query pushdown to JDBC source engine when cube is unmatched&lt;/p&gt;
+
+&lt;p&gt;Check KYLIN-3552 for more.&lt;/p&gt;
+
+&lt;h3 id=&quot;memcached-as-distributed-cache&quot;&gt;Memcached as 
distributed cache&lt;/h3&gt;
+&lt;p&gt;In the past, query caches are not efficiently used in Kylin due to 
two aspects: aggressive cache expiration strategy and local cache. &lt;br /&gt;
+Because of the aggressive cache expiration strategy, useful caches are often 
cleaned up unnecessarily. &lt;br /&gt;
+Because query caches are stored in local servers, they cannot be shared 
between servers. &lt;br /&gt;
+And because of the size limitation of local cache, not all useful query 
results can be cached.&lt;/p&gt;
+
+&lt;p&gt;To deal with these shortcomings, we change the query cache expiration 
strategy by signature checking and introduce the memcached as Kylin’s 
distributed cache so that Kylin servers are able to share cache between 
servers. &lt;br /&gt;
+And it’s easy to add memcached servers to scale out distributed cache. With 
enough memcached servers, we can cached things as much as possible. &lt;br /&gt;
+Then we also introduce segment level query cache which can not only speed up 
query but also reduce the rpcs to HBase. &lt;br /&gt;
+The related tasks are KYLIN-2895, KYLIN-2894, KYLIN-2896, KYLIN-2897, 
KYLIN-2898, KYLIN-2899.&lt;/p&gt;
+
+&lt;h3 id=&quot;forkjoinpool-for-fast-cubing&quot;&gt;ForkJoinPool for fast 
cubing&lt;/h3&gt;
+&lt;p&gt;In the past, fast cubing uses split threads, task threads and main 
thread to do the cube building, there is complex join and error handling 
logic.&lt;/p&gt;
+
+&lt;p&gt;The new implement leverages the ForkJoinPool from JDK, the event 
split logic is handled in&lt;br /&gt;
+main thread. Cuboid task and sub-tasks are handled in fork join pool, cube 
results are collected&lt;br /&gt;
+async and can be write to output earlier. Check KYLIN-2932 for more.&lt;/p&gt;
+
+&lt;h3 id=&quot;improve-hllcounter-performance&quot;&gt;Improve HLLCounter 
performance&lt;/h3&gt;
+&lt;p&gt;In the past, the way to create HLLCounter and to compute harmonic 
mean are not efficient.&lt;/p&gt;
+
+&lt;p&gt;The new implement improve the HLLCounter creation by copy register 
from another HLLCounter instead of merge. To compute harmonic mean in the 
HLLCSnapshot, it does the enhancement by &lt;br /&gt;
+* using table to cache all 1/2^r  without computing on the fly&lt;br /&gt;
+* remove floating addition by using integer addition in the bigger loop&lt;br 
/&gt;
+* remove branch, e.g. needn’t checking whether registers[i] is zero or not, 
although this is minor improvement.&lt;/p&gt;
+
+&lt;p&gt;Check KYLIN-3656 for more.&lt;/p&gt;
+
+&lt;h3 id=&quot;improve-cuboid-recommendation-algorithm&quot;&gt;Improve 
Cuboid Recommendation Algorithm&lt;/h3&gt;
+&lt;p&gt;In the past, to add cuboids which are not prebuilt, the cube planner 
turns to mandatory cuboids which are selected if its rollup row count is above 
some threshold. &lt;br /&gt;
+There are two shortcomings:&lt;br /&gt;
+* The way to estimate the rollup row count is not good&lt;br /&gt;
+* It’s hard to determine the threshold of rollup row count for recommending 
mandatory cuboids&lt;/p&gt;
+
+&lt;p&gt;The new implement improves the way to estimate the row count of 
un-prebuilt cuboids by rollup ratio rather than exact rollup row count. &lt;br 
/&gt;
+With better estimated row counts for un-prebuilt cuboids, the cost-based cube 
planner algorithm will decide which cuboid to be built or not and the threshold 
for previous mandatory cuboids is not needed. &lt;br /&gt;
+By this improvement, we don’t need the threshold for mandatory cuboids 
recommendation, and mandatory cuboids can only be manually set and will not be 
recommended. Check KYLIN-3540 for more.&lt;/p&gt;
+
+&lt;p&gt;&lt;strong&gt;Download&lt;/strong&gt;&lt;/p&gt;
+
+&lt;p&gt;To download Apache Kylin v2.6.0 source code or binary package, visit 
the &lt;a 
href=&quot;http://kylin.apache.org/download&quot;&gt;download&lt;/a&gt; 
page.&lt;/p&gt;
+
+&lt;p&gt;&lt;strong&gt;Upgrade&lt;/strong&gt;&lt;/p&gt;
+
+&lt;p&gt;Follow the &lt;a 
href=&quot;/docs/howto/howto_upgrade.html&quot;&gt;upgrade 
guide&lt;/a&gt;.&lt;/p&gt;
+
+&lt;p&gt;&lt;strong&gt;Feedback&lt;/strong&gt;&lt;/p&gt;
+
+&lt;p&gt;If you face issue or question, please send mail to Apache Kylin dev 
or user mailing list: [email protected] , [email protected]; Before 
sending, please make sure you have subscribed the mailing list by dropping an 
email to [email protected] or 
[email protected].&lt;/p&gt;
+
+&lt;p&gt;&lt;em&gt;Great thanks to everyone who 
contributed!&lt;/em&gt;&lt;/p&gt;
+</description>
+        <pubDate>Fri, 18 Jan 2019 12:00:00 -0800</pubDate>
+        <link>http://kylin.apache.org/blog/2019/01/18/release-v2.6.0/</link>
+        <guid 
isPermaLink="true">http://kylin.apache.org/blog/2019/01/18/release-v2.6.0/</guid>
+        
+        
+        <category>blog</category>
+        
+      </item>
+    
+      <item>
+        <title>Apache Kylin v2.6.0 正式发布</title>
+        <description>&lt;p&gt;近日Apache Kylin 社区很高å…
´åœ°å®£å¸ƒï¼ŒApache Kylin 2.6.0 正式发布。&lt;/p&gt;
+
+&lt;p&gt;Apache Kylin 
是一个开源的分布式分析引擎,旨在为极大数据集提供 SQL 
接口和多维分析(OLAP)的能力。&lt;/p&gt;
+
+&lt;p&gt;这是继2.5.0 后的一个新功能版本。该版本引å…
¥äº†å¾ˆå¤šæœ‰ä»·å€¼çš„æ”¹è¿›ï¼Œå®Œæ•´çš„æ”¹åŠ¨åˆ—è¡¨è¯·å‚è§&lt;a 
href=&quot;https://kylin.apache.org/docs/release_notes.html&quot;&gt;release 
notes&lt;/a&gt;;这里挑一些主要改进做说明:&lt;/p&gt;
+
+&lt;h3 id=&quot;jdbcsdk&quot;&gt;针对以JDBC为数据源的SDK&lt;/h3&gt;
+&lt;p&gt;Kylin目前已经支持通过JDBC连接包括Amazon Redshift, SQL 
Server在内的多种数据源。&lt;br /&gt;
+为了便于开发者更便利地处理各种 SQL 方言(dialect) 
的不同以更加简单地开发新的 JDBC 数据源,Kylin 
提供了相应的 SDK 和统一的 API 入口:&lt;br /&gt;
+* 同步元数据和数据&lt;br /&gt;
+* 构建 cube&lt;br /&gt;
+* 
当找不到相应的cube来解答查询时,下推查询到数据源&lt;/p&gt;
+
+&lt;p&gt;更多内容参见 KYLIN-3552。&lt;/p&gt;
+
+&lt;h3 id=&quot;memcached--kylin-&quot;&gt;使用 Memcached 作 Kylin 
的分布式缓存&lt;/h3&gt;
+&lt;p&gt;在过去,Kylin 
对查询结果的缓存不是十分高效,主要有以下两个方面的原å›
 ï¼š&lt;br /&gt;
+一个是当 Kylin 的 metadata 发生变化时,会主动盲目地去清
除大量缓存,使得缓存会被频繁刷新而导致利用率降低。&lt;br
 /&gt;
+另一点是由于只使用本地缓存而导致 Kylin server 之间不能å…
±äº«å½¼æ­¤çš„缓存,这样查询的缓存命中率就会降低。&lt;br /&gt;
+本地缓存的一个缺点是大小受到限制,不能像分布式缓存那æ
 ·æ°´å¹³æ‰©å±•。这æ 
·å¯¼è‡´èƒ½ç¼“存的查询结果量受到了限制。&lt;/p&gt;
+
+&lt;p&gt;针对这些缺陷,我们改变了缓存失效的机制,不再主动去æ¸
…理缓存,而是采取如下的方案:&lt;br /&gt;
+1. 在将查询结果放入缓存之前,根据当前的å…
ƒæ•°æ®ä¿¡æ¯è®¡ç®—一个数字签名,并与查询结果一同放å…
¥ç¼“存中&lt;br /&gt;
+2. 从缓存中获取查询结果之后,根据当前的å…
ƒæ•°æ®ä¿¡æ¯è®¡ç®—一个数字签名,对比两者
的数字签名是否一致。如果一致,那么缓存有效;反之,该缓存失效并åˆ
 é™¤&lt;/p&gt;
+
+&lt;p&gt;我们还引入了 Memcached 作为 Kylin 的分布式缓存。这样 
Kylin server 之间可以共享查询结果的缓存,而且由于 Memcached 
server 之间的独立性,非常易于水平拓展,更加
有利于缓存更多的数据。&lt;br /&gt;
+相关开发任务是 KYLIN-2895, KYLIN-2894, KYLIN-2896, KYLIN-2897, 
KYLIN-2898, KYLIN-2899。&lt;/p&gt;
+
+&lt;h3 id=&quot;forkjoinpool--fast-cubing-&quot;&gt;ForkJoinPool 简化 fast 
cubing 的线程模型&lt;/h3&gt;
+&lt;p&gt;在过去进行 fast cubing 时,Kylin 
使用自己定义的一系列线程,如 split 线程,task 线程,main 
线程等等进行并发的 cube 构建。&lt;br /&gt;
+在这个线程模型中,线程之间的å…
³ç³»ååˆ†çš„复杂,而且对异常处理也十分容易出错。&lt;/p&gt;
+
+&lt;p&gt;现在我们引入了 ForkJoinPool,在主线程中处理 split 
逻辑,构建 cuboid 的任务以及子任务都在 fork join 
pool中执行,cuboid 
构建的结果可以被异步的收集并且可以更早地输出给下游的 
merge 操作。更多内容参见 KYLIN-2932。&lt;/p&gt;
+
+&lt;h3 id=&quot;hllcounter-&quot;&gt;改进 HLLCounter 的性能&lt;/h3&gt;
+&lt;p&gt;对于 HLLCounter, 我们从两方面进行了改进:构建 
HLLCounter 和计算调和平均的方式。&lt;br /&gt;
+1. 关于 HLLCounter 
的构建,我们不再使用merge的方式,而是直接copy别的HLLCounter里面的registers&lt;br
 /&gt;
+2. 关于计算 HLLCSnapshot 
里面的调和平均,做了以下三个方面的改进:&lt;br /&gt;
+* 缓存所有的1/2^r&lt;br /&gt;
+* 使用整型相加代替浮点型相加&lt;br /&gt;
+* 删除条件分支,例如无需检查 registers[i] 是不是为0&lt;/p&gt;
+
+&lt;p&gt;更多内容参见 KYLIN-3656。&lt;/p&gt;
+
+&lt;h3 id=&quot;cube-planner-&quot;&gt;改进 Cube Planner 算法&lt;/h3&gt;
+&lt;p&gt;在过去,cube planner 的 phase two 增加未被预计算的 
cuboid 的方式只能通过 mandatory cuboid 的方式。而一个 cuboid 
是否为 mandatory,又有两种方式:&lt;br /&gt;
+手动设置,查询时 rollup 
的行数足够大。这里通过判断查询时 rollup 
的行数是否足够大来判断是否为 mandatory cuboid 
的方式有两大缺陷:&lt;br /&gt;
+* 一是估算 rollup 的行数的算法不是很好&lt;br /&gt;
+* 二是很难设立一个静态的阈值来做判定&lt;/p&gt;
+
+&lt;p&gt;现在我们不再从 rollup 
行数的角度看问题了。一切都是从 cuboid 
行数的角度看问题,这样就和 cost based 的 cube planner 
算法做了统一。&lt;br /&gt;
+为此我们通过使用 rollup 比率来改进了未被预先构建的 cuboid 
的行数的估算,然后让 cost based 的 cube planner 
算法来判定哪些未被构建的 cuboid 
该被构建,哪些该被遗弃。&lt;br /&gt;
+通过这样的改进,无需通过设定静态的阈值来推荐 mandatory 
cuboid 了,而 mandatory cuboid 
只能被手动设置,不能被推荐了。更多内容参见 
KYLIN-3540。&lt;/p&gt;
+
+&lt;p&gt;&lt;strong&gt;下载&lt;/strong&gt;&lt;/p&gt;
+
+&lt;p&gt;要下载Apache Kylin v2.6.0源代码或二进制包,请访问&lt;a 
href=&quot;http://kylin.apache.org/download&quot;&gt;下载页面&lt;/a&gt; 
.&lt;/p&gt;
+
+&lt;p&gt;&lt;strong&gt;升级&lt;/strong&gt;&lt;/p&gt;
+
+&lt;p&gt;参考&lt;a 
href=&quot;/docs/howto/howto_upgrade.html&quot;&gt;升级指南&lt;/a&gt;.&lt;/p&gt;
+
+&lt;p&gt;&lt;strong&gt;反馈&lt;/strong&gt;&lt;/p&gt;
+
+&lt;p&gt;如果您遇到问题或疑问,请发送邮件至 Apache Kylin dev 
或 user 邮件列表:[email protected],[email protected]; 
在发送之前,请确保您已通过发送电子邮件至 
[email protected] 或 [email protected]订阅
了邮件列表。&lt;/p&gt;
+
+&lt;p&gt;&lt;em&gt;非常感谢所有贡献Apache 
Kylin的朋友!&lt;/em&gt;&lt;/p&gt;
+</description>
+        <pubDate>Fri, 18 Jan 2019 12:00:00 -0800</pubDate>
+        <link>http://kylin.apache.org/cn/blog/2019/01/18/release-v2.6.0/</link>
+        <guid 
isPermaLink="true">http://kylin.apache.org/cn/blog/2019/01/18/release-v2.6.0/</guid>
+        
+        
+        <category>blog</category>
+        
+      </item>
+    
+      <item>
         <title>How Cisco&#39;s Big Data Team Improved the High Concurrent 
Throughput of Apache Kylin by 5x</title>
         <description>&lt;h2 id=&quot;background&quot;&gt;Background&lt;/h2&gt;
 
@@ -545,70 +702,6 @@ Graphic 10 Process of Querying Cube&lt;/
       </item>
     
       <item>
-        <title>Apache Kylin v2.5.0 正式发布</title>
-        <description>&lt;p&gt;近日Apache Kylin 社区很高å…
´åœ°å®£å¸ƒï¼ŒApache Kylin 2.5.0 正式发布。&lt;/p&gt;
-
-&lt;p&gt;Apache Kylin 
是一个开源的分布式分析引擎,旨在为极大数据集提供 SQL 
接口和多维分析(OLAP)的能力。&lt;/p&gt;
-
-&lt;p&gt;这是继2.4.0 后的一个新功能版本。该版本引å…
¥äº†å¾ˆå¤šæœ‰ä»·å€¼çš„æ”¹è¿›ï¼Œå®Œæ•´çš„æ”¹åŠ¨åˆ—è¡¨è¯·å‚è§&lt;a 
href=&quot;https://kylin.apache.org/docs/release_notes.html&quot;&gt;release 
notes&lt;/a&gt;;这里挑一些主要改进做说明:&lt;/p&gt;
-
-&lt;h3 id=&quot;all-in-spark--cubing-&quot;&gt;All-in-Spark 的 Cubing 
引擎&lt;/h3&gt;
-&lt;p&gt;Kylin 的 Spark 引擎将使用 Spark 运行 cube 
计算中的所有分布式作业,包括获取各个维度的不同值,将 
cuboid 文件转换为 HBase HFile,合并 segment,合并词å…
¸ç­‰ã€‚默认的 Spark é…
ç½®ä¹Ÿç»è¿‡ä¼˜åŒ–,使得用户可以获得开箱即用的体验。相å…
³å¼€å‘任务是 KYLIN-3427, KYLIN-3441, KYLIN-3442.&lt;/p&gt;
-
-&lt;p&gt;Spark 任务管理也有所改进:一旦 Spark 
任务开始运行,您就可以在Web控制台上获得作业链接;如果您丢弃该作业,Kylin
 将立刻终止 Spark 作业以及时释放资源;如果重新启动 
Kylin,它可以从上一个作业恢复,而不是重新提交新作业.&lt;/p&gt;
-
-&lt;h3 id=&quot;mysql--kylin-&quot;&gt;MySQL 做 Kylin å…
ƒæ•°æ®çš„存储&lt;/h3&gt;
-&lt;p&gt;在过去,HBase 是 Kylin 元数据存储的唯一选择。 
在某些情况下 HBase不适用,例如使用多个 HBase 集群来为 Kylin 
提供跨区域的高可用,这里复制的 HBase 
集群是只读的,所以不能做元数据存储。现在我们引入了 
MySQL Metastore 
以满足这种需求。此功能现在处于测试阶段。更多内容参见 
KYLIN-3488。&lt;/p&gt;
-
-&lt;h3 id=&quot;hybrid-model-&quot;&gt;Hybrid model 图形界面&lt;/h3&gt;
-&lt;p&gt;Hybrid 是一种用于组装多个 cube 的高级模型。 
它可用于满足 cube 的 schema 要发生改变的情
况。这个功能过去没有图形界面,因
此只有一小部分用户知道它。现在我们在 Web 
界面上开启了它,以便更多用户可以尝试。&lt;/p&gt;
-
-&lt;h3 id=&quot;cube-planner&quot;&gt;默认开启 Cube planner&lt;/h3&gt;
-&lt;p&gt;Cube planner 可以极大地优化 cube 结构,减少构建的 
cuboid 
数量,从而节省计算/存储资源并提高查询性能。它是在v2.3中引å
…¥çš„,但默认情
况下没有开启。为了让更多用户看到并尝试它,我们默认在v2.5中启用它。
 算法将在第一次构建 segment 的时候,根据数据统计自动优化 
cuboid 集合.&lt;/p&gt;
-
-&lt;h3 id=&quot;segment-&quot;&gt;改进的 Segment 剪枝&lt;/h3&gt;
-&lt;p&gt;Segment(分区)修剪可以有效地减少磁盘和网络I / 
O,因此大大提高了查询性能。 过去,Kylin 只按分区列 
(partition date column) 的值进行 segment 的修剪。 
如果查询中没有将分区列作为过滤条件,那么修剪将不起作用,会扫描所有segment。.&lt;br
 /&gt;
-现在从v2.5开始,Kylin 将在 segment 
级别记录每个维度的最小/最大值。 在扫描 segment 
之前,会将查询的条件与最小/最大索引进行比较。 
如果不匹配,将跳过该 segment。 
检查KYLIN-3370了解更多信息。&lt;/p&gt;
-
-&lt;h3 id=&quot;yarn-&quot;&gt;在 YARN 上合并字典&lt;/h3&gt;
-&lt;p&gt;当 segment 合并时,它们的词å…
¸ä¹Ÿéœ€è¦åˆå¹¶ã€‚在过去,字典合并发生在 Kylin 的 JVM 
中,这需要使用大量的本地内存和 CPU 资源。 在极端情
况下(如果有几个并发作业),可能会导致 Kylin 进程崩溃。 
因此,一些用户不得不为 Kylin 任务节点分配更多内
存,或运行多个任务节点以平衡工作负载。&lt;br /&gt;
-现在从v2.5开始,Kylin 将把这项任务提交给 Hadoop MapReduce 和 
Spark,这样就可以解决这个瓶颈问题。 
查看KYLIN-3471了解更多信息.&lt;/p&gt;
-
-&lt;h3 id=&quot;cube-&quot;&gt;改进使用全局字典的 cube 
构建性能&lt;/h3&gt;
-&lt;p&gt;全局字典 (Global Dictionary) 是 bitmap 精确去重计数的必
要条件。如果去重列具有非常高的基数,则 GD 
可能非常大。在 cube 构建阶段,Kylin 需要通过 GD 
将非整数值转换为整数。尽管 GD 
已被分成多个切片,可以分开加载到内
存,但是由于去重列的值是乱序的。Kylin 需要反复载å…
¥å’Œè½½å‡º(swap in/out)切片,这会导致构建任务非常缓慢。&lt;br 
/&gt;
-该增强功能引入了一个新步骤,为每个数据块从全局字å…
¸ä¸­æž„建一个缩小的字典。 随后每个任务只需要加
载缩小的字典,从而避免频繁的载å…
¥å’Œè½½å‡ºã€‚性能可以比以前快3倍。查看 KYLIN-3491 
了解更多信息.&lt;/p&gt;
-
-&lt;h3 id=&quot;topn-count-distinct--cube-&quot;&gt;改进含 TOPN, COUNT 
DISTINCT 的 cube 大小的估计&lt;/h3&gt;
-&lt;p&gt;Cube 的大小在构建时是预先估计的,并被后续几
个步骤使用,例如决定 MR / Spark 作业的分区数,计算 HBase 
region 切割等。它的准确与否会对构建性能产生很大影响。 
当存在 COUNT DISTINCT,TOPN 的度量时候,因
为它们的大小是灵活的,因
此估计值可能跟真实值有很大偏差。 
在过去,用户需要调整若干个参数以使尺寸估计更接近实际
尺寸,这对普通用户有点困难。&lt;br /&gt;
-现在,Kylin 将æ 
¹æ®æ”¶é›†çš„统计信息自动调整大小估计。这可以使估计值与实é™
…大小更接近。查看 KYLIN-3453 了解更多信息。&lt;/p&gt;
-
-&lt;h3 id=&quot;hadoop-30hbase-20&quot;&gt;支持Hadoop 3.0/HBase 
2.0&lt;/h3&gt;
-&lt;p&gt;Hadoop 3和 HBase 2开始被许多用户采用。现在 Kylin 
提供使用新的 Hadoop 和 HBase API 编译的新二进制包
。我们已经在 Hortonworks HDP 3.0 和 Cloudera CDH 6.0 
上进行了测试&lt;/p&gt;
-
-&lt;p&gt;&lt;strong&gt;下载&lt;/strong&gt;&lt;/p&gt;
-
-&lt;p&gt;要下载Apache Kylin v2.5.0源代码或二进制包,请访问&lt;a 
href=&quot;http://kylin.apache.org/download&quot;&gt;下载页面&lt;/a&gt; 
.&lt;/p&gt;
-
-&lt;p&gt;&lt;strong&gt;升级&lt;/strong&gt;&lt;/p&gt;
-
-&lt;p&gt;参考&lt;a 
href=&quot;/docs/howto/howto_upgrade.html&quot;&gt;升级指南&lt;/a&gt;.&lt;/p&gt;
-
-&lt;p&gt;&lt;strong&gt;反馈&lt;/strong&gt;&lt;/p&gt;
-
-&lt;p&gt;如果您遇到问题或疑问,请发送邮件至 Apache Kylin dev 
或 user 邮件列表:[email protected],[email protected]; 
在发送之前,请确保您已通过发送电子邮件至 
[email protected] 或 [email protected]订阅
了邮件列表。&lt;/p&gt;
-
-&lt;p&gt;&lt;em&gt;非常感谢所有贡献Apache 
Kylin的朋友!&lt;/em&gt;&lt;/p&gt;
-</description>
-        <pubDate>Thu, 20 Sep 2018 13:00:00 -0700</pubDate>
-        <link>http://kylin.apache.org/cn/blog/2018/09/20/release-v2.5.0/</link>
-        <guid 
isPermaLink="true">http://kylin.apache.org/cn/blog/2018/09/20/release-v2.5.0/</guid>
-        
-        
-        <category>blog</category>
-        
-      </item>
-    
-      <item>
         <title>Apache Kylin v2.5.0 Release Announcement</title>
         <description>&lt;p&gt;The Apache Kylin community is pleased to 
announce the release of Apache Kylin v2.5.0.&lt;/p&gt;
 
@@ -681,6 +774,70 @@ Graphic 10 Process of Querying Cube&lt;/
       </item>
     
       <item>
+        <title>Apache Kylin v2.5.0 正式发布</title>
+        <description>&lt;p&gt;近日Apache Kylin 社区很高å…
´åœ°å®£å¸ƒï¼ŒApache Kylin 2.5.0 正式发布。&lt;/p&gt;
+
+&lt;p&gt;Apache Kylin 
是一个开源的分布式分析引擎,旨在为极大数据集提供 SQL 
接口和多维分析(OLAP)的能力。&lt;/p&gt;
+
+&lt;p&gt;这是继2.4.0 后的一个新功能版本。该版本引å…
¥äº†å¾ˆå¤šæœ‰ä»·å€¼çš„æ”¹è¿›ï¼Œå®Œæ•´çš„æ”¹åŠ¨åˆ—è¡¨è¯·å‚è§&lt;a 
href=&quot;https://kylin.apache.org/docs/release_notes.html&quot;&gt;release 
notes&lt;/a&gt;;这里挑一些主要改进做说明:&lt;/p&gt;
+
+&lt;h3 id=&quot;all-in-spark--cubing-&quot;&gt;All-in-Spark 的 Cubing 
引擎&lt;/h3&gt;
+&lt;p&gt;Kylin 的 Spark 引擎将使用 Spark 运行 cube 
计算中的所有分布式作业,包括获取各个维度的不同值,将 
cuboid 文件转换为 HBase HFile,合并 segment,合并词å…
¸ç­‰ã€‚默认的 Spark é…
ç½®ä¹Ÿç»è¿‡ä¼˜åŒ–,使得用户可以获得开箱即用的体验。相å…
³å¼€å‘任务是 KYLIN-3427, KYLIN-3441, KYLIN-3442.&lt;/p&gt;
+
+&lt;p&gt;Spark 任务管理也有所改进:一旦 Spark 
任务开始运行,您就可以在Web控制台上获得作业链接;如果您丢弃该作业,Kylin
 将立刻终止 Spark 作业以及时释放资源;如果重新启动 
Kylin,它可以从上一个作业恢复,而不是重新提交新作业.&lt;/p&gt;
+
+&lt;h3 id=&quot;mysql--kylin-&quot;&gt;MySQL 做 Kylin å…
ƒæ•°æ®çš„存储&lt;/h3&gt;
+&lt;p&gt;在过去,HBase 是 Kylin 元数据存储的唯一选择。 
在某些情况下 HBase不适用,例如使用多个 HBase 集群来为 Kylin 
提供跨区域的高可用,这里复制的 HBase 
集群是只读的,所以不能做元数据存储。现在我们引入了 
MySQL Metastore 
以满足这种需求。此功能现在处于测试阶段。更多内容参见 
KYLIN-3488。&lt;/p&gt;
+
+&lt;h3 id=&quot;hybrid-model-&quot;&gt;Hybrid model 图形界面&lt;/h3&gt;
+&lt;p&gt;Hybrid 是一种用于组装多个 cube 的高级模型。 
它可用于满足 cube 的 schema 要发生改变的情
况。这个功能过去没有图形界面,因
此只有一小部分用户知道它。现在我们在 Web 
界面上开启了它,以便更多用户可以尝试。&lt;/p&gt;
+
+&lt;h3 id=&quot;cube-planner&quot;&gt;默认开启 Cube planner&lt;/h3&gt;
+&lt;p&gt;Cube planner 可以极大地优化 cube 结构,减少构建的 
cuboid 
数量,从而节省计算/存储资源并提高查询性能。它是在v2.3中引å
…¥çš„,但默认情
况下没有开启。为了让更多用户看到并尝试它,我们默认在v2.5中启用它。
 算法将在第一次构建 segment 的时候,根据数据统计自动优化 
cuboid 集合.&lt;/p&gt;
+
+&lt;h3 id=&quot;segment-&quot;&gt;改进的 Segment 剪枝&lt;/h3&gt;
+&lt;p&gt;Segment(分区)修剪可以有效地减少磁盘和网络I / 
O,因此大大提高了查询性能。 过去,Kylin 只按分区列 
(partition date column) 的值进行 segment 的修剪。 
如果查询中没有将分区列作为过滤条件,那么修剪将不起作用,会扫描所有segment。.&lt;br
 /&gt;
+现在从v2.5开始,Kylin 将在 segment 
级别记录每个维度的最小/最大值。 在扫描 segment 
之前,会将查询的条件与最小/最大索引进行比较。 
如果不匹配,将跳过该 segment。 
检查KYLIN-3370了解更多信息。&lt;/p&gt;
+
+&lt;h3 id=&quot;yarn-&quot;&gt;在 YARN 上合并字典&lt;/h3&gt;
+&lt;p&gt;当 segment 合并时,它们的词å…
¸ä¹Ÿéœ€è¦åˆå¹¶ã€‚在过去,字典合并发生在 Kylin 的 JVM 
中,这需要使用大量的本地内存和 CPU 资源。 在极端情
况下(如果有几个并发作业),可能会导致 Kylin 进程崩溃。 
因此,一些用户不得不为 Kylin 任务节点分配更多内
存,或运行多个任务节点以平衡工作负载。&lt;br /&gt;
+现在从v2.5开始,Kylin 将把这项任务提交给 Hadoop MapReduce 和 
Spark,这样就可以解决这个瓶颈问题。 
查看KYLIN-3471了解更多信息.&lt;/p&gt;
+
+&lt;h3 id=&quot;cube-&quot;&gt;改进使用全局字典的 cube 
构建性能&lt;/h3&gt;
+&lt;p&gt;全局字典 (Global Dictionary) 是 bitmap 精确去重计数的必
要条件。如果去重列具有非常高的基数,则 GD 
可能非常大。在 cube 构建阶段,Kylin 需要通过 GD 
将非整数值转换为整数。尽管 GD 
已被分成多个切片,可以分开加载到内
存,但是由于去重列的值是乱序的。Kylin 需要反复载å…
¥å’Œè½½å‡º(swap in/out)切片,这会导致构建任务非常缓慢。&lt;br 
/&gt;
+该增强功能引入了一个新步骤,为每个数据块从全局字å…
¸ä¸­æž„建一个缩小的字典。 随后每个任务只需要加
载缩小的字典,从而避免频繁的载å…
¥å’Œè½½å‡ºã€‚性能可以比以前快3倍。查看 KYLIN-3491 
了解更多信息.&lt;/p&gt;
+
+&lt;h3 id=&quot;topn-count-distinct--cube-&quot;&gt;改进含 TOPN, COUNT 
DISTINCT 的 cube 大小的估计&lt;/h3&gt;
+&lt;p&gt;Cube 的大小在构建时是预先估计的,并被后续几
个步骤使用,例如决定 MR / Spark 作业的分区数,计算 HBase 
region 切割等。它的准确与否会对构建性能产生很大影响。 
当存在 COUNT DISTINCT,TOPN 的度量时候,因
为它们的大小是灵活的,因
此估计值可能跟真实值有很大偏差。 
在过去,用户需要调整若干个参数以使尺寸估计更接近实际
尺寸,这对普通用户有点困难。&lt;br /&gt;
+现在,Kylin 将æ 
¹æ®æ”¶é›†çš„统计信息自动调整大小估计。这可以使估计值与实é™
…大小更接近。查看 KYLIN-3453 了解更多信息。&lt;/p&gt;
+
+&lt;h3 id=&quot;hadoop-30hbase-20&quot;&gt;支持Hadoop 3.0/HBase 
2.0&lt;/h3&gt;
+&lt;p&gt;Hadoop 3和 HBase 2开始被许多用户采用。现在 Kylin 
提供使用新的 Hadoop 和 HBase API 编译的新二进制包
。我们已经在 Hortonworks HDP 3.0 和 Cloudera CDH 6.0 
上进行了测试&lt;/p&gt;
+
+&lt;p&gt;&lt;strong&gt;下载&lt;/strong&gt;&lt;/p&gt;
+
+&lt;p&gt;要下载Apache Kylin v2.5.0源代码或二进制包,请访问&lt;a 
href=&quot;http://kylin.apache.org/download&quot;&gt;下载页面&lt;/a&gt; 
.&lt;/p&gt;
+
+&lt;p&gt;&lt;strong&gt;升级&lt;/strong&gt;&lt;/p&gt;
+
+&lt;p&gt;参考&lt;a 
href=&quot;/docs/howto/howto_upgrade.html&quot;&gt;升级指南&lt;/a&gt;.&lt;/p&gt;
+
+&lt;p&gt;&lt;strong&gt;反馈&lt;/strong&gt;&lt;/p&gt;
+
+&lt;p&gt;如果您遇到问题或疑问,请发送邮件至 Apache Kylin dev 
或 user 邮件列表:[email protected],[email protected]; 
在发送之前,请确保您已通过发送电子邮件至 
[email protected] 或 [email protected]订阅
了邮件列表。&lt;/p&gt;
+
+&lt;p&gt;&lt;em&gt;非常感谢所有贡献Apache 
Kylin的朋友!&lt;/em&gt;&lt;/p&gt;
+</description>
+        <pubDate>Thu, 20 Sep 2018 13:00:00 -0700</pubDate>
+        <link>http://kylin.apache.org/cn/blog/2018/09/20/release-v2.5.0/</link>
+        <guid 
isPermaLink="true">http://kylin.apache.org/cn/blog/2018/09/20/release-v2.5.0/</guid>
+        
+        
+        <category>blog</category>
+        
+      </item>
+    
+      <item>
         <title>Use Star Schema Benchmark for Apache Kylin</title>
         <description>&lt;h2 id=&quot;background&quot;&gt;Background&lt;/h2&gt;
 
@@ -1011,432 +1168,6 @@ send mail to Apache Kylin dev mailing li
         
         
         <category>blog</category>
-        
-      </item>
-    
-      <item>
-        <title>Get Your Interactive Analytics Superpower, with Apache Kylin 
and Apache Superset</title>
-        <description>&lt;h2 id=&quot;challenge-of-big-data&quot;&gt;Challenge 
of Big Data&lt;/h2&gt;
-
-&lt;p&gt;In the big data era, all enterprises’ face the growing demand and 
challenge of processing large volumes of data—workloads that traditional 
legacy systems can no longer satisfy. With the emergence of Artificial 
Intelligence (AI) and Internet-of-Things (IoT) technology, it has become 
mission-critical for businesses to accelerate their pace of discovering 
valuable insights from their massive and ever-growing datasets. Thus, large 
companies are constantly searching for a solution, often turning to open source 
technologies.  We will introduce two open source technologies that, when 
combined together, can meet these pressing big data demands for large 
enterprises.&lt;/p&gt;
-
-&lt;h2 
id=&quot;apache-kylin-a-leading-opensource-olap-on-hadoop&quot;&gt;Apache 
Kylin: a Leading OpenSource OLAP-on-Hadoop&lt;/h2&gt;
-&lt;p&gt;Modern organizations have had a long history of applying Online 
Analytical Processing (OLAP) technology to analyze data and uncover business 
insights. These insights help businesses make informed decisions and improve 
their service and product. With the emergence of the Hadoop ecosystem, OLAP has 
also embraced new technologies in the big data era.&lt;/p&gt;
-
-&lt;p&gt;Apache Kylin is one such technology that directly addresses the 
challenge of conducting analytical workloads on massive datasets. It is already 
widely adopted by enterprises around the world. With powerful pre-calculation 
technology, Apache Kylin enables sub-second query latency over petabyte-scale 
datasets. The innovative and intricate design of Apache Kylin allows it to 
seamlessly consume data from any Hadoop-based data source, as well as other 
relational database management system (RDBMS). Analysts can use Apache Kylin 
using standard SQL through ODBC, JDBC, and Restful API, which enables the 
platform to integrate with any third-party applications.&lt;br /&gt;
-&lt;img src=&quot;/images/Kylin-and-Superset/png/1. kylin_diagram.png&quot; 
alt=&quot;&quot; /&gt;&lt;br /&gt;
-Figure 1: Apache Kylin Architecture&lt;/p&gt;
-
-&lt;p&gt;In a fast-paced and rapidly-changing business environment, business 
users and analysts are expected to uncover insights with speed of thoughts. 
They can meet this expectation with Apache Kylin, and no longer subjected to 
the predicament of waiting for hours for one single query to return results. 
Such a powerful data processing engine empowers the data scientists, engineers, 
and business analysts of any enterprise to find insights to help reach critical 
business decisions. However, business decisions cannot be made without rich 
data visualization. To address this last-mile challenge of big data analytics, 
Apache Superset comes into the picture.&lt;/p&gt;
-
-&lt;h2 
id=&quot;apache-superset-modern-enterprise-ready-business-intelligence-platform&quot;&gt;Apache
 Superset: Modern, Enterprise-ready Business Intelligence Platform&lt;/h2&gt;
-
-&lt;p&gt;Apache Superset is a data exploration and visualization platform 
designed to be visual, intuitive, and interactive. A user can access data in 
the following two ways:&lt;/p&gt;
-
-&lt;ol&gt;
-  &lt;li&gt;
-    &lt;p&gt;Access data from the following commonly used data sources one 
table at a time: Kylin, Presto, Hive, Impala, SparkSQL, MySQL, Postgres, 
Oracle, Redshift, SQL Server, Druid.&lt;/p&gt;
-  &lt;/li&gt;
-  &lt;li&gt;
-    &lt;p&gt;Use a rich SQL Interactive Development Environment (IDE) called 
SQL Lab that is designed for power users with the ability to write SQL queries 
to analyze multiple tables.&lt;/p&gt;
-  &lt;/li&gt;
-&lt;/ol&gt;
-
-&lt;p&gt;Users can immediately analyze and visualize their query results using 
Apache Superset ‘s rich visualization and reporting features.&lt;/p&gt;
-
-&lt;p&gt;&lt;img src=&quot;/images/Kylin-and-Superset/png/2. 
superset_logo.png&quot; alt=&quot;&quot; /&gt;&lt;br /&gt;
-Figure 2&lt;/p&gt;
-
-&lt;p&gt;&lt;img src=&quot;/images/Kylin-and-Superset/png/3. 
Superset_screen_shot.png&quot; alt=&quot;&quot; /&gt;&lt;br /&gt;
-Figure 3: Apache Superset Visualization Interface&lt;/p&gt;
-
-&lt;h2 
id=&quot;integrating-apache-kylin-and-apache-superset-to-boost-your-productivity&quot;&gt;Integrating
 Apache Kylin and Apache Superset to Boost Your Productivity&lt;/h2&gt;
-
-&lt;p&gt;Both Apache Kylin and Apache Superset are built to provide fast and 
interactive analytics for their users. The combination of these two open source 
projects can bring that goal to reality on petabyte-scale datasets, thanks to 
pre-calculated Kylin Cube.&lt;/p&gt;
-
-&lt;p&gt;The Kyligence Data Science team has recently open sourced kylinpy, a 
project that makes this combination possible. Kylinpy is a Python-based Apache 
Kylin client library. Any application that uses SQLAlchemy can now query Apache 
Kylin with this library installed, specifically Apache Superset. Below is a 
brief tutorial that shows how to integrate Apache Kylin and Apache 
Superset.&lt;/p&gt;
-
-&lt;h2 id=&quot;prerequisite&quot;&gt;Prerequisite&lt;/h2&gt;
-&lt;ol&gt;
-  &lt;li&gt;Install Apache Kylin&lt;br /&gt;
-Please refer to this installation tutorial.&lt;/li&gt;
-  &lt;li&gt;Apache Kylin provides a script for you to create a sample Cube. 
After you successfully installed Apache Kylin, you can run the below script 
under Apache Kylin installation directory to generate sample project and Cube. 
&lt;br /&gt;
-  ./${KYLIN_HOME}/bin/sample.sh&lt;/li&gt;
-  &lt;li&gt;
-    &lt;p&gt;When the script finishes running, log onto Apache Kylin web with 
default user ADMIN/KYLIN; in the system page click “Reload Metadata,” then 
you will see a sample project called “Learn Kylin.”&lt;/p&gt;
-  &lt;/li&gt;
-  &lt;li&gt;Select the sample cube “kylin_sales_cube”, click “Actions” 
-&amp;gt; “Build”, pick a date later than 2014-01-01 (to cover all 10000 
sample records);&lt;/li&gt;
-&lt;/ol&gt;
-
-&lt;p&gt;&lt;img src=&quot;/images/Kylin-and-Superset/png/4. 
build_cube.png&quot; alt=&quot;&quot; /&gt;&lt;br /&gt;
-Figure 4: Build Cube in Apache Kylin&lt;/p&gt;
-
-&lt;ol&gt;
-  &lt;li&gt;Check the build progress in “Monitor” tab until it reaches 
100%;&lt;/li&gt;
-  &lt;li&gt;Execute SQL in the “Insight” tab, for example:&lt;/li&gt;
-&lt;/ol&gt;
-
-&lt;div class=&quot;highlighter-rouge&quot;&gt;&lt;pre 
class=&quot;highlight&quot;&gt;&lt;code&gt;  select part_dt,
-         sum(price) as total_selled,
-         count(distinct seller_id) as sellers
-  from kylin_sales
-  group by part_dt
-  order by part_dt
--- #This query will hit on the newly built Cube “Kylin_sales_cube”.
-&lt;/code&gt;&lt;/pre&gt;
-&lt;/div&gt;
-
-&lt;ol&gt;
-  &lt;li&gt;Next, we will install Apache Superset and initialize it.&lt;br 
/&gt;
-  You may refer to Apache Superset official website instruction to install and 
initialize.&lt;/li&gt;
-  &lt;li&gt;Install kylinpy&lt;/li&gt;
-&lt;/ol&gt;
-
-&lt;div class=&quot;highlighter-rouge&quot;&gt;&lt;pre 
class=&quot;highlight&quot;&gt;&lt;code&gt;   $ pip install kylinpy
-&lt;/code&gt;&lt;/pre&gt;
-&lt;/div&gt;
-
-&lt;ol&gt;
-  &lt;li&gt;Verify your installation, if everything goes well, Apache Superset 
daemon should be up and running.&lt;/li&gt;
-&lt;/ol&gt;
-
-&lt;div class=&quot;highlighter-rouge&quot;&gt;&lt;pre 
class=&quot;highlight&quot;&gt;&lt;code&gt;$ superset runserver -d
-Starting server with command:
-gunicorn -w 2 --timeout 60 -b  0.0.0.0:8088 --limit-request-line 0 
--limit-request-field_size 0 superset:app
-
-[2018-01-03 15:54:03 +0800] [73673] [INFO] Starting gunicorn 19.7.1
-[2018-01-03 15:54:03 +0800] [73673] [INFO] Listening at: http://0.0.0.0:8088 
(73673)
-[2018-01-03 15:54:03 +0800] [73673] [INFO] Using worker: sync
-[2018-01-03 15:54:03 +0800] [73676] [INFO] Booting worker with pid: 73676
-[2018-01-03 15:54:03 +0800] [73679] [INFO] Booting worker with pid: 73679
-&lt;/code&gt;&lt;/pre&gt;
-&lt;/div&gt;
-
-&lt;h2 id=&quot;connect-apache-kylin-from-apachesuperset&quot;&gt;Connect 
Apache Kylin from ApacheSuperset&lt;/h2&gt;
-
-&lt;p&gt;Now everything you need is installed and ready to go. Let’s try to 
create an Apache Kylin data source in Apache Superset.&lt;br /&gt;
-1. Open up http://localhost:8088 in your web browser with the credential you 
set during Apache Superset installation.&lt;br /&gt;
-  &lt;img src=&quot;/images/Kylin-and-Superset/png/5. superset_1.png&quot; 
alt=&quot;&quot; /&gt;&lt;br /&gt;
-  Figure 5: Apache Superset Login Page&lt;/p&gt;
-
-&lt;ol&gt;
-  &lt;li&gt;Go to Source -&amp;gt; Datasource to configure a new data source.
-    &lt;ul&gt;
-      &lt;li&gt;SQLAlchemy URI pattern is : 
kylin://&lt;username&gt;:&lt;password&gt;@&lt;hostname&gt;:&lt;port&gt;/&lt;project
 
name=&quot;&quot;&gt;&lt;/project&gt;&lt;/port&gt;&lt;/hostname&gt;&lt;/password&gt;&lt;/username&gt;&lt;/li&gt;
-      &lt;li&gt;Check “Expose in SQL Lab” if you want to expose this data 
source in SQL Lab.&lt;/li&gt;
-      &lt;li&gt;Click “Test Connection” to see if the URI is working 
properly.&lt;/li&gt;
-    &lt;/ul&gt;
-  &lt;/li&gt;
-&lt;/ol&gt;
-
-&lt;p&gt;&lt;img src=&quot;/images/Kylin-and-Superset/png/6. 
superset_2.png&quot; alt=&quot;&quot; /&gt;&lt;br /&gt;
-  Figure 6: Create an Apache Kylin data source&lt;/p&gt;
-
-&lt;p&gt;&lt;img src=&quot;/images/Kylin-and-Superset/png/7. 
superset_3.png&quot; alt=&quot;&quot; /&gt;&lt;br /&gt;
-  Figure 7: Test Connection to Apache Kylin&lt;/p&gt;
-
-&lt;p&gt;If the connection to Apache Kylin is successful, you will see all the 
tables from Learn_kylin project show up at the bottom of the connection 
page.&lt;/p&gt;
-
-&lt;p&gt;&lt;img src=&quot;/images/Kylin-and-Superset/png/8. 
superset_4.png&quot; alt=&quot;&quot; /&gt;&lt;br /&gt;
-Figure 8: Tables will show up if connection is successful&lt;/p&gt;
-
-&lt;h3 id=&quot;query-kylin-table&quot;&gt;Query Kylin Table&lt;/h3&gt;
-&lt;ol&gt;
-  &lt;li&gt;Go to Source -&amp;gt; Tables to add a new table, type in a table 
name from “Learn_kylin” project, for example, “Kylin_sales”.&lt;/li&gt;
-&lt;/ol&gt;
-
-&lt;p&gt;&lt;img src=&quot;/images/Kylin-and-Superset/png/9. 
superset_5.png&quot; alt=&quot;&quot; /&gt;&lt;br /&gt;
-Figure 9 Add Kylin Table in Apache Superset&lt;/p&gt;
-
-&lt;ol&gt;
-  &lt;li&gt;Click on the table you created. Now you are ready to analyze your 
data from Apache Kylin.&lt;/li&gt;
-&lt;/ol&gt;
-
-&lt;p&gt;&lt;img src=&quot;/images/Kylin-and-Superset/png/10. 
superset_6.png&quot; alt=&quot;&quot; /&gt;&lt;br /&gt;
-Figure 10 Query single table from Apache Kylin&lt;/p&gt;
-
-&lt;h3 id=&quot;query-multiple-tables-from-kylin-using-sql-lab&quot;&gt;Query 
Multiple Tables from Kylin Using SQL Lab.&lt;/h3&gt;
-&lt;p&gt;Kylin Cube is usually based on a data model joined by multiples 
tables. Thus, it is quite common to query multiple tables at the same time 
using Apache Kylin. In Apache Superset, you can use SQL Lab to join your data 
across tables by composing SQL queries. We will use a query that can hit on the 
sample cube “kylin_sales_cube” as an example. &lt;br /&gt;
-When you run your query in SQL Lab, the result will come from the data source, 
in this case, Apache Kylin.&lt;/p&gt;
-
-&lt;p&gt;&lt;img src=&quot;/images/Kylin-and-Superset/png/11. 
SQL_Lab.png&quot; alt=&quot;&quot; /&gt;&lt;br /&gt;
-Figure 11 Query multiple tables from Apache Kylin using SQL Lab&lt;/p&gt;
-
-&lt;p&gt;When the query returns results, you may immediately visualize them by 
clicking on the “Visualize” button.&lt;br /&gt;
-&lt;img src=&quot;/images/Kylin-and-Superset/png/12. SQL_Lab_2.png&quot; 
alt=&quot;&quot; /&gt;&lt;br /&gt;
-Figure 12 Define your query and visualize it immediately&lt;/p&gt;
-
-&lt;p&gt;You may copy the entire SQL below to experience how you can query 
Kylin Cube in SQL Lab. &lt;br /&gt;
-&lt;code class=&quot;highlighter-rouge&quot;&gt;
-select
-YEAR_BEG_DT,
-MONTH_BEG_DT,
-WEEK_BEG_DT,
-META_CATEG_NAME,
-CATEG_LVL2_NAME,
-CATEG_LVL3_NAME,
-OPS_REGION,
-NAME as BUYER_COUNTRY_NAME,
-sum(PRICE) as GMV,
-sum(ACCOUNT_BUYER_LEVEL) ACCOUNT_BUYER_LEVEL,
-count(*) as CNT
-from KYLIN_SALES
-join KYLIN_CAL_DT on CAL_DT=PART_DT
-join KYLIN_CATEGORY_GROUPINGS on SITE_ID=LSTG_SITE_ID and 
KYLIN_CATEGORY_GROUPINGS.LEAF_CATEG_ID=KYLIN_SALES.LEAF_CATEG_ID
-join KYLIN_ACCOUNT on ACCOUNT_ID=BUYER_ID
-join KYLIN_COUNTRY on ACCOUNT_COUNTRY=COUNTRY
-group by YEAR_BEG_DT, 
MONTH_BEG_DT,WEEK_BEG_DT,META_CATEG_NAME,CATEG_LVL2_NAME, 
CATEG_LVL3_NAME, OPS_REGION, NAME
-&lt;/code&gt;&lt;br /&gt;
-## Experience All Features in Apache Superset with Apache Kylin&lt;/p&gt;
-
-&lt;p&gt;Most of the common reporting features are available in Apache 
Superset. Now let’s see how we can use those features to analyze data from 
Apache Kylin.&lt;/p&gt;
-
-&lt;h3 id=&quot;sorting&quot;&gt;Sorting&lt;/h3&gt;
-&lt;p&gt;You may sort by a measure regardless of how it is 
visualized.&lt;/p&gt;
-
-&lt;p&gt;You may specify a “Sort By” measure or sort the measure on the 
visualization after the query returns.&lt;/p&gt;
-
-&lt;p&gt;&lt;img src=&quot;/images/Kylin-and-Superset/png/13. sort.png&quot; 
alt=&quot;&quot; /&gt;&lt;br /&gt;
-Figure 13 Sort by&lt;/p&gt;
-
-&lt;h3 id=&quot;filtering&quot;&gt;Filtering&lt;/h3&gt;
-&lt;p&gt;There are multiple ways you may filter data from Apache Kylin.&lt;br 
/&gt;
-1. Date Filter&lt;br /&gt;
-  You may filter date and time dimension with the calendar filter. &lt;br /&gt;
-  &lt;img src=&quot;/images/Kylin-and-Superset/png/14. time_filter.png&quot; 
alt=&quot;&quot; /&gt;&lt;br /&gt;
-  Figure 14  Filtering time&lt;/p&gt;
-
-&lt;ol&gt;
-  &lt;li&gt;
-    &lt;p&gt;Dimension Filter&lt;br /&gt;
-  For other dimensions, you may filter it with SQL conditions like “in, not 
in, equal to, not equal to, greater than and equal to, smaller than and equal 
to, greater than, smaller than, like”.&lt;br /&gt;
-  &lt;img src=&quot;/images/Kylin-and-Superset/png/15. 
dimension_filter.png&quot; alt=&quot;&quot; /&gt;&lt;br /&gt;
-  Figure 15 Filtering dimension&lt;/p&gt;
-  &lt;/li&gt;
-  &lt;li&gt;
-    &lt;p&gt;Search Box&lt;br /&gt;
-  In some visualizations, it is also possible to further narrow down your 
result set after the query is returned from the data source using the “Search 
Box”. &lt;br /&gt;
-  &lt;img src=&quot;/images/Kylin-and-Superset/png/16. search_box.png&quot; 
alt=&quot;&quot; /&gt;&lt;br /&gt;
-  Figure 16 Search Box&lt;/p&gt;
-  &lt;/li&gt;
-  &lt;li&gt;
-    &lt;p&gt;Filtering the measure&lt;br /&gt;
-  Apache Superset allows you to write a “having clause” to filtering the 
measure. &lt;br /&gt;
-  &lt;img src=&quot;/images/Kylin-and-Superset/png/17. having.png&quot; 
alt=&quot;&quot; /&gt;&lt;br /&gt;
-  Figure 17 Filtering measure&lt;/p&gt;
-  &lt;/li&gt;
-  &lt;li&gt;
-    &lt;p&gt;Filter Box&lt;br /&gt;
-  The filter box visualization allows you to create a drop-down style filter 
that can filter all slices on a dashboard dynamically &lt;br /&gt;
-  As the screenshot below shows, if you filter the CATE_LVL2_NAME dimension 
from the filter box, all the visualizations on this dashboard will be filtered 
based on your selection. &lt;br /&gt;
-  &lt;img src=&quot;/images/Kylin-and-Superset/png/18. filter_box.png&quot; 
alt=&quot;&quot; /&gt;&lt;br /&gt;
-  Figure 18 The filter box visualization&lt;/p&gt;
-  &lt;/li&gt;
-&lt;/ol&gt;
-
-&lt;h3 id=&quot;top-n&quot;&gt;Top-N&lt;/h3&gt;
-&lt;p&gt;To provide higher performance in query time for Top N query, Apache 
Kylin provides approximate Top N measure to pre-calculate the top records. In 
Apache Superset, you may use both “Sort By” and “Row Limit” feature to 
make sure your query can utilize the Top N pre-calculation from Kylin Cube. 
&lt;br /&gt;
-  &lt;img src=&quot;/images/Kylin-and-Superset/png/19. top10.png&quot; 
alt=&quot;&quot; /&gt;&lt;br /&gt;
-  Figure 19 use both “Sort By” and “Row Limit” to get Top 10&lt;/p&gt;
-
-&lt;h3 id=&quot;page-length&quot;&gt;Page Length&lt;/h3&gt;
-&lt;p&gt;Apache Kylin users usually need to deal with high cardinality 
dimension. When displaying a high cardinality dimension, the visualization will 
display too many distinct values, taking a long time to render. In that case, 
it is nice that Apache Superset provides the page length feature to limit the 
number of rows per page. This way the up-front rendering effort can be reduced. 
&lt;br /&gt;
-  &lt;img src=&quot;/images/Kylin-and-Superset/png/20. page_length.png&quot; 
alt=&quot;&quot; /&gt;&lt;br /&gt;
-  Figure 20 Limit page length&lt;/p&gt;
-
-&lt;h3 id=&quot;visualizations&quot;&gt;Visualizations&lt;/h3&gt;
-&lt;p&gt;Apache Superset provides a rich and extensive set of visualizations. 
From basic charts like a pie chart, bar chart, line chart to advanced 
visualizations, like a sunburst, heatmap, world map, Sankey diagram. &lt;br 
/&gt;
-  &lt;img src=&quot;/images/Kylin-and-Superset/png/21. viz.png&quot; 
alt=&quot;&quot; /&gt;&lt;br /&gt;
-  Figure 21&lt;/p&gt;
-
-&lt;p&gt;&lt;img src=&quot;/images/Kylin-and-Superset/png/22. viz_2.png&quot; 
alt=&quot;&quot; /&gt;&lt;br /&gt;
-  Figure 22&lt;/p&gt;
-
-&lt;p&gt;&lt;img src=&quot;/images/Kylin-and-Superset/png/23. map.png&quot; 
alt=&quot;&quot; /&gt;&lt;br /&gt;
-  Figure 23 World map visualization&lt;/p&gt;
-
-&lt;p&gt;&lt;img src=&quot;/images/Kylin-and-Superset/png/24. bubble.png&quot; 
alt=&quot;&quot; /&gt;&lt;br /&gt;
-  Figure 24 bubble chart&lt;/p&gt;
-
-&lt;h3 id=&quot;other-functionalities&quot;&gt;Other functionalities&lt;/h3&gt;
-&lt;p&gt;Apache Superset also supports exporting to CSV, sharing, and viewing 
SQL query.&lt;/p&gt;
-
-&lt;h2 id=&quot;summary&quot;&gt;Summary&lt;/h2&gt;
-&lt;p&gt;With the right technical synergy of open source projects, you can 
achieve amazing results, more than the sum of its parts.  The pre-calculation 
technology of Apache Kylin accelerates visualization performance. The rich 
functionality of Apache Superset enables all Kylin Cube features to be fully 
utilized. When you marry the two, you get the superpower of accelerated 
interactive analytics.&lt;/p&gt;
-
-&lt;h2 id=&quot;references&quot;&gt;References&lt;/h2&gt;
-
-&lt;ol&gt;
-  &lt;li&gt;&lt;a href=&quot;http://kylin.apache.org&quot;&gt;Apache 
Kylin&lt;/a&gt;&lt;/li&gt;
-  &lt;li&gt;&lt;a 
href=&quot;https://github.com/Kyligence/kylinpy&quot;&gt;kylinpy on 
Github&lt;/a&gt;&lt;/li&gt;
-  &lt;li&gt;&lt;a 
href=&quot;https://medium.com/airbnb-engineering/caravel-airbnb-s-data-exploration-platform-15a72aa610e5&quot;&gt;Superset:Airbnb’s
 data exploration platform&lt;/a&gt;&lt;/li&gt;
-  &lt;li&gt;&lt;a 
href=&quot;https://github.com/apache/incubator-superset&quot;&gt;Apache 
Superset on Github&lt;/a&gt;&lt;/li&gt;
-&lt;/ol&gt;
-
-</description>
-        <pubDate>Mon, 01 Jan 2018 04:28:00 -0800</pubDate>
-        
<link>http://kylin.apache.org/blog/2018/01/01/kylin-and-superset/</link>
-        <guid 
isPermaLink="true">http://kylin.apache.org/blog/2018/01/01/kylin-and-superset/</guid>
-        
-        
-        <category>blog</category>
-        
-      </item>
-    
-      <item>
-        <title>Improving Spark Cubing in Kylin 2.0</title>
-        <description>&lt;p&gt;Apache Kylin is a OLAP Engine that speeding up 
query by Cube precomputation. The Cube is multi-dimensional dataset which 
contain precomputed all measures in all dimension combinations. Before v2.0, 
Kylin uses MapReduce to build Cube. In order to get better performance, Kylin 
2.0 introduced the Spark Cubing. About the principle of Spark Cubing, please 
refer to the article &lt;a 
href=&quot;http://kylin.apache.org/blog/2017/02/23/by-layer-spark-cubing/&quot;&gt;By-layer
 Spark Cubing&lt;/a&gt;.&lt;/p&gt;
-
-&lt;p&gt;In this blog, I will talk about the following topics:&lt;/p&gt;
-
-&lt;ul&gt;
-  &lt;li&gt;How to make Spark Cubing support HBase cluster with Kerberos 
enabled&lt;/li&gt;
-  &lt;li&gt;Spark configurations for Cubing&lt;/li&gt;
-  &lt;li&gt;Performance of Spark Cubing&lt;/li&gt;
-  &lt;li&gt;Pros and cons of Spark Cubing&lt;/li&gt;
-  &lt;li&gt;Applicable scenarios of Spark Cubing&lt;/li&gt;
-  &lt;li&gt;Improvement for dictionary loading in Spark Cubing&lt;/li&gt;
-&lt;/ul&gt;
-
-&lt;p&gt;In currently Spark Cubing(2.0) version, it doesn’t support HBase 
cluster using Kerberos bacause Spark Cubing need to get matadata from HBase. To 
solve this problem, we have two solutions: one is to make Spark could connect 
HBase with Kerberos, the other is to avoid Spark connect to HBase in Spark 
Cubing.&lt;/p&gt;
-
-&lt;h3 id=&quot;make-spark-connect-hbase-with-kerberos-enabled&quot;&gt;Make 
Spark connect HBase with Kerberos enabled&lt;/h3&gt;
-&lt;p&gt;If just want to run Spark Cubing in Yarn client mode, we only need to 
add three line code before new SparkConf() in SparkCubingByLayer:&lt;/p&gt;
-
-&lt;div class=&quot;highlighter-rouge&quot;&gt;&lt;pre 
class=&quot;highlight&quot;&gt;&lt;code&gt;        Configuration configuration 
= HBaseConnection.getCurrentHBaseConfiguration();        
-        HConnection connection = 
HConnectionManager.createConnection(configuration);
-        //Obtain an authentication token for the given user and add it to the 
user&#39;s credentials.
-        TokenUtil.obtainAndCacheToken(connection, 
UserProvider.instantiate(configuration).create(UserGroupInformation.getCurrentUser()));
-&lt;/code&gt;&lt;/pre&gt;
-&lt;/div&gt;
-
-&lt;p&gt;As for How to make Spark connect HBase using Kerberos in Yarn cluster 
mode, please refer to SPARK-6918, SPARK-12279, and HBASE-17040. The solution 
may work, but not elegant. So I tried the sencond solution.&lt;/p&gt;
-
-&lt;h3 id=&quot;use-hdfs-metastore-for-spark-cubing&quot;&gt;Use HDFS 
metastore for Spark Cubing&lt;/h3&gt;
-
-&lt;p&gt;The core idea here is uploading the necessary metadata job related to 
HDFS and using HDFSResourceStore manage the metadata.&lt;/p&gt;
-
-&lt;p&gt;Before introducing how to use HDFSResourceStore instead of 
HBaseResourceStore in Spark Cubing. Let’s see what’s Kylin metadata format 
and how Kylin manages the metadata.&lt;/p&gt;
-
-&lt;p&gt;Every concrete metadata for table, cube, model and project is a JSON 
file in Kylin. The whole metadata is organized by file directory. The picture 
below is the root directory for Kylin metadata,&lt;br /&gt;
-&lt;img 
src=&quot;http://static.zybuluo.com/kangkaisen/t1tc6neiaebiyfoir4fdhs11/%E5%B1%8F%E5%B9%95%E5%BF%AB%E7%85%A7%202017-07-02%20%E4%B8%8B%E5%8D%883.51.43.png&quot;
 alt=&quot;屏幕快照 2017-07-02 下午3.51.43.png-20.7kB&quot; /&gt;&lt;br 
/&gt;
-This following picture shows the content of project dir, the “learn_kylin” 
and “kylin_test” are both project names.&lt;br /&gt;
-&lt;img 
src=&quot;http://static.zybuluo.com/kangkaisen/4dtiioqnw08w6vtj0r9u5f27/%E5%B1%8F%E5%B9%95%E5%BF%AB%E7%85%A7%202017-07-02%20%E4%B8%8B%E5%8D%883.54.59.png&quot;
 alt=&quot;屏幕快照 2017-07-02 下午3.54.59.png-11.8kB&quot; 
/&gt;&lt;/p&gt;
-
-&lt;p&gt;Kylin manage the metadata using ResourceStore, ResourceStore is a 
abstract class, which abstract the CRUD Interface for metadata. ResourceStore 
has three implementation classes:&lt;/p&gt;
-
-&lt;ul&gt;
-  &lt;li&gt;FileResourceStore  (store with Local FileSystem)&lt;/li&gt;
-  &lt;li&gt;HDFSResourceStore&lt;/li&gt;
-  &lt;li&gt;HBaseResourceStore&lt;/li&gt;
-&lt;/ul&gt;
-
-&lt;p&gt;Currently, only HBaseResourceStore could use in production env. 
FileResourceStore mainly used for testing. HDFSResourceStore doesn’t support 
massive concurrent write, but it is ideal to use for read only scenario like 
Cubing. Kylin use the “kylin.metadata.url” config to decide which kind of 
ResourceStore will be used.&lt;/p&gt;
-
-&lt;p&gt;Now, Let’s see How to use HDFSResourceStore instead of 
HBaseResourceStore in Spark Cubing.&lt;/p&gt;
-
-&lt;ol&gt;
-  &lt;li&gt;Determine the necessary metadata for Spark Cubing job&lt;/li&gt;
-  &lt;li&gt;Dump the necessary metadata from HBase to local&lt;/li&gt;
-  &lt;li&gt;Update the kylin.metadata.url and then write all Kylin config to 
“kylin.properties” file in local metadata dir.&lt;/li&gt;
-  &lt;li&gt;Use ResourceTool upload the local metadata to HDFS.&lt;/li&gt;
-  &lt;li&gt;Construct the HDFSResourceStore from the HDFS 
“kylin.properties” file in Spark executor.&lt;/li&gt;
-&lt;/ol&gt;
-
-&lt;p&gt;Of course, We need to delete the HDFS metadata dir on complete. I’m 
working on a patch for this, please watch KYLIN-2653 for update.&lt;/p&gt;
-
-&lt;h3 id=&quot;spark-configurations-for-cubing&quot;&gt;Spark configurations 
for Cubing&lt;/h3&gt;
-
-&lt;p&gt;Following is the Spark configuration I used in our environment. It 
enables Spark dynamic resource allocation; the goal is to let our user set less 
Spark configurations.&lt;/p&gt;
-
-&lt;div class=&quot;highlighter-rouge&quot;&gt;&lt;pre 
class=&quot;highlight&quot;&gt;&lt;code&gt;//running in yarn-cluster mode
-kylin.engine.spark-conf.spark.master=yarn
-kylin.engine.spark-conf.spark.submit.deployMode=cluster 
-
-//enable the dynamic allocation for Spark to avoid user set the number of 
executors explicitly
-kylin.engine.spark-conf.spark.dynamicAllocation.enabled=true
-kylin.engine.spark-conf.spark.dynamicAllocation.minExecutors=10
-kylin.engine.spark-conf.spark.dynamicAllocation.maxExecutors=1024
-kylin.engine.spark-conf.spark.dynamicAllocation.executorIdleTimeout=300
-kylin.engine.spark-conf.spark.shuffle.service.enabled=true
-kylin.engine.spark-conf.spark.shuffle.service.port=7337
-
-//the memory config
-kylin.engine.spark-conf.spark.driver.memory=4G
-//should enlarge the executor.memory when the cube dict is huge
-kylin.engine.spark-conf.spark.executor.memory=4G 
-//because kylin need to load the cube dict in executor
-kylin.engine.spark-conf.spark.executor.cores=1
-
-//enlarge the timeout
-kylin.engine.spark-conf.spark.network.timeout=600
-
-kylin.engine.spark-conf.spark.yarn.queue=root.hadoop.test
-
-kylin.engine.spark.rdd-partition-cut-mb=100
-&lt;/code&gt;&lt;/pre&gt;
-&lt;/div&gt;
-
-&lt;h3 id=&quot;performance-test-of-spark-cubing&quot;&gt;Performance test of 
Spark Cubing&lt;/h3&gt;
-
-&lt;p&gt;For the source data scale from millions to hundreds of millions, my 
test result is consistent with the blog &lt;a 
href=&quot;http://kylin.apache.org/blog/2017/02/23/by-layer-spark-cubing/&quot;&gt;By-layer
 Spark Cubing&lt;/a&gt;. The improvement is remarkable. Moreover, I also tested 
with billions of source data and having huge dictionary specially.&lt;/p&gt;
-
-&lt;p&gt;The test Cube1 has 2.7 billion source data, 9 dimensions, one precise 
distinct count measure having 70 million cardinality (which means the dict also 
has 70 million cardinality).&lt;/p&gt;
-
-&lt;p&gt;Test test Cube2 has 2.4 billion source data, 13 dimensions, 38 
measures(contains 9 precise distinct count measures).&lt;/p&gt;
-
-&lt;p&gt;The test result is shown in below picture, the unit of time is 
minute.&lt;br /&gt;
-&lt;img 
src=&quot;http://static.zybuluo.com/kangkaisen/1urzfkal8od52fodi1l6u0y5/image.png&quot;
 alt=&quot;image.png-38.1kB&quot; /&gt;&lt;/p&gt;
-
-&lt;p&gt;In one word, &lt;strong&gt;Spark Cubing is much faster than MR cubing 
in most scenes&lt;/strong&gt;.&lt;/p&gt;
-
-&lt;h3 id=&quot;pros-and-cons-of-spark-cubing&quot;&gt;Pros and Cons of Spark 
Cubing&lt;/h3&gt;
-&lt;p&gt;In my opinion, the advantage for Spark Cubing includes:&lt;/p&gt;
-
-&lt;ol&gt;
-  &lt;li&gt;Because of the RDD cache, Spark Cubing could take full advantage 
of memory to avoid disk I/O.&lt;/li&gt;
-  &lt;li&gt;When we have enough memory resource, Spark Cubing could use more 
memory resource to get better build performance.&lt;/li&gt;
-&lt;/ol&gt;
-
-&lt;p&gt;On the contrary,the drawback for Spark Cubing includes:&lt;/p&gt;
-
-&lt;ol&gt;
-  &lt;li&gt;Spark Cubing couldn’t handle huge dictionary well (hundreds of 
millions of cardinality);&lt;/li&gt;
-  &lt;li&gt;Spark Cubing isn’t stable enough for very large scale 
data.&lt;/li&gt;
-&lt;/ol&gt;
-
-&lt;h3 id=&quot;applicable-scenarios-of-spark-cubing&quot;&gt;Applicable 
scenarios of Spark Cubing&lt;/h3&gt;
-&lt;p&gt;In my opinion, except the huge dictionary scenario, we all could use 
Spark Cubing to replace MR Cubing, especially under the following 
scenarios:&lt;/p&gt;
-
-&lt;ol&gt;
-  &lt;li&gt;Many dimensions&lt;/li&gt;
-  &lt;li&gt;Normal dictionaries (e.g, cardinality &amp;lt; 1 hundred 
millions)&lt;/li&gt;
-  &lt;li&gt;Normal scale data (e.g, less than 10 billion rows to build at 
once).&lt;/li&gt;
-&lt;/ol&gt;
-
-&lt;h3 
id=&quot;improvement-for-dictionary-loading-in-spark-cubing&quot;&gt;Improvement
 for dictionary loading in Spark Cubing&lt;/h3&gt;
-
-&lt;p&gt;As we all known, a big difference for MR and Spark is, the task for 
MR is running in process, but the task for Spark is running in thread. So, in 
MR Cubing, the dict of Cube only load once, but in Spark Cubing, the dict will 
be loaded many times in one executor, which will cause frequent GC.&lt;/p&gt;
-
-&lt;p&gt;So, I made the two improvements:&lt;/p&gt;
-
-&lt;ol&gt;
-  &lt;li&gt;Only load the dict once in one executor.&lt;/li&gt;
-  &lt;li&gt;Add maximumSize for LoadingCache in the AppendTrieDictionary to 
make the dict removed as early as possible.&lt;/li&gt;
-&lt;/ol&gt;
-
-&lt;p&gt;These two improvements have been contributed into Kylin 
repository.&lt;/p&gt;
-
-&lt;h3 id=&quot;summary&quot;&gt;Summary&lt;/h3&gt;
-&lt;p&gt;Spark Cubing is a great feature for Kylin 2.0, Thanks Kylin 
community. We will apply Spark Cubing in real scenarios in our company. I 
believe Spark Cubing will be more robust and efficient in the future 
releases.&lt;/p&gt;
-
-</description>
-        <pubDate>Fri, 21 Jul 2017 15:22:22 -0700</pubDate>
-        
<link>http://kylin.apache.org/blog/2017/07/21/Improving-Spark-Cubing/</link>
-        <guid 
isPermaLink="true">http://kylin.apache.org/blog/2017/07/21/Improving-Spark-Cubing/</guid>
-        
-        
-        <category>blog</category>
         
       </item>
     


Reply via email to