Author: lidong
Date: Thu Mar 24 10:17:53 2016
New Revision: 1736410

URL: http://svn.apache.org/viewvc?rev=1736410&view=rev
Log:
KYLIN-1532 Document issue - Wipe cache request should be PUT

Modified:
    kylin/site/blog/2016/03/19/approximate-topn-measure/index.html
    kylin/site/docs/howto/howto_use_restapi.html
    kylin/site/docs15/howto/howto_use_restapi.html
    kylin/site/feed.xml

Modified: kylin/site/blog/2016/03/19/approximate-topn-measure/index.html
URL: 
http://svn.apache.org/viewvc/kylin/site/blog/2016/03/19/approximate-topn-measure/index.html?rev=1736410&r1=1736409&r2=1736410&view=diff
==============================================================================
--- kylin/site/blog/2016/03/19/approximate-topn-measure/index.html (original)
+++ kylin/site/blog/2016/03/19/approximate-topn-measure/index.html Thu Mar 24 
10:17:53 2016
@@ -188,7 +188,7 @@
   <article class="post-content" >
     <h2 id="background">Background</h2>
 
-<p>Find the Top-N (or Top-K) entities from a dataset is a common scenario and 
requirement in data minding; We often see the reports or news like “Top 100 
companies in the world”, “Most popular 20 electronics sold on eBay”, etc. 
Exploring and analysising the top entities can always find some high value 
information.</p>
+<p>Find the Top-N (or Top-K) entities from a dataset is a common scenario and 
requirement in data minding; We often see the reports or news like “Top 100 
companies in the world”, “Most popular 20 electronics” sold on a big 
e-commerce platform, etc. Exploring and analysising the top entities can always 
find some high value information.</p>
 
 <p>Within the era of big data, this need is much stronger than ever before, as 
both the raw dataset and the number of entities can be vast; Without certain 
pre-calculation, get the Top-K entities among a distributed big dataset may 
take a long time, makes the ad-hoc query inefficient.</p>
 
@@ -340,16 +340,16 @@
 <p>A couple of modifications are made to let it better fit with Kylin:</p>
 
 <ul>
-  <li>Use double as the counter data type;</li>
-  <li>Simplfy the data strucutre, using one linked list for all entries;</li>
-  <li>Use a more compact serializer;</li>
+  <li>Using double as the counter data type;</li>
+  <li>Simplfied data strucutre, using one linked list for all entries;</li>
+  <li>A more compact serializer;</li>
 </ul>
 
 <p>Besides, in order to run SpaceSaving in parallel on Hadoop, we make it 
mergable with the algorithm introduced in <i>[2] A parallel space saving 
algorithm for frequent items and the Hurwitz zeta distribution</i>.</p>
 
 <h2 id="accuracy">Accuracy</h2>
 
-<p>Although the experiments in paper [1] has proved SpaceSaving’s efficiency 
and accuracy for realistic Zipfian data, it doesn’t ensure 100% correctness 
for all cases. SpaceSaving uses a fixed space to put the most frequent 
candidates; when the size exceeds the space, the tail elements will be 
truncated, causing data loss. The parallel algorithm will merge multiple 
SpaceSavings into one, at that moment for the elements appeared in one but not 
in the other it had some assumptions, this will also cause some data loss. 
Finally, the result from Top-N measure may have minor difference with the real 
result.</p>
+<p>Although the experiments in paper [1] has proved SpaceSaving’s efficiency 
and accuracy for realistic Zipfian data, it doesn’t ensure 100% accuracy for 
all scenarios. SpaceSaving uses a fixed space to put the most frequent 
candidates;  when the entities exceeds the space size, the tail entities will 
be truncated, causing data loss. The parallel algorithm merges multiple 
SpaceSavings into one, at that moment for the entities appeared in one but not 
in the other it had some assumptions, this will also cause some data 
distortion. Finally, the result from Top-N measure may have minor difference 
with the real result.</p>
 
 <p>A couple of factors can affect the accuracy:</p>
 
@@ -357,27 +357,33 @@
   <li>Zipfian distribution</li>
 </ul>
 
-<p>Many rankings in the world follows the <strong>[3] Zipfian 
distribution</strong>, such as the population ranks of cities in various 
countries, corporation sizes, income rankings, etc. But the exponent of the 
distribution varies in different scenarios, this will affect the correctness of 
the result. The higher the exponent is (the distribution is more sharp), the 
more accurate answer will get. When using SpaceSaving, you’d better have an 
calculation on your data distribution.</p>
+<p>Many rankings in the world follows the <strong>[3] Zipfian 
distribution</strong>, such as the population ranks of cities in various 
countries, corporation sizes, income rankings, etc. But the exponent of the 
distribution varies in different scenarios, this will affect the correctness of 
the result to some extend. The higher the exponent is (the distribution is more 
sharp), the more accurate answer will get. If the distribution is very flat, 
entities’ values are very close, the rankings from SpaceSaving will be less 
accurate. When using SpaceSaving, you’d better have an calculation on your 
data distribution.</p>
 
 <ul>
   <li>Space in SpaceSaving</li>
 </ul>
 
-<p>As mentioned above, SpaceSaving use a small space to put the most frequent 
elements. Giving more space it will provide more accurate answer. For example, 
to calculate Top N elements, using 100 * N space would provide more accurate 
answer than 50 * N space. But more space will take more CPU, memory and 
storage, this need be balanced.</p>
+<p>As mentioned above, SpaceSaving use a limited space to put the most 
frequent elements. Giving more space it will provide more accurate answer. For 
example, to calculate Top N elements, using 100 * N space would provide more 
accurate answer than 50 * N space. If the space is more than the entity’s 
cardinality, the result will be accurate. More space will take more CPU, memory 
and storage, this need be balanced.</p>
 
 <ul>
-  <li>Element cardinality</li>
+  <li>Entity cardinality</li>
 </ul>
 
 <p>Element cardinality is also a factor to consider. Calculating Top 100 among 
10 thousands is easiser than among 10 million.</p>
 
+<ul>
+  <li>Dataset size</li>
+</ul>
+
+<p>Error ratio from a big dataset is less than from a small dataset. The same 
for Top-N calculation.</p>
+
 <h2 id="statistics">Statistics</h2>
 
-<p>We designed a test case to calculate the top 100 elements using the 
parallel SpaceSaving among a data set; The element’s occurancy follows the 
Zipfian distribution, adjust the Zipfian exponent, space, and cardinality time 
to times, compare the result with the accurate result to collect the 
statistics, we get a rough accuracy report in below.</p>
+<p>We designed a test case to calculate the top 100 elements using the 
parallel SpaceSaving among a generated data set (with commons-math3’s 
ZipfDistribution); The entity’s occurancy follows the Zipfian distribution, 
adjusting the parameters of Zipfian exponent, space, entity cardinality and 
dataset size time to times, compare the result with the accurate result (using 
mergesort) to collect the statistics, we get a rough accuracy report in 
below.</p>
 
-<p>The first column is the element cardinality, means among how many elements 
to identify the top 100 elements; The other three columns represent how much 
space using in the algorithm: 20X means using 2,000, 50X means use 5,000. Each 
cell of the table shows how many records are exactly matched with the real 
result. The calculation is executed in parallel with 10 threads.</p>
+<p>The first column is the entity cardinality, means among how many entities 
to identify the top 100 elements; The other three columns represent how much 
space using in the algorithm: 20X means using 2,000, 50X means use 5,000, and 
so on. Each cell of the table shows how many records are matched with the real 
result; if the error (or see difference) is less than 5/million of total data 
size we would think it is matched. E.g, for a 1 million data set, if the 
difference &lt; 5. The SpaceSaving is calculated in parallel with 10 
threads.</p>
 
-<h3 id="test-1-calculate-top-100-in-1-million-records-zif-exponent--05">Test 
1. Calculate top-100 in 1 million records, zif exponent = 0.5</h3>
+<h3 
id="test-1-calculate-top-100-in-1-million-records-zipf-distribution-exponent--05-error-tolerance--5">Test
 1. Calculate top-100 in 1 million records, Zipf distribution exponent = 0.5, 
error tolerance &lt; 5</h3>
 
 <table>
   <thead>
@@ -390,35 +396,29 @@
   </thead>
   <tbody>
     <tr>
-      <td style="text-align: right">10,000</td>
+      <td style="text-align: right">1,000</td>
       <td style="text-align: center">100%</td>
       <td style="text-align: center">100%</td>
       <td style="text-align: center">100%</td>
     </tr>
     <tr>
-      <td style="text-align: right">20,000</td>
-      <td style="text-align: center">100%</td>
+      <td style="text-align: right">10,000</td>
+      <td style="text-align: center">78%</td>
       <td style="text-align: center">100%</td>
       <td style="text-align: center">100%</td>
     </tr>
     <tr>
       <td style="text-align: right">100,000</td>
-      <td style="text-align: center">70%</td>
-      <td style="text-align: center">100%</td>
-      <td style="text-align: center">100%</td>
-    </tr>
-    <tr>
-      <td style="text-align: right">1,000,000</td>
-      <td style="text-align: center">8%</td>
-      <td style="text-align: center">45%</td>
-      <td style="text-align: center">98%</td>
+      <td style="text-align: center">12%</td>
+      <td style="text-align: center">50%</td>
+      <td style="text-align: center">95%</td>
     </tr>
   </tbody>
 </table>
 
-<p>Test 1: More space can get better accuracy.</p>
+<p>Conclusion: More space can get better accuracy.</p>
 
-<h3 id="test-2-calculate-top-100-in-100-million-records-zif-exponent--05">Test 
2. Calculate top-100 in 100 million records, zif exponent = 0.5</h3>
+<h3 
id="test-2-calculate-top-100-in-1-million-records-zipf-distribution-exponent--06-error-tolerance--5">Test
 2. Calculate top-100 in 1 million records, Zipf distribution exponent = 0.6, 
error tolerance &lt; 5</h3>
 
 <table>
   <thead>
@@ -431,35 +431,29 @@
   </thead>
   <tbody>
     <tr>
-      <td style="text-align: right">10,000</td>
+      <td style="text-align: right">1,000</td>
       <td style="text-align: center">100%</td>
       <td style="text-align: center">100%</td>
       <td style="text-align: center">100%</td>
     </tr>
     <tr>
-      <td style="text-align: right">20,000</td>
-      <td style="text-align: center">100%</td>
+      <td style="text-align: right">10,000</td>
+      <td style="text-align: center">93%</td>
       <td style="text-align: center">100%</td>
       <td style="text-align: center">100%</td>
     </tr>
     <tr>
       <td style="text-align: right">100,000</td>
-      <td style="text-align: center">60%</td>
-      <td style="text-align: center">100%</td>
-      <td style="text-align: center">100%</td>
-    </tr>
-    <tr>
-      <td style="text-align: right">1,000,000</td>
-      <td style="text-align: center">8%</td>
-      <td style="text-align: center">56%</td>
-      <td style="text-align: center">96%</td>
+      <td style="text-align: center">30%</td>
+      <td style="text-align: center">89%</td>
+      <td style="text-align: center">99%</td>
     </tr>
   </tbody>
 </table>
 
-<p>Test 2: The data size doesn’t impact much.</p>
+<p>Conclusion: more sharp the entities distribute, the better answer 
SpaceSaving prvoides</p>
 
-<h3 id="test-3-calculate-top-100-in-1-million-records-zif-exponent--06">Test 
3. Calculate top-100 in 1 million records, zif exponent = 0.6</h3>
+<h3 
id="test-3-calculate-top-100-in-20-million-records-zif-distribution-exponent--05-error-tolerance--100">Test
 3. Calculate top-100 in 20 million records, Zif distribution exponent = 0.5, 
error tolerance &lt; 100</h3>
 
 <table>
   <thead>
@@ -472,35 +466,35 @@
   </thead>
   <tbody>
     <tr>
-      <td style="text-align: right">10,000</td>
+      <td style="text-align: right">1,000</td>
       <td style="text-align: center">100%</td>
       <td style="text-align: center">100%</td>
       <td style="text-align: center">100%</td>
     </tr>
     <tr>
-      <td style="text-align: right">20,000</td>
+      <td style="text-align: right">10,000</td>
       <td style="text-align: center">100%</td>
       <td style="text-align: center">100%</td>
       <td style="text-align: center">100%</td>
     </tr>
     <tr>
       <td style="text-align: right">100,000</td>
-      <td style="text-align: center">94%</td>
+      <td style="text-align: center">100%</td>
       <td style="text-align: center">100%</td>
       <td style="text-align: center">100%</td>
     </tr>
     <tr>
       <td style="text-align: right">1,000,000</td>
-      <td style="text-align: center">31%</td>
-      <td style="text-align: center">93%</td>
+      <td style="text-align: center">99%</td>
+      <td style="text-align: center">100%</td>
       <td style="text-align: center">100%</td>
     </tr>
   </tbody>
 </table>
 
-<p>Test 3: more sharp the elements distribute, the better answer it 
prvoides</p>
+<p>Conclusion: The result from SpaceSaving will be close to actual when the 
dataset is enough big.</p>
 
-<h3 id="test-4-calculate-top-100-in-1-million-records-zif-exponent--07">Test 
4. Calculate top-100 in 1 million records, zif exponent = 0.7</h3>
+<h3 
id="test-4-calculate-top-100-in-20-million-records-zif-distribution-exponent--06-error-tolerance--100">Test
 4. Calculate top-100 in 20 million records, Zif distribution exponent = 0.6, 
error tolerance &lt; 100</h3>
 
 <table>
   <thead>
@@ -532,29 +526,35 @@
     </tr>
     <tr>
       <td style="text-align: right">1,000,000</td>
-      <td style="text-align: center">62%</td>
+      <td style="text-align: center">99%</td>
       <td style="text-align: center">100%</td>
       <td style="text-align: center">100%</td>
     </tr>
   </tbody>
 </table>
 
-<p>Test 4: same conclusion as test 3.</p>
+<p>Conclusion: same conclusion as test 3.</p>
 
 <p>These statistics matches with what we expected above. It just gives us a 
rough estimation on the result correctness. To use this feature well in Kylin, 
you need know about all these variables, and do some pilots before publish it 
to the analysts.</p>
 
+<h2 id="query-performance">Query performance</h2>
+
+<p>Coming soon.</p>
+
 <p>##Futher works</p>
 
 <p>This feature in v1.5.0 is a basic version, which may solve 80% cases; While 
it has some limitations or hard-codings that deserve your attention:</p>
 
 <ul>
-  <li>use SUM() as the default aggregation function;</li>
-  <li>sort in descending order always;</li>
-  <li>use 50X space always;</li>
-  <li>use dictionary encoding for the literal column;</li>
-  <li>the UI only allow selecting top 10, 100 and 1000;</li>
+  <li>SUM() is the default aggregation function;</li>
+  <li>Sort in descending order always;</li>
+  <li>Use 50X space always;</li>
+  <li>Use dictionary encoding for the literal column;</li>
+  <li>UI only allow selecting topn(10), topn(100) and topn(1000) as the return 
type;</li>
 </ul>
 
+<p>Please note here, if you select “topn(10)” as the return type, it 
doesn’t mean you have to use “limit 10” in your query; You can use other 
limit numbers, Kylin can at most return the top 500 entities for one 
combination, but the precision after 10 are not tested.</p>
+
 <p>Whether or not to support more aggregations/sortings/encodings are totally 
based on user need. If you have any comment or suggestion, please subscribe and 
then drop email to our dev mailing list <a 
href="&#109;&#097;&#105;&#108;&#116;&#111;:&#100;&#101;&#118;&#064;&#107;&#121;&#108;&#105;&#110;&#046;&#097;&#112;&#097;&#099;&#104;&#101;&#046;&#111;&#114;&#103;">&#100;&#101;&#118;&#064;&#107;&#121;&#108;&#105;&#110;&#046;&#097;&#112;&#097;&#099;&#104;&#101;&#046;&#111;&#114;&#103;</a>,
 thanks for your feedbak.</p>
 
 <p>##References</p>

Modified: kylin/site/docs/howto/howto_use_restapi.html
URL: 
http://svn.apache.org/viewvc/kylin/site/docs/howto/howto_use_restapi.html?rev=1736410&r1=1736409&r2=1736410&view=diff
==============================================================================
--- kylin/site/docs/howto/howto_use_restapi.html (original)
+++ kylin/site/docs/howto/howto_use_restapi.html Thu Mar 24 10:17:53 2016
@@ -3022,7 +3022,7 @@ Get descriptor for specified cube instan
 <hr />
 
 <h2 id="wipe-cache">Wipe cache</h2>
-<p><code class="highlighter-rouge">GET /cache/{type}/{name}/{action}</code></p>
+<p><code class="highlighter-rouge">PUT /cache/{type}/{name}/{action}</code></p>
 
 <h4 id="path-variable-10">Path variable</h4>
 <ul>

Modified: kylin/site/docs15/howto/howto_use_restapi.html
URL: 
http://svn.apache.org/viewvc/kylin/site/docs15/howto/howto_use_restapi.html?rev=1736410&r1=1736409&r2=1736410&view=diff
==============================================================================
--- kylin/site/docs15/howto/howto_use_restapi.html (original)
+++ kylin/site/docs15/howto/howto_use_restapi.html Thu Mar 24 10:17:53 2016
@@ -2699,7 +2699,7 @@ Get descriptor for specified cube instan
 <hr />
 
 <h2 id="wipe-cache">Wipe cache</h2>
-<p><code class="highlighter-rouge">GET /cache/{type}/{name}/{action}</code></p>
+<p><code class="highlighter-rouge">PUT /cache/{type}/{name}/{action}</code></p>
 
 <h4 id="path-variable-10">Path variable</h4>
 <ul>

Modified: kylin/site/feed.xml
URL: 
http://svn.apache.org/viewvc/kylin/site/feed.xml?rev=1736410&r1=1736409&r2=1736410&view=diff
==============================================================================
--- kylin/site/feed.xml (original)
+++ kylin/site/feed.xml Thu Mar 24 10:17:53 2016
@@ -19,15 +19,15 @@
     <description>Apache Kylin Home</description>
     <link>http://kylin.apache.org/</link>
     <atom:link href="http://kylin.apache.org/feed.xml"; rel="self" 
type="application/rss+xml"/>
-    <pubDate>Tue, 22 Mar 2016 06:59:19 -0700</pubDate>
-    <lastBuildDate>Tue, 22 Mar 2016 06:59:19 -0700</lastBuildDate>
+    <pubDate>Thu, 24 Mar 2016 11:16:29 -0700</pubDate>
+    <lastBuildDate>Thu, 24 Mar 2016 11:16:29 -0700</lastBuildDate>
     <generator>Jekyll v2.5.3</generator>
     
       <item>
         <title>Approximate Top-N support in Kylin</title>
         <description>&lt;h2 id=&quot;background&quot;&gt;Background&lt;/h2&gt;
 
-&lt;p&gt;Find the Top-N (or Top-K) entities from a dataset is a common 
scenario and requirement in data minding; We often see the reports or news like 
“Top 100 companies in the world”, “Most popular 20 electronics sold on 
eBay”, etc. Exploring and analysising the top entities can always find some 
high value information.&lt;/p&gt;
+&lt;p&gt;Find the Top-N (or Top-K) entities from a dataset is a common 
scenario and requirement in data minding; We often see the reports or news like 
“Top 100 companies in the world”, “Most popular 20 electronics” sold on 
a big e-commerce platform, etc. Exploring and analysising the top entities can 
always find some high value information.&lt;/p&gt;
 
 &lt;p&gt;Within the era of big data, this need is much stronger than ever 
before, as both the raw dataset and the number of entities can be vast; Without 
certain pre-calculation, get the Top-K entities among a distributed big dataset 
may take a long time, makes the ad-hoc query inefficient.&lt;/p&gt;
 
@@ -179,16 +179,16 @@
 &lt;p&gt;A couple of modifications are made to let it better fit with 
Kylin:&lt;/p&gt;
 
 &lt;ul&gt;
-  &lt;li&gt;Use double as the counter data type;&lt;/li&gt;
-  &lt;li&gt;Simplfy the data strucutre, using one linked list for all 
entries;&lt;/li&gt;
-  &lt;li&gt;Use a more compact serializer;&lt;/li&gt;
+  &lt;li&gt;Using double as the counter data type;&lt;/li&gt;
+  &lt;li&gt;Simplfied data strucutre, using one linked list for all 
entries;&lt;/li&gt;
+  &lt;li&gt;A more compact serializer;&lt;/li&gt;
 &lt;/ul&gt;
 
 &lt;p&gt;Besides, in order to run SpaceSaving in parallel on Hadoop, we make 
it mergable with the algorithm introduced in &lt;i&gt;[2] A parallel space 
saving algorithm for frequent items and the Hurwitz zeta 
distribution&lt;/i&gt;.&lt;/p&gt;
 
 &lt;h2 id=&quot;accuracy&quot;&gt;Accuracy&lt;/h2&gt;
 
-&lt;p&gt;Although the experiments in paper [1] has proved SpaceSaving’s 
efficiency and accuracy for realistic Zipfian data, it doesn’t ensure 100% 
correctness for all cases. SpaceSaving uses a fixed space to put the most 
frequent candidates; when the size exceeds the space, the tail elements will be 
truncated, causing data loss. The parallel algorithm will merge multiple 
SpaceSavings into one, at that moment for the elements appeared in one but not 
in the other it had some assumptions, this will also cause some data loss. 
Finally, the result from Top-N measure may have minor difference with the real 
result.&lt;/p&gt;
+&lt;p&gt;Although the experiments in paper [1] has proved SpaceSaving’s 
efficiency and accuracy for realistic Zipfian data, it doesn’t ensure 100% 
accuracy for all scenarios. SpaceSaving uses a fixed space to put the most 
frequent candidates;  when the entities exceeds the space size, the tail 
entities will be truncated, causing data loss. The parallel algorithm merges 
multiple SpaceSavings into one, at that moment for the entities appeared in one 
but not in the other it had some assumptions, this will also cause some data 
distortion. Finally, the result from Top-N measure may have minor difference 
with the real result.&lt;/p&gt;
 
 &lt;p&gt;A couple of factors can affect the accuracy:&lt;/p&gt;
 
@@ -196,27 +196,33 @@
   &lt;li&gt;Zipfian distribution&lt;/li&gt;
 &lt;/ul&gt;
 
-&lt;p&gt;Many rankings in the world follows the &lt;strong&gt;[3] Zipfian 
distribution&lt;/strong&gt;, such as the population ranks of cities in various 
countries, corporation sizes, income rankings, etc. But the exponent of the 
distribution varies in different scenarios, this will affect the correctness of 
the result. The higher the exponent is (the distribution is more sharp), the 
more accurate answer will get. When using SpaceSaving, you’d better have an 
calculation on your data distribution.&lt;/p&gt;
+&lt;p&gt;Many rankings in the world follows the &lt;strong&gt;[3] Zipfian 
distribution&lt;/strong&gt;, such as the population ranks of cities in various 
countries, corporation sizes, income rankings, etc. But the exponent of the 
distribution varies in different scenarios, this will affect the correctness of 
the result to some extend. The higher the exponent is (the distribution is more 
sharp), the more accurate answer will get. If the distribution is very flat, 
entities’ values are very close, the rankings from SpaceSaving will be less 
accurate. When using SpaceSaving, you’d better have an calculation on your 
data distribution.&lt;/p&gt;
 
 &lt;ul&gt;
   &lt;li&gt;Space in SpaceSaving&lt;/li&gt;
 &lt;/ul&gt;
 
-&lt;p&gt;As mentioned above, SpaceSaving use a small space to put the most 
frequent elements. Giving more space it will provide more accurate answer. For 
example, to calculate Top N elements, using 100 * N space would provide more 
accurate answer than 50 * N space. But more space will take more CPU, memory 
and storage, this need be balanced.&lt;/p&gt;
+&lt;p&gt;As mentioned above, SpaceSaving use a limited space to put the most 
frequent elements. Giving more space it will provide more accurate answer. For 
example, to calculate Top N elements, using 100 * N space would provide more 
accurate answer than 50 * N space. If the space is more than the entity’s 
cardinality, the result will be accurate. More space will take more CPU, memory 
and storage, this need be balanced.&lt;/p&gt;
 
 &lt;ul&gt;
-  &lt;li&gt;Element cardinality&lt;/li&gt;
+  &lt;li&gt;Entity cardinality&lt;/li&gt;
 &lt;/ul&gt;
 
 &lt;p&gt;Element cardinality is also a factor to consider. Calculating Top 100 
among 10 thousands is easiser than among 10 million.&lt;/p&gt;
 
+&lt;ul&gt;
+  &lt;li&gt;Dataset size&lt;/li&gt;
+&lt;/ul&gt;
+
+&lt;p&gt;Error ratio from a big dataset is less than from a small dataset. The 
same for Top-N calculation.&lt;/p&gt;
+
 &lt;h2 id=&quot;statistics&quot;&gt;Statistics&lt;/h2&gt;
 
-&lt;p&gt;We designed a test case to calculate the top 100 elements using the 
parallel SpaceSaving among a data set; The element’s occurancy follows the 
Zipfian distribution, adjust the Zipfian exponent, space, and cardinality time 
to times, compare the result with the accurate result to collect the 
statistics, we get a rough accuracy report in below.&lt;/p&gt;
+&lt;p&gt;We designed a test case to calculate the top 100 elements using the 
parallel SpaceSaving among a generated data set (with commons-math3’s 
ZipfDistribution); The entity’s occurancy follows the Zipfian distribution, 
adjusting the parameters of Zipfian exponent, space, entity cardinality and 
dataset size time to times, compare the result with the accurate result (using 
mergesort) to collect the statistics, we get a rough accuracy report in 
below.&lt;/p&gt;
 
-&lt;p&gt;The first column is the element cardinality, means among how many 
elements to identify the top 100 elements; The other three columns represent 
how much space using in the algorithm: 20X means using 2,000, 50X means use 
5,000. Each cell of the table shows how many records are exactly matched with 
the real result. The calculation is executed in parallel with 10 
threads.&lt;/p&gt;
+&lt;p&gt;The first column is the entity cardinality, means among how many 
entities to identify the top 100 elements; The other three columns represent 
how much space using in the algorithm: 20X means using 2,000, 50X means use 
5,000, and so on. Each cell of the table shows how many records are matched 
with the real result; if the error (or see difference) is less than 5/million 
of total data size we would think it is matched. E.g, for a 1 million data set, 
if the difference &amp;lt; 5. The SpaceSaving is calculated in parallel with 10 
threads.&lt;/p&gt;
 
-&lt;h3 
id=&quot;test-1-calculate-top-100-in-1-million-records-zif-exponent--05&quot;&gt;Test
 1. Calculate top-100 in 1 million records, zif exponent = 0.5&lt;/h3&gt;
+&lt;h3 
id=&quot;test-1-calculate-top-100-in-1-million-records-zipf-distribution-exponent--05-error-tolerance--5&quot;&gt;Test
 1. Calculate top-100 in 1 million records, Zipf distribution exponent = 0.5, 
error tolerance &amp;lt; 5&lt;/h3&gt;
 
 &lt;table&gt;
   &lt;thead&gt;
@@ -229,35 +235,29 @@
   &lt;/thead&gt;
   &lt;tbody&gt;
     &lt;tr&gt;
-      &lt;td style=&quot;text-align: right&quot;&gt;10,000&lt;/td&gt;
+      &lt;td style=&quot;text-align: right&quot;&gt;1,000&lt;/td&gt;
       &lt;td style=&quot;text-align: center&quot;&gt;100%&lt;/td&gt;
       &lt;td style=&quot;text-align: center&quot;&gt;100%&lt;/td&gt;
       &lt;td style=&quot;text-align: center&quot;&gt;100%&lt;/td&gt;
     &lt;/tr&gt;
     &lt;tr&gt;
-      &lt;td style=&quot;text-align: right&quot;&gt;20,000&lt;/td&gt;
-      &lt;td style=&quot;text-align: center&quot;&gt;100%&lt;/td&gt;
+      &lt;td style=&quot;text-align: right&quot;&gt;10,000&lt;/td&gt;
+      &lt;td style=&quot;text-align: center&quot;&gt;78%&lt;/td&gt;
       &lt;td style=&quot;text-align: center&quot;&gt;100%&lt;/td&gt;
       &lt;td style=&quot;text-align: center&quot;&gt;100%&lt;/td&gt;
     &lt;/tr&gt;
     &lt;tr&gt;
       &lt;td style=&quot;text-align: right&quot;&gt;100,000&lt;/td&gt;
-      &lt;td style=&quot;text-align: center&quot;&gt;70%&lt;/td&gt;
-      &lt;td style=&quot;text-align: center&quot;&gt;100%&lt;/td&gt;
-      &lt;td style=&quot;text-align: center&quot;&gt;100%&lt;/td&gt;
-    &lt;/tr&gt;
-    &lt;tr&gt;
-      &lt;td style=&quot;text-align: right&quot;&gt;1,000,000&lt;/td&gt;
-      &lt;td style=&quot;text-align: center&quot;&gt;8%&lt;/td&gt;
-      &lt;td style=&quot;text-align: center&quot;&gt;45%&lt;/td&gt;
-      &lt;td style=&quot;text-align: center&quot;&gt;98%&lt;/td&gt;
+      &lt;td style=&quot;text-align: center&quot;&gt;12%&lt;/td&gt;
+      &lt;td style=&quot;text-align: center&quot;&gt;50%&lt;/td&gt;
+      &lt;td style=&quot;text-align: center&quot;&gt;95%&lt;/td&gt;
     &lt;/tr&gt;
   &lt;/tbody&gt;
 &lt;/table&gt;
 
-&lt;p&gt;Test 1: More space can get better accuracy.&lt;/p&gt;
+&lt;p&gt;Conclusion: More space can get better accuracy.&lt;/p&gt;
 
-&lt;h3 
id=&quot;test-2-calculate-top-100-in-100-million-records-zif-exponent--05&quot;&gt;Test
 2. Calculate top-100 in 100 million records, zif exponent = 0.5&lt;/h3&gt;
+&lt;h3 
id=&quot;test-2-calculate-top-100-in-1-million-records-zipf-distribution-exponent--06-error-tolerance--5&quot;&gt;Test
 2. Calculate top-100 in 1 million records, Zipf distribution exponent = 0.6, 
error tolerance &amp;lt; 5&lt;/h3&gt;
 
 &lt;table&gt;
   &lt;thead&gt;
@@ -270,35 +270,29 @@
   &lt;/thead&gt;
   &lt;tbody&gt;
     &lt;tr&gt;
-      &lt;td style=&quot;text-align: right&quot;&gt;10,000&lt;/td&gt;
+      &lt;td style=&quot;text-align: right&quot;&gt;1,000&lt;/td&gt;
       &lt;td style=&quot;text-align: center&quot;&gt;100%&lt;/td&gt;
       &lt;td style=&quot;text-align: center&quot;&gt;100%&lt;/td&gt;
       &lt;td style=&quot;text-align: center&quot;&gt;100%&lt;/td&gt;
     &lt;/tr&gt;
     &lt;tr&gt;
-      &lt;td style=&quot;text-align: right&quot;&gt;20,000&lt;/td&gt;
-      &lt;td style=&quot;text-align: center&quot;&gt;100%&lt;/td&gt;
+      &lt;td style=&quot;text-align: right&quot;&gt;10,000&lt;/td&gt;
+      &lt;td style=&quot;text-align: center&quot;&gt;93%&lt;/td&gt;
       &lt;td style=&quot;text-align: center&quot;&gt;100%&lt;/td&gt;
       &lt;td style=&quot;text-align: center&quot;&gt;100%&lt;/td&gt;
     &lt;/tr&gt;
     &lt;tr&gt;
       &lt;td style=&quot;text-align: right&quot;&gt;100,000&lt;/td&gt;
-      &lt;td style=&quot;text-align: center&quot;&gt;60%&lt;/td&gt;
-      &lt;td style=&quot;text-align: center&quot;&gt;100%&lt;/td&gt;
-      &lt;td style=&quot;text-align: center&quot;&gt;100%&lt;/td&gt;
-    &lt;/tr&gt;
-    &lt;tr&gt;
-      &lt;td style=&quot;text-align: right&quot;&gt;1,000,000&lt;/td&gt;
-      &lt;td style=&quot;text-align: center&quot;&gt;8%&lt;/td&gt;
-      &lt;td style=&quot;text-align: center&quot;&gt;56%&lt;/td&gt;
-      &lt;td style=&quot;text-align: center&quot;&gt;96%&lt;/td&gt;
+      &lt;td style=&quot;text-align: center&quot;&gt;30%&lt;/td&gt;
+      &lt;td style=&quot;text-align: center&quot;&gt;89%&lt;/td&gt;
+      &lt;td style=&quot;text-align: center&quot;&gt;99%&lt;/td&gt;
     &lt;/tr&gt;
   &lt;/tbody&gt;
 &lt;/table&gt;
 
-&lt;p&gt;Test 2: The data size doesn’t impact much.&lt;/p&gt;
+&lt;p&gt;Conclusion: more sharp the entities distribute, the better answer 
SpaceSaving prvoides&lt;/p&gt;
 
-&lt;h3 
id=&quot;test-3-calculate-top-100-in-1-million-records-zif-exponent--06&quot;&gt;Test
 3. Calculate top-100 in 1 million records, zif exponent = 0.6&lt;/h3&gt;
+&lt;h3 
id=&quot;test-3-calculate-top-100-in-20-million-records-zif-distribution-exponent--05-error-tolerance--100&quot;&gt;Test
 3. Calculate top-100 in 20 million records, Zif distribution exponent = 0.5, 
error tolerance &amp;lt; 100&lt;/h3&gt;
 
 &lt;table&gt;
   &lt;thead&gt;
@@ -311,35 +305,35 @@
   &lt;/thead&gt;
   &lt;tbody&gt;
     &lt;tr&gt;
-      &lt;td style=&quot;text-align: right&quot;&gt;10,000&lt;/td&gt;
+      &lt;td style=&quot;text-align: right&quot;&gt;1,000&lt;/td&gt;
       &lt;td style=&quot;text-align: center&quot;&gt;100%&lt;/td&gt;
       &lt;td style=&quot;text-align: center&quot;&gt;100%&lt;/td&gt;
       &lt;td style=&quot;text-align: center&quot;&gt;100%&lt;/td&gt;
     &lt;/tr&gt;
     &lt;tr&gt;
-      &lt;td style=&quot;text-align: right&quot;&gt;20,000&lt;/td&gt;
+      &lt;td style=&quot;text-align: right&quot;&gt;10,000&lt;/td&gt;
       &lt;td style=&quot;text-align: center&quot;&gt;100%&lt;/td&gt;
       &lt;td style=&quot;text-align: center&quot;&gt;100%&lt;/td&gt;
       &lt;td style=&quot;text-align: center&quot;&gt;100%&lt;/td&gt;
     &lt;/tr&gt;
     &lt;tr&gt;
       &lt;td style=&quot;text-align: right&quot;&gt;100,000&lt;/td&gt;
-      &lt;td style=&quot;text-align: center&quot;&gt;94%&lt;/td&gt;
+      &lt;td style=&quot;text-align: center&quot;&gt;100%&lt;/td&gt;
       &lt;td style=&quot;text-align: center&quot;&gt;100%&lt;/td&gt;
       &lt;td style=&quot;text-align: center&quot;&gt;100%&lt;/td&gt;
     &lt;/tr&gt;
     &lt;tr&gt;
       &lt;td style=&quot;text-align: right&quot;&gt;1,000,000&lt;/td&gt;
-      &lt;td style=&quot;text-align: center&quot;&gt;31%&lt;/td&gt;
-      &lt;td style=&quot;text-align: center&quot;&gt;93%&lt;/td&gt;
+      &lt;td style=&quot;text-align: center&quot;&gt;99%&lt;/td&gt;
+      &lt;td style=&quot;text-align: center&quot;&gt;100%&lt;/td&gt;
       &lt;td style=&quot;text-align: center&quot;&gt;100%&lt;/td&gt;
     &lt;/tr&gt;
   &lt;/tbody&gt;
 &lt;/table&gt;
 
-&lt;p&gt;Test 3: more sharp the elements distribute, the better answer it 
prvoides&lt;/p&gt;
+&lt;p&gt;Conclusion: The result from SpaceSaving will be close to actual when 
the dataset is enough big.&lt;/p&gt;
 
-&lt;h3 
id=&quot;test-4-calculate-top-100-in-1-million-records-zif-exponent--07&quot;&gt;Test
 4. Calculate top-100 in 1 million records, zif exponent = 0.7&lt;/h3&gt;
+&lt;h3 
id=&quot;test-4-calculate-top-100-in-20-million-records-zif-distribution-exponent--06-error-tolerance--100&quot;&gt;Test
 4. Calculate top-100 in 20 million records, Zif distribution exponent = 0.6, 
error tolerance &amp;lt; 100&lt;/h3&gt;
 
 &lt;table&gt;
   &lt;thead&gt;
@@ -371,29 +365,35 @@
     &lt;/tr&gt;
     &lt;tr&gt;
       &lt;td style=&quot;text-align: right&quot;&gt;1,000,000&lt;/td&gt;
-      &lt;td style=&quot;text-align: center&quot;&gt;62%&lt;/td&gt;
+      &lt;td style=&quot;text-align: center&quot;&gt;99%&lt;/td&gt;
       &lt;td style=&quot;text-align: center&quot;&gt;100%&lt;/td&gt;
       &lt;td style=&quot;text-align: center&quot;&gt;100%&lt;/td&gt;
     &lt;/tr&gt;
   &lt;/tbody&gt;
 &lt;/table&gt;
 
-&lt;p&gt;Test 4: same conclusion as test 3.&lt;/p&gt;
+&lt;p&gt;Conclusion: same conclusion as test 3.&lt;/p&gt;
 
 &lt;p&gt;These statistics matches with what we expected above. It just gives 
us a rough estimation on the result correctness. To use this feature well in 
Kylin, you need know about all these variables, and do some pilots before 
publish it to the analysts.&lt;/p&gt;
 
+&lt;h2 id=&quot;query-performance&quot;&gt;Query performance&lt;/h2&gt;
+
+&lt;p&gt;Coming soon.&lt;/p&gt;
+
 &lt;p&gt;##Futher works&lt;/p&gt;
 
 &lt;p&gt;This feature in v1.5.0 is a basic version, which may solve 80% cases; 
While it has some limitations or hard-codings that deserve your 
attention:&lt;/p&gt;
 
 &lt;ul&gt;
-  &lt;li&gt;use SUM() as the default aggregation function;&lt;/li&gt;
-  &lt;li&gt;sort in descending order always;&lt;/li&gt;
-  &lt;li&gt;use 50X space always;&lt;/li&gt;
-  &lt;li&gt;use dictionary encoding for the literal column;&lt;/li&gt;
-  &lt;li&gt;the UI only allow selecting top 10, 100 and 1000;&lt;/li&gt;
+  &lt;li&gt;SUM() is the default aggregation function;&lt;/li&gt;
+  &lt;li&gt;Sort in descending order always;&lt;/li&gt;
+  &lt;li&gt;Use 50X space always;&lt;/li&gt;
+  &lt;li&gt;Use dictionary encoding for the literal column;&lt;/li&gt;
+  &lt;li&gt;UI only allow selecting topn(10), topn(100) and topn(1000) as the 
return type;&lt;/li&gt;
 &lt;/ul&gt;
 
+&lt;p&gt;Please note here, if you select “topn(10)” as the return type, it 
doesn’t mean you have to use “limit 10” in your query; You can use other 
limit numbers, Kylin can at most return the top 500 entities for one 
combination, but the precision after 10 are not tested.&lt;/p&gt;
+
 &lt;p&gt;Whether or not to support more aggregations/sortings/encodings are 
totally based on user need. If you have any comment or suggestion, please 
subscribe and then drop email to our dev mailing list &lt;a 
href=&quot;&amp;#109;&amp;#097;&amp;#105;&amp;#108;&amp;#116;&amp;#111;:&amp;#100;&amp;#101;&amp;#118;&amp;#064;&amp;#107;&amp;#121;&amp;#108;&amp;#105;&amp;#110;&amp;#046;&amp;#097;&amp;#112;&amp;#097;&amp;#099;&amp;#104;&amp;#101;&amp;#046;&amp;#111;&amp;#114;&amp;#103;&quot;&gt;&amp;#100;&amp;#101;&amp;#118;&amp;#064;&amp;#107;&amp;#121;&amp;#108;&amp;#105;&amp;#110;&amp;#046;&amp;#097;&amp;#112;&amp;#097;&amp;#099;&amp;#104;&amp;#101;&amp;#046;&amp;#111;&amp;#114;&amp;#103;&lt;/a&gt;,
 thanks for your feedbak.&lt;/p&gt;
 
 &lt;p&gt;##References&lt;/p&gt;


Reply via email to