Repository: apex-malhar
Updated Branches:
  refs/heads/master 75c7e59bd -> 2e47b4cd7


http://git-wip-us.apache.org/repos/asf/apex-malhar/blob/2e47b4cd/contrib/src/test/resources/com/datatorrent/contrib/romesyndication/datatorrent_feed_updated.rss
----------------------------------------------------------------------
diff --git 
a/contrib/src/test/resources/com/datatorrent/contrib/romesyndication/datatorrent_feed_updated.rss
 
b/contrib/src/test/resources/com/datatorrent/contrib/romesyndication/datatorrent_feed_updated.rss
deleted file mode 100644
index 5e8f52a..0000000
--- 
a/contrib/src/test/resources/com/datatorrent/contrib/romesyndication/datatorrent_feed_updated.rss
+++ /dev/null
@@ -1,1134 +0,0 @@
-<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
-       xmlns:content="http://purl.org/rss/1.0/modules/content/";
-       xmlns:wfw="http://wellformedweb.org/CommentAPI/";
-       xmlns:dc="http://purl.org/dc/elements/1.1/";
-       xmlns:atom="http://www.w3.org/2005/Atom";
-       xmlns:sy="http://purl.org/rss/1.0/modules/syndication/";
-       xmlns:slash="http://purl.org/rss/1.0/modules/slash/";
-       >
-
-<channel>
-       <title>DataTorrent</title>
-       <atom:link href="https://www.datatorrent.com/feed/"; rel="self" 
type="application/rss+xml" />
-       <link>https://www.datatorrent.com</link>
-       <description></description>
-       <lastBuildDate>Tue, 10 Nov 2015 08:00:45 +0000</lastBuildDate>
-       <language>en-US</language>
-       <sy:updatePeriod>hourly</sy:updatePeriod>
-       <sy:updateFrequency>1</sy:updateFrequency>
-       <generator>http://wordpress.org/?v=4.2.5</generator>
-       <item>
-               <title>An introduction to checkpointing in Apache Apex</title>
-               
<link>https://www.datatorrent.com/blog-introduction-to-checkpoint/</link>
-               
<comments>https://www.datatorrent.com/blog-introduction-to-checkpoint/#comments</comments>
-               <pubDate>Tue, 10 Nov 2015 08:00:45 +0000</pubDate>
-               <dc:creator><![CDATA[Gaurav Gupta]]></dc:creator>
-                               <category><![CDATA[How-to]]></category>
-               <category><![CDATA[Apache Apex]]></category>
-
-               <guid 
isPermaLink="false">https://www.datatorrent.com/?p=2254</guid>
-               <description><![CDATA[<p>Know how Apex makes checkpointing easy 
Big data is evolving in a big way. As it booms, the issue of fault tolerance 
becomes more and more exigent. What happens if a node fails? Will your 
application recover from the effects of data or process corruption? In a 
conventional world, the simplest solution for such a problem would have 
[&#8230;]</p>
-<p>The post <a rel="nofollow" 
href="https://www.datatorrent.com/blog-introduction-to-checkpoint/";>An 
introduction to checkpointing in Apache Apex</a> appeared first on <a 
rel="nofollow" href="https://www.datatorrent.com";>DataTorrent</a>.</p>
-]]></description>
-                               <content:encoded><![CDATA[<h5 
id="an-introduction-to-checkpointing-in-apex" class="c5">Know how Apex makes 
checkpointing easy</h5>
-<p class="c2"><span class="c0">Big data is evolving<span class="c0"> 
</span><span class="c3">i</span><span class="c0">n a big way. As it booms, the 
issue of </span><span class="c0">fault</span><span class="c3"> </span><span 
class="c0">tolerance</span><span class="c0"> becomes more and more exigent. 
What happens if a node fails? Will your application recover from the effects of 
data or process corruption?</span></span></p>
-<p class="c2"><span class="c0">In a conventional world, the simplest solution 
for such a problem would have been a restart of the offending processes from 
the beginning. However, that was the conventional world, with data sizes still 
within the reach of imagination. In the world of big data, the size of data 
cannot be imagined. </span><span class="c3">Let alone imagine, the growth is 
almost incomprehensible.</span><span class="c0"> A complete restart would mean 
</span><span class="c3">wasting </span><span class="c0">precious resources, be 
it time, or CPU capacity. Such a restart in a real-time scenario would also be 
unpredictable. After all, how do you recover data that changed by the second 
(or even less) accurately? </span></p>
-<p class="c2"><span class="c0">Fault tolerance</span><span class="c0"> is not 
just a need, it is an absolute necessity. </span><span class="c3">The lack of a 
fault tolerance mechanism affects SLAs</span><span class="c3">. </span><span 
class="c3">A system can be made fault-tolerant by using 
checkpointing</span><span class="c3">. </span><span class="c0">You can think of 
a checkpointing mechanism as a recovery process; a system process saves 
snapshots</span><span class="c3"> of </span><span class="c0">application 
states periodically, and uses these snapshots for recovery in case of failures. 
</span><span class="c3">A</span><span class="c0"> platform developer alone 
should ensure that a big data platform is checkpoint</span><span 
class="c3">&#8211;</span><span class="c0">compliant. Application developers 
should only be concerned with their business logic, thus ensuring clear 
distinction between operability and functional behavior. </span></p>
-<p class="c2"><span class="c3 c7">Apex treats checkpointing as a native 
function, allowing application developers to stop worrying about application 
consistency</span><span class="c3">. </span></p>
-<p class="c2"><span class="c0">Apache</span><span class="c0 
c12">®</span><span class="c0"> Apex is the industry’s only open-source 
platform that checkpoints intelligently, while relieving application developers 
of the need to worry about their platforms being failsafe.</span><span 
class="c3"> </span><span class="c0">By transparently checkpointing the state 
of operators to HDFS periodically, Apex ensures that the operators are 
recoverable on any node within a cluster</span><span class="c3">. </span><span 
class="c3">The Apex infrastructure is designed to scale, while ensuring easy 
recovery at any point during failure</span><span class="c3">. The </span><span 
class="c0">Apex checkpointing mechanism uses the DFS interfa</span><span 
class="c3">ce. thereby being agnostic to the DFS implementation.</span></p>
-<p class="c2"><span class="c3 c7">Apex introduces checkpointing by maintaining 
operator states within HDFS.</span></p>
-<p class="c2"><span class="c3">Apex serializes the state of operators to local 
disks, and then asynchronously copies serialized state to HDFS. The state is 
asynchronously copied to HDFS in order to ensure that the performance of 
applications is not affected by the HDFS latency. An operator is considered 
“checkpointed” only after the serialized state is copied to HDFS. In case 
of </span><span class="c3 c8"><a class="c1" 
href="https://www.datatorrent.com/docs/guides/ApplicationDeveloperGuide.html#h.1gf8i83";>Exactly-Once
 </a></span><span class="c3">recovery mechanism, platform checkpoints at every 
window boundary and it behaves in synchronous mode i.e the operator is blocked 
till the state is copied to HDFS. </span></p>
-<p class="c2"><span class="c3">Although Apex is designed to checkpoint at 
window boundaries, developers can control how optimal a checkpoint operation 
is.</span><span class="c3"> </span><span class="c3">Developers can control how 
often checkpointing is triggered. They can do this by configuring the window 
boundary at which checkpointing will occur, by using the </span><span class="c3 
c10"><a class="c1" 
href="https://www.datatorrent.com/docs/apidocs/com/datatorrent/api/Context.OperatorContext.html#CHECKPOINT_WINDOW_COUNT";>CHECKPOINT_WINDOW_COUNT</a></span><span
 class="c3"> attribute. </span><span class="c0">Frequent checkpoints hamper 
the overall application performance. This is in stark contrast to sparsely 
placed checkpoints, which are dangerous because they might make application 
recovery a </span><span class="c3">time-consuming</span><span class="c0"> 
task. </span></p>
-<p class="c2"><span class="c3 c7">See this example to see how the apex 
checkpointing mechanism works. </span></p>
-<p class="c2"><span class="c3">In our example, let us set 
CHECKPOINT_WINDOW_COUNT to 1. This diagram shows the flow of data within window 
boundaries.  You will see that at the end of each window, Apex checkpoints 
data in order to ensure that it is consistent. If the operator crashes during 
window n+1, Apex restores it to nearest stable state, in this example, the 
state obtained by introducing checkpointing at the end of window n. The 
operator now starts processing window n+1 from the beginning. If 
CHECKPOINT_WINDOW_COUNT was set to 2, then there would have been one checkpoint 
before window n, and another checkpoint after window n+1.</span></p>
-<p class="c2"><img title="" 
src="https://www.datatorrent.com/wp-content/uploads/2015/10/image001.png"; 
alt="checkpointing.png" /></p>
-<p class="c2"><span class="c3 c7">Judicious checkpointing ensures optimum 
application performance</span></p>
-<p class="c2"><span class="c3">Checkpointing is a costly and a 
resource-intensive operation, indicating that an overindulgence will impact an 
application’s performance. </span><span class="c3">To</span><span class="c3 
c9"> </span><span class="c3">act as a deterrent to the performance slowdown 
because of checkpointing, Apex checkpoints non-transient data only.</span><span 
class="c3 c9"> </span><span class="c3">For example, in JDBC operator, 
connection can be reinitialized at the time of setup, it should be marked as 
transient, and thus omitted from getting checkpointed. </span></p>
-<p class="c2"><span class="c0">Application developers must know </span><span 
class="c3">whether</span><span class="c0"> to checkpoint or not; the thumb 
rule dictates that </span><span class="c0">operators</span><span class="c0"> 
for which computation depends on the previous state must be checkpointed. An 
example is the Counter operator, which tracks the number of tuples processed by 
the system. Because the operator relies on </span><span class="c3">its previous 
state to proceed, it</span><span class="c0"> needs to be checkpointed. Some 
operators are stateless; their computation does not depend on the previous 
state. Application developers can omit such operators from checkpointing 
operations by using the </span><span class="c3 c8"><a class="c1" 
href="https://www.datatorrent.com/docs/apidocs/com/datatorrent/api/Context.OperatorContext.html#STATELESS";>STATELESS</a></span><span
 class="c0"> attribute or </span><span class="c3">by annotating the operator 
with </span><span class="c3 c
 8"><a class="c1" 
href="https://www.datatorrent.com/docs/apidocs/com/datatorrent/api/annotation/Stateless.html";>Stateless</a></span><span
 class="c0">.</span></p>
-<p class="c2"><span class="c0">This post is an introduction to 
checkpointing</span><span class="c3"> </span><span class="c0">in Apex. For 
</span><span class="c3">more details</span><span class="c0">, </span><span 
class="c3">see our </span><span class="c3 c8"><a class="c1" 
href="http://www.datatorrent.com/docs/guides/ApplicationDeveloperGuide.html";>Application
 Developer Guide</a></span><span class="c3">.</span></p>
-<h3 id="conclusion">Resources</h3>
-<p>Download DataTorrent Sandbox <a 
href="http://web.datatorrent.com/DataTorrent-RTS-Sandbox-Edition-Download.html"; 
target="_blank">here</a></p>
-<p>Download DataTorrent Enterprise Edition <a 
href="http://web.datatorrent.com/DataTorrent-RTS-Enteprise-Edition-Download.html";
 target="_blank">here</a></p>
-<p>Join Apache Apex Meetups <a 
href="https://www.datatorrent.com/meetups/";>here</a></p>
-<p>The post <a rel="nofollow" 
href="https://www.datatorrent.com/blog-introduction-to-checkpoint/";>An 
introduction to checkpointing in Apache Apex</a> appeared first on <a 
rel="nofollow" href="https://www.datatorrent.com";>DataTorrent</a>.</p>
-]]></content:encoded>
-                       
<wfw:commentRss>https://www.datatorrent.com/blog-introduction-to-checkpoint/feed/</wfw:commentRss>
-               <slash:comments>0</slash:comments>
-               </item>
-               <item>
-               <title>Dimensions Computation (Aggregate Navigator) Part 2: 
Implementation</title>
-               
<link>https://www.datatorrent.com/dimensions-computation-aggregate-navigator-part-2-implementation/</link>
-               
<comments>https://www.datatorrent.com/dimensions-computation-aggregate-navigator-part-2-implementation/#comments</comments>
-               <pubDate>Thu, 05 Nov 2015 09:00:10 +0000</pubDate>
-               <dc:creator><![CDATA[tim farkas]]></dc:creator>
-                               <category><![CDATA[Uncategorized]]></category>
-               <category><![CDATA[Apache Apex]]></category>
-
-               <guid 
isPermaLink="false">https://www.datatorrent.com/?p=2401</guid>
-               <description><![CDATA[<p>Overview While the theory of computing 
the aggregations is correct, some more work is required to provide a scalable 
implementation of Dimensions Computation. As can be seen from the formulas 
provided in the previous post, the number of aggregations to maintain grows 
rapidly as the number of unique key values, aggregators, dimension 
combinations, and time [&#8230;]</p>
-<p>The post <a rel="nofollow" 
href="https://www.datatorrent.com/dimensions-computation-aggregate-navigator-part-2-implementation/";>Dimensions
 Computation (Aggregate Navigator) Part 2: Implementation</a> appeared first on 
<a rel="nofollow" href="https://www.datatorrent.com";>DataTorrent</a>.</p>
-]]></description>
-                               <content:encoded><![CDATA[<h3 
id="overview">Overview</h3>
-<p>While the theory of computing the aggregations is correct, some more work 
is required to provide a scalable implementation of Dimensions Computation. As 
can be seen from the formulas provided in the previous post, the number of 
aggregations to maintain grows rapidly as the number of unique key values, 
aggregators, dimension combinations, and time buckets grows. Additionally, a 
scalable implementation of Dimensions Computation must be capable of handling 
hundreds of thousands of events per second. In order to achieve this level of 
performance a balance must be struck between the speed afforded by in memory 
processing and the need to persist large quantities of data. This balance is 
achieved by performing dimensions computation in three phases:</p>
-<ol>
-<li>The <strong>Pre-Aggregation</strong> phase.</li>
-<li>The <strong>Unification</strong> phase.</li>
-<li>The <strong>Aggregation Storage</strong> phase.</li>
-</ol>
-<p>The sections below will describe the details of each phase of Dimensions 
Computation, and will also provide the code snippets required to implement each 
phase in Data Torrent.</p>
-<h3 id="the-pre-aggregation-phase">The Pre-aggregation Phase</h3>
-<h4 id="the-theory">The Theory</h4>
-<p>This phase allows Dimensions Computation to scale by reducing the number of 
events entering the system. How this is achieved can be described by the 
following example:</p>
-<ul>
-<li>Let’s say we have 500,000 <strong>AdEvents</strong>/second entering our 
system, and we want to perform Dimension Computation on those events.</li>
-</ul>
-<p>Although each <strong>AdEvent</strong> will contribute to many aggregations 
(as described by the formulas in the previous post) the number of unique values 
of keys in the <strong>AdEvents</strong> will likely be much smaller than 
500,000. So the total number of aggregations produced by 500,000 events will 
also be much smaller than 500,000. Let’s say for the sake of this example 
that the number of aggregations produced will be on the order of 10,000. This 
means that if we perform Dimension Computation on batches of 500,000 tuples we 
can reduce 500,000 events to 10,000 aggregations.</p>
-<p>The process can be sped up even further by utilizing partitioning. If a 
partition can handle 500,000 events/second, then 8 partitions would be able to 
handle 4,000,000 events/second. And these 4,000,000 events/seconds would then 
be compressed into 80,000 aggregations/second. These aggregations are then 
passed on to the Unification stage of processing.</p>
-<p><strong>Note</strong> that these 80,000 aggregations will not be complete 
aggregations for two reasons:</p>
-<ol>
-<li>The aggregations do not incorporate the values of events received in 
previous batches. This draw back is corrected by the <strong>Aggregation 
Storage</strong> phase.</li>
-<li>The aggregations computed by different partitions may share the same key 
values and time buckets. This draw back is corrected by the 
<strong>Unification</strong> phase.</li>
-</ol>
-<h4 id="the-code">The Code</h4>
-<p>Setting up the Pre-Aggregation phase of Dimensions Computation involves 
configuring a Dimension Computation operator. There are several flavors of the 
Dimension Computation operator, the easiest to use out of the box for Java and 
dtAssemble is <strong>DimensionsComputationFlexibleSingleSchemaPOJO</strong>. 
This operator can receive any POJO as input (like our AdEvent) and requires the 
following configuration:</p>
-<ul>
-<li><strong>A JSON Schema:</strong> The JSON schema specifies the keys, 
aggregates, aggregators, dimension combinations, and time buckets to be used 
for Dimension Computation. An example of a schema that could be used for 
<strong>AdEvents</strong> is the following:</li>
-</ul>
-<pre class="prettyprint"><code class=" hljs json">{"<span 
class="hljs-attribute">keys</span>":<span class="hljs-value">[{"<span 
class="hljs-attribute">name</span>":<span class="hljs-value"><span 
class="hljs-string">"advertiser"</span></span>,"<span 
class="hljs-attribute">type</span>":<span class="hljs-value"><span 
class="hljs-string">"string"</span></span>},
-         {"<span class="hljs-attribute">name</span>":<span 
class="hljs-value"><span class="hljs-string">"location"</span></span>,"<span 
class="hljs-attribute">type</span>":<span class="hljs-value"><span 
class="hljs-string">"string"</span></span>}]</span>,
- "<span class="hljs-attribute">timeBuckets</span>":<span 
class="hljs-value">[<span class="hljs-string">"1m"</span>,<span 
class="hljs-string">"1h"</span>,<span class="hljs-string">"1d"</span>]</span>,
- "<span class="hljs-attribute">values</span>":
-  <span class="hljs-value">[{"<span class="hljs-attribute">name</span>":<span 
class="hljs-value"><span class="hljs-string">"impressions"</span></span>,"<span 
class="hljs-attribute">type</span>":<span class="hljs-value"><span 
class="hljs-string">"long"</span></span>,"<span 
class="hljs-attribute">aggregators</span>":<span class="hljs-value">[<span 
class="hljs-string">"SUM"</span>,<span class="hljs-string">"MAX"</span>,<span 
class="hljs-string">"MIN"</span>]</span>},
-   {"<span class="hljs-attribute">name</span>":<span class="hljs-value"><span 
class="hljs-string">"clicks"</span></span>,"<span 
class="hljs-attribute">type</span>":<span class="hljs-value"><span 
class="hljs-string">"long"</span></span>,"<span 
class="hljs-attribute">aggregators</span>":<span class="hljs-value">[<span 
class="hljs-string">"SUM"</span>,<span class="hljs-string">"MAX"</span>,<span 
class="hljs-string">"MIN"</span>]</span>},
-   {"<span class="hljs-attribute">name</span>":<span class="hljs-value"><span 
class="hljs-string">"cost"</span></span>,"<span 
class="hljs-attribute">type</span>":<span class="hljs-value"><span 
class="hljs-string">"double"</span></span>,"<span 
class="hljs-attribute">aggregators</span>":<span class="hljs-value">[<span 
class="hljs-string">"SUM"</span>,<span class="hljs-string">"MAX"</span>,<span 
class="hljs-string">"MIN"</span>]</span>},
-   {"<span class="hljs-attribute">name</span>":<span class="hljs-value"><span 
class="hljs-string">"revenue"</span></span>,"<span 
class="hljs-attribute">type</span>":<span class="hljs-value"><span 
class="hljs-string">"double"</span></span>,"<span 
class="hljs-attribute">aggregators</span>":<span class="hljs-value">[<span 
class="hljs-string">"SUM"</span>,<span class="hljs-string">"MAX"</span>,<span 
class="hljs-string">"MIN"</span>]</span>}]</span>,
- "<span class="hljs-attribute">dimensions</span>":
-  <span class="hljs-value">[{"<span 
class="hljs-attribute">combination</span>":<span class="hljs-value">[]</span>},
-   {"<span class="hljs-attribute">combination</span>":<span 
class="hljs-value">[<span class="hljs-string">"location"</span>]</span>},
-   {"<span class="hljs-attribute">combination</span>":<span 
class="hljs-value">[<span class="hljs-string">"advertiser"</span>]</span>},
-   {"<span class="hljs-attribute">combination</span>":<span 
class="hljs-value">[<span class="hljs-string">"advertiser"</span>,<span 
class="hljs-string">"location"</span>]</span>}]
-</span>}</code></pre>
-<ul>
-<li>A map from key names to the Java expression used to extract the key from 
an incoming POJO.</li>
-<li>A map from aggregate names to the Java expression used to extract the 
aggregate from an incoming POJO.</li>
-</ul>
-<p>An example of how to configure a Dimensions Computation operator to process 
<strong>AdEvents</strong> is as follows:</p>
-<pre class="prettyprint"><code class=" hljs 
avrasm">DimensionsComputationFlexibleSingleSchemaPOJO dimensions = dag<span 
class="hljs-preprocessor">.addOperator</span>(<span 
class="hljs-string">"DimensionsComputation"</span>, 
DimensionsComputationFlexibleSingleSchemaPOJO<span 
class="hljs-preprocessor">.class</span>)<span class="hljs-comment">;</span>
-
-Map&lt;String, String&gt; keyToExpression = Maps<span 
class="hljs-preprocessor">.newHashMap</span>()<span 
class="hljs-comment">;</span>
-keyToExpression<span class="hljs-preprocessor">.put</span>(<span 
class="hljs-string">"advertiser"</span>, <span 
class="hljs-string">"getAdvertiser()"</span>)<span class="hljs-comment">;</span>
-keyToExpression<span class="hljs-preprocessor">.put</span>(<span 
class="hljs-string">"location"</span>, <span 
class="hljs-string">"getLocation()"</span>)<span class="hljs-comment">;</span>
-keyToExpression<span class="hljs-preprocessor">.put</span>(<span 
class="hljs-string">"time"</span>, <span 
class="hljs-string">"getTime()"</span>)<span class="hljs-comment">;</span>
-
-Map&lt;String, String&gt; aggregateToExpression = Maps<span 
class="hljs-preprocessor">.newHashMap</span>()<span 
class="hljs-comment">;</span>
-aggregateToExpression<span class="hljs-preprocessor">.put</span>(<span 
class="hljs-string">"cost"</span>, <span 
class="hljs-string">"getCost()"</span>)<span class="hljs-comment">;</span>
-aggregateToExpression<span class="hljs-preprocessor">.put</span>(<span 
class="hljs-string">"revenue"</span>, <span 
class="hljs-string">"getRevenue()"</span>)<span class="hljs-comment">;</span>
-aggregateToExpression<span class="hljs-preprocessor">.put</span>(<span 
class="hljs-string">"impressions"</span>, <span 
class="hljs-string">"getImpressions()"</span>)<span 
class="hljs-comment">;</span>
-aggregateToExpression<span class="hljs-preprocessor">.put</span>(<span 
class="hljs-string">"clicks"</span>, <span 
class="hljs-string">"getClicks()"</span>)<span class="hljs-comment">;</span>
-
-dimensions<span 
class="hljs-preprocessor">.setKeyToExpression</span>(keyToExpression)<span 
class="hljs-comment">;</span>
-dimensions<span 
class="hljs-preprocessor">.setAggregateToExpression</span>(aggregateToExpression)<span
 class="hljs-comment">;</span>
-//Here eventSchema is a string containing the JSON listed above.
-dimensions<span 
class="hljs-preprocessor">.setConfigurationSchemaJSON</span>(eventSchema)<span 
class="hljs-comment">;</span></code></pre>
-<h3 id="the-unification-phase">The Unification Phase</h3>
-<h4 id="the-theory-1">The Theory</h4>
-<p>The Unification phase is relatively simple. It combines the outputs of all 
the partitions in the Pre-Aggregation phase into a single single stream which 
can be passed on to the storage phase. It has the added benefit of reducing the 
number of aggregations even further. This is because the aggregations produced 
by different partitions which share the same key and time bucket can be 
combined to produce a single aggregation. For example, if the Unification phase 
receives 80,000 aggregations/second, you can expect 20,000 aggregations/second 
after unification.</p>
-<h4 id="the-code-1">The Code</h4>
-<p>The Unification phase is implemented as a unifier that can be set on your 
dimensions computation operator.</p>
-<pre class="prettyprint"><code class=" hljs vbnet">dimensions.setUnifier(<span 
class="hljs-keyword">new</span> DimensionsComputationUnifierImpl&lt;InputEvent, 
<span class="hljs-keyword">Aggregate</span>&gt;());</code></pre>
-<h3 id="the-aggregation-storage-phase">The Aggregation Storage Phase</h3>
-<h4 id="the-theory-2">The Theory</h4>
-<p>The total number of aggregations produced by Dimension Computation is 
large, and it only increases with time (due to time bucketing). Aggregations 
are persisted to HDFS using HDHT. This persistence is performed by the 
Dimensions Store and serves two purposes:</p>
-<ul>
-<li>Functions as a storage so that aggregations can be retrieved for 
visualization.</li>
-<li>Functions as a storage allowing aggregations to be combined with 
incomplete aggregates produced by Unification.</li>
-</ul>
-<h5 id="visualization">Visualization</h5>
-<p>The DimensionsStore allows you to visualize your aggregations over time. 
This is done by allowing queries and responses to be received from and sent to 
the UI via websocket.</p>
-<h5 id="aggregation">Aggregation</h5>
-<p>The store produces complete aggregations by combining the incomplete 
aggregations received from the Unification stage with aggregations persisted to 
HDFS.</p>
-<h5 id="scalability">Scalability</h5>
-<p>Since the work done by the DimensionsStore is IO intensive, it cannot 
handle hundreds of thousands of events. The purpose of the the Pre-Aggregation 
and Unification phases is to reduce the cardinality of events so that the Store 
will almost always have a small number of events to handle. However, in cases 
where there are many unique values for keys, the Pre-Aggregation and 
Unification phases will not be sufficient to reduce the cardinality of events 
handled by the Dimension Store. In such cases it is possible to partition the 
Dimensions Store so that each partition handles the aggregates for a subset of 
the dimension combinations and time buckets.</p>
-<h4 id="the-code-2">The Code</h4>
-<p>Configuration of the DimensionsStore involves the following:</p>
-<ul>
-<li>Setting the JSON Schema.</li>
-<li>Connecting Query and Result operators that are used to send queries to and 
receive results from the DimensionsStore.</li>
-<li>Setting an HDHT File Implementation.</li>
-<li>Setting an HDFS path for storing aggregation data.</li>
-</ul>
-<p>An example of configuring the store is as follows:</p>
-<pre class="prettyprint"><code class=" hljs 
avrasm">AppDataSingleSchemaDimensionStoreHDHT store = dag<span 
class="hljs-preprocessor">.addOperator</span>(<span 
class="hljs-string">"Store"</span>, AppDataSingleSchemaDimensionStoreHDHT<span 
class="hljs-preprocessor">.class</span>)<span class="hljs-comment">;</span>
-
-TFileImpl hdsFile = new TFileImpl<span 
class="hljs-preprocessor">.DTFileImpl</span>()<span 
class="hljs-comment">;</span>
-hdsFile<span class="hljs-preprocessor">.setBasePath</span>(basePath)<span 
class="hljs-comment">;</span>
-store<span class="hljs-preprocessor">.setFileStore</span>(hdsFile)<span 
class="hljs-comment">;</span>
-store<span 
class="hljs-preprocessor">.setConfigurationSchemaJSON</span>(eventSchema)<span 
class="hljs-comment">;</span>
-
-String gatewayAddress = dag<span 
class="hljs-preprocessor">.getValue</span>(DAG<span 
class="hljs-preprocessor">.GATEWAY</span>_CONNECT_ADDRESS)<span 
class="hljs-comment">;</span>
-URI uri = URI<span class="hljs-preprocessor">.create</span>(<span 
class="hljs-string">"ws://"</span> + gatewayAddress + <span 
class="hljs-string">"/pubsub"</span>)<span class="hljs-comment">;</span>
-
-PubSubWebSocketAppDataQuery wsIn = dag<span 
class="hljs-preprocessor">.addOperator</span>(<span 
class="hljs-string">"Query"</span>, PubSubWebSocketAppDataQuery<span 
class="hljs-preprocessor">.class</span>)<span class="hljs-comment">;</span>
-wsIn<span class="hljs-preprocessor">.setUri</span>(uri)<span 
class="hljs-comment">;</span>
-wsIn<span class="hljs-preprocessor">.setTopic</span>(<span 
class="hljs-string">"Query Topic"</span>)<span class="hljs-comment">;</span>
-
-PubSubWebSocketAppDataResult wsOut = dag<span 
class="hljs-preprocessor">.addOperator</span>(<span 
class="hljs-string">"QueryResult"</span>, PubSubWebSocketAppDataResult<span 
class="hljs-preprocessor">.class</span>)<span class="hljs-comment">;</span>
-wsOut<span class="hljs-preprocessor">.setUri</span>(uri)<span 
class="hljs-comment">;</span>
-wsOut<span class="hljs-preprocessor">.setTopic</span>(<span 
class="hljs-string">"Result Topic"</span>)<span class="hljs-comment">;</span>
-
-dag<span class="hljs-preprocessor">.addStream</span>(<span 
class="hljs-string">"Query"</span>, wsIn<span 
class="hljs-preprocessor">.outputPort</span>, store<span 
class="hljs-preprocessor">.query</span>)<span class="hljs-comment">;</span>
-dag<span class="hljs-preprocessor">.addStream</span>(<span 
class="hljs-string">"QueryResult"</span>, store<span 
class="hljs-preprocessor">.queryResult</span>, wsOut<span 
class="hljs-preprocessor">.input</span>)<span 
class="hljs-comment">;</span></code></pre>
-<h3 id="putting-it-all-together">Putting it all Together</h3>
-<p>When you combine all the pieces described above, an application that 
visualizes <strong>AdEvents</strong> looks like this:</p>
-<pre class="prettyprint"><code class=" hljs 
avrasm">@ApplicationAnnotation(name=<span 
class="hljs-string">"AdEventDemo"</span>)
-public class AdEventDemo implements StreamingApplication
-{
-  public static final String EVENT_SCHEMA = <span 
class="hljs-string">"adsGenericEventSchema.json"</span><span 
class="hljs-comment">;</span>
-
-  @Override
-  public void populateDAG(DAG dag, Configuration conf)
-  {
-    //This loads the eventSchema<span class="hljs-preprocessor">.json</span> 
file which is a jar resource file.
-    String eventSchema = SchemaUtils<span 
class="hljs-preprocessor">.jarResourceFileToString</span>(<span 
class="hljs-string">"eventSchema.json"</span>)<span 
class="hljs-comment">;</span>
-
-    //Operator that receives Ad Events
-    AdEventReceiver receiver = dag<span 
class="hljs-preprocessor">.addOperator</span>(<span class="hljs-string">"Event 
Receiver"</span>, AdEventReceiver<span 
class="hljs-preprocessor">.class</span>)<span class="hljs-comment">;</span>
-
-    //Dimension Computation
-    DimensionsComputationFlexibleSingleSchemaPOJO dimensions = dag<span 
class="hljs-preprocessor">.addOperator</span>(<span 
class="hljs-string">"DimensionsComputation"</span>, 
DimensionsComputationFlexibleSingleSchemaPOJO<span 
class="hljs-preprocessor">.class</span>)<span class="hljs-comment">;</span>
-
-    Map&lt;String, String&gt; keyToExpression = Maps<span 
class="hljs-preprocessor">.newHashMap</span>()<span 
class="hljs-comment">;</span>
-    keyToExpression<span class="hljs-preprocessor">.put</span>(<span 
class="hljs-string">"advertiser"</span>, <span 
class="hljs-string">"getAdvertiser()"</span>)<span class="hljs-comment">;</span>
-    keyToExpression<span class="hljs-preprocessor">.put</span>(<span 
class="hljs-string">"location"</span>, <span 
class="hljs-string">"getLocation()"</span>)<span class="hljs-comment">;</span>
-    keyToExpression<span class="hljs-preprocessor">.put</span>(<span 
class="hljs-string">"time"</span>, <span 
class="hljs-string">"getTime()"</span>)<span class="hljs-comment">;</span>
-
-    Map&lt;String, String&gt; aggregateToExpression = Maps<span 
class="hljs-preprocessor">.newHashMap</span>()<span 
class="hljs-comment">;</span>
-    aggregateToExpression<span class="hljs-preprocessor">.put</span>(<span 
class="hljs-string">"cost"</span>, <span 
class="hljs-string">"getCost()"</span>)<span class="hljs-comment">;</span>
-    aggregateToExpression<span class="hljs-preprocessor">.put</span>(<span 
class="hljs-string">"revenue"</span>, <span 
class="hljs-string">"getRevenue()"</span>)<span class="hljs-comment">;</span>
-    aggregateToExpression<span class="hljs-preprocessor">.put</span>(<span 
class="hljs-string">"impressions"</span>, <span 
class="hljs-string">"getImpressions()"</span>)<span 
class="hljs-comment">;</span>
-    aggregateToExpression<span class="hljs-preprocessor">.put</span>(<span 
class="hljs-string">"clicks"</span>, <span 
class="hljs-string">"getClicks()"</span>)<span class="hljs-comment">;</span>
-
-    dimensions<span 
class="hljs-preprocessor">.setKeyToExpression</span>(keyToExpression)<span 
class="hljs-comment">;</span>
-    dimensions<span 
class="hljs-preprocessor">.setAggregateToExpression</span>(aggregateToExpression)<span
 class="hljs-comment">;</span>
-    dimensions<span 
class="hljs-preprocessor">.setConfigurationSchemaJSON</span>(eventSchema)<span 
class="hljs-comment">;</span>
-
-    dimensions<span class="hljs-preprocessor">.setUnifier</span>(new 
DimensionsComputationUnifierImpl&lt;InputEvent, Aggregate&gt;())<span 
class="hljs-comment">;</span>
-
-    //Dimension Store
-    AppDataSingleSchemaDimensionStoreHDHT store = dag<span 
class="hljs-preprocessor">.addOperator</span>(<span 
class="hljs-string">"Store"</span>, AppDataSingleSchemaDimensionStoreHDHT<span 
class="hljs-preprocessor">.class</span>)<span class="hljs-comment">;</span>
-
-    TFileImpl hdsFile = new TFileImpl<span 
class="hljs-preprocessor">.DTFileImpl</span>()<span 
class="hljs-comment">;</span>
-    hdsFile<span class="hljs-preprocessor">.setBasePath</span>(<span 
class="hljs-string">"dataStorePath"</span>)<span class="hljs-comment">;</span>
-    store<span class="hljs-preprocessor">.setFileStore</span>(hdsFile)<span 
class="hljs-comment">;</span>
-    store<span 
class="hljs-preprocessor">.setConfigurationSchemaJSON</span>(eventSchema)<span 
class="hljs-comment">;</span>
-
-    String gatewayAddress = dag<span 
class="hljs-preprocessor">.getValue</span>(DAG<span 
class="hljs-preprocessor">.GATEWAY</span>_CONNECT_ADDRESS)<span 
class="hljs-comment">;</span>
-    URI uri = URI<span class="hljs-preprocessor">.create</span>(<span 
class="hljs-string">"ws://"</span> + gatewayAddress + <span 
class="hljs-string">"/pubsub"</span>)<span class="hljs-comment">;</span>
-
-    PubSubWebSocketAppDataQuery wsIn = dag<span 
class="hljs-preprocessor">.addOperator</span>(<span 
class="hljs-string">"Query"</span>, PubSubWebSocketAppDataQuery<span 
class="hljs-preprocessor">.class</span>)<span class="hljs-comment">;</span>
-    wsIn<span class="hljs-preprocessor">.setUri</span>(uri)<span 
class="hljs-comment">;</span>
-    wsIn<span class="hljs-preprocessor">.setTopic</span>(<span 
class="hljs-string">"Query Topic"</span>)<span class="hljs-comment">;</span>
-
-    PubSubWebSocketAppDataResult wsOut = dag<span 
class="hljs-preprocessor">.addOperator</span>(<span 
class="hljs-string">"QueryResult"</span>, PubSubWebSocketAppDataResult<span 
class="hljs-preprocessor">.class</span>)<span class="hljs-comment">;</span>
-    wsOut<span class="hljs-preprocessor">.setUri</span>(uri)<span 
class="hljs-comment">;</span>
-    wsOut<span class="hljs-preprocessor">.setTopic</span>(<span 
class="hljs-string">"Result Topic"</span>)<span class="hljs-comment">;</span>
-
-    //Configure Streams
-
-    dag<span class="hljs-preprocessor">.addStream</span>(<span 
class="hljs-string">"Query"</span>, wsIn<span 
class="hljs-preprocessor">.outputPort</span>, store<span 
class="hljs-preprocessor">.query</span>)<span class="hljs-comment">;</span>
-    dag<span class="hljs-preprocessor">.addStream</span>(<span 
class="hljs-string">"QueryResult"</span>, store<span 
class="hljs-preprocessor">.queryResult</span>, wsOut<span 
class="hljs-preprocessor">.input</span>)<span class="hljs-comment">;</span>
-
-    dag<span class="hljs-preprocessor">.addStream</span>(<span 
class="hljs-string">"InputStream"</span>, receiver<span 
class="hljs-preprocessor">.output</span>, dimensions<span 
class="hljs-preprocessor">.input</span>)<span class="hljs-comment">;</span>
-    dag<span class="hljs-preprocessor">.addStream</span>(<span 
class="hljs-string">"DimensionalData"</span>, dimensions<span 
class="hljs-preprocessor">.output</span>, store<span 
class="hljs-preprocessor">.input</span>)<span class="hljs-comment">;</span>
-  }
-}</code></pre>
-<h3> </h3>
-<h3 id="visualizing-the-aggregations">Visualizing The Aggregations</h3>
-<p>When you launch your application you can visualize the aggregations of 
AdEvents over time by adding a widget to a visualization dashboard.</p>
-<p><img title="" 
src="https://docs.google.com/drawings/d/1wcHlgORqQYRdlnvkp3K7R9-2BllsW2jrgSBERqOq2jg/pub?w=960&amp;h=720";
 alt="enter image description here" /></p>
-<h3 id="conclusion">Conclusion</h3>
-<p>Aggregating huge amounts of data in real time is a challenge that many 
enterprises face today. Dimensions Computation is valuable for aggregating 
data, and Data Torrent provides an implementation of Dimensions Computation 
that allows users to integrate data aggregation with their applications with 
minimal effort.</p>
-<h3 id="conclusion">Resources</h3>
-<p>Download DataTorrent Sandbox <a 
href="http://web.datatorrent.com/DataTorrent-RTS-Sandbox-Edition-Download.html"; 
target="_blank">here</a></p>
-<p>Download DataTorrent Enterprise Edition <a 
href="http://web.datatorrent.com/DataTorrent-RTS-Enteprise-Edition-Download.html";
 target="_blank">here</a></p>
-<p>Join Apache Apex Meetups <a 
href="https://www.datatorrent.com/meetups/";>here</a></p>
-<p>The post <a rel="nofollow" 
href="https://www.datatorrent.com/dimensions-computation-aggregate-navigator-part-2-implementation/";>Dimensions
 Computation (Aggregate Navigator) Part 2: Implementation</a> appeared first on 
<a rel="nofollow" href="https://www.datatorrent.com";>DataTorrent</a>.</p>
-]]></content:encoded>
-                       
<wfw:commentRss>https://www.datatorrent.com/dimensions-computation-aggregate-navigator-part-2-implementation/feed/</wfw:commentRss>
-               <slash:comments>0</slash:comments>
-               </item>
-               <item>
-               <title>Dimensions Computation (Aggregate Navigator) Part 1: 
Intro</title>
-               
<link>https://www.datatorrent.com/blog-dimensions-computation-aggregate-navigator-part-1-intro/</link>
-               
<comments>https://www.datatorrent.com/blog-dimensions-computation-aggregate-navigator-part-1-intro/#comments</comments>
-               <pubDate>Tue, 03 Nov 2015 08:00:29 +0000</pubDate>
-               <dc:creator><![CDATA[tim farkas]]></dc:creator>
-                               <category><![CDATA[Uncategorized]]></category>
-
-               <guid 
isPermaLink="false">https://www.datatorrent.com/?p=2399</guid>
-               <description><![CDATA[<p>Introduction In the world of big data, 
enterprises have a common problem. They have large volumes of data flowing into 
their systems for which they need to observe historical trends in real-time. 
Consider the case of a digital advertising publisher that is receiving hundreds 
of thousands of click events every second. Looking at the history [&#8230;]</p>
-<p>The post <a rel="nofollow" 
href="https://www.datatorrent.com/blog-dimensions-computation-aggregate-navigator-part-1-intro/";>Dimensions
 Computation (Aggregate Navigator) Part 1: Intro</a> appeared first on <a 
rel="nofollow" href="https://www.datatorrent.com";>DataTorrent</a>.</p>
-]]></description>
-                               <content:encoded><![CDATA[<h2 
id="introduction">Introduction</h2>
-<p>In the world of big data, enterprises have a common problem. They have 
large volumes of data flowing into their systems for which they need to observe 
historical trends in real-time. Consider the case of a digital advertising 
publisher that is receiving hundreds of thousands of click events every second. 
Looking at the history of individual clicks and impressions doesn’t tell the 
publisher much about what is going on. A technique the publisher might employ 
is to track the total number of clicks and impressions for every second, 
minute, hour, and day. Such a technique might help find global trends in their 
systems, but may not provide enough granularity to take action on localized 
trends. The technique will need to be powerful enough to spot local trends. For 
example, the total clicks and impressions for an advertiser, a geographical 
area, or a combination of the two can provide some actionable insight. This 
process of receiving individual events, aggregating them over time, and
  drilling down into the data using some parameters like “advertiser” and 
“location” is called Dimensions Computation.</p>
-<p>Dimensions Computation is a powerful mechanism that allows you to spot 
trends in your streaming data in real-time. In this post we’ll cover the key 
concepts behind Dimensions Computation and outline the process of performing 
Dimensions Computation. We will also show you how to use Data Torrent’s 
out-of-the-box enterprise operators to easily add Dimensions Computation to 
your application.</p>
-<h2 id="the-process">The Process</h2>
-<p>Let us continue with our example of an advertising publisher. Let us now 
see the steps that the publisher might take to ensure that large volumes of raw 
advertisement data is converted into meaningful historical views of their 
advertisement events.</p>
-<h3 id="the-data">The Data</h3>
-<p>Typically advertising publishers receive packets of information for each 
advertising event. The events that a publisher receives might look like 
this:</p>
-<pre class="prettyprint"><code class=" hljs cs">    <span 
class="hljs-keyword">public</span> <span class="hljs-keyword">class</span> 
AdEvent
-    {
-        <span class="hljs-comment">//The name of the company that is 
advertising</span>
-      <span class="hljs-keyword">public</span> String advertiser;
-      <span class="hljs-comment">//The geographical location of the person 
initiating the event</span>
-      <span class="hljs-keyword">public</span> String location;
-      <span class="hljs-comment">//How much the advertiser was charged for the 
event</span>
-      <span class="hljs-keyword">public</span> <span 
class="hljs-keyword">double</span> cost;
-      <span class="hljs-comment">//How much revenue was generated for the 
advertiser</span>
-      <span class="hljs-keyword">public</span> <span 
class="hljs-keyword">double</span> revenue;
-      <span class="hljs-comment">//The number of impressions the advertiser 
received from this event</span>
-      <span class="hljs-keyword">public</span> <span 
class="hljs-keyword">long</span> impressions;
-      <span class="hljs-comment">//The number of clicks the advertiser 
received from this event</span>
-      <span class="hljs-keyword">public</span> <span 
class="hljs-keyword">long</span> clicks;
-      <span class="hljs-comment">//The timestamp of the event in 
milliseconds</span>
-      <span class="hljs-keyword">public</span> <span 
class="hljs-keyword">long</span> time;
-    }</code></pre>
-<p>The class <strong>AdEvent</strong> contains two types of data:</p>
-<ul>
-<li><strong>Aggregates</strong>: The data that is combined using 
aggregators.</li>
-<li><strong>Keys</strong>: The data that is used to select aggregations at a 
finer granularity.</li>
-</ul>
-<h4 id="aggregates">Aggregates</h4>
-<p>The aggregates in our <strong>AdEvent</strong> object are the pieces of 
data, which we must combine using aggregators in order to obtain a meaningful 
historical view. In this case, we can think of combining cost, revenue, 
impressions, and clicks. So these are our aggregates. We won’t obtain 
anything useful by aggregating the location and advertiser strings in our 
<strong>AdEvent</strong>, so those are not considered aggregates. It’s 
important to note that aggregates are considered separate entities. This means 
that the cost field of and <strong>AdEvent</strong> cannot be combined with its 
revenue field; cost values can only be aggregated with other cost values and 
revenue values can only be aggregated with other revenue values.</p>
-<p>In summary the aggregates in our <strong>AdEvent</strong> are:</p>
-<ul>
-<li><strong>cost</strong></li>
-<li><strong>revenue</strong></li>
-<li><strong>impressions</strong></li>
-<li><strong>clicks</strong></li>
-</ul>
-<h4 id="keys">Keys</h4>
-<p>The keys in our <strong>AdEvent</strong> object are used for selecting 
aggregations at a finer granularity. For example, it would make sense to look 
at the number of clicks for a particular advertiser, the number of clicks for a 
certain location, and the number of clicks for a certain location and 
advertiser combination. So location and advertiser are keys. Time is also 
another key since it is useful to look at the number of clicks received in a 
particular time range (For example, 12:00 pm through 1:00 pm or 12:00 pm 
through 12:01 pm.</p>
-<p>In summary the keys in our <strong>AdEvent</strong> are:</p>
-<ul>
-<li><strong>advertiser</strong></li>
-<li><strong>location</strong></li>
-<li><strong>time</strong></li>
-</ul>
-<h3 id="computing-the-aggregations">Computing The Aggregations</h3>
-<p>When the publisher receives a new <strong>AdEvent</strong> the event is 
added to running aggregations in real time. The keys and aggregates in the 
<strong>AdEvent</strong> are used to compute aggregations. How the aggregations 
are computed and the number of aggregations computed are determined by three 
tunable parameters:</p>
-<ul>
-<li><strong>Aggregators</strong></li>
-<li><strong>Dimensions Combinations</strong></li>
-<li><strong>Time Buckets</strong></li>
-</ul>
-<h4 id="aggregators">Aggregators</h4>
-<p>Dimensions Computation supports more than just one type of aggregation, and 
multiple aggregators can be used to combine incoming data at once. Some of the 
aggregators available out-of-the-box are:</p>
-<ul>
-<li><strong>Sum</strong></li>
-<li><strong>Count</strong></li>
-<li><strong>Min</strong></li>
-<li><strong>Max</strong></li>
-</ul>
-<p>As an example, suppose the publisher is not using the keys in their 
<strong>AdEvents</strong> and this publisher wants to perform a sum and a max 
aggregation.</p>
-<p><strong>1.</strong> An AdEvent arrives. The AdEvent is aggregated to the 
Sum and Count aggregations.<br />
-<img title="" 
src="https://docs.google.com/drawings/d/1upf5hv-UDT4BKhm7yTrcuFZYqnI263vMTXioKhr_qTo/pub?w=960&amp;h=720";
 alt="Adding Aggregate" /><br />
-<strong>2.</strong> Another AdEvent arrives. The AdEvent is aggregated to the 
Sum and Count aggregations.<br />
-<img title="" 
src="https://docs.google.com/drawings/d/10gTXjMyxanYo9UFc76IShPxOi5G7U5tvQKtfwqGyIws/pub?w=960&amp;h=720";
 alt="Adding Aggregate" /></p>
-<p>As can be seen from the example above, each <strong>AdEvent</strong> 
contributes to two aggregations.</p>
-<h4 id="dimension-combinations">Dimension Combinations</h4>
-<p>Each <strong>AdEvent</strong> does not necessarily contribute to only one 
aggregation for each aggregator. In our advertisement example there are 4 
<strong>dimension combinations</strong> over which aggregations can be 
computed.</p>
-<ul>
-<li><strong>advertiser:</strong> This dimension combination is comprised of 
just the advertiser value. This means that all the aggregates for 
<strong>AdEvents</strong> with a particular value for advertiser (for example, 
Gamestop) are aggregated.</li>
-<li><strong>location:</strong> This dimension combination is comprised of just 
the location value. This means that all the aggregates for 
<strong>AdEvents</strong> with a particular value for location (for example, 
California) are aggregated.</li>
-<li><strong>advertiser, location:</strong> This dimension combination is 
comprised the advertiser and location values. This means that all the 
aggregates for <strong>AdEvents</strong> with the same advertiser and location 
combination (for example, Gamestop, California) are aggregated.</li>
-<li><strong>the empty combination:</strong> This combination is a <em>global 
aggregation</em> because it doesn’t use any of the keys in the 
<strong>AdEvent</strong>. This means that all the <strong>AddEvents</strong> 
are aggregated.</li>
-</ul>
-<p>Therefore if a publisher is using the four dimension combinations 
enumerated above along with the sum and max aggregators, the number of 
aggregations being maintained would be:</p>
-<p>NUM_AGGS = 2 x <em>(number of unique advertisers)</em> + 2 * <em>(number of 
unique locations)</em> + 2 * <em>(number of unique advertiser and location 
combinations)</em> + 2</p>
-<p>And each individual <strong>AdEvent</strong> will contribute to <em>(number 
of aggregators)</em> x <em>(number of dimension combinations)</em> 
aggregations.</p>
-<p>Here is an example of how NUM_AGGS aggregations are computed:</p>
-<p><strong>1.</strong> An <strong>AdEvent</strong> arrives. The 
<strong>AdEvent</strong> is applied to aggregations for each aggregator and 
each dimension combination.<br />
-<img title="" 
src="https://docs.google.com/drawings/d/1qx8gLu615KneLDspsGkAS0_OlkX-DyvCUA7DAJtYJys/pub?w=960&amp;h=720";
 alt="Adding Aggregate" /><br />
-<strong>2.</strong> Another <strong>AdEvent</strong> arrives. The 
<strong>AdEvent</strong> is applied to aggregations for each aggregator and 
each dimension combination.<br />
-<img title="" 
src="https://docs.google.com/drawings/d/1FA2IyxewwzXtJ9A8JfJPrKtx-pfWHtHpVXp8lb8lKmE/pub?w=960&amp;h=720";
 alt="Adding Aggregate" /><br />
-<strong>3.</strong> Another <strong>AdEvent</strong> arrives. The 
<strong>AdEvent</strong> is applied to aggregations for each aggregator and 
each dimension combination.<br />
-<img title="" 
src="https://docs.google.com/drawings/d/15sxwfZeYOKBiapoD2o721M4rZs-bZBxhF3MeXelnu6M/pub?w=960&amp;h=720";
 alt="Adding Aggregate" /></p>
-<p>As can be seen from the example above each <strong>AdEvent</strong> 
contributes to 2 x 4 = 8 aggregations and there are 2 x 2 + 2 x 2 + 2 x 3 + 2 = 
16 aggregations in total.</p>
-<h4 id="time-buckets">Time Buckets</h4>
-<p>In addition to computing multiple aggregations for each dimension 
combination, aggregations can also be performed over time buckets. Time buckets 
are windows of time (for example, 1:00 pm through 1:01 pm) that are specified 
by a simple notation: 1m is one minute, 1h is one hour, 1d is one day. When 
aggregations are performed over time buckets, separate aggregations are 
maintained for each time bucket. Aggregations for a time bucket are comprised 
only of events with a time stamp that falls into that time bucket.</p>
-<p>An example of how these time bucketed aggregations are computed is as 
follows:</p>
-<p>Let’s say our advertisement publisher is interested in computing the Sum 
and Max of <strong>AdEvents</strong> for the dimension combinations comprised 
of <strong>advertiser</strong> and <strong>location</strong> over 1 minute and 
1 hour time buckets.</p>
-<p><strong>1.</strong> An <strong>AdEvent</strong> arrives. The 
<strong>AdEvent</strong> is applied to the aggregations for the appropriate 
aggregator, dimension combination and time bucket.</p>
-<p><img title="" 
src="https://docs.google.com/drawings/d/11voOdqkagpGKcWn5HOiWWAn78fXlpGl7aXUa3tG5sQc/pub?w=960&amp;h=720";
 alt="Adding Aggregate" /></p>
-<p><strong>3.</strong> Another <strong>AdEvent</strong> arrives. The 
<strong>AdEvent</strong> is applied to the aggregations for the appropriate 
aggregator, dimension combination and time bucket.</p>
-<p><img title="" 
src="https://docs.google.com/drawings/d/1ffovsxWZfHnSc_Z30RzGIXgzQeHjCnyZBoanO_xT_e4/pub?w=960&amp;h=720";
 alt="Adding Aggregate" /></p>
-<h4 id="conclusion">Conclusion</h4>
-<p>In summary, the three tunable parameters discussed above 
(<strong>Aggregators</strong>, <strong>Dimension Combinations</strong>, and 
<strong>Time Buckets</strong>) determine how aggregations are computed. In the 
examples provided in the <strong>Aggregators</strong>, <strong>Dimension 
Combinations</strong>, and <strong>Time Buckets</strong> sections respectively, 
we have incrementally increased the complexity in which the aggregations are 
performed. The examples provided in the <strong>Aggregators</strong>, and 
<strong>Dimension Combinations</strong> sections were for illustration purposes 
only; the example provided in the <strong>Time Buckets</strong> section 
provides an accurate view of how aggregations are computed within Data 
Torrent&#8217;s enterprise operators.</p>
-<p>Download DataTorrent Sandbox <a 
href="http://web.datatorrent.com/DataTorrent-RTS-Sandbox-Edition-Download.html"; 
target="_blank">here</a></p>
-<p>Download DataTorrent Enterprise Edition <a 
href="http://web.datatorrent.com/DataTorrent-RTS-Enteprise-Edition-Download.html";
 target="_blank">here</a></p>
-<p>The post <a rel="nofollow" 
href="https://www.datatorrent.com/blog-dimensions-computation-aggregate-navigator-part-1-intro/";>Dimensions
 Computation (Aggregate Navigator) Part 1: Intro</a> appeared first on <a 
rel="nofollow" href="https://www.datatorrent.com";>DataTorrent</a>.</p>
-]]></content:encoded>
-                       
<wfw:commentRss>https://www.datatorrent.com/blog-dimensions-computation-aggregate-navigator-part-1-intro/feed/</wfw:commentRss>
-               <slash:comments>0</slash:comments>
-               </item>
-               <item>
-               <title>Cisco ACI, Big Data, and DataTorrent</title>
-               <link>https://www.datatorrent.com/blog_cisco_aci/</link>
-               
<comments>https://www.datatorrent.com/blog_cisco_aci/#comments</comments>
-               <pubDate>Tue, 27 Oct 2015 22:30:07 +0000</pubDate>
-               <dc:creator><![CDATA[Charu Madan]]></dc:creator>
-                               <category><![CDATA[Uncategorized]]></category>
-
-               <guid 
isPermaLink="false">https://www.datatorrent.com/?p=2348</guid>
-               <description><![CDATA[<p>By: Harry Petty, Data Center and Cloud 
Networking, Cisco  (This blog has been developed in association with Farid 
Jiandani, Product Manager with Cisco’s Insieme Networks Business Unit and 
Charu Madan, Director Business Development at DataTorrent. It was originally 
published on Cisco Blogs) If you work for an enterprise that’s looking to hit 
its digital sweet [&#8230;]</p>
-<p>The post <a rel="nofollow" 
href="https://www.datatorrent.com/blog_cisco_aci/";>Cisco ACI, Big Data, and 
DataTorrent</a> appeared first on <a rel="nofollow" 
href="https://www.datatorrent.com";>DataTorrent</a>.</p>
-]]></description>
-                               <content:encoded><![CDATA[<p>By: Harry Petty, 
Data Center and Cloud Networking, Cisco</p>
-<p class="c0 c11"><a name="h.gjdgxs"></a><span class="c1"> (</span><span 
class="c4 c13">This blog has been developed in association with Farid Jiandani, 
Product Manager with Cisco’s Insieme Networks Business Unit and Charu Madan, 
Director Business Development at DataTorrent. It was originally published on <a 
href="http://blogs.cisco.com/datacenter/aci-big-data-and-datatorrent"; 
target="_blank">Cisco Blogs</a>)</span></p>
-<p class="c0"><span class="c1">If you work for an enterprise that’s looking 
to hit its digital sweet spot, then you’re scrutinizing your sales, marketing 
and operations to see where you should make digital investments to innovate and 
improve productivity. Super-fast data processing at scale is being used to 
obtain real-time insights for digital business and Internet of Things (IoT) 
initiatives.</span></p>
-<p class="c0"><span class="c1">According to Gartner Group, one of the cool 
vendors in this area of providing super- fast big data analysis using in-memory 
streaming analytics is called DataTorrent, a startup founded by long-time 
ex-Yahoo! veterans with vast experience managing big data for leading edge 
applications and infrastructure at massive scale. Their goal is to empower 
today’s enterprises to experience the full potential and business impact of 
big data with a platform that processes and analyzes data in 
real-time.</span></p>
-<p class="c0"><span class="c1 c2">DataTorrent RTS</span></p>
-<p class="c0"><span class="c4 c6">DataTorrent RTS is an open core</span><span 
class="c2 c4 c6">, enterprise-grade product powered by Apache Apex. 
</span><span class="c4 c6">DataTorrent RTS provides a single, unified batch and 
stream processing platform that enables organizations to reduce time to market, 
development costs and operational expenditures for big data analytics 
applications. </span></p>
-<p class="c0"><span class="c1 c2">DataTorrent RTS Integration with 
ACI</span></p>
-<p class="c0"><span class="c4 c6">A member of the Cisco ACI ecosystem, 
DataTorrent announced on September 29th DataTorrent RTS integration with Cisco 
</span><span class="c4 c6"><a class="c7" 
href="https://www.google.com/url?q=http://www.cisco.com/c/en/us/solutions/data-center-virtualization/application-centric-infrastructure/index.html&amp;sa=D&amp;usg=AFQjCNFMZhMYdUmPuuqrUI5IZmrvEhlK5g";>Application
 Centric Infrastructure (ACI)</a></span><span class="c4 c6"> through the 
Application Policy Infrastructure Controller (APIC) to help create more 
efficient IT operations, bringing together network operations management and 
big data application management and development: </span></p>
-<p class="c0"><span class="c4"><a class="c7" 
href="https://www.google.com/url?q=https://www.datatorrent.com/press-releases/datatorrent-integrates-with-cisco-aci-to-help-secure-big-data-processing-through-a-unified-data-and-network-fabric/&amp;sa=D&amp;usg=AFQjCNG4S_2-OY5ox5nCf_0_Qj7s-x9pyw";>DataTorrent
 Integrates with Cisco ACI to Help Secure Big Data Processing Through a Unified 
Data and Network Fabric</a></span><span class="c4">. </span><span class="c2 
c4">The joint solution enables</span></p>
-<ul class="c8 lst-kix_list_2-0 start">
-<li class="c12 c0"><span class="c4">A unified fabric approach for managing 
</span><span class="c2 c4">Applications, Data </span><span class="c4">and 
</span><span class="c2 c4">Network</span></li>
-<li class="c0 c12"><span class="c4">A highly secure and automated Big Data 
application platform which uses the power of Cisco ACI for automation and 
security policy management </span></li>
-<li class="c12 c0"><span class="c4">The creation, repository, and enforcement 
point for Cisco ACI application policies for big data applications</span></li>
-</ul>
-<p class="c0"><span class="c4">With the ACI integration, secure connectivity 
to diverse data sets becomes a part of a user defined policy which is automated 
and does not compromise on security and access management. As an example, if 
one of the DataTorrent Big Data application needs access to say a Kafka source, 
then all nodes need to be opened up. This leaves the environment vulnerable and 
prone to attacks. With ACI, the access management policies and contracts help 
define the connectivity contracts and only the right node and right application 
gets access. See Figure 1 and 2 for the illustration of this concept. 
</span></p>
-<p class="c0"><span class="c1 c2">Figure 1:</span></p>
-<p class="c0"><a 
href="https://www.datatorrent.com/wp-content/uploads/2015/10/image00.jpg";><img 
class="alignnone size-full wp-image-2349" 
src="https://www.datatorrent.com/wp-content/uploads/2015/10/image00.jpg"; 
alt="image00" width="432" height="219" /></a></p>
-<p class="c0"><span class="c1 c2">Figure 2</span></p>
-<p class="c0"><a 
href="https://www.datatorrent.com/wp-content/uploads/2015/10/image011.png";><img 
class="alignnone size-full wp-image-2350" 
src="https://www.datatorrent.com/wp-content/uploads/2015/10/image011.png"; 
alt="image01" width="904" height="493" /></a></p>
-<p class="c0"><span class="c1 c2">ACI Support for Big Data Solutions</span></p>
-<p class="c0"><span class="c1">The openness and the flexibility of ACI allow 
big data customers to run a wide variety of different applications within their 
fabric alongside Hadoop. Due to the elasticity of ACI, customers are able to 
run batch processing alongside stream processing and other data base 
applications in a seamless fashion. In traditional Hadoop environments, the 
network is segmented based off of individual server nodes (see Figure 1). This 
makes it difficult to elastically allow access to and from different 
applications. Ultimately, within the ACI framework, logical demarcation points 
can be created based on application workloads rather than physical server 
groups (a set of Hadoop nodes should not be considered as a bunch of individual 
server nodes, rather a single group.)</span></p>
-<p class="c0"><span class="c1 c2">A Robust and Active Ecosystem</span></p>
-<p class="c0"><span class="c1">Many vendors claim they have a broad ecosystem 
of vendors, but sometimes that’s pure marketing, without any real integration 
efforts going on behind the slideware. But Cisco’s Application Centric 
Infrastructure (ACI) has a very active ecosystem of industry leaders who are 
putting significant resources into integration efforts, taking advantage of 
ACI’s open Northbound and Southbound API’s. DataTorrent is just one example 
of an innovative company that is using ACI integration to add value to their 
solutions and deliver real benefits to their channel partners and 
customers.</span></p>
-<p class="c0"><span class="c1">Stay tuned for more success stories to come: 
we’ll continue to showcase industry leaders that are taking advantage of the 
open ACI API’s.</span></p>
-<p class="c0"><span class="c1">Additional References</span></p>
-<p class="c0"><span class="c3"><a class="c7" 
href="https://www.google.com/url?q=https://www.cisco.com/go/aci&amp;sa=D&amp;usg=AFQjCNHPa1zEn6-1fEWQeCgZ-QmP9te5ig";>https://www.cisco.com/go/aci</a></span></p>
-<p class="c0"><span class="c3"><a class="c7" 
href="https://www.google.com/url?q=https://www.cisco.com/go/aciecosystem&amp;sa=D&amp;usg=AFQjCNGmS3P3mOU0DQen5F43--fDi25uWw";>https://www.cisco.com/go/aciecosystem</a></span></p>
-<p class="c11 c0"><span class="c3"><a class="c7" 
href="https://www.google.com/url?q=http://www.datatorrent/com&amp;sa=D&amp;usg=AFQjCNHbzoCVBy0azkWTbjpqdyxPqkCo9g";>http://www.datatorrent/</a></span></p>
-<p>&nbsp;</p>
-<p>&nbsp;</p>
-<p>The post <a rel="nofollow" 
href="https://www.datatorrent.com/blog_cisco_aci/";>Cisco ACI, Big Data, and 
DataTorrent</a> appeared first on <a rel="nofollow" 
href="https://www.datatorrent.com";>DataTorrent</a>.</p>
-]]></content:encoded>
-                       
<wfw:commentRss>https://www.datatorrent.com/blog_cisco_aci/feed/</wfw:commentRss>
-               <slash:comments>0</slash:comments>
-               </item>
-               <item>
-               <title>Write Your First Apache Apex Application in Scala</title>
-               
<link>https://www.datatorrent.com/blog-writing-apache-apex-application-in-scala/</link>
-               
<comments>https://www.datatorrent.com/blog-writing-apache-apex-application-in-scala/#comments</comments>
-               <pubDate>Tue, 27 Oct 2015 01:58:25 +0000</pubDate>
-               <dc:creator><![CDATA[Tushar Gosavi]]></dc:creator>
-                               <category><![CDATA[Uncategorized]]></category>
-
-               <guid 
isPermaLink="false">https://www.datatorrent.com/?p=2280</guid>
-               <description><![CDATA[<p>* Extend your Scala expertise to 
building Apache Apex applications * Scala is modern, multi-paradigm programing 
language that integrates features of functional as well as object-oriented 
languages elegantly. Big Data frameworks are already exploring Scala as a 
language of choice for implementations. Apache Apex is developed in Java, the 
Apex APIs are such that writing [&#8230;]</p>
-<p>The post <a rel="nofollow" 
href="https://www.datatorrent.com/blog-writing-apache-apex-application-in-scala/";>Write
 Your First Apache Apex Application in Scala</a> appeared first on <a 
rel="nofollow" href="https://www.datatorrent.com";>DataTorrent</a>.</p>
-]]></description>
-                               <content:encoded><![CDATA[<p><em>* Extend your 
Scala expertise to building Apache Apex applications *</em></p>
-<p>Scala is modern, multi-paradigm programing language that integrates 
features of functional as well as object-oriented languages elegantly. Big Data 
frameworks are already exploring Scala as a language of choice for 
implementations. Apache Apex is developed in Java, the Apex APIs are such that 
writing applications is a smooth sail. Developers can use any programing 
language that can run on JVM and access JAVA classes, because Scala has good 
interoperability with Java, running Apex applications designed in Scala is a 
fuss-free experience. We will explain how to write an Apache Apex application 
in Scala.</p>
-<p>Writing an <a href="http://www.datatorrent.com/apex"; 
target="_blank">Apache Apex</a> application in Scala is simple.</p>
-<h2 id="operators-within-the-application">Operators within the application</h2>
-<p>We will develop a word count applications in Scala. This application will 
look for new files in a directory. With the availability of new files, the word 
count application will read the files, and compute a count for each word and 
print result on stdout. The application requires following operators:</p>
-<ul>
-<li><strong>LineReader</strong> &#8211; This operator monitors directories for 
new files periodically. After a new file is detected, LineReader reads the file 
line-by-line, and makes lines available on the output port for the next 
operator.</li>
-<li><strong>Parser</strong> &#8211; This operator receives lines read by 
LineReader on its input port. Parser breaks the line into words, and makes 
individual words available on the output port.</li>
-<li><strong>UniqueCounter</strong> &#8211; This operator computes the count of 
each word received on its input port.</li>
-<li><strong>ConsoleOutputOperator</strong> &#8211; This operator prints unique 
counts of words on standard output.</li>
-</ul>
-<h2 id="build-the-scala-word-count-application">Build the Scala word count 
application</h2>
-<p>Now, we will generate a sample application using maven 
archtype:generate.</p>
-<h3 id="generate-a-sample-application">Generate a sample application.</h3>
-<pre class="prettyprint"><code class="language-bash hljs ">mvn 
archetype:generate 
-DarchetypeRepository=https://www.datatorrent.com/maven/content/repositories/releases
 -DarchetypeGroupId=com.datatorrent -DarchetypeArtifactId=apex-app-archetype 
-DarchetypeVersion=<span class="hljs-number">3.0</span>.<span 
class="hljs-number">0</span> -DgroupId=com.datatorrent 
-Dpackage=com.datatorrent.wordcount -DartifactId=wordcount -Dversion=<span 
class="hljs-number">1.0</span>-SNAPSHOT</code></pre>
-<p>This creates a directory called <strong>wordcount</strong>, with a sample 
application and build script. Let us see how to modify this application into 
the Scala-based word count application that we are looking to develop.</p>
-<h3 id="add-the-scala-build-plugin">Add the Scala build plugin.</h3>
-<p>Apache Apex uses maven for building the framework and operator library. 
Maven supports a plugin for compiling Scala files. To enable this plugin, add 
the following snippet to the <code>build -&gt; plugins</code> sections of the 
<code>pom.xml</code> file that is located in the application directory.</p>
-<pre class="prettyprint"><code class="language-xml hljs ">  <span 
class="hljs-tag">&lt;<span class="hljs-title">plugin</span>&gt;</span>
-    <span class="hljs-tag">&lt;<span 
class="hljs-title">groupId</span>&gt;</span>net.alchim31.maven<span 
class="hljs-tag">&lt;/<span class="hljs-title">groupId</span>&gt;</span>
-    <span class="hljs-tag">&lt;<span 
class="hljs-title">artifactId</span>&gt;</span>scala-maven-plugin<span 
class="hljs-tag">&lt;/<span class="hljs-title">artifactId</span>&gt;</span>
-    <span class="hljs-tag">&lt;<span 
class="hljs-title">version</span>&gt;</span>3.2.1<span 
class="hljs-tag">&lt;/<span class="hljs-title">version</span>&gt;</span>
-    <span class="hljs-tag">&lt;<span 
class="hljs-title">executions</span>&gt;</span>
-      <span class="hljs-tag">&lt;<span 
class="hljs-title">execution</span>&gt;</span>
-        <span class="hljs-tag">&lt;<span 
class="hljs-title">goals</span>&gt;</span>
-          <span class="hljs-tag">&lt;<span 
class="hljs-title">goal</span>&gt;</span>compile<span 
class="hljs-tag">&lt;/<span class="hljs-title">goal</span>&gt;</span>
-          <span class="hljs-tag">&lt;<span 
class="hljs-title">goal</span>&gt;</span>testCompile<span 
class="hljs-tag">&lt;/<span class="hljs-title">goal</span>&gt;</span>
-        <span class="hljs-tag">&lt;/<span 
class="hljs-title">goals</span>&gt;</span>
-      <span class="hljs-tag">&lt;/<span 
class="hljs-title">execution</span>&gt;</span>
-    <span class="hljs-tag">&lt;/<span 
class="hljs-title">executions</span>&gt;</span>
-  <span class="hljs-tag">&lt;/<span 
class="hljs-title">plugin</span>&gt;</span></code></pre>
-<p>Also, specify the Scala library as a dependency in the pom.xml file.<br />
-Add the Scala library.</p>
-<pre class="prettyprint"><code class="language-xml hljs "><span 
class="hljs-tag">&lt;<span class="hljs-title">dependency</span>&gt;</span>
- <span class="hljs-tag">&lt;<span 
class="hljs-title">groupId</span>&gt;</span>org.scala-lang<span 
class="hljs-tag">&lt;/<span class="hljs-title">groupId</span>&gt;</span>
- <span class="hljs-tag">&lt;<span 
class="hljs-title">artifactId</span>&gt;</span>scala-library<span 
class="hljs-tag">&lt;/<span class="hljs-title">artifactId</span>&gt;</span>
- <span class="hljs-tag">&lt;<span 
class="hljs-title">version</span>&gt;</span>2.11.2<span 
class="hljs-tag">&lt;/<span class="hljs-title">version</span>&gt;</span>
-<span class="hljs-tag">&lt;/<span 
class="hljs-title">dependency</span>&gt;</span></code></pre>
-<p>We are now set to write a Scala application.</p>
-<h2 id="write-your-scala-word-count-application">Write your Scala word count 
application</h2>
-<h3 id="linereader">LineReader</h3>
-<p><a href="https://github.com/apache/incubator-apex-malhar"; 
target="_blank">Apache Malhar library</a> contains an <a 
href="https://github.com/apache/incubator-apex-malhar/blob/1f5676b455f7749d11c7cd200216d0d4ad7fce32/library/src/main/java/com/datatorrent/lib/io/AbstractFTPInputOperator.java";
 target="_blank">AbstractFileInputOperator</a> operator that monitors and reads 
files from a directory. This operator has capabilities such as support for 
scaling, fault tolerance, and exactly once processing. To complete the 
functionality, override a few methods:<br />
-<em>readEntity</em> : Reads a line from a file.<br />
-<em>emit</em> : Emits data read on the output port.<br />
-We have overridden openFile to obtain an instance of BufferedReader that is 
required while reading lines from the file. We also override closeFile for 
closing an instance of BufferedReader.</p>
-<pre class="prettyprint"><code class="language-scala hljs "><span 
class="hljs-class"><span class="hljs-keyword">class</span> <span 
class="hljs-title">LineReader</span> <span class="hljs-keyword">extends</span> 
<span class="hljs-title">AbstractFileInputOperator</span>[<span 
class="hljs-title">String</span>] {</span>
-
-  <span class="hljs-annotation">@transient</span>
-  <span class="hljs-keyword">val</span> out : DefaultOutputPort[String] = 
<span class="hljs-keyword">new</span> DefaultOutputPort[String]();
-
-  <span class="hljs-keyword">override</span> <span 
class="hljs-keyword">def</span> readEntity(): String = br.readLine()
-
-  <span class="hljs-keyword">override</span> <span 
class="hljs-keyword">def</span> emit(line: String): Unit = out.emit(line)
-
-  <span class="hljs-keyword">override</span> <span 
class="hljs-keyword">def</span> openFile(path: Path): InputStream = {
-    <span class="hljs-keyword">val</span> in = <span 
class="hljs-keyword">super</span>.openFile(path)
-    br = <span class="hljs-keyword">new</span> BufferedReader(<span 
class="hljs-keyword">new</span> InputStreamReader(in))
-    <span class="hljs-keyword">return</span> in
-  }
-
-  <span class="hljs-keyword">override</span> <span 
class="hljs-keyword">def</span> closeFile(is: InputStream): Unit = {
-    br.close()
-    <span class="hljs-keyword">super</span>.closeFile(is)
-  }
-
-  <span class="hljs-annotation">@transient</span>
-  <span class="hljs-keyword">private</span> <span 
class="hljs-keyword">var</span> br : BufferedReader = <span 
class="hljs-keyword">null</span>
-}</code></pre>
-<p>Some Apex API classes are not serializable, and must be defined as 
transient. Scala supports transient annotation for such scenarios. If you see 
objects that are not a part of the state of the operator, you must annotate 
them with a @transient. For example, in this code, we have annotated buffer 
reader and output port as transient.</p>
-<h3 id="parser">Parser</h3>
-<p>Parser splits lines using in-built JAVA split function, and emits 
individual words to the output port.</p>
-<pre class="prettyprint"><code class="language-scala hljs "><span 
class="hljs-class"><span class="hljs-keyword">class</span> <span 
class="hljs-title">Parser</span> <span class="hljs-keyword">extends</span> 
<span class="hljs-title">BaseOperator</span> {</span>
-  <span class="hljs-annotation">@BeanProperty</span>
-  <span class="hljs-keyword">var</span> regex : String = <span 
class="hljs-string">" "</span>
-
-  <span class="hljs-annotation">@transient</span>
-  <span class="hljs-keyword">val</span> out = <span 
class="hljs-keyword">new</span> DefaultOutputPort[String]()
-
-  <span class="hljs-annotation">@transient</span>
-  <span class="hljs-keyword">val</span> in = <span 
class="hljs-keyword">new</span> DefaultInputPort[String]() {
-    <span class="hljs-keyword">override</span> <span 
class="hljs-keyword">def</span> process(t: String): Unit = {
-      <span class="hljs-keyword">for</span>(w &lt;- t.split(regex)) out.emit(w)
-    }
-  }
-}</code></pre>
-<p>Scala simplifies automatic generation of setters and getters based on the 
@BinProperty annotation. Properties annotated with @BinProperty can be modified 
at the time of launching an application by using configuration files. You can 
also modify such properties while an application is running. You can specify 
the regular expression used for splitting within the configuration file.</p>
-<h3 id="uniquecount-and-consoeloutputoperator">UniqueCount and 
ConsoelOutputOperator</h3>
-<p>For this application, let us use UniqueCount and ConsoleOutputOperator as 
is.</p>
-<h3 id="put-together-the-word-count-application">Put together the word count 
application</h3>
-<p>Writing the main application class in Scala is similar to doing it in JAVA. 
You must first get an instance of DAG object by overriding the populateDAG() 
method. Later, you must add operators to this instance using the addOperator() 
method. Finally, you must connect the operators with the addStream() method.</p>
-<pre class="prettyprint"><code class="language-scala hljs "><span 
class="hljs-annotation">@ApplicationAnnotation</span>(name=<span 
class="hljs-string">"WordCount"</span>)
-<span class="hljs-class"><span class="hljs-keyword">class</span> <span 
class="hljs-title">Application</span> <span class="hljs-keyword">extends</span> 
<span class="hljs-title">StreamingApplication</span> {</span>
-  <span class="hljs-keyword">override</span> <span 
class="hljs-keyword">def</span> populateDAG(dag: DAG, configuration: 
Configuration): Unit = {
-    <span class="hljs-keyword">val</span> input = dag.addOperator(<span 
class="hljs-string">"input"</span>, <span class="hljs-keyword">new</span> 
LineReader)
-    <span class="hljs-keyword">val</span> parser = dag.addOperator(<span 
class="hljs-string">"parser"</span>, <span class="hljs-keyword">new</span> 
Parser)
-    <span class="hljs-keyword">val</span> counter = dag.addOperator(<span 
class="hljs-string">"counter"</span>, <span class="hljs-keyword">new</span> 
UniqueCounter[String])
-    <span class="hljs-keyword">val</span> out = dag.addOperator(<span 
class="hljs-string">"console"</span>, <span class="hljs-keyword">new</span> 
ConsoleOutputOperator)
-
-    dag.addStream[String](<span class="hljs-string">"lines"</span>, input.out, 
parser.in)
-    dag.addStream[String](<span class="hljs-string">"words"</span>, 
parser.out, counter.data)
-    dag.addStream[java.util.HashMap[String,Integer]](<span 
class="hljs-string">"counts"</span>, counter.count, out.input)
-  }
-}</code></pre>
-<h2 id="running-application">Running application</h2>
-<p>Before running the word count application, specify the input directory for 
the input operator. You can use the default configuration file for this. Open 
the <em>src/main/resources/META-INF/properties.xml</em> file, and add the 
following lines between the tag. Do not forget to replace 
“username” with your Hadoop username.</p>
-<pre class="prettyprint"><code class="language-xml hljs "><span 
class="hljs-tag">&lt;<span class="hljs-title">property</span>&gt;</span>
- <span class="hljs-tag">&lt;<span 
class="hljs-title">name</span>&gt;</span>dt.application.WordCount.operator.input.prop.directory<span
 class="hljs-tag">&lt;/<span class="hljs-title">name</span>&gt;</span>
-  <span class="hljs-tag">&lt;<span 
class="hljs-title">value</span>&gt;</span>/user/username/data<span 
class="hljs-tag">&lt;/<span class="hljs-title">value</span>&gt;</span>
-<span class="hljs-tag">&lt;/<span 
class="hljs-title">property</span>&gt;</span></code></pre>
-<p>Build the application from the application directory using this command:</p>
-<pre class="prettyprint"><code class="language-bash hljs ">mvn clean install 
-DskipTests</code></pre>
-<p>You should now have an application package in the target directory.</p>
-<p>Now, launch this application package using dtcli.</p>
-<pre class="prettyprint"><code class="language-bash hljs ">$ dtcli
-DT CLI <span class="hljs-number">3.2</span>.<span 
class="hljs-number">0</span>-SNAPSHOT <span 
class="hljs-number">28.09</span>.<span class="hljs-number">2015</span> @ <span 
class="hljs-number">12</span>:<span class="hljs-number">45</span>:<span 
class="hljs-number">15</span> IST rev: <span class="hljs-number">8</span>e49cfb 
branch: devel-<span class="hljs-number">3</span>
-dt&gt; launch target/wordcount-<span 
class="hljs-number">1.0</span>-SNAPSHOT.apa
-{<span class="hljs-string">"appId"</span>: <span 
class="hljs-string">"application_1443354392775_0010"</span>}
-dt (application_1443354392775_0010) &gt;</code></pre>
-<p>Add some text files to the <em>/user/username/data</em> directory on your 
HDFS to see how the application works. You can see the words along with their 
counts in the container log of the console operator.</p>
-<h2 id="summary">Summary</h2>
-<p>Scala classes are JVM classes that can be inherited from JAVA classes, 
while allowing transparency in JAVA object creation and calling. That is why 
you can easily extend your Scala capabilities to build Apex applications.<br />
-To get started with creating your first application, see <a 
href="https://www.datatorrent.com/buildingapps/";>https://www.datatorrent.com/buildingapps/</a>.</p>
-<h2 id="see-also">See Also</h2>
-<ul>
-<li>Building Applications with Apache Apex and Malhar <a 
href="https://www.datatorrent.com/buildingapps/";>https://www.datatorrent.com/buildingapps/</a></li>
-<li>Scala tutorial for java programmers <a 
href="http://docs.scala-lang.org/tutorials/scala-for-java-programmers.html";>http://docs.scala-lang.org/tutorials/scala-for-java-programmers.html</a></li>
-<li>Application Developer Guide <a 
href="https://www.datatorrent.com/docs/guides/ApplicationDeveloperGuide.html";>https://www.datatorrent.com/docs/guides/ApplicationDeveloperGuide.html</a></li>
-<li>Operator Developer Guide <a 
href="https://www.datatorrent.com/docs/guides/OperatorDeveloperGuide.html";>https://www.datatorrent.com/docs/guides/OperatorDeveloperGuide.html</a></li>
-<li>Malhar Operator Library <a 
href="https://www.datatorrent.com/docs/guides/MalharStandardOperatorLibraryTemplatesGuide.html";>https://www.datatorrent.com/docs/guides/MalharStandardOperatorLibraryTemplatesGuide.html</a></li>
-</ul>
-<p>The post <a rel="nofollow" 
href="https://www.datatorrent.com/blog-writing-apache-apex-application-in-scala/";>Write
 Your First Apache Apex Application in Scala</a> appeared first on <a 
rel="nofollow" href="https://www.datatorrent.com";>DataTorrent</a>.</p>
-]]></content:encoded>
-                       
<wfw:commentRss>https://www.datatorrent.com/blog-writing-apache-apex-application-in-scala/feed/</wfw:commentRss>
-               <slash:comments>0</slash:comments>
-               </item>
-               <item>
-               <title>Apache Apex Performance Benchmarks</title>
-               
<link>https://www.datatorrent.com/blog-apex-performance-benchmark/</link>
-               
<comments>https://www.datatorrent.com/blog-apex-performance-benchmark/#comments</comments>
-               <pubDate>Tue, 20 Oct 2015 13:23:27 +0000</pubDate>
-               <dc:creator><![CDATA[Vlad Rozov]]></dc:creator>
-                               <category><![CDATA[Uncategorized]]></category>
-
-               <guid 
isPermaLink="false">https://www.datatorrent.com/?p=2261</guid>
-               <description><![CDATA[<p>Why another benchmark blog? As 
engineers, we are skeptical of performance benchmarks developed and published 
by software vendors. Most of the time such benchmarks are biased towards the 
company’s own product in comparison to other vendors. Any reported advantage 
may be the result of selecting a specific use case better handled by the 
product or [&#8230;]</p>
-<p>The post <a rel="nofollow" 
href="https://www.datatorrent.com/blog-apex-performance-benchmark/";>Apache Apex 
Performance Benchmarks</a> appeared first on <a rel="nofollow" 
href="https://www.datatorrent.com";>DataTorrent</a>.</p>
-]]></description>
-                               <content:encoded><![CDATA[<p><b 
id="apex-performance-benchmarks" class="c2 c16"><span class="c0">Why another 
benchmark blog?</span></b></p>
-<p class="c2">As engineers, we are skeptical of performance benchmarks 
developed and published by software vendors. Most of the time such benchmarks 
are biased towards the company’s own product in comparison to other vendors. 
Any reported advantage may be the result of selecting a specific use case 
better handled by the product or using optimized configuration for one’s own 
product compared to out of the box configuration for other vendors’ 
products.</p>
-<p class="c2">So, why another blog on the topic? The reason I decided to write 
this blog is to explain the rationale behind <a 
href="http://www.datatorrent.com";>DataTorrent’s</a> effort to develop and 
maintain a benchmark performance suite, how the suite is used to certify 
various releases, and seek community opinion on how the performance benchmark 
may be improved.</p>
-<p class="c2 c4">Note: the performance numbers given here are only for 
reference and by no means a comprehensive performance evaluation of <a 
href="http://apex.apache.org/";>Apache APEX</a>; performance numbers can vary 
depending on different configurations or different use cases.</p>
-<p class="c12 c2 subtitle"><strong>Benchmark application.</strong><img class=" 
aligncenter" title="" 
src="https://www.datatorrent.com/wp-content/uploads/2015/10/image02.png"; alt="" 
/></p>
-<p class="c2">To evaluate the performance of the <a 
href="http://apex.apache.org/";>Apache APEX</a>  platform,  we use Benchmark 
application that has a simple <a 
href="https://www.datatorrent.com/blog-tracing-dags-from-specification-to-execution/";>DAG</a>
 with only two operators. The first operator (<span 
class="c3">wordGenerato</span>r) emits tuples and the second operator (<span 
class="c3">counter</span>) counts tuples received. For such trivial operations, 
operators add minimum overhead to CPU and memory consumption allowing 
measurement of <a href="http://apex.apache.org/";>Apache APEX</a>  platform 
throughput. As operators don’t change from release to release, this 
application is suitable for comparing the platform performance across 
releases.</p>
-<p class="c2">Tuples are byte arrays with configurable length, minimizing 
complexity of tuples serialization and at the same time allowing examination of 
 performance of the platform against several different tuple sizes. The 
emitter (<span class="c3">wordGenerator</span>) operator may be configured to 
use the same byte array avoiding the operator induced garbage collection. Or it 
may be configured to allocate new byte array for every new tuple emitted, more 
closely simulating real application behavior.</p>
-<p class="c2">The consumer (<span class="c3">counter</span>) operator collects 
the total number of tuples received and the wall-clock time in milliseconds 
passed between begin and end window. It writes the collected data to the log at 
the end of every 10th window.</p>
-<p class="c2">The data stream (<span class="c3">Generator2Counter</span>) 
connects the first operator output port to the second operator input port. The 
benchmark suite exercises all possible configurations for the stream 
locality:</p>
-<ul class="c8 lst-kix_2ql03f9wui4c-0 start">
-<li class="c2 c7">thread local (<span class="c3">THREAD_LOCAL</span>) when 
both operators are deployed into the same thread within a container effectively 
serializing operators computation;</li>
-<li class="c2 c7">container local (<span 
class="c3">CONTAINER_LOCAL)</span><span class="c3"> </span>when both operators 
are deployed into the same container and execute in two different threads;</li>
-<li class="c2 c7">node local (<span class="c3">NODE_LOCAL</span>)<sup><a 
href="#ftnt_ref1">[1]</a></sup> when operators are deployed into two different 
containers running on the same yarn node;</li>
-<li class="c2 c7">rack local (RACK_LOCAL)<sup><a 
href="#ftnt_ref2">[2]</a></sup> when operators are deployed into two different 
containers running on yarn nodes residing on the same rack</li>
-<li class="c2 c7">no locality when operators are deployed into two different 
containers running on any hadoop cluster node.</li>
-</ul>
-<p class="c2 c12 subtitle"><span class="c0"><b><a 
href="http://apex.apache.org/";>Apache APEX</a> release performance 
certification</b></span></p>
-<p class="c2">The benchmark application is a part of <a 
href="http://apex.apache.org/";>Apache APEX</a> release certification. It is 
executed on <a href="http://www.datatorrent.com";>DataTorrent’s</a> 
development Hadoop cluster by an automated script that launches the application 
with all supported <span class="c3">Generator2Counter</span> stream localities 
and 64, 128, 256, 512, 1024, 2048 and a tuple byte array length of 4096. The 
script collects the number of tuples emitted, the number of tuples processed 
and the <span class="c3">counter</span> operator latency for the running 
application and shuts down the application after it runs for 5 minutes, 
whereupon it moves on to the next configuration. For all configurations, the 
script runs between 6 and 8 hours depending on the development cluster load.</p>
-<p class="c12 c2 subtitle"><span class="c0"><b>Benchmark results</b></span></p>
-<p class="c2">As each supported stream locality has distinct performance 
characteristics (with exception of rack local and no locality due to the 
development cluster being setup on a single rack), I use a separate chart for 
each stream locality.</p>
-<p class="c2">Overall the results are self explanatory and I hope that anyone 
who uses, plans to use or plans to contribute to the <a 
href="http://apex.apache.org/";>Apache APEX</a> project finds it useful. A few 
notes that seems to be worth mentioning:</p>
-<ul class="c8 lst-kix_5u2revq5rd1r-0 start">
-<li class="c2 c7">There is no performance regression in APEX release 3.0 
compared to release 2.0</li>
-<li class="c2 c7">Benchmark was executed with default settings for buffer 
server spooling (turned on by default in release 3.0 and off in release 2.0). 
As the result, the benchmark application required just 2 GB of memory for the 
<span class="c3">wordGenerator</span> operator container in release 3.0, while 
it was necessary to allocate 8 GB to the same in release 2.0</li>
-<li class="c2 c7">When tuple size increases, JVM garbage collection starts to 
play a major role in performance benchmark compared to locality</li>
-<li class="c2 c7">Thread local outperforms all other stream localities only 
for trivial operators that we specifically designed for the benchmark.</li>
-<li class="c2 c7">The benchmark was performed on the development cluster while 
other developers were using it<img title="" 
src="https://www.datatorrent.com/wp-content/uploads/2015/10/image03.png"; alt="" 
/></li>
-</ul>
-<p class="c2"><img title="" 
src="https://www.datatorrent.com/wp-content/uploads/2015/10/image01.png"; alt="" 
/></p>
-<p class="c2 c17"><img title="" 
src="https://www.datatorrent.com/wp-content/uploads/2015/10/image002.png"; 
alt="" /></p>
-<hr class="c10" />
-<div>
-<p class="c2 c13"><a name="ftnt_ref1"></a>[1]<span class="c6"> NODE_LOCAL is 
currently excluded from the benchmark test due to known limitation. Please see 
</span><span class="c6 c9"><a class="c5" 
href="https://malhar.atlassian.net/browse/APEX-123";>APEX-123</a></span></p>
-</div>
-<div>
-<p class="c2 c13"><a name="ftnt_ref2"></a>[2]<span class="c6"> RACK_LOCAL is 
not yet fully implemented by APEX and is currently equivalent to no locality 
specified</span></p>
-</div>
-<p>The post <a rel="nofollow" 
href="https://www.datatorrent.com/blog-apex-performance-benchmark/";>Apache Apex 
Performance Benchmarks</a> appeared first on <a rel="nofollow" 
href="https://www.datatorrent.com";>DataTorrent</a>.</p>
-]]></content:encoded>
-                       
<wfw:commentRss>https://www.datatorrent.com/blog-apex-performance-benchmark/feed/</wfw:commentRss>
-               <slash:comments>0</slash:comments>
-               </item>
-               <item>
-               <title>Introduction to dtGateway</title>
-               
<link>https://www.datatorrent.com/blog-introduction-to-dtgateway/</link>
-               
<comments>https://www.datatorrent.com/blog-introduction-to-dtgateway/#comments</comments>
-               <pubDate>Tue, 06 Oct 2015 13:00:48 +0000</pubDate>
-               <dc:creator><![CDATA[David Yan]]></dc:creator>
-                               <category><![CDATA[Uncategorized]]></category>
-
-               <guid 
isPermaLink="false">https://www.datatorrent.com/?p=2247</guid>
-               <description><![CDATA[<p>A platform, no matter how much it can 
do, and how technically superb it is, does not delight users without a proper 
UI or an API. That’s why there are products such as Cloudera Manager and 
Apache Ambari to improve the usability of the Hadoop platform. At DataTorrent, 
in addition to excellence in technology, we [&#8230;]</p>
-<p>The post <a rel="nofollow" 
href="https://www.datatorrent.com/blog-introduction-to-dtgateway/";>Introduction 
to dtGateway</a> appeared first on <a rel="nofollow" 
href="https://www.datatorrent.com";>DataTorrent</a>.</p>
-]]></description>
-                               <content:encoded><![CDATA[<p>A platform, no 
matter how much it can do, and how technically superb it is, does not delight 
users without a proper UI or an API. That’s why there are products such as 
Cloudera Manager and Apache Ambari to improve the usability of the Hadoop 
platform. At DataTorrent, in addition to excellence in technology, we strive 
for user delight. One of the main components of DataTorrent RTS is dtGateway. 
dtGateway is the windo

<TRUNCATED>

Reply via email to