http://git-wip-us.apache.org/repos/asf/apex-malhar/blob/2e47b4cd/contrib/src/test/resources/com/datatorrent/contrib/romesyndication/datatorrent_feed.rss
----------------------------------------------------------------------
diff --git 
a/contrib/src/test/resources/com/datatorrent/contrib/romesyndication/datatorrent_feed.rss
 
b/contrib/src/test/resources/com/datatorrent/contrib/romesyndication/datatorrent_feed.rss
deleted file mode 100644
index 1397ab3..0000000
--- 
a/contrib/src/test/resources/com/datatorrent/contrib/romesyndication/datatorrent_feed.rss
+++ /dev/null
@@ -1,894 +0,0 @@
-<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
-       xmlns:content="http://purl.org/rss/1.0/modules/content/";
-       xmlns:wfw="http://wellformedweb.org/CommentAPI/";
-       xmlns:dc="http://purl.org/dc/elements/1.1/";
-       xmlns:atom="http://www.w3.org/2005/Atom";
-       xmlns:sy="http://purl.org/rss/1.0/modules/syndication/";
-       xmlns:slash="http://purl.org/rss/1.0/modules/slash/";
-       >
-
-<channel>
-       <title>DataTorrent</title>
-       <atom:link href="https://www.datatorrent.com/feed/"; rel="self" 
type="application/rss+xml" />
-       <link>https://www.datatorrent.com</link>
-       <description></description>
-       <lastBuildDate>Tue, 03 Nov 2015 08:00:45 +0000</lastBuildDate>
-       <language>en-US</language>
-       <sy:updatePeriod>hourly</sy:updatePeriod>
-       <sy:updateFrequency>1</sy:updateFrequency>
-       <generator>http://wordpress.org/?v=4.2.5</generator>
-               <item>
-               <title>Dimensions Computation (Aggregate Navigator) Part 1: 
Intro</title>
-               
<link>https://www.datatorrent.com/blog-dimensions-computation-aggregate-navigator-part-1-intro/</link>
-               
<comments>https://www.datatorrent.com/blog-dimensions-computation-aggregate-navigator-part-1-intro/#comments</comments>
-               <pubDate>Tue, 03 Nov 2015 08:00:29 +0000</pubDate>
-               <dc:creator><![CDATA[tim farkas]]></dc:creator>
-                               <category><![CDATA[Uncategorized]]></category>
-
-               <guid 
isPermaLink="false">https://www.datatorrent.com/?p=2399</guid>
-               <description><![CDATA[<p>Introduction In the world of big data, 
enterprises have a common problem. They have large volumes of data flowing into 
their systems for which they need to observe historical trends in real-time. 
Consider the case of a digital advertising publisher that is receiving hundreds 
of thousands of click events every second. Looking at the history [&#8230;]</p>
-<p>The post <a rel="nofollow" 
href="https://www.datatorrent.com/blog-dimensions-computation-aggregate-navigator-part-1-intro/";>Dimensions
 Computation (Aggregate Navigator) Part 1: Intro</a> appeared first on <a 
rel="nofollow" href="https://www.datatorrent.com";>DataTorrent</a>.</p>
-]]></description>
-                               <content:encoded><![CDATA[<h2 
id="introduction">Introduction</h2>
-<p>In the world of big data, enterprises have a common problem. They have 
large volumes of data flowing into their systems for which they need to observe 
historical trends in real-time. Consider the case of a digital advertising 
publisher that is receiving hundreds of thousands of click events every second. 
Looking at the history of individual clicks and impressions doesn’t tell the 
publisher much about what is going on. A technique the publisher might employ 
is to track the total number of clicks and impressions for every second, 
minute, hour, and day. Such a technique might help find global trends in their 
systems, but may not provide enough granularity to take action on localized 
trends. The technique will need to be powerful enough to spot local trends. For 
example, the total clicks and impressions for an advertiser, a geographical 
area, or a combination of the two can provide some actionable insight. This 
process of receiving individual events, aggregating them over time, and
  drilling down into the data using some parameters like “advertiser” and 
“location” is called Dimensions Computation.</p>
-<p>Dimensions Computation is a powerful mechanism that allows you to spot 
trends in your streaming data in real-time. In this post we’ll cover the key 
concepts behind Dimensions Computation and outline the process of performing 
Dimensions Computation. We will also show you how to use Data Torrent’s 
out-of-the-box enterprise operators to easily add Dimensions Computation to 
your application.</p>
-<h2 id="the-process">The Process</h2>
-<p>Let us continue with our example of an advertising publisher. Let us now 
see the steps that the publisher might take to ensure that large volumes of raw 
advertisement data is converted into meaningful historical views of their 
advertisement events.</p>
-<h3 id="the-data">The Data</h3>
-<p>Typically advertising publishers receive packets of information for each 
advertising event. The events that a publisher receives might look like 
this:</p>
-<pre class="prettyprint"><code class=" hljs cs">    <span 
class="hljs-keyword">public</span> <span class="hljs-keyword">class</span> 
AdEvent
-    {
-        <span class="hljs-comment">//The name of the company that is 
advertising</span>
-      <span class="hljs-keyword">public</span> String advertiser;
-      <span class="hljs-comment">//The geographical location of the person 
initiating the event</span>
-      <span class="hljs-keyword">public</span> String location;
-      <span class="hljs-comment">//How much the advertiser was charged for the 
event</span>
-      <span class="hljs-keyword">public</span> <span 
class="hljs-keyword">double</span> cost;
-      <span class="hljs-comment">//How much revenue was generated for the 
advertiser</span>
-      <span class="hljs-keyword">public</span> <span 
class="hljs-keyword">double</span> revenue;
-      <span class="hljs-comment">//The number of impressions the advertiser 
received from this event</span>
-      <span class="hljs-keyword">public</span> <span 
class="hljs-keyword">long</span> impressions;
-      <span class="hljs-comment">//The number of clicks the advertiser 
received from this event</span>
-      <span class="hljs-keyword">public</span> <span 
class="hljs-keyword">long</span> clicks;
-      <span class="hljs-comment">//The timestamp of the event in 
milliseconds</span>
-      <span class="hljs-keyword">public</span> <span 
class="hljs-keyword">long</span> time;
-    }</code></pre>
-<p>The class <strong>AdEvent</strong> contains two types of data:</p>
-<ul>
-<li><strong>Aggregates</strong>: The data that is combined using 
aggregators.</li>
-<li><strong>Keys</strong>: The data that is used to select aggregations at a 
finer granularity.</li>
-</ul>
-<h4 id="aggregates">Aggregates</h4>
-<p>The aggregates in our <strong>AdEvent</strong> object are the pieces of 
data, which we must combine using aggregators in order to obtain a meaningful 
historical view. In this case, we can think of combining cost, revenue, 
impressions, and clicks. So these are our aggregates. We won’t obtain 
anything useful by aggregating the location and advertiser strings in our 
<strong>AdEvent</strong>, so those are not considered aggregates. It’s 
important to note that aggregates are considered separate entities. This means 
that the cost field of and <strong>AdEvent</strong> cannot be combined with its 
revenue field; cost values can only be aggregated with other cost values and 
revenue values can only be aggregated with other revenue values.</p>
-<p>In summary the aggregates in our <strong>AdEvent</strong> are:</p>
-<ul>
-<li><strong>cost</strong></li>
-<li><strong>revenue</strong></li>
-<li><strong>impressions</strong></li>
-<li><strong>clicks</strong></li>
-</ul>
-<h4 id="keys">Keys</h4>
-<p>The keys in our <strong>AdEvent</strong> object are used for selecting 
aggregations at a finer granularity. For example, it would make sense to look 
at the number of clicks for a particular advertiser, the number of clicks for a 
certain location, and the number of clicks for a certain location and 
advertiser combination. So location and advertiser are keys. Time is also 
another key since it is useful to look at the number of clicks received in a 
particular time range (For example, 12:00 pm through 1:00 pm or 12:00 pm 
through 12:01 pm.</p>
-<p>In summary the keys in our <strong>AdEvent</strong> are:</p>
-<ul>
-<li><strong>advertiser</strong></li>
-<li><strong>location</strong></li>
-<li><strong>time</strong></li>
-</ul>
-<h3 id="computing-the-aggregations">Computing The Aggregations</h3>
-<p>When the publisher receives a new <strong>AdEvent</strong> the event is 
added to running aggregations in real time. The keys and aggregates in the 
<strong>AdEvent</strong> are used to compute aggregations. How the aggregations 
are computed and the number of aggregations computed are determined by three 
tunable parameters:</p>
-<ul>
-<li><strong>Aggregators</strong></li>
-<li><strong>Dimensions Combinations</strong></li>
-<li><strong>Time Buckets</strong></li>
-</ul>
-<h4 id="aggregators">Aggregators</h4>
-<p>Dimensions Computation supports more than just one type of aggregation, and 
multiple aggregators can be used to combine incoming data at once. Some of the 
aggregators available out-of-the-box are:</p>
-<ul>
-<li><strong>Sum</strong></li>
-<li><strong>Count</strong></li>
-<li><strong>Min</strong></li>
-<li><strong>Max</strong></li>
-</ul>
-<p>As an example, suppose the publisher is not using the keys in their 
<strong>AdEvents</strong> and this publisher wants to perform a sum and a max 
aggregation.</p>
-<p><strong>1.</strong> An AdEvent arrives. The AdEvent is aggregated to the 
Sum and Count aggregations.<br />
-<img title="" 
src="https://docs.google.com/drawings/d/1upf5hv-UDT4BKhm7yTrcuFZYqnI263vMTXioKhr_qTo/pub?w=960&amp;h=720";
 alt="Adding Aggregate" /><br />
-<strong>2.</strong> Another AdEvent arrives. The AdEvent is aggregated to the 
Sum and Count aggregations.<br />
-<img title="" 
src="https://docs.google.com/drawings/d/10gTXjMyxanYo9UFc76IShPxOi5G7U5tvQKtfwqGyIws/pub?w=960&amp;h=720";
 alt="Adding Aggregate" /></p>
-<p>As can be seen from the example above, each <strong>AdEvent</strong> 
contributes to two aggregations.</p>
-<h4 id="dimension-combinations">Dimension Combinations</h4>
-<p>Each <strong>AdEvent</strong> does not necessarily contribute to only one 
aggregation for each aggregator. In our advertisement example there are 4 
<strong>dimension combinations</strong> over which aggregations can be 
computed.</p>
-<ul>
-<li><strong>advertiser:</strong> This dimension combination is comprised of 
just the advertiser value. This means that all the aggregates for 
<strong>AdEvents</strong> with a particular value for advertiser (for example, 
Gamestop) are aggregated.</li>
-<li><strong>location:</strong> This dimension combination is comprised of just 
the location value. This means that all the aggregates for 
<strong>AdEvents</strong> with a particular value for location (for example, 
California) are aggregated.</li>
-<li><strong>advertiser, location:</strong> This dimension combination is 
comprised the advertiser and location values. This means that all the 
aggregates for <strong>AdEvents</strong> with the same advertiser and location 
combination (for example, Gamestop, California) are aggregated.</li>
-<li><strong>the empty combination:</strong> This combination is a <em>global 
aggregation</em> because it doesn’t use any of the keys in the 
<strong>AdEvent</strong>. This means that all the <strong>AddEvents</strong> 
are aggregated.</li>
-</ul>
-<p>Therefore if a publisher is using the four dimension combinations 
enumerated above along with the sum and max aggregators, the number of 
aggregations being maintained would be:</p>
-<p>NUM_AGGS = 2 x <em>(number of unique advertisers)</em> + 2 * <em>(number of 
unique locations)</em> + 2 * <em>(number of unique advertiser and location 
combinations)</em> + 2</p>
-<p>And each individual <strong>AdEvent</strong> will contribute to <em>(number 
of aggregators)</em> x <em>(number of dimension combinations)</em> 
aggregations.</p>
-<p>Here is an example of how NUM_AGGS aggregations are computed:</p>
-<p><strong>1.</strong> An <strong>AdEvent</strong> arrives. The 
<strong>AdEvent</strong> is applied to aggregations for each aggregator and 
each dimension combination.<br />
-<img title="" 
src="https://docs.google.com/drawings/d/1qx8gLu615KneLDspsGkAS0_OlkX-DyvCUA7DAJtYJys/pub?w=960&amp;h=720";
 alt="Adding Aggregate" /><br />
-<strong>2.</strong> Another <strong>AdEvent</strong> arrives. The 
<strong>AdEvent</strong> is applied to aggregations for each aggregator and 
each dimension combination.<br />
-<img title="" 
src="https://docs.google.com/drawings/d/1FA2IyxewwzXtJ9A8JfJPrKtx-pfWHtHpVXp8lb8lKmE/pub?w=960&amp;h=720";
 alt="Adding Aggregate" /><br />
-<strong>3.</strong> Another <strong>AdEvent</strong> arrives. The 
<strong>AdEvent</strong> is applied to aggregations for each aggregator and 
each dimension combination.<br />
-<img title="" 
src="https://docs.google.com/drawings/d/15sxwfZeYOKBiapoD2o721M4rZs-bZBxhF3MeXelnu6M/pub?w=960&amp;h=720";
 alt="Adding Aggregate" /></p>
-<p>As can be seen from the example above each <strong>AdEvent</strong> 
contributes to 2 x 4 = 8 aggregations and there are 2 x 2 + 2 x 2 + 2 x 3 + 2 = 
16 aggregations in total.</p>
-<h4 id="time-buckets">Time Buckets</h4>
-<p>In addition to computing multiple aggregations for each dimension 
combination, aggregations can also be performed over time buckets. Time buckets 
are windows of time (for example, 1:00 pm through 1:01 pm) that are specified 
by a simple notation: 1m is one minute, 1h is one hour, 1d is one day. When 
aggregations are performed over time buckets, separate aggregations are 
maintained for each time bucket. Aggregations for a time bucket are comprised 
only of events with a time stamp that falls into that time bucket.</p>
-<p>An example of how these time bucketed aggregations are computed is as 
follows:</p>
-<p>Let’s say our advertisement publisher is interested in computing the Sum 
and Max of <strong>AdEvents</strong> for the dimension combinations comprised 
of <strong>advertiser</strong> and <strong>location</strong> over 1 minute and 
1 hour time buckets.</p>
-<p><strong>1.</strong> An <strong>AdEvent</strong> arrives. The 
<strong>AdEvent</strong> is applied to the aggregations for the appropriate 
aggregator, dimension combination and time bucket.</p>
-<p><img title="" 
src="https://docs.google.com/drawings/d/11voOdqkagpGKcWn5HOiWWAn78fXlpGl7aXUa3tG5sQc/pub?w=960&amp;h=720";
 alt="Adding Aggregate" /></p>
-<p><strong>3.</strong> Another <strong>AdEvent</strong> arrives. The 
<strong>AdEvent</strong> is applied to the aggregations for the appropriate 
aggregator, dimension combination and time bucket.</p>
-<p><img title="" 
src="https://docs.google.com/drawings/d/1ffovsxWZfHnSc_Z30RzGIXgzQeHjCnyZBoanO_xT_e4/pub?w=960&amp;h=720";
 alt="Adding Aggregate" /></p>
-<h4 id="conclusion">Conclusion</h4>
-<p>In summary, the three tunable parameters discussed above 
(<strong>Aggregators</strong>, <strong>Dimension Combinations</strong>, and 
<strong>Time Buckets</strong>) determine how aggregations are computed. In the 
examples provided in the <strong>Aggregators</strong>, <strong>Dimension 
Combinations</strong>, and <strong>Time Buckets</strong> sections respectively, 
we have incrementally increased the complexity in which the aggregations are 
performed. The examples provided in the <strong>Aggregators</strong>, and 
<strong>Dimension Combinations</strong> sections were for illustration purposes 
only; the example provided in the <strong>Time Buckets</strong> section 
provides an accurate view of how aggregations are computed within Data 
Torrent&#8217;s enterprise operators.</p>
-<p>Download DataTorrent Sandbox <a 
href="http://web.datatorrent.com/DataTorrent-RTS-Sandbox-Edition-Download.html"; 
target="_blank">here</a></p>
-<p>Download DataTorrent Enterprise Edition <a 
href="http://web.datatorrent.com/DataTorrent-RTS-Enteprise-Edition-Download.html";
 target="_blank">here</a></p>
-<p>The post <a rel="nofollow" 
href="https://www.datatorrent.com/blog-dimensions-computation-aggregate-navigator-part-1-intro/";>Dimensions
 Computation (Aggregate Navigator) Part 1: Intro</a> appeared first on <a 
rel="nofollow" href="https://www.datatorrent.com";>DataTorrent</a>.</p>
-]]></content:encoded>
-                       
<wfw:commentRss>https://www.datatorrent.com/blog-dimensions-computation-aggregate-navigator-part-1-intro/feed/</wfw:commentRss>
-               <slash:comments>0</slash:comments>
-               </item>
-               <item>
-               <title>Cisco ACI, Big Data, and DataTorrent</title>
-               <link>https://www.datatorrent.com/blog_cisco_aci/</link>
-               
<comments>https://www.datatorrent.com/blog_cisco_aci/#comments</comments>
-               <pubDate>Tue, 27 Oct 2015 22:30:07 +0000</pubDate>
-               <dc:creator><![CDATA[Charu Madan]]></dc:creator>
-                               <category><![CDATA[Uncategorized]]></category>
-
-               <guid 
isPermaLink="false">https://www.datatorrent.com/?p=2348</guid>
-               <description><![CDATA[<p>By: Harry Petty, Data Center and Cloud 
Networking, Cisco  (This blog has been developed in association with Farid 
Jiandani, Product Manager with Cisco’s Insieme Networks Business Unit and 
Charu Madan, Director Business Development at DataTorrent. It was originally 
published on Cisco Blogs) If you work for an enterprise that’s looking to hit 
its digital sweet [&#8230;]</p>
-<p>The post <a rel="nofollow" 
href="https://www.datatorrent.com/blog_cisco_aci/";>Cisco ACI, Big Data, and 
DataTorrent</a> appeared first on <a rel="nofollow" 
href="https://www.datatorrent.com";>DataTorrent</a>.</p>
-]]></description>
-                               <content:encoded><![CDATA[<p>By: Harry Petty, 
Data Center and Cloud Networking, Cisco</p>
-<p class="c0 c11"><a name="h.gjdgxs"></a><span class="c1"> (</span><span 
class="c4 c13">This blog has been developed in association with Farid Jiandani, 
Product Manager with Cisco’s Insieme Networks Business Unit and Charu Madan, 
Director Business Development at DataTorrent. It was originally published on <a 
href="http://blogs.cisco.com/datacenter/aci-big-data-and-datatorrent"; 
target="_blank">Cisco Blogs</a>)</span></p>
-<p class="c0"><span class="c1">If you work for an enterprise that’s looking 
to hit its digital sweet spot, then you’re scrutinizing your sales, marketing 
and operations to see where you should make digital investments to innovate and 
improve productivity. Super-fast data processing at scale is being used to 
obtain real-time insights for digital business and Internet of Things (IoT) 
initiatives.</span></p>
-<p class="c0"><span class="c1">According to Gartner Group, one of the cool 
vendors in this area of providing super- fast big data analysis using in-memory 
streaming analytics is called DataTorrent, a startup founded by long-time 
ex-Yahoo! veterans with vast experience managing big data for leading edge 
applications and infrastructure at massive scale. Their goal is to empower 
today’s enterprises to experience the full potential and business impact of 
big data with a platform that processes and analyzes data in 
real-time.</span></p>
-<p class="c0"><span class="c1 c2">DataTorrent RTS</span></p>
-<p class="c0"><span class="c4 c6">DataTorrent RTS is an open core</span><span 
class="c2 c4 c6">, enterprise-grade product powered by Apache Apex. 
</span><span class="c4 c6">DataTorrent RTS provides a single, unified batch and 
stream processing platform that enables organizations to reduce time to market, 
development costs and operational expenditures for big data analytics 
applications. </span></p>
-<p class="c0"><span class="c1 c2">DataTorrent RTS Integration with 
ACI</span></p>
-<p class="c0"><span class="c4 c6">A member of the Cisco ACI ecosystem, 
DataTorrent announced on September 29th DataTorrent RTS integration with Cisco 
</span><span class="c4 c6"><a class="c7" 
href="https://www.google.com/url?q=http://www.cisco.com/c/en/us/solutions/data-center-virtualization/application-centric-infrastructure/index.html&amp;sa=D&amp;usg=AFQjCNFMZhMYdUmPuuqrUI5IZmrvEhlK5g";>Application
 Centric Infrastructure (ACI)</a></span><span class="c4 c6"> through the 
Application Policy Infrastructure Controller (APIC) to help create more 
efficient IT operations, bringing together network operations management and 
big data application management and development: </span></p>
-<p class="c0"><span class="c4"><a class="c7" 
href="https://www.google.com/url?q=https://www.datatorrent.com/press-releases/datatorrent-integrates-with-cisco-aci-to-help-secure-big-data-processing-through-a-unified-data-and-network-fabric/&amp;sa=D&amp;usg=AFQjCNG4S_2-OY5ox5nCf_0_Qj7s-x9pyw";>DataTorrent
 Integrates with Cisco ACI to Help Secure Big Data Processing Through a Unified 
Data and Network Fabric</a></span><span class="c4">. </span><span class="c2 
c4">The joint solution enables</span></p>
-<ul class="c8 lst-kix_list_2-0 start">
-<li class="c12 c0"><span class="c4">A unified fabric approach for managing 
</span><span class="c2 c4">Applications, Data </span><span class="c4">and 
</span><span class="c2 c4">Network</span></li>
-<li class="c0 c12"><span class="c4">A highly secure and automated Big Data 
application platform which uses the power of Cisco ACI for automation and 
security policy management </span></li>
-<li class="c12 c0"><span class="c4">The creation, repository, and enforcement 
point for Cisco ACI application policies for big data applications</span></li>
-</ul>
-<p class="c0"><span class="c4">With the ACI integration, secure connectivity 
to diverse data sets becomes a part of a user defined policy which is automated 
and does not compromise on security and access management. As an example, if 
one of the DataTorrent Big Data application needs access to say a Kafka source, 
then all nodes need to be opened up. This leaves the environment vulnerable and 
prone to attacks. With ACI, the access management policies and contracts help 
define the connectivity contracts and only the right node and right application 
gets access. See Figure 1 and 2 for the illustration of this concept. 
</span></p>
-<p class="c0"><span class="c1 c2">Figure 1:</span></p>
-<p class="c0"><a 
href="https://www.datatorrent.com/wp-content/uploads/2015/10/image00.jpg";><img 
class="alignnone size-full wp-image-2349" 
src="https://www.datatorrent.com/wp-content/uploads/2015/10/image00.jpg"; 
alt="image00" width="432" height="219" /></a></p>
-<p class="c0"><span class="c1 c2">Figure 2</span></p>
-<p class="c0"><a 
href="https://www.datatorrent.com/wp-content/uploads/2015/10/image011.png";><img 
class="alignnone size-full wp-image-2350" 
src="https://www.datatorrent.com/wp-content/uploads/2015/10/image011.png"; 
alt="image01" width="904" height="493" /></a></p>
-<p class="c0"><span class="c1 c2">ACI Support for Big Data Solutions</span></p>
-<p class="c0"><span class="c1">The openness and the flexibility of ACI allow 
big data customers to run a wide variety of different applications within their 
fabric alongside Hadoop. Due to the elasticity of ACI, customers are able to 
run batch processing alongside stream processing and other data base 
applications in a seamless fashion. In traditional Hadoop environments, the 
network is segmented based off of individual server nodes (see Figure 1). This 
makes it difficult to elastically allow access to and from different 
applications. Ultimately, within the ACI framework, logical demarcation points 
can be created based on application workloads rather than physical server 
groups (a set of Hadoop nodes should not be considered as a bunch of individual 
server nodes, rather a single group.)</span></p>
-<p class="c0"><span class="c1 c2">A Robust and Active Ecosystem</span></p>
-<p class="c0"><span class="c1">Many vendors claim they have a broad ecosystem 
of vendors, but sometimes that’s pure marketing, without any real integration 
efforts going on behind the slideware. But Cisco’s Application Centric 
Infrastructure (ACI) has a very active ecosystem of industry leaders who are 
putting significant resources into integration efforts, taking advantage of 
ACI’s open Northbound and Southbound API’s. DataTorrent is just one example 
of an innovative company that is using ACI integration to add value to their 
solutions and deliver real benefits to their channel partners and 
customers.</span></p>
-<p class="c0"><span class="c1">Stay tuned for more success stories to come: 
we’ll continue to showcase industry leaders that are taking advantage of the 
open ACI API’s.</span></p>
-<p class="c0"><span class="c1">Additional References</span></p>
-<p class="c0"><span class="c3"><a class="c7" 
href="https://www.google.com/url?q=https://www.cisco.com/go/aci&amp;sa=D&amp;usg=AFQjCNHPa1zEn6-1fEWQeCgZ-QmP9te5ig";>https://www.cisco.com/go/aci</a></span></p>
-<p class="c0"><span class="c3"><a class="c7" 
href="https://www.google.com/url?q=https://www.cisco.com/go/aciecosystem&amp;sa=D&amp;usg=AFQjCNGmS3P3mOU0DQen5F43--fDi25uWw";>https://www.cisco.com/go/aciecosystem</a></span></p>
-<p class="c11 c0"><span class="c3"><a class="c7" 
href="https://www.google.com/url?q=http://www.datatorrent/com&amp;sa=D&amp;usg=AFQjCNHbzoCVBy0azkWTbjpqdyxPqkCo9g";>http://www.datatorrent/</a></span></p>
-<p>&nbsp;</p>
-<p>&nbsp;</p>
-<p>The post <a rel="nofollow" 
href="https://www.datatorrent.com/blog_cisco_aci/";>Cisco ACI, Big Data, and 
DataTorrent</a> appeared first on <a rel="nofollow" 
href="https://www.datatorrent.com";>DataTorrent</a>.</p>
-]]></content:encoded>
-                       
<wfw:commentRss>https://www.datatorrent.com/blog_cisco_aci/feed/</wfw:commentRss>
-               <slash:comments>0</slash:comments>
-               </item>
-               <item>
-               <title>Write Your First Apache Apex Application in Scala</title>
-               
<link>https://www.datatorrent.com/blog-writing-apache-apex-application-in-scala/</link>
-               
<comments>https://www.datatorrent.com/blog-writing-apache-apex-application-in-scala/#comments</comments>
-               <pubDate>Tue, 27 Oct 2015 01:58:25 +0000</pubDate>
-               <dc:creator><![CDATA[Tushar Gosavi]]></dc:creator>
-                               <category><![CDATA[Uncategorized]]></category>
-
-               <guid 
isPermaLink="false">https://www.datatorrent.com/?p=2280</guid>
-               <description><![CDATA[<p>* Extend your Scala expertise to 
building Apache Apex applications * Scala is modern, multi-paradigm programing 
language that integrates features of functional as well as object-oriented 
languages elegantly. Big Data frameworks are already exploring Scala as a 
language of choice for implementations. Apache Apex is developed in Java, the 
Apex APIs are such that writing [&#8230;]</p>
-<p>The post <a rel="nofollow" 
href="https://www.datatorrent.com/blog-writing-apache-apex-application-in-scala/";>Write
 Your First Apache Apex Application in Scala</a> appeared first on <a 
rel="nofollow" href="https://www.datatorrent.com";>DataTorrent</a>.</p>
-]]></description>
-                               <content:encoded><![CDATA[<p><em>* Extend your 
Scala expertise to building Apache Apex applications *</em></p>
-<p>Scala is modern, multi-paradigm programing language that integrates 
features of functional as well as object-oriented languages elegantly. Big Data 
frameworks are already exploring Scala as a language of choice for 
implementations. Apache Apex is developed in Java, the Apex APIs are such that 
writing applications is a smooth sail. Developers can use any programing 
language that can run on JVM and access JAVA classes, because Scala has good 
interoperability with Java, running Apex applications designed in Scala is a 
fuss-free experience. We will explain how to write an Apache Apex application 
in Scala.</p>
-<p>Writing an <a href="http://www.datatorrent.com/apex"; 
target="_blank">Apache Apex</a> application in Scala is simple.</p>
-<h2 id="operators-within-the-application">Operators within the application</h2>
-<p>We will develop a word count applications in Scala. This application will 
look for new files in a directory. With the availability of new files, the word 
count application will read the files, and compute a count for each word and 
print result on stdout. The application requires following operators:</p>
-<ul>
-<li><strong>LineReader</strong> &#8211; This operator monitors directories for 
new files periodically. After a new file is detected, LineReader reads the file 
line-by-line, and makes lines available on the output port for the next 
operator.</li>
-<li><strong>Parser</strong> &#8211; This operator receives lines read by 
LineReader on its input port. Parser breaks the line into words, and makes 
individual words available on the output port.</li>
-<li><strong>UniqueCounter</strong> &#8211; This operator computes the count of 
each word received on its input port.</li>
-<li><strong>ConsoleOutputOperator</strong> &#8211; This operator prints unique 
counts of words on standard output.</li>
-</ul>
-<h2 id="build-the-scala-word-count-application">Build the Scala word count 
application</h2>
-<p>Now, we will generate a sample application using maven 
archtype:generate.</p>
-<h3 id="generate-a-sample-application">Generate a sample application.</h3>
-<pre class="prettyprint"><code class="language-bash hljs ">mvn 
archetype:generate 
-DarchetypeRepository=https://www.datatorrent.com/maven/content/repositories/releases
 -DarchetypeGroupId=com.datatorrent -DarchetypeArtifactId=apex-app-archetype 
-DarchetypeVersion=<span class="hljs-number">3.0</span>.<span 
class="hljs-number">0</span> -DgroupId=com.datatorrent 
-Dpackage=com.datatorrent.wordcount -DartifactId=wordcount -Dversion=<span 
class="hljs-number">1.0</span>-SNAPSHOT</code></pre>
-<p>This creates a directory called <strong>wordcount</strong>, with a sample 
application and build script. Let us see how to modify this application into 
the Scala-based word count application that we are looking to develop.</p>
-<h3 id="add-the-scala-build-plugin">Add the Scala build plugin.</h3>
-<p>Apache Apex uses maven for building the framework and operator library. 
Maven supports a plugin for compiling Scala files. To enable this plugin, add 
the following snippet to the <code>build -&gt; plugins</code> sections of the 
<code>pom.xml</code> file that is located in the application directory.</p>
-<pre class="prettyprint"><code class="language-xml hljs ">  <span 
class="hljs-tag">&lt;<span class="hljs-title">plugin</span>&gt;</span>
-    <span class="hljs-tag">&lt;<span 
class="hljs-title">groupId</span>&gt;</span>net.alchim31.maven<span 
class="hljs-tag">&lt;/<span class="hljs-title">groupId</span>&gt;</span>
-    <span class="hljs-tag">&lt;<span 
class="hljs-title">artifactId</span>&gt;</span>scala-maven-plugin<span 
class="hljs-tag">&lt;/<span class="hljs-title">artifactId</span>&gt;</span>
-    <span class="hljs-tag">&lt;<span 
class="hljs-title">version</span>&gt;</span>3.2.1<span 
class="hljs-tag">&lt;/<span class="hljs-title">version</span>&gt;</span>
-    <span class="hljs-tag">&lt;<span 
class="hljs-title">executions</span>&gt;</span>
-      <span class="hljs-tag">&lt;<span 
class="hljs-title">execution</span>&gt;</span>
-        <span class="hljs-tag">&lt;<span 
class="hljs-title">goals</span>&gt;</span>
-          <span class="hljs-tag">&lt;<span 
class="hljs-title">goal</span>&gt;</span>compile<span 
class="hljs-tag">&lt;/<span class="hljs-title">goal</span>&gt;</span>
-          <span class="hljs-tag">&lt;<span 
class="hljs-title">goal</span>&gt;</span>testCompile<span 
class="hljs-tag">&lt;/<span class="hljs-title">goal</span>&gt;</span>
-        <span class="hljs-tag">&lt;/<span 
class="hljs-title">goals</span>&gt;</span>
-      <span class="hljs-tag">&lt;/<span 
class="hljs-title">execution</span>&gt;</span>
-    <span class="hljs-tag">&lt;/<span 
class="hljs-title">executions</span>&gt;</span>
-  <span class="hljs-tag">&lt;/<span 
class="hljs-title">plugin</span>&gt;</span></code></pre>
-<p>Also, specify the Scala library as a dependency in the pom.xml file.<br />
-Add the Scala library.</p>
-<pre class="prettyprint"><code class="language-xml hljs "><span 
class="hljs-tag">&lt;<span class="hljs-title">dependency</span>&gt;</span>
- <span class="hljs-tag">&lt;<span 
class="hljs-title">groupId</span>&gt;</span>org.scala-lang<span 
class="hljs-tag">&lt;/<span class="hljs-title">groupId</span>&gt;</span>
- <span class="hljs-tag">&lt;<span 
class="hljs-title">artifactId</span>&gt;</span>scala-library<span 
class="hljs-tag">&lt;/<span class="hljs-title">artifactId</span>&gt;</span>
- <span class="hljs-tag">&lt;<span 
class="hljs-title">version</span>&gt;</span>2.11.2<span 
class="hljs-tag">&lt;/<span class="hljs-title">version</span>&gt;</span>
-<span class="hljs-tag">&lt;/<span 
class="hljs-title">dependency</span>&gt;</span></code></pre>
-<p>We are now set to write a Scala application.</p>
-<h2 id="write-your-scala-word-count-application">Write your Scala word count 
application</h2>
-<h3 id="linereader">LineReader</h3>
-<p><a href="https://github.com/apache/incubator-apex-malhar"; 
target="_blank">Apache Malhar library</a> contains an <a 
href="https://github.com/apache/incubator-apex-malhar/blob/1f5676b455f7749d11c7cd200216d0d4ad7fce32/library/src/main/java/com/datatorrent/lib/io/AbstractFTPInputOperator.java";
 target="_blank">AbstractFileInputOperator</a> operator that monitors and reads 
files from a directory. This operator has capabilities such as support for 
scaling, fault tolerance, and exactly once processing. To complete the 
functionality, override a few methods:<br />
-<em>readEntity</em> : Reads a line from a file.<br />
-<em>emit</em> : Emits data read on the output port.<br />
-We have overridden openFile to obtain an instance of BufferedReader that is 
required while reading lines from the file. We also override closeFile for 
closing an instance of BufferedReader.</p>
-<pre class="prettyprint"><code class="language-scala hljs "><span 
class="hljs-class"><span class="hljs-keyword">class</span> <span 
class="hljs-title">LineReader</span> <span class="hljs-keyword">extends</span> 
<span class="hljs-title">AbstractFileInputOperator</span>[<span 
class="hljs-title">String</span>] {</span>
-
-  <span class="hljs-annotation">@transient</span>
-  <span class="hljs-keyword">val</span> out : DefaultOutputPort[String] = 
<span class="hljs-keyword">new</span> DefaultOutputPort[String]();
-
-  <span class="hljs-keyword">override</span> <span 
class="hljs-keyword">def</span> readEntity(): String = br.readLine()
-
-  <span class="hljs-keyword">override</span> <span 
class="hljs-keyword">def</span> emit(line: String): Unit = out.emit(line)
-
-  <span class="hljs-keyword">override</span> <span 
class="hljs-keyword">def</span> openFile(path: Path): InputStream = {
-    <span class="hljs-keyword">val</span> in = <span 
class="hljs-keyword">super</span>.openFile(path)
-    br = <span class="hljs-keyword">new</span> BufferedReader(<span 
class="hljs-keyword">new</span> InputStreamReader(in))
-    <span class="hljs-keyword">return</span> in
-  }
-
-  <span class="hljs-keyword">override</span> <span 
class="hljs-keyword">def</span> closeFile(is: InputStream): Unit = {
-    br.close()
-    <span class="hljs-keyword">super</span>.closeFile(is)
-  }
-
-  <span class="hljs-annotation">@transient</span>
-  <span class="hljs-keyword">private</span> <span 
class="hljs-keyword">var</span> br : BufferedReader = <span 
class="hljs-keyword">null</span>
-}</code></pre>
-<p>Some Apex API classes are not serializable, and must be defined as 
transient. Scala supports transient annotation for such scenarios. If you see 
objects that are not a part of the state of the operator, you must annotate 
them with a @transient. For example, in this code, we have annotated buffer 
reader and output port as transient.</p>
-<h3 id="parser">Parser</h3>
-<p>Parser splits lines using in-built JAVA split function, and emits 
individual words to the output port.</p>
-<pre class="prettyprint"><code class="language-scala hljs "><span 
class="hljs-class"><span class="hljs-keyword">class</span> <span 
class="hljs-title">Parser</span> <span class="hljs-keyword">extends</span> 
<span class="hljs-title">BaseOperator</span> {</span>
-  <span class="hljs-annotation">@BeanProperty</span>
-  <span class="hljs-keyword">var</span> regex : String = <span 
class="hljs-string">" "</span>
-
-  <span class="hljs-annotation">@transient</span>
-  <span class="hljs-keyword">val</span> out = <span 
class="hljs-keyword">new</span> DefaultOutputPort[String]()
-
-  <span class="hljs-annotation">@transient</span>
-  <span class="hljs-keyword">val</span> in = <span 
class="hljs-keyword">new</span> DefaultInputPort[String]() {
-    <span class="hljs-keyword">override</span> <span 
class="hljs-keyword">def</span> process(t: String): Unit = {
-      <span class="hljs-keyword">for</span>(w &lt;- t.split(regex)) out.emit(w)
-    }
-  }
-}</code></pre>
-<p>Scala simplifies automatic generation of setters and getters based on the 
@BinProperty annotation. Properties annotated with @BinProperty can be modified 
at the time of launching an application by using configuration files. You can 
also modify such properties while an application is running. You can specify 
the regular expression used for splitting within the configuration file.</p>
-<h3 id="uniquecount-and-consoeloutputoperator">UniqueCount and 
ConsoelOutputOperator</h3>
-<p>For this application, let us use UniqueCount and ConsoleOutputOperator as 
is.</p>
-<h3 id="put-together-the-word-count-application">Put together the word count 
application</h3>
-<p>Writing the main application class in Scala is similar to doing it in JAVA. 
You must first get an instance of DAG object by overriding the populateDAG() 
method. Later, you must add operators to this instance using the addOperator() 
method. Finally, you must connect the operators with the addStream() method.</p>
-<pre class="prettyprint"><code class="language-scala hljs "><span 
class="hljs-annotation">@ApplicationAnnotation</span>(name=<span 
class="hljs-string">"WordCount"</span>)
-<span class="hljs-class"><span class="hljs-keyword">class</span> <span 
class="hljs-title">Application</span> <span class="hljs-keyword">extends</span> 
<span class="hljs-title">StreamingApplication</span> {</span>
-  <span class="hljs-keyword">override</span> <span 
class="hljs-keyword">def</span> populateDAG(dag: DAG, configuration: 
Configuration): Unit = {
-    <span class="hljs-keyword">val</span> input = dag.addOperator(<span 
class="hljs-string">"input"</span>, <span class="hljs-keyword">new</span> 
LineReader)
-    <span class="hljs-keyword">val</span> parser = dag.addOperator(<span 
class="hljs-string">"parser"</span>, <span class="hljs-keyword">new</span> 
Parser)
-    <span class="hljs-keyword">val</span> counter = dag.addOperator(<span 
class="hljs-string">"counter"</span>, <span class="hljs-keyword">new</span> 
UniqueCounter[String])
-    <span class="hljs-keyword">val</span> out = dag.addOperator(<span 
class="hljs-string">"console"</span>, <span class="hljs-keyword">new</span> 
ConsoleOutputOperator)
-
-    dag.addStream[String](<span class="hljs-string">"lines"</span>, input.out, 
parser.in)
-    dag.addStream[String](<span class="hljs-string">"words"</span>, 
parser.out, counter.data)
-    dag.addStream[java.util.HashMap[String,Integer]](<span 
class="hljs-string">"counts"</span>, counter.count, out.input)
-  }
-}</code></pre>
-<h2 id="running-application">Running application</h2>
-<p>Before running the word count application, specify the input directory for 
the input operator. You can use the default configuration file for this. Open 
the <em>src/main/resources/META-INF/properties.xml</em> file, and add the 
following lines between the tag. Do not forget to replace 
“username” with your Hadoop username.</p>
-<pre class="prettyprint"><code class="language-xml hljs "><span 
class="hljs-tag">&lt;<span class="hljs-title">property</span>&gt;</span>
- <span class="hljs-tag">&lt;<span 
class="hljs-title">name</span>&gt;</span>dt.application.WordCount.operator.input.prop.directory<span
 class="hljs-tag">&lt;/<span class="hljs-title">name</span>&gt;</span>
-  <span class="hljs-tag">&lt;<span 
class="hljs-title">value</span>&gt;</span>/user/username/data<span 
class="hljs-tag">&lt;/<span class="hljs-title">value</span>&gt;</span>
-<span class="hljs-tag">&lt;/<span 
class="hljs-title">property</span>&gt;</span></code></pre>
-<p>Build the application from the application directory using this command:</p>
-<pre class="prettyprint"><code class="language-bash hljs ">mvn clean install 
-DskipTests</code></pre>
-<p>You should now have an application package in the target directory.</p>
-<p>Now, launch this application package using dtcli.</p>
-<pre class="prettyprint"><code class="language-bash hljs ">$ dtcli
-DT CLI <span class="hljs-number">3.2</span>.<span 
class="hljs-number">0</span>-SNAPSHOT <span 
class="hljs-number">28.09</span>.<span class="hljs-number">2015</span> @ <span 
class="hljs-number">12</span>:<span class="hljs-number">45</span>:<span 
class="hljs-number">15</span> IST rev: <span class="hljs-number">8</span>e49cfb 
branch: devel-<span class="hljs-number">3</span>
-dt&gt; launch target/wordcount-<span 
class="hljs-number">1.0</span>-SNAPSHOT.apa
-{<span class="hljs-string">"appId"</span>: <span 
class="hljs-string">"application_1443354392775_0010"</span>}
-dt (application_1443354392775_0010) &gt;</code></pre>
-<p>Add some text files to the <em>/user/username/data</em> directory on your 
HDFS to see how the application works. You can see the words along with their 
counts in the container log of the console operator.</p>
-<h2 id="summary">Summary</h2>
-<p>Scala classes are JVM classes that can be inherited from JAVA classes, 
while allowing transparency in JAVA object creation and calling. That is why 
you can easily extend your Scala capabilities to build Apex applications.<br />
-To get started with creating your first application, see <a 
href="https://www.datatorrent.com/buildingapps/";>https://www.datatorrent.com/buildingapps/</a>.</p>
-<h2 id="see-also">See Also</h2>
-<ul>
-<li>Building Applications with Apache Apex and Malhar <a 
href="https://www.datatorrent.com/buildingapps/";>https://www.datatorrent.com/buildingapps/</a></li>
-<li>Scala tutorial for java programmers <a 
href="http://docs.scala-lang.org/tutorials/scala-for-java-programmers.html";>http://docs.scala-lang.org/tutorials/scala-for-java-programmers.html</a></li>
-<li>Application Developer Guide <a 
href="https://www.datatorrent.com/docs/guides/ApplicationDeveloperGuide.html";>https://www.datatorrent.com/docs/guides/ApplicationDeveloperGuide.html</a></li>
-<li>Operator Developer Guide <a 
href="https://www.datatorrent.com/docs/guides/OperatorDeveloperGuide.html";>https://www.datatorrent.com/docs/guides/OperatorDeveloperGuide.html</a></li>
-<li>Malhar Operator Library <a 
href="https://www.datatorrent.com/docs/guides/MalharStandardOperatorLibraryTemplatesGuide.html";>https://www.datatorrent.com/docs/guides/MalharStandardOperatorLibraryTemplatesGuide.html</a></li>
-</ul>
-<p>The post <a rel="nofollow" 
href="https://www.datatorrent.com/blog-writing-apache-apex-application-in-scala/";>Write
 Your First Apache Apex Application in Scala</a> appeared first on <a 
rel="nofollow" href="https://www.datatorrent.com";>DataTorrent</a>.</p>
-]]></content:encoded>
-                       
<wfw:commentRss>https://www.datatorrent.com/blog-writing-apache-apex-application-in-scala/feed/</wfw:commentRss>
-               <slash:comments>0</slash:comments>
-               </item>
-               <item>
-               <title>Apache Apex Performance Benchmarks</title>
-               
<link>https://www.datatorrent.com/blog-apex-performance-benchmark/</link>
-               
<comments>https://www.datatorrent.com/blog-apex-performance-benchmark/#comments</comments>
-               <pubDate>Tue, 20 Oct 2015 13:23:27 +0000</pubDate>
-               <dc:creator><![CDATA[Vlad Rozov]]></dc:creator>
-                               <category><![CDATA[Uncategorized]]></category>
-
-               <guid 
isPermaLink="false">https://www.datatorrent.com/?p=2261</guid>
-               <description><![CDATA[<p>Why another benchmark blog? As 
engineers, we are skeptical of performance benchmarks developed and published 
by software vendors. Most of the time such benchmarks are biased towards the 
company’s own product in comparison to other vendors. Any reported advantage 
may be the result of selecting a specific use case better handled by the 
product or [&#8230;]</p>
-<p>The post <a rel="nofollow" 
href="https://www.datatorrent.com/blog-apex-performance-benchmark/";>Apache Apex 
Performance Benchmarks</a> appeared first on <a rel="nofollow" 
href="https://www.datatorrent.com";>DataTorrent</a>.</p>
-]]></description>
-                               <content:encoded><![CDATA[<p><b 
id="apex-performance-benchmarks" class="c2 c16"><span class="c0">Why another 
benchmark blog?</span></b></p>
-<p class="c2">As engineers, we are skeptical of performance benchmarks 
developed and published by software vendors. Most of the time such benchmarks 
are biased towards the company’s own product in comparison to other vendors. 
Any reported advantage may be the result of selecting a specific use case 
better handled by the product or using optimized configuration for one’s own 
product compared to out of the box configuration for other vendors’ 
products.</p>
-<p class="c2">So, why another blog on the topic? The reason I decided to write 
this blog is to explain the rationale behind <a 
href="http://www.datatorrent.com";>DataTorrent’s</a> effort to develop and 
maintain a benchmark performance suite, how the suite is used to certify 
various releases, and seek community opinion on how the performance benchmark 
may be improved.</p>
-<p class="c2 c4">Note: the performance numbers given here are only for 
reference and by no means a comprehensive performance evaluation of <a 
href="http://apex.apache.org/";>Apache APEX</a>; performance numbers can vary 
depending on different configurations or different use cases.</p>
-<p class="c12 c2 subtitle"><strong>Benchmark application.</strong><img class=" 
aligncenter" title="" 
src="https://www.datatorrent.com/wp-content/uploads/2015/10/image02.png"; alt="" 
/></p>
-<p class="c2">To evaluate the performance of the <a 
href="http://apex.apache.org/";>Apache APEX</a>  platform,  we use Benchmark 
application that has a simple <a 
href="https://www.datatorrent.com/blog-tracing-dags-from-specification-to-execution/";>DAG</a>
 with only two operators. The first operator (<span 
class="c3">wordGenerato</span>r) emits tuples and the second operator (<span 
class="c3">counter</span>) counts tuples received. For such trivial operations, 
operators add minimum overhead to CPU and memory consumption allowing 
measurement of <a href="http://apex.apache.org/";>Apache APEX</a>  platform 
throughput. As operators don’t change from release to release, this 
application is suitable for comparing the platform performance across 
releases.</p>
-<p class="c2">Tuples are byte arrays with configurable length, minimizing 
complexity of tuples serialization and at the same time allowing examination of 
 performance of the platform against several different tuple sizes. The 
emitter (<span class="c3">wordGenerator</span>) operator may be configured to 
use the same byte array avoiding the operator induced garbage collection. Or it 
may be configured to allocate new byte array for every new tuple emitted, more 
closely simulating real application behavior.</p>
-<p class="c2">The consumer (<span class="c3">counter</span>) operator collects 
the total number of tuples received and the wall-clock time in milliseconds 
passed between begin and end window. It writes the collected data to the log at 
the end of every 10th window.</p>
-<p class="c2">The data stream (<span class="c3">Generator2Counter</span>) 
connects the first operator output port to the second operator input port. The 
benchmark suite exercises all possible configurations for the stream 
locality:</p>
-<ul class="c8 lst-kix_2ql03f9wui4c-0 start">
-<li class="c2 c7">thread local (<span class="c3">THREAD_LOCAL</span>) when 
both operators are deployed into the same thread within a container effectively 
serializing operators computation;</li>
-<li class="c2 c7">container local (<span 
class="c3">CONTAINER_LOCAL)</span><span class="c3"> </span>when both operators 
are deployed into the same container and execute in two different threads;</li>
-<li class="c2 c7">node local (<span class="c3">NODE_LOCAL</span>)<sup><a 
href="#ftnt_ref1">[1]</a></sup> when operators are deployed into two different 
containers running on the same yarn node;</li>
-<li class="c2 c7">rack local (RACK_LOCAL)<sup><a 
href="#ftnt_ref2">[2]</a></sup> when operators are deployed into two different 
containers running on yarn nodes residing on the same rack</li>
-<li class="c2 c7">no locality when operators are deployed into two different 
containers running on any hadoop cluster node.</li>
-</ul>
-<p class="c2 c12 subtitle"><span class="c0"><b><a 
href="http://apex.apache.org/";>Apache APEX</a> release performance 
certification</b></span></p>
-<p class="c2">The benchmark application is a part of <a 
href="http://apex.apache.org/";>Apache APEX</a> release certification. It is 
executed on <a href="http://www.datatorrent.com";>DataTorrent’s</a> 
development Hadoop cluster by an automated script that launches the application 
with all supported <span class="c3">Generator2Counter</span> stream localities 
and 64, 128, 256, 512, 1024, 2048 and a tuple byte array length of 4096. The 
script collects the number of tuples emitted, the number of tuples processed 
and the <span class="c3">counter</span> operator latency for the running 
application and shuts down the application after it runs for 5 minutes, 
whereupon it moves on to the next configuration. For all configurations, the 
script runs between 6 and 8 hours depending on the development cluster load.</p>
-<p class="c12 c2 subtitle"><span class="c0"><b>Benchmark results</b></span></p>
-<p class="c2">As each supported stream locality has distinct performance 
characteristics (with exception of rack local and no locality due to the 
development cluster being setup on a single rack), I use a separate chart for 
each stream locality.</p>
-<p class="c2">Overall the results are self explanatory and I hope that anyone 
who uses, plans to use or plans to contribute to the <a 
href="http://apex.apache.org/";>Apache APEX</a> project finds it useful. A few 
notes that seems to be worth mentioning:</p>
-<ul class="c8 lst-kix_5u2revq5rd1r-0 start">
-<li class="c2 c7">There is no performance regression in APEX release 3.0 
compared to release 2.0</li>
-<li class="c2 c7">Benchmark was executed with default settings for buffer 
server spooling (turned on by default in release 3.0 and off in release 2.0). 
As the result, the benchmark application required just 2 GB of memory for the 
<span class="c3">wordGenerator</span> operator container in release 3.0, while 
it was necessary to allocate 8 GB to the same in release 2.0</li>
-<li class="c2 c7">When tuple size increases, JVM garbage collection starts to 
play a major role in performance benchmark compared to locality</li>
-<li class="c2 c7">Thread local outperforms all other stream localities only 
for trivial operators that we specifically designed for the benchmark.</li>
-<li class="c2 c7">The benchmark was performed on the development cluster while 
other developers were using it<img title="" 
src="https://www.datatorrent.com/wp-content/uploads/2015/10/image03.png"; alt="" 
/></li>
-</ul>
-<p class="c2"><img title="" 
src="https://www.datatorrent.com/wp-content/uploads/2015/10/image01.png"; alt="" 
/></p>
-<p class="c2 c17"><img title="" 
src="https://www.datatorrent.com/wp-content/uploads/2015/10/image002.png"; 
alt="" /></p>
-<hr class="c10" />
-<div>
-<p class="c2 c13"><a name="ftnt_ref1"></a>[1]<span class="c6"> NODE_LOCAL is 
currently excluded from the benchmark test due to known limitation. Please see 
</span><span class="c6 c9"><a class="c5" 
href="https://malhar.atlassian.net/browse/APEX-123";>APEX-123</a></span></p>
-</div>
-<div>
-<p class="c2 c13"><a name="ftnt_ref2"></a>[2]<span class="c6"> RACK_LOCAL is 
not yet fully implemented by APEX and is currently equivalent to no locality 
specified</span></p>
-</div>
-<p>The post <a rel="nofollow" 
href="https://www.datatorrent.com/blog-apex-performance-benchmark/";>Apache Apex 
Performance Benchmarks</a> appeared first on <a rel="nofollow" 
href="https://www.datatorrent.com";>DataTorrent</a>.</p>
-]]></content:encoded>
-                       
<wfw:commentRss>https://www.datatorrent.com/blog-apex-performance-benchmark/feed/</wfw:commentRss>
-               <slash:comments>0</slash:comments>
-               </item>
-               <item>
-               <title>Introduction to dtGateway</title>
-               
<link>https://www.datatorrent.com/blog-introduction-to-dtgateway/</link>
-               
<comments>https://www.datatorrent.com/blog-introduction-to-dtgateway/#comments</comments>
-               <pubDate>Tue, 06 Oct 2015 13:00:48 +0000</pubDate>
-               <dc:creator><![CDATA[David Yan]]></dc:creator>
-                               <category><![CDATA[Uncategorized]]></category>
-
-               <guid 
isPermaLink="false">https://www.datatorrent.com/?p=2247</guid>
-               <description><![CDATA[<p>A platform, no matter how much it can 
do, and how technically superb it is, does not delight users without a proper 
UI or an API. That’s why there are products such as Cloudera Manager and 
Apache Ambari to improve the usability of the Hadoop platform. At DataTorrent, 
in addition to excellence in technology, we [&#8230;]</p>
-<p>The post <a rel="nofollow" 
href="https://www.datatorrent.com/blog-introduction-to-dtgateway/";>Introduction 
to dtGateway</a> appeared first on <a rel="nofollow" 
href="https://www.datatorrent.com";>DataTorrent</a>.</p>
-]]></description>
-                               <content:encoded><![CDATA[<p>A platform, no 
matter how much it can do, and how technically superb it is, does not delight 
users without a proper UI or an API. That’s why there are products such as 
Cloudera Manager and Apache Ambari to improve the usability of the Hadoop 
platform. At DataTorrent, in addition to excellence in technology, we strive 
for user delight. One of the main components of DataTorrent RTS is dtGateway. 
dtGateway is the window to your DataTorrent RTS installation. It is a 
Java-based multithreaded web server that allows you to easily access 
information and perform various operations on DataTorrent RTS, and it is the 
server behind dtManage. It can run on any node in your Hadoop cluster or any 
other node that can access your Hadoop nodes, and is installed as a system 
service automatically by the RTS installer.</p>
-<p>dtGateway talks to all running Apex App Masters, as well as the Node 
Managers and the Resource Manager in the Hadoop cluster, in order to gather all 
the information and to perform all the operations users may need.</p>
-<p><img title="" 
src="https://www.datatorrent.com/wp-content/uploads/2015/10/Blog-dtGateway.png"; 
alt="dtGateway diagram" /></p>
-<p>These features are exposed through a REST API. Here are some of things you 
can do with the REST API:</p>
-<ul>
-<li>Get performance metrics (e.g. CPU, memory usage, tuples per second, 
latency, etc.) and other details of all Apex application instances</li>
-<li>Get performance metrics and other details of physical and logical 
operators of each Apex application instance</li>
-<li>Get performance metrics and other details of individual containers used by 
each Apex application instance</li>
-<li>Retrieve container logs</li>
-<li>Dynamically change operator properties, and add and remove operators from 
the DAG of a running Apex application</li>
-<li>Record and retrieve tuples on the fly</li>
-<li>Shutdown a running container or an entire Apex application</li>
-<li>Dynamically change logging level of a container</li>
-<li>Create, manage, and view custom system alerts</li>
-<li>Create, manage, and interact with dtDashboard</li>
-<li>Create, manage, and launch Apex App Packages</li>
-<li>Basic health checks of the cluster</li>
-</ul>
-<p>Here is an example of using the curl command to access dtGateway’s REST 
API to get the details of a physical operator with ID=40 of application 
instance with ID=application_1442448722264_14891, assuming dtGateway is 
listening at localhost:9090:</p>
-<pre class="prettyprint"><code class="language-bash hljs ">$ curl 
http://localhost:<span 
class="hljs-number">9090</span>/ws/v2/applications/application_1442448722264_14891/physicalPlan/operators/<span
 class="hljs-number">40</span>
-{
-    <span class="hljs-string">"checkpointStartTime"</span>: <span 
class="hljs-string">"1442512091772"</span>, 
-    <span class="hljs-string">"checkpointTime"</span>: <span 
class="hljs-string">"175"</span>, 
-    <span class="hljs-string">"checkpointTimeMA"</span>: <span 
class="hljs-string">"164"</span>, 
-    <span class="hljs-string">"className"</span>: <span 
class="hljs-string">"com.datatorrent.contrib.kafka.KafkaSinglePortOutputOperator"</span>,
 
-    <span class="hljs-string">"container"</span>: <span 
class="hljs-string">"container_e08_1442448722264_14891_01_000017"</span>, 
-    <span class="hljs-string">"counters"</span>: null, 
-    <span class="hljs-string">"cpuPercentageMA"</span>: <span 
class="hljs-string">"0.2039266316727741"</span>, 
-    <span class="hljs-string">"currentWindowId"</span>: <span 
class="hljs-string">"6195527785184762469"</span>, 
-    <span class="hljs-string">"failureCount"</span>: <span 
class="hljs-string">"0"</span>, 
-    <span class="hljs-string">"host"</span>: <span 
class="hljs-string">"node22.morado.com:8041"</span>, 
-    <span class="hljs-string">"id"</span>: <span 
class="hljs-string">"40"</span>, 
-    <span class="hljs-string">"lastHeartbeat"</span>: <span 
class="hljs-string">"1442512100742"</span>, 
-    <span class="hljs-string">"latencyMA"</span>: <span 
class="hljs-string">"5"</span>, 
-    <span class="hljs-string">"logicalName"</span>: <span 
class="hljs-string">"QueryResult"</span>, 
-    <span class="hljs-string">"metrics"</span>: {}, 
-    <span class="hljs-string">"name"</span>: <span 
class="hljs-string">"QueryResult"</span>, 
-    <span class="hljs-string">"ports"</span>: [
-        {
-            <span class="hljs-string">"bufferServerBytesPSMA"</span>: <span 
class="hljs-string">"0"</span>, 
-            <span class="hljs-string">"name"</span>: <span 
class="hljs-string">"inputPort"</span>, 
-            <span class="hljs-string">"queueSizeMA"</span>: <span 
class="hljs-string">"1"</span>, 
-            <span class="hljs-string">"recordingId"</span>: null, 
-            <span class="hljs-string">"totalTuples"</span>: <span 
class="hljs-string">"6976"</span>, 
-            <span class="hljs-string">"tuplesPSMA"</span>: <span 
class="hljs-string">"0"</span>, 
-            <span class="hljs-string">"type"</span>: <span 
class="hljs-string">"input"</span>
-        }
-    ], 
-    <span class="hljs-string">"recordingId"</span>: null, 
-    <span class="hljs-string">"recoveryWindowId"</span>: <span 
class="hljs-string">"6195527785184762451"</span>, 
-    <span class="hljs-string">"status"</span>: <span 
class="hljs-string">"ACTIVE"</span>, 
-    <span class="hljs-string">"totalTuplesEmitted"</span>: <span 
class="hljs-string">"0"</span>, 
-    <span class="hljs-string">"totalTuplesProcessed"</span>: <span 
class="hljs-string">"6976"</span>, 
-    <span class="hljs-string">"tuplesEmittedPSMA"</span>: <span 
class="hljs-string">"0"</span>, 
-    <span class="hljs-string">"tuplesProcessedPSMA"</span>: <span 
class="hljs-string">"20"</span>, 
-    <span class="hljs-string">"unifierClass"</span>: null
-}</code></pre>
-<p>For the complete spec of the REST API, please refer to our dtGateway REST 
API documentation <a 
href="https://www.datatorrent.com/docs/guides/DTGatewayAPISpecification.html"; 
target="_blank">here</a>.</p>
-<p>With great power comes great responsibility. With all the information 
dtGateway has and what dtGateway can do, the admin of DataTorrent RTS may want 
to restrict access to certain information and operations to only certain group 
of users. This means dtGateway must support authentication and 
authorization.</p>
-<p>For authentication, dtGateway can easily be integrated with existing LDAP, 
Kerberos, or PAM framework. You can also choose to have dtGateway manage its 
own user database.</p>
-<p>For authorization, dtGateway provides built-in role-based access control. 
The admin can decide which roles can view what information and perform what 
operations in dtGateway. The user-to-role mapping can be managed by dtGateway, 
or be integrated with LDAP roles.<br />
-In addition, we provide access control with granularity to the application 
instance level as well as to the application package level. For example, you 
can control which users and which roles have read or write access to which 
application instances and to which application packages.</p>
-<p>For more information, visit our dtGateway security documentation <a 
href="https://www.datatorrent.com/docs/guides/GatewaySecurity.html"; 
target="_blank">here</a>.</p>
-<p>An important part of user delight is backward compatibility. Imagine after 
a version upgrade, stuff starts breaking because a “new feature” or a 
“bug fix” changes an API so that components that expect the old API don’t 
work any more with the new version. That has to be a frustrating experience for 
the user!</p>
-<p>When a user upgrades to a newer version of DataTorrent RTS, we guarantee 
that existing components or applications that work with previous minor versions 
of DataTorrent RTS still work. And that includes the REST API. Even when we 
release a major RTS version that has backward incompatible changes to the REST 
API spec, we will maintain backward compatibility by versioning the resource 
paths of the REST API (e.g. with a change in the prefix in the path /ws/v2 to 
/ws/v3) and maintaining the old spec until the end-of-life of the old version 
is reached.</p>
-<p>We hope dtGateway is a delight to use for DataTorrent RTS users. We welcome 
any feedback.</p>
-<p>The post <a rel="nofollow" 
href="https://www.datatorrent.com/blog-introduction-to-dtgateway/";>Introduction 
to dtGateway</a> appeared first on <a rel="nofollow" 
href="https://www.datatorrent.com";>DataTorrent</a>.</p>
-]]></content:encoded>
-                       
<wfw:commentRss>https://www.datatorrent.com/blog-introduction-to-dtgateway/feed/</wfw:commentRss>
-               <slash:comments>0</slash:comments>
-               </item>
-               <item>
-               <title>Tracing DAGs from specification to execution</title>
-               
<link>https://www.datatorrent.com/blog-tracing-dags-from-specification-to-execution/</link>
-               
<comments>https://www.datatorrent.com/blog-tracing-dags-from-specification-to-execution/#comments</comments>
-               <pubDate>Thu, 01 Oct 2015 04:09:00 +0000</pubDate>
-               <dc:creator><![CDATA[Thomas Weise]]></dc:creator>
-                               <category><![CDATA[Uncategorized]]></category>
-               <category><![CDATA[Apache Apex]]></category>
-               <category><![CDATA[dag]]></category>
-
-               <guid 
isPermaLink="false">https://www.datatorrent.com/?p=2151</guid>
-               <description><![CDATA[<p>How Apex orchestrates the DAG 
lifecycle Apache Apex (incubating) uses the concept of a DAG to represent an 
application’s processing logic. This blog will introduce the different 
perspectives within the architecture, starting from specification by the user 
to execution within the engine. Understanding DAGs DAG, or Directed Acyclic 
Graph, expresses processing logic as operators (vertices) [&#8230;]</p>
-<p>The post <a rel="nofollow" 
href="https://www.datatorrent.com/blog-tracing-dags-from-specification-to-execution/";>Tracing
 DAGs from specification to execution</a> appeared first on <a rel="nofollow" 
href="https://www.datatorrent.com";>DataTorrent</a>.</p>
-]]></description>
-                               <content:encoded><![CDATA[<h2 
id="how-apex-orchestrates-the-dag-lifecycle">How Apex orchestrates the DAG 
lifecycle</h2>
-<p><a href="http://apex.apache.org/";>Apache Apex (incubating)</a> uses the 
concept of a DAG to represent an application’s processing logic. This blog 
will introduce the different perspectives within the architecture, starting 
from specification by the user to execution within the engine.</p>
-<h3 id="understanding-dags">Understanding DAGs</h3>
-<p>DAG, or Directed Acyclic Graph, expresses processing logic as operators 
(vertices) and streams (edges) that together make an Apache® Apex (incubating) 
application. Just as the name suggests, the resulting graph must be acyclic, 
while specifying the logic that will be executed in sequence or in parallel. 
DAGs are used to exhibit dependencies, such as in event-based systems where 
previously occurred events lead to newer ones. The DAG concept is widely used, 
for example in revision control systems such as Git. Apex leverages the concept 
of a DAG to express how data is processed. Operators function as nodes within 
the graph, which are connected by a stream of events called tuples. </p>
-<p>There are several frameworks in the wider Hadoop ecosystem that employ the 
DAG concept to model dependencies. Some of those trace back to MapReduce, where 
the processing logic is a two operator sequence: map and reduce. This is simple 
but also rigid, as most processing pipelines have a more complex structure. 
Therefore, when using MapReduce directly, multiple map-reduce stages need to be 
chained together to achieve the overall goal. Coordination is not trivial, 
which lead to the rise of higher level frameworks that attempt to shield the 
user from this complexity, such as Pig, Hive, Cascading, etc. Earlier on, Pig 
and Hive directly translated into a MapReduce execution layer, later Tez came 
into the picture as alternative, common layer for optimization and execution. 
Other platforms such as Storm and Spark also represent the logic as DAG, each 
with its own flavor of specification and different architecture of execution 
layer.  </p>
-<h3 id="dag-of-operators-represents-the-business-logic">DAG of operators 
represents the business logic</h3>
-<p>Apex permits any operation to be applied to a stream of events and there is 
practically no limitation on the complexity of the ensuing DAG of operators. 
The full DAG blueprint is visible to the engine, which means that it can be 
translated into an end-to-end, fault-tolerant, scalable execution layer.</p>
-<p>The operators represent the point where business logic is introduced. 
Operators receive events via input ports, and emit events via output ports to 
represent the execution of a user-defined functionality. Operators that don’t 
receive events on a port are called the input operators. They receive events 
from external systems, thus acting as roots of the DAG.</p>
-<p>Operators can implement any functionality. It can be code that is very 
specific to a use case or generic and broadly applicable functionality like the 
operators that are part of the Apex Malhar operator library, with support for 
reading from various sources, transformations, filtering, dimensional 
computation or write to a variety of destinations.</p>
-<h3 id="specifying-a-dag">Specifying a DAG</h3>
-<p>As discussed earlier, a DAG is comprised of connections between output 
ports and input ports. Operators can have multiple input and output ports, each 
of a different type. This simplifies the operator programming because the port 
concept clearly highlights the source and type of the event. This information 
is visible to the Java compiler, thus enabling immediate feedback to the 
developer working in the IDE.</p>
-<p>Similar to the DOM in a web browser, which can result from a static HTML 
source or a piece of JavaScript that created it on the fly, an Apex DAG can be 
created from different source representations, and dynamically modified after 
the application is running!</p>
-<p><img 
src="https://www.datatorrent.com/wp-content/uploads/2015/10/Development-Workflow_02_with_ports.png";
 alt="Logical Plan" title=""></p>
-<p>We refer to the DAG that was specified by the user as the “logical 
plan”. This is because upon launch it will be translated into a physical 
plan, and then mapped to an execution layer (more on this process below).</p>
-<h3 id="a-simple-example">A simple example</h3>
-<p>Let’s consider the example of the WordCount application, which is the 
de-facto hello world application of Hadoop. Here is how this simple, sequential 
DAG will look: The input operator reads a file to emit lines. The “lines” 
act as a stream, which in turn becomes the input for the parser operator. The 
parser operator performs a parse operation to generate words for the counter 
operator. The counter operator emits tuples (word, count) to the console. </p>
-<p><img 
src="https://www.datatorrent.com/wp-content/uploads/2015/10/wordcount-dag1.png"; 
alt="WordCount DAG" title=""></p>
-<p>The source for the logical plan can be in different formats. Using the Apex 
Java API, the WordCount example could look like this:</p>
-<pre class="prettyprint"><code class="language-java hljs "><span 
class="hljs-annotation">@ApplicationAnnotation</span>(name=<span 
class="hljs-string">"MyFirstApplication"</span>)
-<span class="hljs-keyword">public</span> <span class="hljs-class"><span 
class="hljs-keyword">class</span> <span class="hljs-title">Application</span> 
<span class="hljs-keyword">implements</span> <span 
class="hljs-title">StreamingApplication</span>
-{</span>
-  <span class="hljs-annotation">@Override</span>
-  <span class="hljs-keyword">public</span> <span 
class="hljs-keyword">void</span> <span 
class="hljs-title">populateDAG</span>(DAG dag, Configuration conf)
-  {
-    LineReader lineReader = dag.addOperator(<span 
class="hljs-string">"input"</span>, <span class="hljs-keyword">new</span> 
LineReader());
-    Parser parser = dag.addOperator(<span class="hljs-string">"parser"</span>, 
<span class="hljs-keyword">new</span> Parser());
-    UniqueCounter&lt;String&gt; counter = dag.addOperator(<span 
class="hljs-string">"counter"</span>, <span class="hljs-keyword">new</span> 
UniqueCounter&lt;String&gt;());
-    ConsoleOutputOperator cons = dag.addOperator(<span 
class="hljs-string">"console"</span>, <span class="hljs-keyword">new</span> 
ConsoleOutputOperator());
-    dag.addStream(<span class="hljs-string">"lines"</span>, lineReader.output, 
parser.input);
-    dag.addStream(<span class="hljs-string">"words"</span>, parser.output, 
counter.data);
-    dag.addStream(<span class="hljs-string">"counts"</span>, counter.count, 
cons.input);
-  }
-}</code></pre>
-<p>The same WordCount application can be specified through JSON format 
(typically generated by a tool, such as the DataTorrent RTS application builder 
known as dtAssemble):</p>
-<pre class="prettyprint"><code class="language-json hljs ">{
-  "<span class="hljs-attribute">displayName</span>": <span 
class="hljs-value"><span class="hljs-string">"WordCountJSON"</span></span>,
-  "<span class="hljs-attribute">operators</span>": <span class="hljs-value">[
-    {
-      "<span class="hljs-attribute">name</span>": <span 
class="hljs-value"><span class="hljs-string">"input"</span></span>,
-      ...
-    },
-    {
-      "<span class="hljs-attribute">name</span>": <span 
class="hljs-value"><span class="hljs-string">"parse"</span></span>,
-      ...
-    },
-    {
-      "<span class="hljs-attribute">name</span>": <span 
class="hljs-value"><span class="hljs-string">"count"</span></span>,
-      "<span class="hljs-attribute">class</span>": <span 
class="hljs-value"><span 
class="hljs-string">"com.datatorrent.lib.algo.UniqueCounter"</span></span>,
-      "<span class="hljs-attribute">properties</span>": <span 
class="hljs-value">{
-        "<span 
class="hljs-attribute">com.datatorrent.lib.algo.UniqueCounter</span>": <span 
class="hljs-value">{
-          "<span class="hljs-attribute">cumulative</span>": <span 
class="hljs-value"><span class="hljs-literal">false</span>
-        </span>}
-      </span>}
-    </span>},
-    {
-      "<span class="hljs-attribute">name</span>": <span 
class="hljs-value"><span class="hljs-string">"console"</span></span>,
-      ...
-    }
-  ]</span>,
-  "<span class="hljs-attribute">streams</span>": <span class="hljs-value">[
-    {
-      "<span class="hljs-attribute">name</span>": <span 
class="hljs-value"><span class="hljs-string">"lines"</span></span>,
-      "<span class="hljs-attribute">sinks</span>": <span class="hljs-value">[
-        {
-          "<span class="hljs-attribute">operatorName</span>": <span 
class="hljs-value"><span class="hljs-string">"parse"</span></span>,
-          "<span class="hljs-attribute">portName</span>": <span 
class="hljs-value"><span class="hljs-string">"input"</span>
-        </span>}
-      ]</span>,
-      "<span class="hljs-attribute">source</span>": <span class="hljs-value">{
-        "<span class="hljs-attribute">operatorName</span>": <span 
class="hljs-value"><span class="hljs-string">"input"</span></span>,
-        "<span class="hljs-attribute">portName</span>": <span 
class="hljs-value"><span class="hljs-string">"output"</span>
-      </span>}
-    </span>},
-    {
-      "<span class="hljs-attribute">name</span>": <span 
class="hljs-value"><span class="hljs-string">"words"</span></span>,
-      ...
-    },
-    {
-      "<span class="hljs-attribute">name</span>": <span 
class="hljs-value"><span class="hljs-string">"counts"</span></span>,
-      ...
-    }
-  ]
-</span>}</code></pre>
-<p>As mentioned previously, the DAG can also be modified after an application 
was launched. In the following example we add another console operator to 
display the lines emitted by the input operator: </p>
-<pre class="prettyprint"><code class="language-bash hljs ">Connected to 
application application_1442901180806_0001
-dt (application_1442901180806_0001) &gt; begin-logical-plan-change 
-logical-plan-change (application_1442901180806_0001) &gt; create-operator 
linesConsole com.datatorrent.lib.io.ConsoleOutputOperator
-logical-plan-change (application_1442901180806_0001) &gt; add-stream-sink 
lines linesConsole input
-logical-plan-change (application_1442901180806_0001) &gt; submit 
-{}</code></pre>
-<p><img 
src="https://www.datatorrent.com/wp-content/uploads/2015/10/wordcount-dag2.png"; 
alt="Altered WordCount DAG" title=""></p>
-<h3 id="translation-of-logical-dag-into-physical-plan">Translation of logical 
DAG into physical plan</h3>
-<p>Users specify the logical DAG. This logical representation is provided to 
the Apex client that bootstraps an application. When running on a <a 
href="http://hortonworks.com/blog/apache-hadoop-yarn-concepts-and-applications/";
 target="_blank">YARN</a> cluster, this client will launch the StrAM (Streaming 
Application Master), along with the logical plan and exit. StrAM takes over 
and, as a first task, converts the logical DAG into a physical plan.</p>
-<p>To do so, StrAM assigns the operators within the DAG to containers, which 
will later correspond to actual YARN containers in the execution layer. You can 
influence many aspects of this translation using (optional) attributes in the 
the logical plan. The physical plan layout determines the performance and 
scalability of the application, which is why the configuration will typically 
specify more attributes as the application evolves.</p>
-<p>Here are a few examples of attributes:</p>
-<ul>
-<li>The amount of memory that an operator requires</li>
-<li>The operator partitioning</li>
-<li>Affinity between operators (aka stream locality)</li>
-<li>Windows (sliding, tumbling)</li>
-<li>Checkpointing</li>
-<li>JVM options for a container process</li>
-<li>Timeout and interval settings for monitoring</li>
-<li>Queue sizes</li>
-</ul>
-<p><img 
src="https://www.datatorrent.com/wp-content/uploads/2015/10/Physical-Plan.png"; 
alt="Physical Plan" title=""></p>
-<h3 id="the-physical-plan-works-as-the-blueprint-for-the-execution-layer">The 
physical plan works as the blueprint for the execution layer</h3>
-<p>The physical plan lays the foundation for the execution layer, but because 
both are distinct, the same physical plan can be mapped to different execution 
layers. </p>
-<p>Apex was designed to run on YARN natively and take full advantage of its 
features. When executing on YARN, resource scheduling and allocation are the 
responsibility of the underlying infrastructure. </p>
-<p>There is only one other execution layer implementation for development 
purposes: Local mode will host an entire application within a single JVM. This 
allows to do all work including functional testing and efficient debugging 
within an IDE, before packaging the application and taking it to the YARN 
cluster.</p>
-<h3 id="executing-the-physical-plan-on-yarn">Executing the physical plan on 
YARN</h3>
-<p>When running on YARN, each container in the physical plan is mapped to a 
separate process (called a container). The containers are requested by StrAM 
based on the resource specification prescribed by the physical plan. Once the 
resource manager allocates a container, StrAM will launch the processes on the 
respective node manager. We call these processes Streaming Containers, 
reflecting their facilitation for the data flow. The container, once launched, 
initiates the heartbeat protocol for passing on status information about the 
execution to StrAM. </p>
-<p><img 
src="https://www.datatorrent.com/wp-content/uploads/2015/10/execution-2.jpg"; 
alt="Execution Layer" title=""></p>
-<p>Each container provisions a buffer server – the component that enables 
the pub sub based inter-process data flow. Once all containers are up and StrAM 
knows the buffer server endpoints, deployment of operators commences. The 
deploy instructions (and other control commands) are passed as part of the 
heartbeat response from StrAM to the streaming containers. There is no further 
scheduling or provisioning related activity unless a process fails or the 
operator is undeployed due to dynamic changes in the physical plan. </p>
-<h3 id="deployment-of-operators">Deployment of operators</h3>
-<p>The container, upon receiving the operator deployment request from StrAM, 
will bring to life the operator from its frozen state (the initial checkpoint). 
It will create a separate thread for each operator, in which all user code will 
run (except of course in the case where operators share a thread because the 
user defined a stream as <code>THREAD_LOCAL</code>). User code comprises all 
the callbacks defined in the <code>Operator</code> interface. The nice thing 
about this is that the user code is not concerned with thread synchronization, 
thus making it easier to develop and typically more efficient to run, as the 
heavy lifting is left to the engine and overhead avoided. </p>
-<p>The very first thing after operator instantiation is the (one time) call to 
its <code>setup</code> method which gives the operator the opportunity to 
initialize state that is required for its processing prior to connecting the 
streams. There is also an optional interface <code>ActivationListener</code> 
and a method activate which will be called after the operator is wired and just 
before window processing starts.</p>
-<p>Now the operator is ready to process the data, framed in streaming windows. 
The engine will call <code>beginWindow</code>, then <code>process</code> on the 
respective input port(s) for every data tuple and <code>endWindow</code>. This 
will repeat until either something catastrophic happens or StrAM requests an 
operator undeploy due to dynamic plan changes. It is clear at this point that 
this lifecycle minimizes the scheduling and expense to bootstrap processing. It 
is a one time cost.</p>
-<p>There are a few other things that happen between invocations of the user 
code, demarcated by windows. For example, checkpoints are periodically taken 
(every 30s by default, tunable by the user). There are also optional callbacks 
defined by <code>CheckpointListener</code> that can be used to implement 
synchronization with external systems (think database transactions or copy of 
finalized files, for example).</p>
-<h3 id="monitoring-the-execution">Monitoring the execution</h3>
-<p>Once the containers are fully provisioned, StrAM records the periodic 
heartbeat updates, and watches operator processing as data flows through the 
pipeline. StrAM does not contribute to the data flow itself, processing is 
decentralized and asynchronous. StrAM collects the stats from the heartbeats 
and uses them to provide the central view of the execution. For example, it 
calculates latency based on the window timestamps that are reported, which is 
vital in identifying processing bottlenecks. It also uses the window 
information to monitor the progress of operators and identify operators that 
are stuck (and when necessary restarts them, with an interval controllable by 
user). StrAM also hosts a REST API that clients such as the CLI can use to 
collect data. Here is an example for the information that can be obtained 
through this API:</p>
-<pre class="prettyprint"><code class="language-json hljs ">  {
-    "<span class="hljs-attribute">id</span>": <span class="hljs-value"><span 
class="hljs-string">"3"</span></span>,
-    "<span class="hljs-attribute">name</span>": <span class="hljs-value"><span 
class="hljs-string">"counter"</span></span>,
-    "<span class="hljs-attribute">className</span>": <span 
class="hljs-value"><span 
class="hljs-string">"com.datatorrent.lib.algo.UniqueCounter"</span></span>,
-    "<span class="hljs-attribute">container</span>": <span 
class="hljs-value"><span 
class="hljs-string">"container_1443668714920_0001_01_000003"</span></span>,
-    "<span class="hljs-attribute">host</span>": <span class="hljs-value"><span 
class="hljs-string">"localhost:8052"</span></span>,
-    "<span class="hljs-attribute">totalTuplesProcessed</span>": <span 
class="hljs-value"><span class="hljs-string">"198"</span></span>,
-    "<span class="hljs-attribute">totalTuplesEmitted</span>": <span 
class="hljs-value"><span class="hljs-string">"1"</span></span>,
-    "<span class="hljs-attribute">tuplesProcessedPSMA</span>": <span 
class="hljs-value"><span class="hljs-string">"0"</span></span>,
-    "<span class="hljs-attribute">tuplesEmittedPSMA</span>": <span 
class="hljs-value"><span class="hljs-string">"0"</span></span>,
-    "<span class="hljs-attribute">cpuPercentageMA</span>": <span 
class="hljs-value"><span class="hljs-string">"1.5208279931258353"</span></span>,
-    "<span class="hljs-attribute">latencyMA</span>": <span 
class="hljs-value"><span class="hljs-string">"10"</span></span>,
-    "<span class="hljs-attribute">status</span>": <span 
class="hljs-value"><span class="hljs-string">"ACTIVE"</span></span>,
-    "<span class="hljs-attribute">lastHeartbeat</span>": <span 
class="hljs-value"><span class="hljs-string">"1443670671506"</span></span>,
-    "<span class="hljs-attribute">failureCount</span>": <span 
class="hljs-value"><span class="hljs-string">"0"</span></span>,
-    "<span class="hljs-attribute">recoveryWindowId</span>": <span 
class="hljs-value"><span 
class="hljs-string">"6200516265145009027"</span></span>,
-    "<span class="hljs-attribute">currentWindowId</span>": <span 
class="hljs-value"><span 
class="hljs-string">"6200516265145009085"</span></span>,
-    "<span class="hljs-attribute">ports</span>": <span class="hljs-value">[
-      {
-        "<span class="hljs-attribute">name</span>": <span 
class="hljs-value"><span class="hljs-string">"data"</span></span>,
-        "<span class="hljs-attribute">type</span>": <span 
class="hljs-value"><span class="hljs-string">"input"</span></span>,
-        "<span class="hljs-attribute">totalTuples</span>": <span 
class="hljs-value"><span class="hljs-string">"198"</span></span>,
-        "<span class="hljs-attribute">tuplesPSMA</span>": <span 
class="hljs-value"><span class="hljs-string">"0"</span></span>,
-        "<span class="hljs-attribute">bufferServerBytesPSMA</span>": <span 
class="hljs-value"><span class="hljs-string">"16"</span></span>,
-        "<span class="hljs-attribute">queueSizeMA</span>": <span 
class="hljs-value"><span class="hljs-string">"1"</span></span>,
-        "<span class="hljs-attribute">recordingId</span>": <span 
class="hljs-value"><span class="hljs-literal">null</span>
-      </span>},
-      {
-        "<span class="hljs-attribute">name</span>": <span 
class="hljs-value"><span class="hljs-string">"count"</span></span>,
-        "<span class="hljs-attribute">type</span>": <span 
class="hljs-value"><span class="hljs-string">"output"</span></span>,
-        "<span class="hljs-attribute">totalTuples</span>": <span 
class="hljs-value"><span class="hljs-string">"1"</span></span>,
-        "<span class="hljs-attribute">tuplesPSMA</span>": <span 
class="hljs-value"><span class="hljs-string">"0"</span></span>,
-        "<span class="hljs-attribute">bufferServerBytesPSMA</span>": <span 
class="hljs-value"><span class="hljs-string">"12"</span></span>,
-        "<span class="hljs-attribute">queueSizeMA</span>": <span 
class="hljs-value"><span class="hljs-string">"0"</span></span>,
-        "<span class="hljs-attribute">recordingId</span>": <span 
class="hljs-value"><span class="hljs-literal">null</span>
-      </span>}
-    ]</span>,
-    "<span class="hljs-attribute">unifierClass</span>": <span 
class="hljs-value"><span class="hljs-literal">null</span></span>,
-    "<span class="hljs-attribute">logicalName</span>": <span 
class="hljs-value"><span class="hljs-string">"counter"</span></span>,
-    "<span class="hljs-attribute">recordingId</span>": <span 
class="hljs-value"><span class="hljs-literal">null</span></span>,
-    "<span class="hljs-attribute">counters</span>": <span 
class="hljs-value"><span class="hljs-literal">null</span></span>,
-    "<span class="hljs-attribute">metrics</span>": <span 
class="hljs-value">{}</span>,
-    "<span class="hljs-attribute">checkpointStartTime</span>": <span 
class="hljs-value"><span class="hljs-string">"1443670642472"</span></span>,
-    "<span class="hljs-attribute">checkpointTime</span>": <span 
class="hljs-value"><span class="hljs-string">"42"</span></span>,
-    "<span class="hljs-attribute">checkpointTimeMA</span>": <span 
class="hljs-value"><span class="hljs-string">"129"</span>
-  </span>}</code></pre>
-<p>This blog covered the lifecycle of a DAG. Future posts will cover the 
inside view of the Apex engine, including checkpointing, processing semantics, 
partitioning and more. Watch this space! </p>
-<p>The post <a rel="nofollow" 
href="https://www.datatorrent.com/blog-tracing-dags-from-specification-to-execution/";>Tracing
 DAGs from specification to execution</a> appeared first on <a rel="nofollow" 
href="https://www.datatorrent.com";>DataTorrent</a>.</p>
-]]></content:encoded>
-                       
<wfw:commentRss>https://www.datatorrent.com/blog-tracing-dags-from-specification-to-execution/feed/</wfw:commentRss>
-               <slash:comments>0</slash:comments>
-               </item>
-               <item>
-               <title>Meet &amp; Name the Apache Apex Logo</title>
-               
<link>https://www.datatorrent.com/name-the-apache-apex-logo/</link>
-               
<comments>https://www.datatorrent.com/name-the-apache-apex-logo/#comments</comments>
-               <pubDate>Fri, 25 Sep 2015 15:02:32 +0000</pubDate>
-               <dc:creator><![CDATA[John Fanelli]]></dc:creator>
-                               <category><![CDATA[Big Data in Everyday 
Life]]></category>
-               <category><![CDATA[How-to]]></category>
-               <category><![CDATA[Apache Apex]]></category>
-               <category><![CDATA[Big Data]]></category>
-               <category><![CDATA[Fast Batch]]></category>
-               <category><![CDATA[Logo]]></category>
-               <category><![CDATA[Streaming]]></category>
-
-               <guid 
isPermaLink="false">https://www.datatorrent.com/?p=2113</guid>
-               <description><![CDATA[<p>Apache Apex, the open source, 
enterprise-grade unified stream and fast batch processing engine, is gathering 
momentum staggeringly fast. The timeline has been aggressive: Project Apex was 
announced on June 5, code was dropped on GitHub July 30 and an Apache 
Incubation proposal was posted on August 12 and accepted on August 17. Apache 
Apex has [&#8230;]</p>
-<p>The post <a rel="nofollow" 
href="https://www.datatorrent.com/name-the-apache-apex-logo/";>Meet &#038; Name 
the Apache Apex Logo</a> appeared first on <a rel="nofollow" 
href="https://www.datatorrent.com";>DataTorrent</a>.</p>
-]]></description>
-                               <content:encoded><![CDATA[<p>Apache Apex, the 
open source, enterprise-grade unified stream and fast batch processing engine, 
is gathering momentum staggeringly fast.</p>
-<p>The timeline has been aggressive: Project Apex <a 
href="https://www.datatorrent.com/press-releases/datatorrent-open-sources-datatorrent-rts-industrys-only-enterprise-grade-unified-stream-and-batch-processing-platform/";>was
 announced</a> on June 5, code was dropped on <a 
href="https://github.com/apache/incubator-apex-core";>GitHub</a> July 30 and an 
<a href="https://wiki.apache.org/incubator/ApexProposal";>Apache Incubation 
proposal</a> was posted on August 12 and <a 
href="https://www.datatorrent.com/apex-accepted-as-apache-incubator-project/";>accepted</a>
 on August 17.</p>
-<p>Apache Apex has hit some great milestones already. We are just past the one 
month anniversary and Apache Apex has already <a 
href="http://www.infoworld.com/article/2982429/open-source-tools/bossie-awards-2015-the-best-open-source-big-data-tools.html#slide5";>been
 named one of the best open source big data tools</a> of 2015 by InfoWorld, 
hosted its <a href="http://www.meetup.com/Apex-Bay-Area-Chapter/";>first 
meetup</a> and <a href="https://twitter.com/ApacheApex";>@ApacheApex</a> is 
quickly gaining Twitter followers.</p>
-<p><strong>What’s in a logo?<br />
-</strong>Today, we are pleased to introduce the Apache Apex logo. Meet, uhm, 
actually, he or she doesn&#8217;t have a name yet, and we need your help here! 
Well, let me describe what (he or she) represents first.</p>
-<p>Apache Apex has a lofty goal to be at the top of its game or at the 
“Apex” of stream and batch processing pipeline engines. As such, you will 
see a mountain peak that reflects our aspiration to always be the best, and at 
the peak of our industry.</p>
-<p>As a YARN native application, Apache Apex not only runs on, but is also 
optimized for Hadoop deployments. In a nod to that design center, the logo 
acknowledges the Hadoop foundation we have built on with feet similar to 
Hadoop’s logo.</p>
-<p>Finally, open <a href="http://www.mysql.com/common/logos/logo-mysql-110x5

<TRUNCATED>

Reply via email to