This is an automated email from the ASF dual-hosted git repository.
git-site-role pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/groovy-dev-site.git
The following commit(s) were added to refs/heads/asf-site by this push:
new 8c01223 2025/02/17 04:50:48: Generated dev website from
groovy-website@81ba492
8c01223 is described below
commit 8c012239d66cf437939798851342abaa0f62f185
Author: jenkins <[email protected]>
AuthorDate: Mon Feb 17 04:50:48 2025 +0000
2025/02/17 04:50:48: Generated dev website from groovy-website@81ba492
---
blog/feed.atom | 4 +-
blog/index.html | 2 +-
blog/using-groovy-with-apache-wayang.html | 259 ++++++++++++++++++++----------
3 files changed, 174 insertions(+), 91 deletions(-)
diff --git a/blog/feed.atom b/blog/feed.atom
index e03525b..2aef4e9 100644
--- a/blog/feed.atom
+++ b/blog/feed.atom
@@ -4,7 +4,7 @@
<link href="http://groovy.apache.org/blog"/>
<link href="http://groovy.apache.org/blog/feed.atom" rel="self"/>
<id>http://groovy.apache.org/blog</id>
- <updated>2024-12-22T08:45:00Z</updated>
+ <updated>2025-02-15T14:30:00Z</updated>
<entry>
<id>http://groovy.apache.org/blog/groovy-lucene</id>
<author>
@@ -639,7 +639,7 @@
</author>
<title type="html">Using Groovy with Apache Wayang and Apache Spark</title>
<link
href="http://groovy.apache.org/blog/using-groovy-with-apache-wayang"/>
- <updated>2022-06-19T13:01:07Z</updated>
+ <updated>2025-02-15T14:30:00Z</updated>
<published>2022-06-19T13:01:07Z</published>
<summary type="html">This post looks at using Apache Wayang and Apache
Spark with Apache Groovy to cluster various Whiskies.</summary>
</entry>
diff --git a/blog/index.html b/blog/index.html
index 7157f79..4b36fe6 100644
--- a/blog/index.html
+++ b/blog/index.html
@@ -53,7 +53,7 @@
</ul>
</div>
</div>
- </div><div id='content' class='page-1'><div
class='row'><div class='row-fluid'><div class='col-lg-3' id='blog-index'><ul
class='nav-sidebar list'><li class='active'><a
href='/blog/'>Blogs</a></li><li><a href='groovy-lucene'>Searching with
Lucene</a></li><li><a href='groovy-graph-databases'>Using Graph Databases with
Groovy</a></li><li><a
href='solving-simple-optimization-problems-with-groovy'>Solving simple
optimization problems with Groovy using Commons Math, Hip [...]
+ </div><div id='content' class='page-1'><div
class='row'><div class='row-fluid'><div class='col-lg-3' id='blog-index'><ul
class='nav-sidebar list'><li class='active'><a
href='/blog/'>Blogs</a></li><li><a href='using-groovy-with-apache-wayang'>Using
Groovy with Apache Wayang and Apache Spark</a></li><li><a
href='groovy-lucene'>Searching with Lucene</a></li><li><a
href='groovy-graph-databases'>Using Graph Databases with Groovy</a></li><li><a
href='solving-simple-opti [...]
<div class='row'>
<div class='colset-3-footer'>
<div class='col-1'>
diff --git a/blog/using-groovy-with-apache-wayang.html
b/blog/using-groovy-with-apache-wayang.html
index b5b9df3..4a02d6d 100644
--- a/blog/using-groovy-with-apache-wayang.html
+++ b/blog/using-groovy-with-apache-wayang.html
@@ -53,7 +53,13 @@
</ul>
</div>
</div>
- </div><div id='content' class='page-1'><div
class='row'><div class='row-fluid'><div class='col-lg-3'><ul
class='nav-sidebar'><li><a href='./'>Blog index</a></li><li class='active'><a
href='#doc'>Using Groovy with Apache Wayang and Apache Spark</a></li><li><a
href='#_whiskey_clustering' class='anchor-link'>Whiskey
Clustering</a></li><li><a href='#_implementation_details'
class='anchor-link'>Implementation Details</a></li><li><a
href='#_running_with_the_java_streams [...]
+ </div><div id='content' class='page-1'><div
class='row'><div class='row-fluid'><div class='col-lg-3'><ul
class='nav-sidebar'><li><a href='./'>Blog index</a></li><li class='active'><a
href='#doc'>Using Groovy with Apache Wayang and Apache Spark</a></li><li><a
href='#_whiskey_clustering' class='anchor-link'>Whiskey
Clustering</a></li><li><a href='#_implementation_details'
class='anchor-link'>Implementation Details</a></li><li><a
href='#_running_with_the_java_streams [...]
+<a href="https://github.com/paulk-asert/" target="_blank" rel="noopener
noreferrer"><img style="border-radius:50%;height:48px;width:auto"
src="https://github.com/paulk-asert.png" alt="Paul King"></a>
+<div style="display:grid;align-items:center;margin:0.1ex;padding:0ex">
+ <div><a href="https://github.com/paulk-asert/" target="_blank" rel="noopener
noreferrer"><span>Paul King</span></a></div>
+ <div><small><i>PMC Member</i></small></div>
+</div>
+ </div><br/><span>Published: 2022-06-19 01:01PM (Last updated:
2025-02-15 02:30PM)</span></p><hr/><div id="preamble">
<div class="sectionbody">
<div class="paragraph">
<p><span class="image right"><img
src="https://www.apache.org/logos/res/wayang/default.png" alt="wayang logo"
width="100"></span>
@@ -115,34 +121,35 @@ in that cluster.</p>
</div>
<div class="listingblock">
<div class="content">
-<pre class="prettyprint highlight"><code data-lang="groovy">record
Point(double[] pts) implements Serializable {
- static Point fromLine(String line) { new
Point(line.split(',')[2..-1]*.toDouble() as double[]) }
-}</code></pre>
+<pre class="prettyprint highlight"><code data-lang="groovy">record
Point(double[] pts) implements Serializable { }</code></pre>
</div>
</div>
<div class="paragraph">
-<p>We’ve made it <code>Serializable</code> (more on that later) and
included
-a <code>fromLine</code> factory method to help us make points from a CSV
-file. We’ll do that ourselves rather than rely on other libraries
-which could assist. It’s not a 2D or 3D point for us but 12D
+<p>We’ve made it <code>Serializable</code> (more on that later).
+It’s not a 2D or 3D point for us but 12D
corresponding to the 12 criteria. We just use a <code>double</code> array,
so any dimension would be supported but the 12 comes from the
number of columns in our data file.</p>
</div>
<div class="paragraph">
-<p>We’ll define a related <code>TaggedPointCounter</code> record.
It’s like
+<p>We’ll define a related <code>PointGrouping</code> record. It’s
like
<code>Point</code> but tracks an <code>int</code> cluster id and
<code>long</code> count used
when clustering the points:</p>
</div>
<div class="listingblock">
<div class="content">
-<pre class="prettyprint highlight"><code data-lang="groovy">record
TaggedPointCounter(double[] pts, int cluster, long count) implements
Serializable {
- TaggedPointCounter plus(TaggedPointCounter that) {
- new TaggedPointCounter((0..<pts.size()).collect{ pts[it] +
that.pts[it] } as double[], cluster, count + that.count)
+<pre class="prettyprint highlight"><code data-lang="groovy">record
PointGrouping(double[] pts, int cluster, long count) implements Serializable {
+ PointGrouping(List<Double> pts, int cluster, long count) {
+ this(pts as double[], cluster, count)
+ }
+
+ PointGrouping plus(PointGrouping that) {
+ var newPts = pts.indices.collect{ pts[it] + that.pts[it] }
+ new PointGrouping(newPts, cluster, count + that.count)
}
- TaggedPointCounter average() {
- new TaggedPointCounter(pts.collect{ double d -> d/count } as
double[], cluster, 0)
+ PointGrouping average() {
+ new PointGrouping(pts.collect{ double d -> d/count }, cluster, 1)
}
}</code></pre>
</div>
@@ -162,24 +169,26 @@ class to capture this part of the algorithm:</p>
</div>
<div class="listingblock">
<div class="content">
-<pre class="prettyprint highlight"><code data-lang="groovy">class
SelectNearestCentroid implements ExtendedSerializableFunction<Point,
TaggedPointCounter> {
- Iterable<TaggedPointCounter> centroids
+<pre class="prettyprint highlight"><code data-lang="groovy">class
SelectNearestCentroid implements ExtendedSerializableFunction<Point,
PointGrouping> {
+ Iterable<PointGrouping> centroids
void open(ExecutionContext context) {
- centroids = context.getBroadcast("centroids")
+ centroids = context.getBroadcast('centroids')
}
- TaggedPointCounter apply(Point p) {
- def minDistance = Double.POSITIVE_INFINITY
- def nearestCentroidId = -1
+ PointGrouping apply(Point p) {
+ var minDistance = Double.POSITIVE_INFINITY
+ var nearestCentroidId = -1
for (c in centroids) {
- def distance = sqrt((0..<p.pts.size()).collect{ p.pts[it] -
c.pts[it] }.sum{ it ** 2 } as double)
+ var distance = sqrt(p.pts.indices
+ .collect{ p.pts[it] - c.pts[it] }
+ .sum{ it ** 2 } as double)
if (distance < minDistance) {
minDistance = distance
nearestCentroidId = c.cluster
}
}
- new TaggedPointCounter(p.pts, nearestCentroidId, 1)
+ new PointGrouping(p.pts, nearestCentroidId, 1)
}
}</code></pre>
</div>
@@ -191,25 +200,15 @@ functionality where an optimization decision can be made
about
where to run the operation.</p>
</div>
<div class="paragraph">
-<p>Once we get to using Spark, the classes in the map/reduce part
-of our algorithm will need to be serializable. Method closures
-in dynamic Groovy aren’t serializable. We have a few options to
-avoid using them. I’ll show one approach here which is to use
-some helper classes in places where we might typically use method
-references. Here are the helper classes:</p>
+<p>To make our pipeline definitions a little shorter,
+we’ll define some useful operators in a <code>PipelineOps</code> helper
class:</p>
</div>
<div class="listingblock">
<div class="content">
-<pre class="prettyprint highlight"><code data-lang="groovy">class Cluster
implements SerializableFunction<TaggedPointCounter, Integer> {
- Integer apply(TaggedPointCounter tpc) { tpc.cluster() }
-}
-
-class Average implements SerializableFunction<TaggedPointCounter,
TaggedPointCounter> {
- TaggedPointCounter apply(TaggedPointCounter tpc) { tpc.average() }
-}
-
-class Plus implements SerializableBinaryOperator<TaggedPointCounter> {
- TaggedPointCounter apply(TaggedPointCounter tpc1, TaggedPointCounter tpc2)
{ tpc1.plus(tpc2) }
+<pre class="prettyprint highlight"><code data-lang="groovy">class PipelineOps {
+ public static SerializableFunction<PointGrouping, Integer> cluster =
tpc -> tpc.cluster
+ public static SerializableFunction<PointGrouping, PointGrouping>
average = tpc -> tpc.average()
+ public static SerializableBinaryOperator<PointGrouping> plus =
(tpc1, tpc2) -> tpc1 + tpc2
}</code></pre>
</div>
</div>
@@ -219,42 +218,43 @@ class Plus implements
SerializableBinaryOperator<TaggedPointCounter> {
<div class="listingblock">
<div class="content">
<pre class="prettyprint highlight"><code data-lang="groovy">int k = 5
-int iterations = 20
+int iterations = 10
// read in data from our file
-def url = WhiskeyWayang.classLoader.getResource('whiskey.csv').file
-def pointsData = new File(url).readLines()[1..-1].collect{ Point.fromLine(it) }
-def dims = pointsData[0].pts().size()
+var url = WhiskeyWayang.classLoader.getResource('whiskey.csv').file
+def rows = new File(url).readLines()[1..-1]*.split(',')
+var distilleries = rows*.getAt(1)
+var pointsData = rows.collect{ new Point(it[2..-1] as double[]) }
+var dims = pointsData[0].pts.size()
// create some random points as initial centroids
-def r = new Random()
-def initPts = (1..k).collect { (0..<dims).collect { r.nextGaussian() + 2 }
as double[] }
+var r = new Random()
+var randomPoint = { (0..<dims).collect { r.nextGaussian() + 2 } as double[]
}
+var initPts = (1..k).collect(randomPoint)
-// create planbuilder with Java and Spark enabled
-def configuration = new Configuration()
-def context = new WayangContext(configuration)
+var context = new WayangContext()
.withPlugin(Java.basicPlugin())
.withPlugin(Spark.basicPlugin())
-def planBuilder = new JavaPlanBuilder(context, "KMeans ($url, k=$k,
iterations=$iterations)")
+var planBuilder = new JavaPlanBuilder(context, "KMeans ($url, k=$k,
iterations=$iterations)")
-def points = planBuilder
+var points = planBuilder
.loadCollection(pointsData).withName('Load points')
-def initialCentroids = planBuilder
- .loadCollection((0..<k).collect{ idx -> new
TaggedPointCounter(initPts[idx], idx, 0) })
- .withName("Load random centroids")
+var initialCentroids = planBuilder
+ .loadCollection((0..<k).collect{ idx -> new
PointGrouping(initPts[idx], idx, 0) })
+ .withName('Load random centroids')
-def finalCentroids = initialCentroids
- .repeat(iterations, currentCentroids ->
- points.map(new SelectNearestCentroid())
- .withBroadcast(currentCentroids, "centroids").withName("Find
nearest centroid")
- .reduceByKey(new Cluster(), new Plus()).withName("Add up points")
- .map(new Average()).withName("Average points")
- .withOutputClass(TaggedPointCounter)).withName("Loop").collect()
+var finalCentroids = initialCentroids.repeat(iterations, currentCentroids ->
+ points.map(new SelectNearestCentroid())
+ .withBroadcast(currentCentroids, 'centroids').withName('Find nearest
centroid')
+ .reduceByKey(cluster, plus).withName('Aggregate points')
+ .map(average).withName('Average points')
+ .withOutputClass(PointGrouping)
+).withName('Loop').collect()
println 'Centroids:'
finalCentroids.each { c ->
- println "Cluster$c.cluster: ${c.pts.collect{ sprintf('%.3f', it) }.join(',
')}"
+ println "Cluster $c.cluster: ${c.pts.collect('%.2f'::formatted).join(',
')}"
}</code></pre>
</div>
</div>
@@ -272,6 +272,21 @@ at each iteration, all the points to their closest current
centroid and then calculating the new centroids given those
assignments. Finally, we output the results.</p>
</div>
+<div class="paragraph">
+<p>Optionally, we might want to print out the distilleries allocated to each
cluster.
+The code looks like this:</p>
+</div>
+<div class="listingblock">
+<div class="content">
+<pre class="prettyprint highlight"><code data-lang="groovy">var allocator =
new SelectNearestCentroid(centroids: finalCentroids)
+var allocations = pointsData.withIndex()
+ .collect{ pt, idx -> [allocator.apply(pt).cluster, distilleries[idx]] }
+ .groupBy{ cluster, ds -> "Cluster $cluster" }
+ .collectValues{ v -> v.collect{ it[1] } }
+ .sort{ it.key }
+allocations.each{ c, ds -> println "$c (${ds.size()} members): ${ds.join(',
')}" }</code></pre>
+</div>
+</div>
</div>
</div>
<div class="sect1">
@@ -295,10 +310,10 @@ the script is run, but here is one output:</p>
<div class="content">
<pre class="prettyprint highlight"><code data-lang="shell">> Task
:WhiskeyWayang:run
Centroids:
-Cluster0: 2.548, 2.419, 1.613, 0.194, 0.097, 1.871, 1.742, 1.774, 1.677,
1.935, 1.806, 1.613
-Cluster2: 1.464, 2.679, 1.179, 0.321, 0.071, 0.786, 1.429, 0.429, 0.964,
1.643, 1.929, 2.179
-Cluster3: 3.250, 1.500, 3.250, 3.000, 0.500, 0.250, 1.625, 0.375, 1.375,
1.375, 1.250, 0.250
-Cluster4: 1.684, 1.842, 1.211, 0.421, 0.053, 1.316, 0.632, 0.737, 1.895,
2.000, 1.842, 1.737
+Cluster0: 2.55, 2.42, 1.61, 0.19, 0.10, 1.87, 1.74, 1.77, 1.68, 1.93, 1.81,
1.61
+Cluster2: 1.46, 2.68, 1.18, 0.32, 0.07, 0.79, 1.43, 0.43, 0.96, 1.64, 1.93,
2.18
+Cluster3: 3.25, 1.50, 3.25, 3.00, 0.50, 0.25, 1.62, 0.37, 1.37, 1.37, 1.25,
0.25
+Cluster4: 1.68, 1.84, 1.21, 0.42, 0.05, 1.32, 0.63, 0.74, 1.89, 2.00, 1.84,
1.74
...</code></pre>
</div>
</div>
@@ -331,11 +346,9 @@ change in our code:</p>
<div class="listingblock">
<div class="content">
<pre class="prettyprint highlight"><code data-lang="groovy">...
-def configuration = new Configuration()
-def context = new WayangContext(configuration)
-// .withPlugin(Java.basicPlugin()) <b class="conum">(1)</b>
+var context = new WayangContext()
+// .withPlugin(Java.basicPlugin()) <b class="conum">(1)</b>
.withPlugin(Spark.basicPlugin())
-def planBuilder = new JavaPlanBuilder(context, "KMeans ($url, k=$k,
iterations=$iterations)")
...</code></pre>
</div>
</div>
@@ -353,15 +366,14 @@ Spark and Wayang log information - truncated for
presentation purposes):</p>
</div>
<div class="listingblock">
<div class="content">
-<pre>[main] INFO org.apache.spark.SparkContext - Running Spark version 3.3.0
+<pre>[main] INFO org.apache.spark.SparkContext - Running Spark version 3.5.4
[main] INFO org.apache.spark.util.Utils - Successfully started service
'sparkDriver' on port 62081.
...
Centroids:
-Cluster4: 1.414, 2.448, 0.966, 0.138, 0.034, 0.862, 1.000, 0.483, 1.345,
1.690, 2.103, 2.138
-Cluster0: 2.773, 2.455, 1.455, 0.000, 0.000, 1.909, 1.682, 1.955, 2.091,
2.045, 2.136, 1.818
-Cluster1: 1.762, 2.286, 1.571, 0.619, 0.143, 1.714, 1.333, 0.905, 1.190,
1.952, 1.095, 1.524
-Cluster2: 3.250, 1.500, 3.250, 3.000, 0.500, 0.250, 1.625, 0.375, 1.375,
1.375, 1.250, 0.250
-Cluster3: 2.167, 2.000, 2.167, 1.000, 0.333, 0.333, 2.000, 0.833, 0.833,
1.500, 2.333, 1.667
+Cluster 4: 1.63, 2.26, 1.68, 0.63, 0.16, 1.47, 1.42, 0.89, 1.16, 1.95, 0.89,
1.58
+Cluster 0: 2.76, 2.44, 1.44, 0.04, 0.00, 1.88, 1.68, 1.92, 1.92, 2.04, 2.16,
1.72
+Cluster 1: 3.11, 1.44, 3.11, 2.89, 0.56, 0.22, 1.56, 0.44, 1.44, 1.44, 1.33,
0.44
+Cluster 2: 1.52, 2.42, 1.09, 0.24, 0.06, 0.91, 1.09, 0.45, 1.30, 1.64, 2.18,
2.09
...
[shutdown-hook-0] INFO org.apache.spark.SparkContext - Successfully stopped
SparkContext
[shutdown-hook-0] INFO org.apache.spark.util.ShutdownHookManager - Shutdown
hook called</pre>
@@ -370,6 +382,66 @@ Cluster3: 2.167, 2.000, 2.167, 1.000, 0.333, 0.333, 2.000,
0.833, 0.833, 1.500,
</div>
</div>
<div class="sect1">
+<h2 id="_using_ml4all">Using ML4all</h2>
+<div class="sectionbody">
+<div class="paragraph">
+<p>In recent versions of Wayang, a new abstraction, called ML4all has been
introduced.
+It frees users from the burden of machine learning algorithm selection and
low-level implementation details. Many readers will be familiar with how
systems supporting
+<em>MapReduce</em> split functionality into <code>map</code>,
<code>filter</code> or <code>shuffle</code>, and <code>reduce</code> steps.
+ML4all abstracts machine learning algorithm functionality into 7 operators:
+<code>Transform</code>, <code>Stage</code>, <code>Compute</code>,
<code>Update</code>, <code>Sample</code>, <code>Converge</code>, and
<code>Loop</code>.</p>
+</div>
+<div class="paragraph">
+<p>Wayang comes bundled with implementations for many of these operators, but
+you can write your own like we have here for the Transform operator:</p>
+</div>
+<div class="listingblock">
+<div class="content">
+<pre class="prettyprint highlight"><code data-lang="groovy">class TransformCSV
extends Transform<double[], String> {
+ double[] transform(String input) {
+ input.split(',')[2..-1] as double[]
+ }
+}</code></pre>
+</div>
+</div>
+<div class="paragraph">
+<p>With this operator defined, we can now write our</p>
+</div>
+<div class="listingblock">
+<div class="content">
+<pre class="prettyprint highlight"><code data-lang="groovy">var dims = 12
+var context = new WayangContext()
+ .withPlugin(Spark.basicPlugin())
+ .withPlugin(Java.basicPlugin())
+
+var plan = new ML4allPlan(
+ transformOp: new TransformCSV(),
+ localStage: new KMeansStageWithRandoms(k: k, dimension: dims),
+ computeOp: new KMeansCompute(),
+ updateOp: new KMeansUpdate(),
+ loopOp: new KMeansConvergeOrMaxIterationsLoop(accuracy, maxIterations)
+)
+
+var model = plan.execute('file:' + url, context)
+model.getByKey("centers").eachWithIndex { center, idx ->
+ var pts = center.collect('%.2f'::formatted).join(', ')
+ println "Cluster$idx: $pts"
+}</code></pre>
+</div>
+</div>
+<div class="paragraph">
+<p>When run we get this output:</p>
+</div>
+<div class="listingblock">
+<div class="content">
+<pre>Cluster0: 1.57, 2.32, 1.32, 0.45, 0.09, 1.08, 1.19, 0.60, 1.26, 1.74,
1.72, 1.85
+Cluster1: 3.43, 1.57, 3.43, 3.14, 0.57, 0.14, 1.71, 0.43, 1.29, 1.43, 1.29,
0.14
+Cluster2: 2.73, 2.42, 1.46, 0.04, 0.04, 1.88, 1.69, 1.88, 1.92, 2.04, 2.12,
1.81</pre>
+</div>
+</div>
+</div>
+</div>
+<div class="sect1">
<h2 id="_discussion">Discussion</h2>
<div class="sectionbody">
<div class="paragraph">
@@ -379,10 +451,9 @@ the abstractions aren’t perfect. As an example, if I
know I
am only using the streams-backed platform, I don’t need to worry
about making any of my classes serializable (which is a Spark
requirement). In our example, we could have omitted the
-<code>implements Serializable</code> part of the
<code>TaggedPointCounter</code> record,
-and we could have used a method reference
-<code>TaggedPointCounter::average</code> instead of our <code>Average</code>
-helper class. This isn’t meant to be a criticism of Wayang,
+<code>implements Serializable</code> part of the <code>PointGrouping</code>
record,
+and several of our pipeline operators may have reduced to simple closures.
+This isn’t meant to be a criticism of Wayang,
after all if you want to write cross-platform UDFs, you might
expect to have to follow some rules. Instead, it is meant to
just indicate that abstractions often have leaks around the edges.
@@ -390,13 +461,9 @@ Sometimes those leaks can be beneficially used, other
times they
are traps waiting for unknowing developers.</p>
</div>
<div class="paragraph">
-<p>To summarise, if using the Java streams-backed platform, you can
-run the application on JDK17 (which uses native records) as well
-as JDK11 and JDK8 (where Groovy provides emulated records).
-Also, we could make numerous simplifications if we desired.
-When using the Spark processing platform, the potential
-simplifications aren’t applicable, and we can run on JDK8 and
-JDK11 (Spark isn’t yet supported on JDK17).</p>
+<p>We ran this example using JDK17, but on earlier
+JDK versions, Groovy will use emulated records
+instead of native records without changing the source code.</p>
</div>
</div>
</div>
@@ -434,16 +501,32 @@ in achieving this goal.</p>
<p>Repo containing the source code: <a
href="https://github.com/paulk-asert/groovy-data-science/tree/master/subprojects/WhiskeyWayang">WhiskeyWayang</a></p>
</li>
<li>
-<p>Repo containing similar examples using a variety of libraries including
Apache Commons CSV, Weka, Smile, Tribuo and others: <a
href="https://github.com/paulk-asert/groovy-data-science/tree/master/subprojects/Whiskey">Whiskey</a></p>
+<p>Repo containing solutions to this problem using a variety of
non-distributed libraries including Apache Commons CSV, Weka, Smile, Tribuo and
others:
+<a
href="https://github.com/paulk-asert/groovy-data-science/tree/master/subprojects/Whiskey">Whiskey</a></p>
+</li>
+<li>
+<p>A similar example using <a href="https://spark.apache.org/">Apache
Spark</a> directly but with a built-in parallelized KMeans from the
<code>spark-mllib</code> library:
+<a
href="https://github.com/paulk-asert/groovy-data-science/tree/master/subprojects/WhiskeySpark">WhiskeySpark</a></p>
</li>
<li>
-<p>A similar example using Apache Spark directly but with a built-in
parallelized KMeans from the <code>spark-mllib</code> library rather than a
hand-crafted algorithm: <a
href="https://github.com/paulk-asert/groovy-data-science/tree/master/subprojects/WhiskeySpark">WhiskeySpark</a></p>
+<p>A similar example using <a href="https://ignite.apache.org/">Apache
Ignite</a> using the built-in clustered KMeans from the <code>ignite-ml</code>
library:
+<a
href="https://github.com/paulk-asert/groovy-data-science/tree/master/subprojects/WhiskeyIgnite">WhiskeyIgnite</a></p>
</li>
<li>
-<p>A similar example using Apache Ignite directly but with a built-in
clustered KMeans from the <code>ignite-ml</code> library rather than a
hand-crafted algorithm: <a
href="https://github.com/paulk-asert/groovy-data-science/tree/master/subprojects/WhiskeyIgnite">WhiskeyIgnite</a></p>
+<p>A similar example using <a href="https://flink.apache.org/">Apache
Flink</a> using KMeans from the Flink ML (<code>flink-ml-uber</code>) library:
+<a
href="https://github.com/paulk-asert/groovy-data-science/tree/master/subprojects/WhiskeyFlink">WhiskeyFlink</a></p>
</li>
</ul>
</div>
+<div class="sidebarblock">
+<div class="content">
+<div class="title">Update history</div>
+<div class="paragraph">
+<p><strong>19/Jun/2022</strong>: Initial version.<br>
+<strong>15/Feb/2025</strong>: Updated for Apache Wayang 1.0.0.</p>
+</div>
+</div>
+</div>
</div>
</div></div></div></div></div><footer id='footer'>
<div class='row'>