This is an automated email from the ASF dual-hosted git repository.

paulk pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/groovy-website.git


The following commit(s) were added to refs/heads/asf-site by this push:
     new 783d0dd  refine Matrix examples
783d0dd is described below

commit 783d0ddbf799ed4f83dd097d39a907e63a58d45e
Author: Paul King <[email protected]>
AuthorDate: Sun Apr 20 20:39:11 2025 +1000

    refine Matrix examples
---
 site/src/site/blog/whisky-revisited.adoc | 155 +++++++++++++++++++++++--------
 1 file changed, 115 insertions(+), 40 deletions(-)

diff --git a/site/src/site/blog/whisky-revisited.adoc 
b/site/src/site/blog/whisky-revisited.adoc
index 15964d6..9acb41f 100644
--- a/site/src/site/blog/whisky-revisited.adoc
+++ b/site/src/site/blog/whisky-revisited.adoc
@@ -8,7 +8,7 @@ Paul King
 ++++
 <table><tr><td style="padding: 0px; padding-left: 20px; padding-right: 20px; 
font-size: 18pt; line-height: 1.5; margin: 0px">
 ++++
-[blue]#_Let's take a first look at Underdog and Matrix, two new Groovy powered 
dataframe libraries.
+[blue]#_Let's take a first look at Underdog and Matrix, two new Groovy-powered 
dataframe libraries.
 We'll explore Whisky flavor profiles!_#
 ++++
 </td></tr></table>
@@ -21,25 +21,14 @@ In previous blog posts, we have looked at clustering whisky 
profiles using:
 
 The https://github.com/paulk-asert/groovy-data-science[groovy-data-science] 
repo also has examples of this case study using other technologies including:
 
-[cols="1,4"]
-|===
-| Data manipulation
-| Tablesaw, Datumbox, Apache Commons CSV, Tribuo
+* *Data manipulation*: Tablesaw, Datumbox, Apache Commons CSV, Tribuo
+* *Clustering*: Smile, Apache Commons Math, Datumbox, Weka, Encog, Elki, Tribuo
+* *Visualization*: XChart, Tablesaw Plot.ly, Smile visualization, JFreeChart
+* *Scaling clustering*: Apache Ignite, Apache Spark, Apache Wayang, Apache 
Flink, Apache Beam
 
-| Clustering
-| Smile, Apache Commons Math, Datumbox, Weka, Encog, Elki, Tribuo
-
-| Visualization
-| XChart, Tablesaw Plot.ly, Smile visualization, JFreeChart
-
-| Scaling clustering
-| Apache Ignite, Apache Spark, Apache Wayang, Apache Flink, Apache Beam
-|===
-
-Let's take a first look at two new Groovy powered dataframe libraries,
+Let's explore the same case study, but this time, taking a first look at two 
new Groovy-powered dataframe libraries:
 https://grooviter.github.io/underdog/[Underdog] and
-https://github.com/Alipsa/matrix[Matrix],
-to explore the same case study.
+https://github.com/Alipsa/matrix[Matrix].
 
 == The Case Study
 
@@ -59,12 +48,14 @@ https://grooviter.github.io/underdog/[Underdog].
 Let's use it to explore Whisky profiles.
 It has many Groovy-powered features delivering a very expressive developer 
experience.
 
+Underdog has the following modules: underdog-dataframe, underdog-graphs, 
underdog-plots, underdog-ml, and underdog-ta. We'll use all but the last of 
these.
+
 Underdog sits on top of some well-known data-science libraries in the JVM 
ecosystem
 like Smile, Tablesaw, and https://echarts.apache.org/[Apache ECharts].
 If you have used any of those libraries, you'll recognise parts of the 
functionality
 shining through.
 
-First, we'll load our CSV file into an Underdog dataframe:
+First, we'll load our CSV file into an Underdog dataframe (removing a column 
we don't need):
 
 [source,groovy]
 ----
@@ -102,7 +93,13 @@ It gives this output:
     12  |       Floral  |      INTEGER  |
 ----
 
-Let's look at a correlation matrix plot of the data:
+When data has many dimensions, understanding the relationship between the 
columns can be hard.
+We can look at a correlation matrix to help us understand whether there is any 
redundant data,
+e.g. are _Sweetness_ and _Honey_, or _Tobacco_ and _Smoky_, two measure of the 
same thing
+or different things.
+
+Underdog has a built-in plot for this, so let's
+gather the numeric features and plot the correlation matrix:
 
 [source,groovy]
 ----
@@ -115,6 +112,10 @@ Which has this output:
 
 image:img/underdogCorrelationPlot.png[correlation plot,50%]
 
+We can see that the different flavor measures are quite distinct.
+The highest correlations are between _Smoky_ and _Medicinal_, and _Smoky_ and 
_Body_.
+Some, like _Floral_ and _Medicinal_, are very unrelated.
+
 Let's now explore searching for whiskies of a particular flavor,
 in this case profiles that are somewhat _fruity_ and somewhat _sweet_ in 
flavor.
 
@@ -204,6 +205,17 @@ println df.agg([Distillery:'count'])
     .rename('Whisky Cluster Sizes')
 ----
 
+Which has this output:
+
+----
+       Whisky Cluster Sizes
+ Cluster  |  Count [Distillery]  |
+----------------------------------
+       0  |                  25  |
+       2  |                  44  |
+       1  |                  17  |
+----
+
 Or, we can easily print out the distilleries in each cluster:
 
 [source,groovy]
@@ -271,8 +283,15 @@ image:img/underdogClustersAgglomerative.png[scatter plot 
agglomerative,50%]
 The
 https://github.com/Alipsa/matrix/tree/main[Matrix]
 library makes it easy to work with a matrix of tabular data.
+The Matrix project consists of the following modules: matrix-core, 
matrix-stats, matrix-datasets, matrix-spreadsheet, matrix-csv, matrix-json, 
matrix-xcharts, matrix-sql, matrix-parquet, matrix-bigquery, matrix-charts, and 
matrix-tablesaw.
+
+While new, Matrix does build upon common JVM data science libraries, like 
Tablesaw and Apache Commons Math.
+For certain functionality, like clustering and dimension reduction, Matrix 
works well with libraries like Smile.
 
-Let's read in our data and explore its size:
+For a first intro, we'll look at the
+matrix-core, matrix-stats, matrix-csv, and matrix-xchart modules.
+
+Let's read in our data, remove a column we don't need, and explore its size:
 
 [source,groovy]
 ----
@@ -288,24 +307,22 @@ This outputs:
 ----
 
 Currently, the data is all strings. Matrix provides a `convert` option for 
getting data
-into the right type including handling missing values. It also has powerful 
normalization
-functionality. We'll want to normalize our data because some of the algorithms 
and certainly
-the radar plot assume normalized data (values between 0 and 1).
+into the right type. It also has various normalization
+methods. We want our data as numbers, and some of the functionality we'll use, 
e.g.
+the radar plot, assumes our data is normalized (values between 0 and 1).
 
-But, here we'll show off the `apply` functionality which will convert and 
normalize all-in-one
-by hand:
+Rather than using `convert` or the normalization methods, here we'll show off 
the `apply`
+functionality which will achieve the same thing for our example:
 
 [source,groovy]
 ----
 def features = m.columnNames() - 'Distillery'
 def size = features.size()
-features.each { feature ->
-    m.apply(feature) { it.toDouble() / 4 }
-}
+features.each(feature -> m.apply(feature) { it.toDouble() / 4 })
 ----
 
-Now, like we did with Underdog, we want to perform a query to find the
-whiskies which are somewhat _fruity_ and somewhat _sweet_ in flavor:
+Now, like we did with Underdog, we want to perform a query to find and display
+the whiskies which are somewhat _fruity_ and somewhat _sweet_ in flavor:
 
 [source,groovy]
 ----
@@ -337,11 +354,15 @@ def rc = 
RadarChart.create(aberlour).addSeries('Distillery', transparency)
 new SwingWrapper(rc.exportSwing().chart).displayChart()
 ----
 
+NOTE: If `matrix-xchart` doesn't have the functionality you are after, 
considering
+looking at the `matrix-chart` library. They offer many similar charts but there
+are some differences too.
+
 The output looks like this:
 
 image:img/matrixAberlourRadar.png[aberlour profile,50%]
 
-Or, for all selected whiskies:
+The same chart also works to display all selected whiskies:
 
 [source,groovy]
 ----
@@ -353,7 +374,9 @@ Which looks like this:
 
 image:img/matrixWhiskySelectionsRadar.png[selected whisky profiles,50%]
 
-Let's now apply K-Means, placing the allocated clusters back into the matrix:
+Let's now cluster our whiskies. We'll use the K-Means functionality from
+https://haifengl.github.io/clustering.html[Smile].
+Let's apply K-Means, and place the allocated clusters back into the matrix:
 
 [source,groovy]
 ----
@@ -363,7 +386,60 @@ def model = KMeans.fit(data,3, iterations)
 m['Cluster'] = model.group().toList()
 ----
 
-We can also project onto two dimensions using PCA:
+We can examine the cluster allocation using groovy-ginq functionality, which 
works well with Matrix:
+
+[source,groovy]
+----
+def result = GQ {
+    from w in m
+    groupby w.Cluster
+    orderby w.Cluster
+    select w.Cluster, count(w.Cluster) as Count
+}
+println result
+----
+
+Which has this output:
+
+----
++---------+-------+
+| Cluster | Count |
++---------+-------+
+| 0       | 51    |
+| 1       | 23    |
+| 2       | 12    |
++---------+-------+
+----
+
+We can convert the ginq result back into a matrix like this:
+
+[source,groovy]
+----
+println Matrix.builder('Cluster 
allocation').ginqResult(result).build().content()
+----
+
+Which has this output:
+
+----
+Cluster allocation: 3 obs * 2 variables
+Cluster        Count
+      0           51
+      1           23
+      2           12
+----
+
+For the particular problem of checking cluster allocation, we can also
+use the normal Groovy extension methods:
+
+[source,groovy]
+----
+assert m.rows().countBy{ it.Cluster } == [0:51, 1:23, 2:12]
+----
+
+We can also project onto two dimensions using Principal Component Analysis 
(PCA).
+We'll again use the
+https://haifengl.github.io/feature.html#dimension-reduction[Smile] 
functionality for this.
+Let's project onto 2 dimensions and place the projected coordinates back into 
the matrix:
 
 [source,groovy]
 ----
@@ -373,9 +449,9 @@ m['X'] = projected*.getAt(0)
 m['Y'] = projected*.getAt(1)
 ----
 
-We've placed the projected coordinates back into the matrix.
-Let's now create a scatter plot with the distilleries for each cluster
-added in distinct series:
+Let's now create a scatter plot showing the distilleries mapped according
+to the projected coordinates. The most compact form of the `ScatterPlot#create`
+method assumes one series, but it's not hard to add each series ourselves:
 
 [source,groovy]
 ----
@@ -393,7 +469,7 @@ When run, we get the following output:
 
 image:img/matrixWhiskyScatterPlot.png[scatter plot,50%]
 
-Matrix doesn't have a correlation heatmap out of the box, but it does have 
heatmap plots,
+Matrix doesn't have a correlation heatmap plot out of the box, but it does 
have heatmap plots,
 and it does have correlation functionality.
 It's easy enough to roll our own:
 
@@ -411,8 +487,7 @@ def corrMatrix = Matrix.builder().data(X: 0..<corr.size(), 
Heat: corr)
 def hc = HeatmapChart.create(corrMatrix)
     .addSeries('Heat Series', features.reverse(), features,
         corrMatrix.column('Heat').collate(size))
-hc.exportPng('matrixWhiskyCorrHeatmap.png' as File)
-new SwingWrapper(hc.exportSwing().chart).displayChart()
+
 ----
 
 Which has this output:

Reply via email to