This is an automated email from the ASF dual-hosted git repository.
paulk pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/groovy-website.git
The following commit(s) were added to refs/heads/asf-site by this push:
new 1927aa2 update for lucene 10.1.0 plus add some colors
1927aa2 is described below
commit 1927aa2f21de6be6dc6c2e485ed37e769a782d73
Author: Paul King <[email protected]>
AuthorDate: Sun Dec 22 07:57:49 2024 +1000
update for lucene 10.1.0 plus add some colors
---
site/src/site/blog/groovy-lucene.adoc | 204 +++++++++++++++++-----------------
1 file changed, 101 insertions(+), 103 deletions(-)
diff --git a/site/src/site/blog/groovy-lucene.adoc
b/site/src/site/blog/groovy-lucene.adoc
index 65cb60f..0b8077d 100644
--- a/site/src/site/blog/groovy-lucene.adoc
+++ b/site/src/site/blog/groovy-lucene.adoc
@@ -72,11 +72,26 @@ are wanting to follow along and run these examples:
----
String baseDir = '/projects/groovy-website/site/src/site/blog' // <1>
----
-<1> You'd need to check out the Groovy website and point to it here
+<1> You'd need to check out the Groovy website and point `baseDir` to it here
-Now our script will traverse all the files in that directory, processing them
with our regex
-and track the hits we find.
+First, let's create a little helper method for printing a pretty
+graph of our results (we'll use the `colorize` method from
https://github.com/dialex/JColor[JColor]):
+[source,groovy]
+----
+def display(Map<String, Integer> data, int max, int scale = 1) {
+ data.each { k, v ->
+ var label = "$k ($v)"
+ var color = k.startsWith('apache') ? MAGENTA_TEXT() : BLUE_TEXT()
+ println "${label.padRight(32)} ${colorize(bar(v * scale, 0, max, max),
color)}"
+ }
+}
+----
+
+Now our script will traverse all the files in that directory,
+processing them with our regex and track the hits we find.
+
+// Matcher.groovy
[source,groovy]
----
var histogram = [:].withDefault { 0 } // <1>
@@ -92,10 +107,7 @@ new File(baseDir).traverse(nameFilter: ~/.*\.adoc/) { file
-> // <2>
}
println "\nFrequency of total hits mentioning a project:"
-histogram.sort { e -> -e.value }.each { k, v -> // <8>
- var label = "$k ($v)"
- println "${label.padRight(32)} ${bar(v, 0, 50, 50)}"
-}
+display(histogram.sort { e -> -e.value }, 50) // <8>
----
<1> This is a map which provides a default value for non-existent keys
<2> This traverses the directory processing each AsciiDoc file
@@ -124,7 +136,7 @@ groovy-2-5-clibuilder-renewal.adoc: [apache commons
cli:2]
groovy-graph-databases.adoc: [apache age:11, apache hugegraph:3,
apache tinkerpop:3]
groovy-haiku-processing.adoc: [eclipse collections:3]
groovy-list-processing-cheat-sheet.adoc: [eclipse collections:4,
apache commons collections:3]
-groovy-lucene.adoc: [apache nutch:1, apache solr:1,
apache lucene:2, apache commons:1, apache commons math:2]
+groovy-lucene.adoc: [apache nutch:1, apache solr:1,
apache lucene:3, apache commons:4, apache commons math:2,
apache spark:1]
groovy-null-processing.adoc: [eclipse collections:6, apache commons
collections:4]
groovy-pekko-gpars.adoc: [apache pekko:4]
groovy-record-performance.adoc: [apache commons codec:1]
@@ -141,33 +153,33 @@ wordle-checker.adoc: [eclipse collections:3]
zipping-collections-with-groovy.adoc: [eclipse collections:4]
Frequency of total hits mentioning a project:
-eclipse collections (50)
██████████████████████████████████████████████████▏
-apache commons math (18) ██████████████████▏
-apache ignite (17) █████████████████▏
-apache spark (13) █████████████▏
-apache mxnet (12) ████████████▏
-apache wayang (11) ███████████▏
-apache age (11) ███████████▏
-eclipse deeplearning4j (8) ████████▏
-apache commons collections (7) ███████▏
-apache commons csv (6) ██████▏
-apache nlpcraft (5) █████▏
-apache pekko (4) ████▏
-apache hugegraph (3) ███▏
-apache tinkerpop (3) ███▏
-apache flink (2) ██▏
-apache commons cli (2) ██▏
-apache lucene (2) ██▏
-apache commons (2) ██▏
-apache opennlp (2) ██▏
-apache ofbiz (1) █▏
-apache beam (1) █▏
-apache commons numbers (1) █▏
-apache nutch (1) █▏
-apache solr (1) █▏
-apache commons codec (1) █▏
-apache commons io (1) █▏
-apache kie (1) █▏
+eclipse collections (50) <span
style="color:blue">██████████████████████████████████████████████████</span>▏
+apache commons math (18) <span
style="color:purple">██████████████████</span>▏
+apache ignite (17) <span
style="color:purple">█████████████████</span>▏
+apache spark (14) <span
style="color:purple">██████████████</span>▏
+apache mxnet (12) <span
style="color:purple">████████████</span>▏
+apache wayang (11) <span
style="color:purple">███████████</span>▏
+apache age (11) <span
style="color:purple">███████████</span>▏
+eclipse deeplearning4j (8) <span style="color:blue">████████</span>▏
+apache commons collections (7) <span
style="color:purple">███████</span>▏
+apache commons csv (6) <span style="color:purple">██████</span>▏
+apache nlpcraft (5) <span style="color:purple">█████</span>▏
+apache pekko (4) <span style="color:purple">████</span>▏
+apache hugegraph (3) <span style="color:purple">███</span>▏
+apache tinkerpop (3) <span style="color:purple">███</span>▏
+apache lucene (3) <span style="color:purple">███</span>▏
+apache flink (2) <span style="color:purple">██</span>▏
+apache commons cli (2) <span style="color:purple">██</span>▏
+apache commons (2) <span style="color:purple">██</span>▏
+apache opennlp (2) <span style="color:purple">██</span>▏
+apache ofbiz (1) <span style="color:purple">█</span>▏
+apache beam (1) <span style="color:purple">█</span>▏
+apache commons numbers (1) <span style="color:purple">█</span>▏
+apache nutch (1) <span style="color:purple">█</span>▏
+apache solr (1) <span style="color:purple">█</span>▏
+apache commons codec (1) <span style="color:purple">█</span>▏
+apache commons io (1) <span style="color:purple">█</span>▏
+apache kie (1) <span style="color:purple">█</span>▏
</pre>
++++
@@ -205,6 +217,7 @@ class ProjectNameAnalyzer extends Analyzer {
Let's now tokenize our documents and let Lucene index them.
+// LuceneWithRegexAnalyzer.groovy
[source,groovy]
----
var analyzer = new ProjectNameAnalyzer() // <1>
@@ -268,19 +281,13 @@ println "\nFrequency of total hits mentioning a project
(top 10):"
var termFreq = terms.collectEntries { term ->
[term.text(), reader.totalTermFreq(term)] // <3>
}
-termFreq.sort(byReverseValue).take(10).each { k, v ->
- var label = "$k ($v)"
- println "${label.padRight(32)} ${bar(v, 0, 50, 50)}"
-}
+display(termFreq.sort(byReverseValue).take(10), 50)
println "\nFrequency of documents mentioning a project (top 10):"
var docFreq = terms.collectEntries { term ->
[term.text(), reader.docFreq(term)] // <4>
}
-docFreq.sort(byReverseValue).take(10).each { k, v ->
- var label = "$k ($v)"
- println "${label.padRight(32)} ${bar(v * 2, 0, 20, 20)}"
-}
+display(docFreq.sort(byReverseValue).take(10), 20, 2)
----
<1> Get all index terms
<2> Look for terms which match project names, so we can save them to a set
@@ -305,7 +312,7 @@ groovy-2-5-clibuilder-renewal.adoc: [apache commons
cli:2]
groovy-graph-databases.adoc: [apache age:11, apache hugegraph:3,
apache tinkerpop:3]
groovy-haiku-processing.adoc: [eclipse collections:3]
groovy-list-processing-cheat-sheet.adoc: [apache commons collections:3,
eclipse collections:4]
-groovy-lucene.adoc: [apache commons:1, apache commons math:2,
apache lucene:2, apache nutch:1, apache solr:1]
+groovy-lucene.adoc: [apache commons:4, apache commons math:2,
apache lucene:3, apache nutch:1, apache solr:1,
apache spark:1]
groovy-null-processing.adoc: [apache commons collections:4,
eclipse collections:6]
groovy-pekko-gpars.adoc: [apache pekko:4]
groovy-record-performance.adoc: [apache commons codec:1]
@@ -322,28 +329,28 @@ wordle-checker.adoc: [eclipse collections:3]
zipping-collections-with-groovy.adoc: [eclipse collections:4]
Frequency of total hits mentioning a project (top 10):
-eclipse collections (50)
██████████████████████████████████████████████████▏
-apache commons math (17) █████████████████▏
-apache ignite (17) █████████████████▏
-apache spark (13) █████████████▏
-apache mxnet (12) ████████████▏
-apache wayang (11) ███████████▏
-apache age (11) ███████████▏
-eclipse deeplearning4j (8) ████████▏
-apache commons collections (7) ███████▏
-apache commons csv (6) ██████▏
+eclipse collections (50) <span
style="color:blue">██████████████████████████████████████████████████</span>▏
+apache commons math (17) <span
style="color:purple">█████████████████</span>▏
+apache ignite (17) <span
style="color:purple">█████████████████</span>▏
+apache spark (14) <span
style="color:purple">██████████████</span>▏
+apache mxnet (12) <span
style="color:purple">████████████</span>▏
+apache wayang (11) <span
style="color:purple">███████████</span>▏
+apache age (11) <span
style="color:purple">███████████</span>▏
+eclipse deeplearning4j (8) <span style="color:blue">████████</span>▏
+apache commons collections (7) <span
style="color:purple">███████</span>▏
+apache commons csv (6) <span style="color:purple">██████</span>▏
Frequency of documents mentioning a project (top 10):
-eclipse collections (10) ████████████████████▏
-apache commons math (7) ██████████████▏
-apache spark (5) ██████████▏
-apache ignite (4) ████████▏
-apache commons csv (4) ████████▏
-eclipse deeplearning4j (3) ██████▏
-apache wayang (3) ██████▏
-apache flink (2) ████▏
-apache commons collections (2) ████▏
-apache commons (2) ████▏
+eclipse collections (10) <span
style="color:blue">████████████████████</span>▏
+apache commons math (7) <span
style="color:purple">██████████████</span>▏
+apache spark (6) <span
style="color:purple">██████████</span>▏
+apache ignite (4) <span
style="color:purple">████████</span>▏
+apache commons csv (4) <span
style="color:purple">████████</span>▏
+eclipse deeplearning4j (3) <span style="color:blue">██████</span>▏
+apache wayang (3) <span style="color:purple">██████</span>▏
+apache flink (2) <span style="color:purple">████</span>▏
+apache commons collections (2) <span style="color:purple">████</span>▏
+apache commons (2) <span style="color:purple">████</span>▏
</pre>
++++
@@ -396,6 +403,7 @@ pick out the terms of interest, project names that match
our query.
For the highlight functionality to work, we ask the indexer to store some
additional information
when indexing, in particular term positions and offsets. The index code
changes to look like this:
+// Lucene.groovy
[source,groovy]
----
new IndexWriter(indexDir, config).withCloseable { writer ->
@@ -451,10 +459,7 @@ results.scoreDocs.each { ScoreDoc scoreDoc -> // <3>
}
println "\nFrequency of total hits mentioning a project (top 10):"
-histogram.sort { e -> -e.value }.take(10).each { k, v -> // <6>
- var label = "$k ($v)"
- println "${label.padRight(32)} ${bar(v, 0, 50, 50)}"
-}
+display(histogram.sort { e -> -e.value }.take(10), 50) // <6>
----
<1> Search for terms with the apache or eclipse prefixes
<2> Perform our query with a limit of 30 results
@@ -485,7 +490,7 @@ fun-with-obfuscated-groovy.adoc: [apache commons
math:1]
groovy-2-5-clibuilder-renewal.adoc: [apache commons cli:2]
groovy-graph-databases.adoc: [apache age:11, apache hugegraph:3,
apache tinkerpop:3]
groovy-haiku-processing.adoc: [eclipse collections:3]
-groovy-lucene.adoc: [apache nutch:1, apache solr:1,
apache lucene:2, apache commons:1, apache commons math:2]
+groovy-lucene.adoc: [apache nutch:1, apache solr:1,
apache lucene:3, apache commons:4, apache commons math:2,
apache spark:1]
groovy-pekko-gpars.adoc: [apache pekko:4]
groovy-record-performance.adoc: [apache commons codec:1]
handling-byte-order-mark-characters.adoc: [apache commons io:1]
@@ -500,16 +505,16 @@ wordle-checker.adoc: [eclipse collections:3]
zipping-collections-with-groovy.adoc: [eclipse collections:4]
Frequency of total hits mentioning a project (top 10):
-eclipse collections (50)
██████████████████████████████████████████████████▏
-apache commons math (18) ██████████████████▏
-apache ignite (17) █████████████████▏
-apache spark (13) █████████████▏
-apache mxnet (12) ████████████▏
-apache wayang (11) ███████████▏
-apache age (11) ███████████▏
-eclipse deeplearning4j (8) ████████▏
-apache commons collections (7) ███████▏
-apache commons csv (6) ██████▏
+eclipse collections (50) <span
style="color:blue">██████████████████████████████████████████████████</span>▏
+apache commons math (18) <span
style="color:purple">██████████████████</span>▏
+apache ignite (17) <span
style="color:purple">█████████████████</span>▏
+apache spark (14) <span
style="color:purple">█████████████</span>▏
+apache mxnet (12) <span
style="color:purple">████████████</span>▏
+apache wayang (11) <span
style="color:purple">███████████</span>▏
+apache age (11) <span
style="color:purple">███████████</span>▏
+eclipse deeplearning4j (8) <span style="color:blue">████████</span>▏
+apache commons collections (7) <span
style="color:purple">███████</span>▏
+apache commons csv (6) <span style="color:purple">██████</span>▏
</pre>
++++
@@ -563,6 +568,7 @@ We'll use our regex to find project names and store the
information in our facet
Lucene creates a special _taxonomy_ index for indexing facet information.
We'll also enable that.
+// LuceneFacets.groovy
[source,groovy]
----
var analyzer = new ProjectNameAnalyzer()
@@ -626,7 +632,7 @@ groovy-2-5-clibuilder-renewal.adoc: [apache commons
cli:2]
groovy-graph-databases.adoc: [apache age:11, apache hugegraph:3,
apache tinkerpop:3]
groovy-haiku-processing.adoc: [eclipse collections:3]
groovy-list-processing-cheat-sheet.adoc: [eclipse collections:4,
apache commons collections:3]
-groovy-lucene.adoc: [apache nutch:1, apache solr:1,
apache lucene:2, apache commons:1, apache commons math:2]
+groovy-lucene.adoc: [apache nutch:1, apache solr:1,
apache lucene:3, apache commons:4, apache commons math:2,
apache spark:1]
groovy-null-processing.adoc: [eclipse collections:6, apache commons
collections:4]
groovy-pekko-gpars.adoc: [apache pekko:4]
groovy-record-performance.adoc: [apache commons codec:1]
@@ -665,16 +671,10 @@ var projects = new
TaxonomyFacetIntAssociations('$projectHitCounts', taxonReader
var hitData = projects.getTopChildren(topN, 'projectHitCounts').labelValues
println "\nFrequency of total hits mentioning a project (top $topN):"
-hitData.each { m ->
- var label = "$m.label ($m.value)"
- println "${label.padRight(32)} ${bar(m.value, 0, 50, 50)}"
-}
+display(hitData.collectEntries{ lv -> [lv.label, lv.value] }, 50)
println "\nFrequency of documents mentioning a project (top $topN):"
-hitData.each { m ->
- var label = "$m.label ($m.count)"
- println "${label.padRight(32)} ${bar(m.count * 2, 0, 20, 20)}"
-}
+display(hitData.collectEntries{ lv -> [lv.label, lv.count] }, 20, 2)
----
When running this we can see the frequencies for the total hits and number of
files:
@@ -683,25 +683,22 @@ When running this we can see the frequencies for the
total hits and number of fi
++++
<pre>
Frequency of total hits mentioning a project (top 5):
-eclipse collections (50)
██████████████████████████████████████████████████▏
-apache commons math (18) ██████████████████▏
-apache ignite (17) █████████████████▏
-apache spark (13) █████████████▏
-apache mxnet (12) ████████████▏
+eclipse collections (50) <span
style="color:blue">██████████████████████████████████████████████████</span>▏
+apache commons math (18) <span
style="color:purple">██████████████████</span>▏
+apache ignite (17) <span
style="color:purple">█████████████████</span>▏
+apache spark (14) <span
style="color:purple">██████████████</span>▏
+apache mxnet (12) <span
style="color:purple">████████████</span>▏
Frequency of documents mentioning a project (top 5):
-eclipse collections (10) ████████████████████▏
-apache commons math (7) ██████████████▏
-apache spark (5) ██████████▏
-apache ignite (4) ████████▏
-apache mxnet (1) ██▏
+eclipse collections (10) <span
style="color:blue">████████████████████</span>▏
+apache commons math (7) <span
style="color:purple">██████████████</span>▏
+apache ignite (4) <span
style="color:purple">████████</span>▏
+apache spark (6) <span
style="color:purple">████████████</span>▏
+apache mxnet (1) <span style="color:purple">██</span>▏
</pre>
++++
-NOTE: At the time of writing, there is a bug in sorting for the second of
these graphs.
-A https://github.com/apache/lucene/issues/14008[fix] is coming.
-
Now, the taxonomy information about document frequency is for the top hits
scored using the number of hits.
One of our other facets (`projectFileCounts`) tracks document frequency
independently.
Let's look at how we can query that information:
@@ -724,7 +721,7 @@ Frequency of documents mentioning a project (top 5):
dim=projectFileCounts path=[] value=-1 childCount=27
eclipse collections (10)
apache commons math (7)
- apache spark (5)
+ apache spark (6)
apache ignite (4)
apache commons csv (4)
@@ -764,7 +761,7 @@ dim=projectNameCounts path=[] value=-1 childCount=2
Frequency of documents mentioning a project with path [apache] (top 5):
dim=projectNameCounts path=[apache] value=-1 childCount=18
commons (16)
- spark (5)
+ spark (6)
ignite (4)
wayang (3)
flink (2)
@@ -805,6 +802,7 @@ Let's have a look at what the code for that scenario could
look like.
First, we'll do indexing with the `StandardAnalyzer`.
+// LuceneWithStandardAnalyzer.groovy
[source,groovy]
----
var analyzer = new StandardAnalyzer()