This is an automated email from the ASF dual-hosted git repository.
paulk pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/groovy-website.git
The following commit(s) were added to refs/heads/asf-site by this push:
new cb17eb7 ready for publishing
cb17eb7 is described below
commit cb17eb76792f9ba320f05964d6fea0ea3894c9ea
Author: Paul King <[email protected]>
AuthorDate: Mon Nov 25 07:23:00 2024 +1000
ready for publishing
---
site/src/site/blog/groovy-lucene.adoc | 50 ++++++++++++++++++++---------------
1 file changed, 29 insertions(+), 21 deletions(-)
diff --git a/site/src/site/blog/groovy-lucene.adoc
b/site/src/site/blog/groovy-lucene.adoc
index 0fe78da..0dba500 100644
--- a/site/src/site/blog/groovy-lucene.adoc
+++ b/site/src/site/blog/groovy-lucene.adoc
@@ -1,7 +1,6 @@
= Searching with Lucene
Paul King
:revdate: 2024-11-18T20:30:00+00:00
-:draft: true
:keywords: aggregation, search, lucene, groovy
:description: This post looks at using Lucene to find references to other
projects in Groovy's blog posts.
@@ -383,7 +382,7 @@ When exploring query results, we are going to use some
classes in the `vectorhig
package in the `lucene-highlight` module. You'd typically use functionality in
that
module to highlight hits as part of potentially displaying them on a web page
as part of some web search functionality. For us, we are going to just
-pick out the terms of interest, project names that matching our query.
+pick out the terms of interest, project names that match our query.
For the highlight functionality to work, we ask the indexer to store some
additional information
when indexing, in particular term positions and offsets. The index code
changes to look like this:
@@ -442,8 +441,8 @@ results.scoreDocs.each { ScoreDoc scoreDoc -> // <3>
found.each { histogram[it.replaceAll('\n', ' ')] += 1 } // <5>
}
-println "\nFrequency of total hits mentioning a project:"
-histogram.sort { e -> -e.value }.each { k, v -> // <6>
+println "\nFrequency of total hits mentioning a project (top 10):"
+histogram.sort { e -> -e.value }.take(10).each { k, v -> // <6>
var label = "$k ($v)"
println "${label.padRight(32)} ${bar(v, 0, 50, 50)}"
}
@@ -453,7 +452,7 @@ histogram.sort { e -> -e.value }.each { k, v -> // <6>
<3> Process each result
<4> Pull out the actual matched terms
<5> Also aggregate the counts
-<6> Display the aggregates as a pretty barchart
+<6> Display the top 10 aggregates as a pretty barchart
The output is essentially the same as before:
@@ -508,11 +507,26 @@ apache commons csv (6) ██████▏
== Using Lucene Facets
As well as the metadata Lucene stores for its own purposes in the index,
-it provides a mechanism for storing custom metadata called facets. If we
wanted to we could
-store referenced project names using this mechanism.
+Lucene provides a mechanism, called facets, for storing custom metadata.
+Facets allow for more powerful searching. They are often used for grouping
+search results into categories. The search user can drill down into
+categories to refine their search.
+
+Let's use facets to store project names for each document.
+One facet capturing the project name information might be all we need,
+but to illustrate some Lucene features, we'll use three facets and
+store slightly different information in each one.
+
+NOTE: Facets are a really powerful feature. Given that we are indexing
asciidoc source
+files, we could even use libraries like
https://github.com/asciidoctor/asciidoctorj[AsciidoctorJ]
+to extract more metadata from our source files and store them as facets.
+We could for instance extra titles, author(s), keywords, publication dates and
so forth.
+This would allow us to make some pretty powerful searches.
+We leave this as an exercise for the reader.
+But if you try, please let us know how you go!
-Let's use our regex to find project names and store the information in various
facets.
-Lucene has a special taxonomy index which stores metadata about our metadata.
+We'll use our regex to find project names and store the information in our
facets.
+Lucene creates a special _taxonomy_ index for indexing facet information.
We'll also enable that.
[source,groovy]
@@ -561,7 +575,7 @@ taxonWriter.close()
<3> Define some properties for the facets we are interested in
<4> We add our facets of interest to our document
-Since we are collecting this data during indexing, we can print it out:
+Since we are collecting our project names during indexing, we can print then
out:
++++
<pre>
@@ -664,7 +678,9 @@ println "\nFrequency of documents mentioning a project (top
$topN):"
println facets.getTopChildren(topN, 'projectFileCounts')
----
-The output looks like this:
+We could display our search result (a `FacetResult` instance) as a barchart
+like we've done before, but the `toString` for the result is also quite
informative.
+Here is what running the above code looks like:
++++
<pre>
@@ -687,8 +703,8 @@ the `projectFileCounts` facet if you didn't need that extra
information.
Our final facet (`projectNameCounts`) is a hierarchical facet. These are
typically used interactively
when "browsing" search results. We can look at project names by first word,
e.g. the foundation.
-We could then drill down into "Apache" and find referenced projects, and then
in the
-case of commons, we could drill down into its subprojects.
+We could then drill down into one of the foundations, e.g. "Apache", and find
referenced projects,
+and then in the case of commons, we could drill down into its subprojects.
Here is the code which does that.
[source,groovy]
@@ -744,14 +760,6 @@ assert results.totalHits.value() == 1 &&
This query shows that there is exactly one blog post that mentions
Apache projects, Eclipse projects, and also emojis.
-Facets are a really powerful feature. Given that we are indexing asciidoc
source
-files, we could even use libraries like
https://github.com/asciidoctor/asciidoctorj[AsciidoctorJ]
-to extract more metadata from our source files and store them as facets.
-We could for instance extra titles, author(s), keywords, publication dates and
so forth.
-This would allow us to make some pretty powerful searches.
-We leave this as an exercise for the reader.
-But if you try, please let us know how you go!
-
== More complex queries
As a final example, we chose earlier to extract project names at index time.