This is an automated email from the ASF dual-hosted git repository.

paulk pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/groovy-website.git


The following commit(s) were added to refs/heads/asf-site by this push:
     new cb17eb7  ready for publishing
cb17eb7 is described below

commit cb17eb76792f9ba320f05964d6fea0ea3894c9ea
Author: Paul King <[email protected]>
AuthorDate: Mon Nov 25 07:23:00 2024 +1000

    ready for publishing
---
 site/src/site/blog/groovy-lucene.adoc | 50 ++++++++++++++++++++---------------
 1 file changed, 29 insertions(+), 21 deletions(-)

diff --git a/site/src/site/blog/groovy-lucene.adoc 
b/site/src/site/blog/groovy-lucene.adoc
index 0fe78da..0dba500 100644
--- a/site/src/site/blog/groovy-lucene.adoc
+++ b/site/src/site/blog/groovy-lucene.adoc
@@ -1,7 +1,6 @@
 = Searching with Lucene
 Paul King
 :revdate: 2024-11-18T20:30:00+00:00
-:draft: true
 :keywords: aggregation, search, lucene, groovy
 :description: This post looks at using Lucene to find references to other 
projects in Groovy's blog posts.
 
@@ -383,7 +382,7 @@ When exploring query results, we are going to use some 
classes in the `vectorhig
 package in the `lucene-highlight` module. You'd typically use functionality in 
that
 module to highlight hits as part of potentially displaying them on a web page
 as part of some web search functionality. For us, we are going to just
-pick out the terms of interest, project names that matching our query.
+pick out the terms of interest, project names that match our query.
 
 For the highlight functionality to work, we ask the indexer to store some 
additional information
 when indexing, in particular term positions and offsets. The index code 
changes to look like this:
@@ -442,8 +441,8 @@ results.scoreDocs.each { ScoreDoc scoreDoc -> // <3>
     found.each { histogram[it.replaceAll('\n', ' ')] += 1 } // <5>
 }
 
-println "\nFrequency of total hits mentioning a project:"
-histogram.sort { e -> -e.value }.each { k, v -> // <6>
+println "\nFrequency of total hits mentioning a project (top 10):"
+histogram.sort { e -> -e.value }.take(10).each { k, v -> // <6>
     var label = "$k ($v)"
     println "${label.padRight(32)} ${bar(v, 0, 50, 50)}"
 }
@@ -453,7 +452,7 @@ histogram.sort { e -> -e.value }.each { k, v -> // <6>
 <3> Process each result
 <4> Pull out the actual matched terms
 <5> Also aggregate the counts
-<6> Display the aggregates as a pretty barchart
+<6> Display the top 10 aggregates as a pretty barchart
 
 The output is essentially the same as before:
 
@@ -508,11 +507,26 @@ apache&nbsp;commons csv (6)           ██████▏
 == Using Lucene Facets
 
 As well as the metadata Lucene stores for its own purposes in the index,
-it provides a mechanism for storing custom metadata called facets. If we 
wanted to we could
-store referenced project names using this mechanism.
+Lucene provides a mechanism, called facets, for storing custom metadata.
+Facets allow for more powerful searching. They are often used for grouping
+search results into categories. The search user can drill down into
+categories to refine their search.
+
+Let's use facets to store project names for each document.
+One facet capturing the project name information might be all we need,
+but to illustrate some Lucene features, we'll use three facets and
+store slightly different information in each one.
+
+NOTE: Facets are a really powerful feature. Given that we are indexing 
asciidoc source
+files, we could even use libraries like 
https://github.com/asciidoctor/asciidoctorj[AsciidoctorJ]
+to extract more metadata from our source files and store them as facets.
+We could for instance extra titles, author(s), keywords, publication dates and 
so forth.
+This would allow us to make some pretty powerful searches.
+We leave this as an exercise for the reader.
+But if you try, please let us know how you go!
 
-Let's use our regex to find project names and store the information in various 
facets.
-Lucene has a special taxonomy index which stores metadata about our metadata.
+We'll use our regex to find project names and store the information in our 
facets.
+Lucene creates a special _taxonomy_ index for indexing facet information.
 We'll also enable that.
 
 [source,groovy]
@@ -561,7 +575,7 @@ taxonWriter.close()
 <3> Define some properties for the facets we are interested in
 <4> We add our facets of interest to our document
 
-Since we are collecting this data during indexing, we can print it out:
+Since we are collecting our project names during indexing, we can print then 
out:
 
 ++++
 <pre>
@@ -664,7 +678,9 @@ println "\nFrequency of documents mentioning a project (top 
$topN):"
 println facets.getTopChildren(topN, 'projectFileCounts')
 ----
 
-The output looks like this:
+We could display our search result (a `FacetResult` instance) as a barchart
+like we've done before, but the `toString` for the result is also quite 
informative.
+Here is what running the above code looks like:
 
 ++++
 <pre>
@@ -687,8 +703,8 @@ the `projectFileCounts` facet if you didn't need that extra 
information.
 
 Our final facet (`projectNameCounts`) is a hierarchical facet. These are 
typically used interactively
 when "browsing" search results. We can look at project names by first word, 
e.g. the foundation.
-We could then drill down into "Apache" and find referenced projects, and then 
in the
-case of commons, we could drill down into its subprojects.
+We could then drill down into one of the foundations, e.g. "Apache", and find 
referenced projects,
+and then in the case of commons, we could drill down into its subprojects.
 Here is the code which does that.
 
 [source,groovy]
@@ -744,14 +760,6 @@ assert results.totalHits.value() == 1 &&
 This query shows that there is exactly one blog post that mentions
 Apache projects, Eclipse projects, and also emojis.
 
-Facets are a really powerful feature. Given that we are indexing asciidoc 
source
-files, we could even use libraries like 
https://github.com/asciidoctor/asciidoctorj[AsciidoctorJ]
-to extract more metadata from our source files and store them as facets.
-We could for instance extra titles, author(s), keywords, publication dates and 
so forth.
-This would allow us to make some pretty powerful searches.
-We leave this as an exercise for the reader.
-But if you try, please let us know how you go!
-
 == More complex queries
 
 As a final example, we chose earlier to extract project names at index time.

Reply via email to