(jackrabbit-oak) branch trunk updated: OAK-10640: fix external link on lucene multi-word synonyms (#1298)

fortino Thu, 08 Feb 2024 00:40:53 -0800

This is an automated email from the ASF dual-hosted git repository.

fortino pushed a commit to branch trunk
in repository https://gitbox.apache.org/repos/asf/jackrabbit-oak.git



The following commit(s) were added to refs/heads/trunk by this push:
     new bbee74e96f OAK-10640: fix external link on lucene multi-word synonyms 
(#1298)
bbee74e96f is described below

commit bbee74e96f8cbe65b08e592edac6de8a2cc89bff
Author: Fabrizio Fortino <[email protected]>
AuthorDate: Thu Feb 8 09:40:44 2024 +0100

    OAK-10640: fix external link on lucene multi-word synonyms (#1298)
    
    * OAK-10640: fix external link on lucene multi-word synonyms
    
    * OAK-10640: fix clear typos + other minor improvements
---
 oak-doc/src/site/markdown/query/lucene.md | 66 +++++++++++++++----------------
 1 file changed, 33 insertions(+), 33 deletions(-)

diff --git a/oak-doc/src/site/markdown/query/lucene.md 
b/oak-doc/src/site/markdown/query/lucene.md
index 8e41b314a8..6cff4b1bdc 100644
--- a/oak-doc/src/site/markdown/query/lucene.md
+++ b/oak-doc/src/site/markdown/query/lucene.md
@@ -344,13 +344,13 @@ in the index definition:
 Please note that typically, those settings don't need to be explicitly set.
 Cost per execution is the overhead of one query.
 Cost per entry is the cost per node in the index.
-Using 0.5 means the cost is half, which means the index would be used used 
more often
+Using 0.5 means the cost is half, which means the index would be used more 
often
 (that is, even if there is a different index with similar cost).
 
 ##### <a name="indexing-rule-inheritence"></a>Indexing Rule inheritance
 
 `indexRules` are defined per nodeType and support nodeType inheritance. For
-example while indexing any node the indexer would lookup for applicable
+example while indexing any node the indexer would look up for applicable
 indexRule for that node based on its _primaryType_. If a direct match is
 found then that rule would be used otherwise it would look for rule for any
 of the parent types. The rules are looked up in the order of there entry
@@ -362,7 +362,7 @@ applicable if exact match is found
 
 ##### <a name="property-definitions"></a>Property Definitions
 
-Each index rule consist of one ore more property definition defined under
+Each index rule consist of one or more property definition defined under
 `properties`. Order of property definition node is important as some properties
 are based on regular expressions. Below is the canonical property definition
 structure
@@ -435,7 +435,7 @@ nodeScopeIndex
     * /jcr:root/content//element(*, app:Asset)[jcr:contains(., 'image')]_
 
   In case of aggregation all properties would be indexed at node level by 
default
-  if the property type is part of `includePropertyTypes`. However if there is 
an
+  if the property type is part of `includePropertyTypes`. However, if there is 
an
   explicit property definition provided then it would only be included if
   `nodeScopeIndex` is set to true.
 
@@ -574,7 +574,7 @@ Consider a query like
 
     select * from [app:Asset] as a where isdescendantnode(a, 
[/content/app/old]) AND contains(*, 'white')
 
-By default the index would return all node which _contain white_ and Query
+By default, the index would return all node which _contain white_ and Query
 engine would filter out nodes which are not under _/content/app/old_. This
 can perform slow if lots of nodes are not under that path. To speed up such
 queries one can enable `evaluatePathRestrictions` in Lucene index and index
@@ -608,7 +608,7 @@ e.g. transient system data.
 If the application stores logs under `/var/log`, and this data is
 not supposed to be indexed, then it can be excluded, by setting
 `excludedPaths` to `["/var/log"]`.
-However it is typically better to set `includedPaths` and `queryPaths`.
+However, it is typically better to set `includedPaths` and `queryPaths`.
 
 <a name="query-paths"></a>
 **queryPaths**
@@ -676,7 +676,7 @@ includes the content of the jcr:content node:
           + include0
             - path = "jcr:content"
 
-By default all properties whose type matches `includePropertyTypes` and are
+By default, all properties whose type matches `includePropertyTypes` and are
 part of child nodes as per the aggregation pattern are included for indexing.
 For excluding certain properties define a property definition with relative
 path and set `excludeFromAggregation` to `true`. Such properties would then be
@@ -913,7 +913,7 @@ Points to note
    i.e. just use `domain => range`.
 4. Precedence: Specifying analyzer class directly has precedence over analyzer 
configuration
    by composition. If you want to configure analyzers by composition then 
analyzer class
-   MUST NOT not be specified. In-build analyzer has least precedence and comes 
into play only
+   MUST NOT be specified. In-build analyzer has least precedence and comes 
into play only
    if no custom analyzer has been configured. Similarly, setting 
`indexOriginalTerm` on
    analyzers node to modify behavior of in-built analyzer also works only when 
no custom
    analyzer has been configured.
@@ -932,14 +932,14 @@ Points to note
     analyzer for indexing as well as querying, using one-way synonyms in
     any practical way is not supported at the moment.
     * special characters have to be escaped
-    * multi word synonyms need particular attention (see 
https://lucidworks.com/2014/07/12/solution-for-multi-term-synonyms-in-lucenesolr-using-the-auto-phrasing-tokenfilter)
+    * multi word synonyms need particular attention (see 
https://lucidworks.com/post/solution-for-multi-term-synonyms-in-lucenesolr-using-the-auto-phrasing-tokenfilter/)
 
 Note that currently only one analyzer can be configured per index. Its not 
possible to specify separate
 analyzer for query and index time currently.
 
 #### <a name="codec"></a>Codec
 
-Name of [Lucene Codec][lucene-codec] to use. By default if the index involves
+Name of [Lucene Codec][lucene-codec] to use. By default, if the index involves
 fulltext indexing then Oak Lucene uses `OakCodec` which disables compression.
 Due to this the index size may grow large. To enable compression you can set
 the codec to `Lucene46`
@@ -973,7 +973,7 @@ then search result would those node coming earlier where 
searched term is found
 in title field
 
 For that to work ensure that for each such property (which need to be 
preferred)
-both `nodeScopeIndex` and `analyzed` are set to true. In addition you can 
specify
+both `nodeScopeIndex` and `analyzed` are set to true. In addition, you can 
specify
 `boost` property so give higher weightage to values found in specific property
 
 Note that even without setting explicit `boost` and just setting 
`nodeScopeIndex`
@@ -1020,17 +1020,17 @@ way and that would start affecting the query execution 
leading to inconsistent r
 
 Since Oak 1.6 the index definitions are cloned upon reindexing and stored in a 
hidden structure.
 For further incremental indexing and for query plan calculation the stored 
index definition is used.
-So any changes done post reindex to index definition would not be applicable 
untill a reindex is done.
+So any changes done post reindex to index definition would not be applicable 
until a reindex is done.
 
 There would be some cases where changes in index definition does not require a 
reindex. For e.g. if a new property
-is being introduced in content model and no prior content exist with such a 
property then its safe to index such
+is being introduced in content model and no prior content exist with such a 
property then it's safe to index such
 a property without doing a reindex. For such cases user must follow below steps
 
 1. Make the required changes
 2. Set `refresh` property to `true` in index definition node
 3. Save the changes
 
-On next async indexing cycle this flag would be pickedup and stored index 
definition would be refreshed.
+On next async indexing cycle this flag would be picked up and stored index 
definition would be refreshed.
 _Post this the flag would be automatically removed and a log message would be 
logged_. You would also see a
 log message like below
 
@@ -1116,7 +1116,7 @@ the config file via `tika/config.xml` node in index 
config.
 #### <a name="mime-type-usage"></a>Mime type usage
 
 A binary is only indexed if there is an associated property `jcr:mimeType` 
defined
-and that is supported by Tika. By default indexer uses [TypeDetector][OAK-2895]
+and that is supported by Tika. By default, indexer uses 
[TypeDetector][OAK-2895]
 instead of default `DefaultDetector` which relies on the `jcr:mimeType` to 
pick up the
 right parser.
 
@@ -1238,7 +1238,7 @@ should be used
 
 ### <a name="persisting-indexes"></a>Persisting indexes to FileSystem
 
-By default Lucene indexes are stored in the `NodeStore`. If required they can
+By default, Lucene indexes are stored in the `NodeStore`. If required they can
 be stored on the file system directly
 
     - jcr:primaryType = "oak:QueryIndexDefinition"
@@ -1307,7 +1307,7 @@ index content e.g. size of index, number of documents 
present in index etc
 
 ![Lucene Index MBean](lucene-index-mbean.png)
 
-This MBean supports retriving index fields and terms using the 
`getFieldTermsInfo(java.lang.String indexPath, java.lang.String field, int max)`
+This MBean supports retrieving index fields and terms using the 
`getFieldTermsInfo(java.lang.String indexPath, java.lang.String field, int max)`
 and the `getFieldTermsInfo(java.lang.String indexPath, java.lang.String field, 
java.lang.String fieldType, int max)` methods. 
 
 The first method always assumes the return type is a String, the second method 
allows you to specify the return type as either:
@@ -1445,7 +1445,7 @@ Analyzed suggestions can be enabled by setting 
"suggestAnalyzed" property to tru
     - suggestUpdateFrequencyMinutes = 20
     - suggestAnalyzed = true
 ```
-_Note that up till Oak 1.3.14/1.2.14, `suggestAnalyzed` was to be setup at 
index definition node itself. That is is still
+_Note that up till Oak 1.3.14/1.2.14, `suggestAnalyzed` was to be setup at 
index definition node itself. That is still
 supported for backward compatibility, but having a separate `suggestion` node 
is preferred._
 
 Setting up `useInSuggest=true` for a property definition having 
`name=:nodeName` would add node names to
@@ -1476,7 +1476,7 @@ properties terms to be used for spellcheck corrections 
will be taken.
 Sample configuration for spellchecking based on terms contained in `jcr:title`
 property.
 
-Since Oak 1.3.11/1.2.14, the each suggestion would be returned per row.
+Since Oak 1.3.11/1.2.14, each suggestion would be returned per row.
 
 ```
 /oak:index/lucene-spellcheck
@@ -1655,7 +1655,7 @@ _Note that showing explanation score is expensive. So, 
this feature should be us
 `@since Oak 1.3.14`
 
 The following features is now deprecated:
-In OSGi enviroment, implementations of `IndexFieldProvider` and 
`FulltextQueryTermsProvider` under
+In OSGi environment, implementations of `IndexFieldProvider` and 
`FulltextQueryTermsProvider` under
 `org.apache.jackrabbit.oak.plugins.index.lucene.spi` (see javadoc 
[here][oak-lucene]) are called during indexing
 and querying as documented in javadocs.
 
@@ -1671,9 +1671,9 @@ the index definition should have a rule for each such 
property with the _useInSi
 As a result, after (re)indexing, each vector will be indexed so that an 
approximate nearest neighbour search is possible,
 not requiring brute force nearest neighbour search over the entire set of 
indexed vectors.
 
-By default another property for feature vector similarity search, called 
_similarityRerank_, is set to _true_ in order
+By default, another property for feature vector similarity search, called 
_similarityRerank_, is set to _true_ in order
 to allow reranking of the top 15 results using brute force nearest neighbour.
-Therefore in a first iteration an approximate nearest neighbour search is 
performed to obtain all the possibly relevant
+Therefore, in a first iteration an approximate nearest neighbour search is 
performed to obtain all the possibly relevant
 results (expecting high recall), then a brute force nearest neighbour over the 
top 15 search results is performed to
 improve precision (see 
[OAK-7824](https://issues.apache.org/jira/browse/OAK-7824), 
[OAK-7962](https://issues.apache.org/jira/browse/OAK-7962),
 [OAK-8119](https://issues.apache.org/jira/browse/OAK-8119)).
@@ -1717,16 +1717,16 @@ While defining the index definition do consider the 
following aspects
     data is stored under specific repository path and all queries are made 
under
     those path.
 
-    In fact its recommended to use single index if all the properties being 
indexed
+    In fact, it's recommended to use single index if all the properties being 
indexed
     are related. This would enable Lucene index to evaluate as much property
-    restriction as possible  natively (which is faster) and also save on 
storage
+    restriction as possible natively (which is faster) and also save on storage
     cost incurred in storing the node path.
 
 7.  Use features when required - There are certain features provided by Lucene
-    index  which incur extra cost in terms of storage space when enabled. For
+    index which incur extra cost in terms of storage space when enabled. For
     example enabling `evaluatePathRestrictions`, `ordering` etc. Enable such
     option only when you make use of those features and further enable them for
-    only those properties. So `ordering`  should be enabled only when sorting 
is
+    only those properties. So `ordering` should be enabled only when sorting is
     being performed for those properties and `evaluatePathRestrictions` should
     only be enabled if you are going to specify path restrictions.
 
@@ -1735,7 +1735,7 @@ While defining the index definition do consider the 
following aspects
     Index selection logic does not make use of the `includedPaths` and 
`excludedPaths`
     for index selection. Index selection is done only on cost basis and 
`queryPaths`.
     Having multiple definition for same type would cause ambiguity in index 
selection
-    and may lead to unexpected results. Instead have a single index definition 
for same
+    and may lead to unexpected results. Instead, have a single index 
definition for same
     type.
 
 Following analogy might be helpful to people coming from RDBMS world. Treat 
your
@@ -1804,7 +1804,7 @@ WHERE
 ```
 
 Here we can either add another property to the above definition or create a new
-index definition altogether. By default prefer to club such indexes together
+index definition altogether. By default, prefer to club such indexes together
 
 ```
 /oak:index/assetType
@@ -1866,8 +1866,8 @@ This can also be clubbed in same index definition above
 #### <a name="queries-structured-content"></a>B - Queries for structured 
content
 
 Queries in previous examples were based on mostly unstructured content where no
-nodeType restrictions were applied. However in many cases the nodes being 
queried
-confirm to certain structure. For example you have following content
+nodeType restrictions were applied. However, in many cases the nodes being 
queried
+confirm to certain structure. For example, you have the following content
 
 ```
 /content/dam/assets/december/banner.png
@@ -1902,7 +1902,7 @@ SELECT * FROM [app:Asset] AS a
 WHERE a.[jcr:content/metadata/status] = 'published'
 ```
 
-For this following index definition would be have to be created
+For this following index definition would have to be created
 
 ```
 /oak:index/assetType
@@ -2049,7 +2049,7 @@ Above index definition
     `original` node
 
 *   Aggregation would include by default all properties which are part of
-    **`includePropertyTypes`**. However if any property has a explicit property
+    **`includePropertyTypes`**. However, if any property has an explicit 
property
     definition provided like `comment` then `nodeScopeIndex` would need to be
     set to true
 
@@ -2089,7 +2089,7 @@ or
 ```
 would require an index on `app:Asset`  containing all nodes of the type. That, 
in
 turn, means that either the index needs to be a fulltext index or needs to be
-indexing `jcr:primaryType` property. All of the following definitions would 
work
+indexing `jcr:primaryType` property. All the following definitions would work
 for such a case:
 ```
    + /oak:index/index1

(jackrabbit-oak) branch trunk updated: OAK-10640: fix external link on lucene multi-word synonyms (#1298)

Reply via email to