Clone URL (Committers only): https://cms.apache.org/redirect?new=anonymous;action=diff;uri=http://jena.apache.org/documentation%2Fquery%2Ftext-query.mdtext
Chris Tomlinson Index: trunk/content/documentation/query/text-query.mdtext =================================================================== --- trunk/content/documentation/query/text-query.mdtext (revision 1821724) +++ trunk/content/documentation/query/text-query.mdtext (working copy) @@ -2,6 +2,8 @@ Title: Jena Full Text Search +Title: Jena Full Text Search + This extension to ARQ combines SPARQL and full text search via [Lucene](https://lucene.apache.org) 6.4.1 or [ElasticSearch](https://www.elastic.co) 5.2.1 (which is built on @@ -80,6 +82,7 @@ - [Queries with graphs](#queries-with-graphs) - [Queries across multiple `Fields`](#queries-across-multiple-fields) - [Queries with _Boolean Operators_ and _Term Modifiers_](#queries-with-boolean-operators-and-term-modifiers) + - [Highlighting](#highlighting) - [Good practice](#good-practice) - [Configuration](#configuration) - [Text Dataset Assembler](#text-dataset-assembler) @@ -242,6 +245,7 @@ | query string | Lucene query string fragment | | limit | (optional) `int` limit on the number of results | | lang:xx | (optional) language tag spec | +| highlight:xx | (optional) highlighting options | The `property` URI is only necessary if multiple properties have been indexed and the property being searched over is not the [default field @@ -258,8 +262,10 @@ indexed with the tag _xx_. Searches may be restricted to field values with no language tag via `"lang:none"`. -If both `limit` and `lang:xx` are present, then `limit` must precede `lang:xx`. +The `highlight:xx` specification is an optional string where _xx_ are options that control the highlighting of search result literals. See [below](#highlighting) for details. +If both `limit` and one or more of `lang:xx` or `highlight:xx` are present, then `limit` must precede these arguments. + If only the query string is required, the surrounding `( )` _may be_ omitted. #### Output arguments: @@ -495,7 +501,52 @@ **Always surround the query string with `( )` if more than a single term or phrase are involved.** +#### Highlighting +The highlighting option uses the Lucene `Highlighter` and `SimpleHTMLFormatter` to insert highlighting markup into the literals returned from search results (hence the text dataset must be configured to store the literals). The highlighted results are returned via the _literal_ output argument. + +The simplest way to request highlighting is via `'highlight:'`. This will apply all the defaults: + +| Option | Key | Default | +|--------------------|-----------------|---------------------| +| maxFrags | m: | 3 | +| fragSize | z: | 128 | +| start | s: | RIGHT_ARROW | +| end | e: | LEFT_ARROW | +| fragSep | f: | DIVIDES | +| joinHi | jh: | true | +| joinFrags | jf: | true | + +to the highlighting of the search results. For example if the query is: + + (?s ?sc ?lit) text:query ( "brown fox" "highlight:" ) + +then a resulting literal binding might be: + + "the quick ↦brown fox↤ jumped over the lazy baboon" + +The `RIGHT_ARROW` is Unicode \u21a6 and the `LEFT_ARROW` is Unicode \u21a4. These are chosen to be single characters that in most situations will be very unlikely to occur in resulting literals. The `fragSize` of 128 is chosen to be large enough that in many situations the matches will result in single fragments. If the literal is larger than 128 characters and there are several matches in the literal then there may be additional fragments separated by the `DIVIDES`, Unicode \u2223. + +Depending on the analyzer used and the tokenizer, the highlighting will result in marking each token rather than an entire phrase. The `joinHi` option is by default `true` so that entire phrases are highlighted together rather than as individual tokens as in: + + "the quick ↦brown↤ ↦fox↤ jumped over the lazy baboon" + +which would result from: + + (?s ?sc ?lit) text:query ( "brown fox" "highlight:jh:n" ) + +The `jh` and `jf` boolean options are set `false` via `n`. Any other value is `true`. The defaults for these options have been selected to be reasonable for most applications. + +The joining is performed post highlighting via Java `String replaceAll` rather than using the Lucene Unified Highlighter facility which requires that term vectors and positions be stored. The joining deletes _extra_ highlighting with only intervening Unicode separators, `\p{Z}`. + +The more conventional output of the Lucene `SimpleHTMLFormatter` with html emphasis markup is achieved via, `"highlight:s:<em class='hiLite'> | e:</em>"` (highlight options are separated by a Unicode vertical line, \u007c. The spaces are not necessary). The result with the above example will be: + + "the quick <em class='hiLite'>brown fox</em> jumped over the lazy baboon" + +which would result from the query: + + (?s ?sc ?lit) text:query ( "brown fox" "highlight:s:<em class='hiLite'> | e:</em>" ) + ### Good practice From the above it should be clear that best practice, except in the simplest cases