Clone URL (Committers only):
https://cms.apache.org/redirect?new=anonymous;action=diff;uri=http://jena.apache.org/documentation%2Fquery%2Ftext-query.mdtext
Alexis Miara
Index: trunk/content/documentation/query/text-query.mdtext
===================================================================
--- trunk/content/documentation/query/text-query.mdtext (revision 1682942)
+++ trunk/content/documentation/query/text-query.mdtext (working copy)
@@ -40,6 +40,7 @@
- [Configuring an analyzer](#configuring-an-analyzer)
- [Configuration by Code](#configuration-by-code)
- [Graph-specific Indexing](#graph-specific-indexing)
+ - [Linguistic Support with Lucene
Index](#linguistic-support-with-lucene-index)
- [Working with Fuseki](#working-with-fuseki)
- [Building a Text Index](#building-a-text-index)
- [Deletion of Indexed Entities](#deletion-of-indexed-entities)
@@ -242,11 +243,20 @@
### Configuring an Analyzer
Text to be indexed is passed through a text analyzer that divides it into
tokens
-and may perform other transformations such as eliminating stop words. If a
Lucene
-text index is used then, by default a `StandardAnalyzer` is used. If a Solr
text
+and may perform other transformations such as eliminating stop words. If a
Solr text
index is used, the analyzer used is determined by the Solr configuration.
+If a Lucene text index is used, then by default a `StandardAnalyzer` is used.
However,
+it can be replaced by another analyzer with the `text:analyzer` property.
+For example with a `SimpleAnalyzer`:
-It is possible to configure an alternative analyzer for each field indexed in a
+ <#indexLucene> a text:TextIndexLucene ;
+ text:directory <file:Lucene> ;
+ text:analyzer [
+ a text:SimpleAnalyzer
+ ]
+ .
+
+It is also possible to configure an alternative analyzer for each field
indexed in a
Lucene index. For example:
<#entMap> a text:EntityMap ;
@@ -271,9 +281,15 @@
In addition, Jena provides `LowerCaseKeywordAnalyzer`,
which is a case-insensitive version of `KeywordAnalyzer`.
-New in Jena 2.13.0:
+In Jena 3.0.0, the new `LocalizedAnalyzer` has been introduced to deal with
Lucene
+language specific analyzers.
+See [Linguistic Support with Lucene
Index](#linguistic-support-with-lucene-index)
+part for details.
-There is an ability to specify an analyzer to be used for the
+
+#### Analyzer for Query
+
+New in Jena 2.13.0 is the optional ability to specify an analyzer to be used
for the
query string itself. It will find terms in the query text. If not set, then
the
analyzer used for the document will be used. The query analyzer is specified
on
the `TextIndexLucene` resource:
@@ -338,6 +354,116 @@
**Note:** If you migrate from a global (non-graph-aware) index to a
graph-aware index,
you need to rebuild the index to ensure that the graph information is stored.
+### Linguistic support with Lucene index
+
+It is now possible to take advantage of languages of triple literals to
enhance
+index and queries. Sub-sections below detail different settings with the
index,
+and use cases with SPARQL queries.
+
+#### Explicit Language Field in the Index
+
+Literals' languages of triples can be stored (during triple addition phase)
into the
+index to extend query capabilities.
+For that, the new `text:langField` property must be set in the EntityMap
assembler :
+
+ <#entMap> a text:EntityMap ;
+ text:entityField "uri" ;
+ text:defaultField "text" ;
+ text:langField "lang" ;
+ .
+
+If you configure the index via Java code, you need to set this parameter to
the
+EntityDefinition instance, e.g.
+
+ EntityDefinition docDef = new EntityDefinition(entityField, defaultField);
+ docDef.setLangField("lang");
+
+
+#### SPARQL Linguistic Clause Forms
+
+Once the `langField` is set, you can use it directly inside SPARQL queries,
for that the `'lang:xx'`
+argument allows you to target specific localized values. For example:
+
+ //target english literals
+ ?s text:query (rdfs:label 'word' 'lang:en' )
+
+ //target unlocalized literals
+ ?s text:query (rdfs:label 'word' 'lang:none')
+
+ //ignore language field
+ ?s text:query (rdfs:label 'word')
+
+
+#### LocalizedAnalyzer
+
+You can specify and use a LocalizedAnalyzer in order to benefit from Lucene
language
+specific analyzers (stemming, stop words,...). Like any others analyzers, it
can
+be done for default text indexation, for each different field or for query.
+
+With an assembler configuration, the `text:language` property needs to be
provided, e.g :
+
+ <#indexLucene> a text:TextIndexLucene ;
+ text:directory <file:Lucene> ;
+ text:entityMap <#entMap> ;
+ text:analyzer [
+ a text:LocalizedAnalyzer ;
+ text:language "fr"
+ ]
+ .
+
+will configure the index to analyze values of the 'text' field using a
FrenchAnalyzer.
+
+To configure the same example via Java code, you need to provide the analyzer
to the
+index configuration object:
+
+ TextIndexConfig config = new TextIndexConfig(def);
+ Analyzer analyzer = Util.getLocalizedAnalyzer("fr");
+ config.setAnalyzer(analyzer);
+ Dataset ds = TextDatasetFactory.createLucene(ds1, dir, config) ;
+
+Where `def`, `ds1` and `dir` are instances of `EntityDefinition`, `Dataset`
and
+`Directory` classes.
+
+**Note**: You do not have to set the `text:langField` property with a single
+localized analyzer.
+
+#### Multilingual Support
+
+Let us suppose that we have many triples with many localized literals in many
different
+languages. It is possible to take all this languages into account for future
mixed localized queries.
+Just set the `text:multilingualSupport` property at `true` to automatically
enable the localized
+indexation (and also the localized analyzer for query) :
+
+ <#indexLucene> a text:TextIndexLucene ;
+ text:directory "mem" ;
+ text:multilingualSupport true;
+ .
+
+Via Java code, set the multilingual support flag :
+
+ TextIndexConfig config = new TextIndexConfig(def);
+ config.setMultilingualSupport(true);
+ Dataset ds = TextDatasetFactory.createLucene(ds1, dir, config) ;
+
+Thus, this multilingual index combines dynamically all localized analyzers of
existing languages and
+the storage of langField properties.
+
+For example, it is possible to involve different languages into the same text
search query :
+
+ SELECT ?s
+ WHERE {
+ { ?s text:query ( rdfs:label 'institut' 'lang:fr' ) }
+ UNION
+ { ?s text:query ( rdfs:label 'institute' 'lang:en' ) }
+ }
+
+Hence, the result set of the query will contain "institute" related subjects
+(institution, institutional,...) in French and in English.
+
+**Note**: If the `text:langField` property is not set, the "lang" field will be
+used anyway by default, because multilingual index cannot work without it.
+
+
## Working with Fuseki
The Fuseki configuration simply points to the text dataset as the
@@ -500,3 +626,6 @@
adjusting the version <code>X.Y.Z</code> as necessary. This will automatically
include a compatible version of Lucene and the Solr java client, but not Solr
server.
+
+
+