CMS diff: Text searches with SPARQL

Alexis Miara Fri, 05 Jun 2015 04:14:49 -0700

Clone URL (Committers only):
https://cms.apache.org/redirect?new=anonymous;action=diff;uri=http://jena.apache.org/documentation%2Fquery%2Ftext-query.mdtext


Alexis Miara

Index: trunk/content/documentation/query/text-query.mdtext
===================================================================
--- trunk/content/documentation/query/text-query.mdtext (revision 1682942)
+++ trunk/content/documentation/query/text-query.mdtext (working copy)
@@ -40,6 +40,7 @@
     -   [Configuring an analyzer](#configuring-an-analyzer)
     -   [Configuration by Code](#configuration-by-code)
     -   [Graph-specific Indexing](#graph-specific-indexing)
+    -   [Linguistic Support with Lucene 
Index](#linguistic-support-with-lucene-index)
 - [Working with Fuseki](#working-with-fuseki)
 - [Building a Text Index](#building-a-text-index)
 - [Deletion of Indexed Entities](#deletion-of-indexed-entities)
@@ -242,11 +243,20 @@
 ### Configuring an Analyzer
 
 Text to be indexed is passed through a text analyzer that divides it into 
tokens 
-and may perform other transformations such as eliminating stop words.  If a 
Lucene
-text index is used then, by default a `StandardAnalyzer` is used.  If a Solr 
text
+and may perform other transformations such as eliminating stop words. If a 
Solr text
 index is used, the analyzer used is determined by the Solr configuration.
+If a Lucene text index is used, then by default a `StandardAnalyzer` is used. 
However, 
+it can be replaced by another analyzer with the `text:analyzer` property. 
+For example with a `SimpleAnalyzer`:   
 
-It is possible to configure an alternative analyzer for each field indexed in a
+    <#indexLucene> a text:TextIndexLucene ;
+            text:directory <file:Lucene> ;
+            text:analyzer [
+                a text:SimpleAnalyzer
+            ]
+            . 
+
+It is also possible to configure an alternative analyzer for each field 
indexed in a
 Lucene index.  For example:
 
     <#entMap> a text:EntityMap ;
@@ -271,9 +281,15 @@
 In addition, Jena provides `LowerCaseKeywordAnalyzer`,
 which is a case-insensitive version of `KeywordAnalyzer`.
 
-New in Jena 2.13.0:
+In Jena 3.0.0, the new `LocalizedAnalyzer` has been introduced to deal with 
Lucene 
+language specific analyzers. 
+See [Linguistic Support with Lucene 
Index](#linguistic-support-with-lucene-index)
+part for details.
 
-There is an ability to specify an analyzer to be used for the
+
+#### Analyzer for Query
+
+New in Jena 2.13.0 is the optional ability to specify an analyzer to be used 
for the
 query string itself.  It will find terms in the query text.  If not set, then 
the
 analyzer used for the document will be used.  The query analyzer is specified 
on
 the `TextIndexLucene` resource:
@@ -338,6 +354,116 @@
 **Note:** If you migrate from a global (non-graph-aware) index to a 
graph-aware index,
 you need to rebuild the index to ensure that the graph information is stored.
 
+### Linguistic support with Lucene index
+
+It is now possible to take advantage of languages of triple literals to 
enhance 
+index and queries. Sub-sections below detail different settings with the 
index, 
+and use cases with SPARQL queries.
+
+#### Explicit Language Field in the Index 
+
+Literals' languages of triples can be stored (during triple addition phase) 
into the 
+index to extend query capabilities. 
+For that, the new `text:langField` property must be set in the EntityMap 
assembler :
+
+    <#entMap> a text:EntityMap ;
+        text:entityField      "uri" ;
+        text:defaultField     "text" ;        
+        text:langField        "lang" ;       
+        . 
+
+If you configure the index via Java code, you need to set this parameter to 
the 
+EntityDefinition instance, e.g.
+
+    EntityDefinition docDef = new EntityDefinition(entityField, defaultField);
+    docDef.setLangField("lang");
+
+ 
+#### SPARQL Linguistic Clause Forms
+
+Once the `langField` is set, you can use it directly inside SPARQL queries, 
for that the `'lang:xx'`
+argument allows you to target specific localized values. For example:
+
+    //target english literals
+    ?s text:query (rdfs:label 'word' 'lang:en' ) 
+    
+    //target unlocalized literals
+    ?s text:query (rdfs:label 'word' 'lang:none') 
+    
+    //ignore language field
+    ?s text:query (rdfs:label 'word')
+
+
+#### LocalizedAnalyzer
+
+You can specify and use a LocalizedAnalyzer in order to benefit from Lucene 
language 
+specific analyzers (stemming, stop words,...). Like any others analyzers, it 
can 
+be done for default text indexation, for each different field or for query.
+
+With an assembler configuration, the `text:language` property needs to be 
provided, e.g :
+
+    <#indexLucene> a text:TextIndexLucene ;
+        text:directory <file:Lucene> ;
+        text:entityMap <#entMap> ;
+        text:analyzer [
+            a text:LocalizedAnalyzer ;
+            text:language "fr"
+        ]
+        .
+
+will configure the index to analyze values of the 'text' field using a 
FrenchAnalyzer.
+
+To configure the same example via Java code, you need to provide the analyzer 
to the
+index configuration object:
+
+        TextIndexConfig config = new TextIndexConfig(def);
+        Analyzer analyzer = Util.getLocalizedAnalyzer("fr");
+        config.setAnalyzer(analyzer);
+        Dataset ds = TextDatasetFactory.createLucene(ds1, dir, config) ;
+
+Where `def`, `ds1` and `dir` are instances of `EntityDefinition`, `Dataset` 
and 
+`Directory` classes.
+
+**Note**: You do not have to set the `text:langField` property with a single 
+localized analyzer.
+
+#### Multilingual Support
+
+Let us suppose that we have many triples with many localized literals in many 
different 
+languages. It is possible to take all this languages into account for future 
mixed localized queries.
+Just set the `text:multilingualSupport` property at `true` to automatically 
enable the localized
+indexation (and also the localized analyzer for query) :
+
+    <#indexLucene> a text:TextIndexLucene ;
+        text:directory "mem" ;
+        text:multilingualSupport true;     
+        .
+
+Via Java code, set the multilingual support flag : 
+
+        TextIndexConfig config = new TextIndexConfig(def);
+        config.setMultilingualSupport(true);
+        Dataset ds = TextDatasetFactory.createLucene(ds1, dir, config) ;
+
+Thus, this multilingual index combines dynamically all localized analyzers of 
existing languages and 
+the storage of langField properties.
+
+For example, it is possible to involve different languages into the same text 
search query :
+
+    SELECT ?s
+    WHERE {
+        { ?s text:query ( rdfs:label 'institut' 'lang:fr' ) }
+        UNION
+        { ?s text:query ( rdfs:label 'institute' 'lang:en' ) }
+    }
+
+Hence, the result set of the query will contain "institute" related subjects 
+(institution, institutional,...) in French and in English.
+
+**Note**: If the `text:langField` property is not set, the "lang" field will be
+used anyway by default, because multilingual index cannot work without it.
+
+
 ## Working with Fuseki
 
 The Fuseki configuration simply points to the text dataset as the
@@ -500,3 +626,6 @@
 
 adjusting the version <code>X.Y.Z</code> as necessary.  This will automatically
 include a compatible version of Lucene and the Solr java client, but not Solr 
server.
+
+
+

CMS diff: Text searches with SPARQL

Reply via email to