CMS diff: Text searches with SPARQL

Chris Dollin Mon, 23 Mar 2015 04:48:07 -0700

Clone URL (Committers only):
https://cms.apache.org/redirect?new=anonymous;action=diff;uri=http://jena.apache.org/documentation%2Fquery%2Ftext-query.mdtext


Chris Dollin

Index: trunk/content/documentation/query/text-query.mdtext
===================================================================
--- trunk/content/documentation/query/text-query.mdtext (revision 1655891)
+++ trunk/content/documentation/query/text-query.mdtext (working copy)
@@ -43,6 +43,7 @@
 - [Working with Fuseki](#working-with-fuseki)
 - [Building a Text Index](#building-a-text-index)
 - [Deletion of Indexed Entities](#deletion-of-indexed-entities)
+- [Configuring Alternative 
TextDocProducers](#configuring-alternative-textdocproducers)
 - [Maven Dependency](#maven-dependency)
 
 ## Architecture
@@ -405,6 +406,73 @@
 It may be necessary to periodically rebuild the index if a large proportion
 of the RDF data changes.
 
+# Configuring Alternative TextDocProducers
+
+The default behaviour when text indexing is to index a single
+property as a single field, generating a different `Document` 
+for each indexed triple. To change this behaviour requires 
+writing and configuring an alternative 'TextDocProducer'.
+
+To configure a `TextDocProducer` `MyProducer` in a dataset assembly,
+use the property `textDocProducer`, eg:
+
+       <#ds-with-lucene> rdf:type text:TextDataset;
+               text:index <#indexLucene> ;
+               text:dataset <#ds> ;
+               text:textDocProducer <java:CLASSNAME> ;
+               .
+
+where CLASSNAME is the `TextDocProducer` class; it must have either
+a single-argument constructor of type `TextIndex`, or a two-argument
+constructor `(DatasetGraph, TextIndex)`. The `TextIndex` argument
+will be the configured text index, and the `DatasetGraph` argument
+will be the graph of the configured dataset.
+
+For example, to explicitly create the default `TextDocProducer` use:
+
+       ...
+       text:textDocProducer 
<java:org.apache.jena.query.text.TextDocProducerTriples> ;
+       ...
+
+`TextDocProducerTriples` produces a new `Document` for each subject/field
+added to the dataset, using `TextIndex.addEntity(Entity)`. 
+
+## Example 
+
+The example class below is a `TextDocProducer` that only indexes
+`ADD`s of quads for which the subject already had at least one
+property-value. It uses the two-argument constructor to give it
+access to the dataset so that it count the `(?G, S, P, ?O)` quads
+with that subject and predicate, and delegates the indexing to
+`TextDocProducerTriples` if there are at least two values for
+that property (one of those values, of course, is the one that
+gives rise to this `change()`).
+
+
+       public class Example extends TextDocProducerTriples {
+       
+               final DatasetGraph dg;
+               
+               public Example(DatasetGraph dg, TextIndex indexer) {
+                       super(indexer);
+                       this.dg = dg;
+               }
+               
+               public void change(QuadAction qaction, Node g, Node s, Node p, 
Node o) {
+                       if (qaction == QuadAction.ADD) {
+                               if (alreadyHasOne(s, p)) super.change(qaction, 
g, s, p, o);
+                       }
+               }
+       
+               private boolean alreadyHasOne(Node s, Node p) {
+                       int count = 0;
+                       Iterator<Quad> quads = dg.find( null, s, p, null );
+                       while (quads.hasNext()) { quads.next(); count += 1; }
+                       return count > 1;
+               }
+       
+       }
+
 ## Maven Dependency
 
 The <code>jena-text</code> module is included in Fuseki.  To use it within 
application code,
@@ -417,4 +485,4 @@
     </dependency>
 
 adjusting the version <code>X.Y.Z</code> as necessary.  This will automatically
-include a compatible version of Lucene and the Solr java client, but not Solr 
server.
\ No newline at end of file
+include a compatible version of Lucene and the Solr java client, but not Solr 
server.

CMS diff: Text searches with SPARQL

Reply via email to