Re: jena-text limit by named graph (and language?)

Osma Suominen Wed, 04 Dec 2013 10:15:33 -0800

Hi!

Sorry for spamming the list again :) This turned out to be easier toimplement than I thought.

Attached is a new version of the patch. This adds support for storingthe graph URI in the text index, as well as making use of it at querytime. The storing and use of graph URIs in the text index is optional,and is enabled by defining the text:graphField property, as in theattached config file. By default, no graph information is stored, i.e.nothing changes, so the enhancement should be 100% backward compatibleand should not cause trouble for upgrading.



To test this, do the following:

1. Rebuild and reinstall jena-text and Fuseki with the attached patch

2. Start Fuseki with the attached config file:
   ./fuseki-server --config config-text-tdb-graph.ttl

3. Put this in the named graph <http://example.com/graphA>:

<http://example.com/resourceA><http://www.w3.org/2000/01/rdf-schema#label> "resourceA" .


...and this in the named graph <http://example.com/graphB>:

<http://example.com/resourceB><http://www.w3.org/2000/01/rdf-schema#label> "resourceB" .


4. Run the following SPARQL query:

PREFIX text: <http://jena.apache.org/text#>
SELECT ?s {
  GRAPH <http://example.com/graphA> {
    ?s text:query 'res*' .
  }
}

If everything worked, you should get only one result,<http://example.com/resourceA>. Without this patch (or with the graphindexing disabled), you will also get <http://example.com/resourceB>.

I haven't yet tested the performance of this modification, but I expectthis to perform much better than current jena-text for queries targetedat a single named graph, where the index currently returns hits from allgraphs. I'll try to find out soon.

I did find that the increase in index size is negligible (this is afterloading the STW Thesaurus, UNESCO Thesaurus, GEMET and Reegle thesaurusinto distinct named graphs, using skos:prefLabel as the indexed predicate):


$ du -s Lucene*
5004    Lucene
5012    Lucene-graph


Comments? Any chances of getting this merged?

-Osma


04.12.2013 17:59, Osma Suominen wrote:

04.12.2013 15:40, Osma Suominen wrote:

So my question is: if we assume that we're dealing with TDB graphs, and
the SPARQL pattern limits the context to a single graph URI (as e.g.
<http://example.com/mygraph> in the example below), how can the
text:search property function know that and find out the graph URI?


Ah, nevermind, I got it now. The object available from
execCxt.getActiveGraph() inside TextQueryPF.exec() is actually a
GraphTDB instance in this case. GraphTDB inherits the getGraphName()
method from GraphView. And it seems I can use that method (as well as
isDefaultGraph() and isUnionGraph() for sanity checks) to determine the
graph URI to query for in the Lucene/Solr index.

I will try to implement the query side now, but it might take a while.

-Osma



--
Osma Suominen
D.Sc. (Tech), Information Systems Specialist
National Library of Finland
P.O. Box 26 (Teollisuuskatu 23)
00014 HELSINGIN YLIOPISTO
Tel. +358 50 3199529
[email protected]
http://www.nationallibrary.fi

Index: src/test/java/org/apache/jena/query/text/TestTextTDB.java
===================================================================
--- src/test/java/org/apache/jena/query/text/TestTextTDB.java	(revision 1547859)
+++ src/test/java/org/apache/jena/query/text/TestTextTDB.java	(working copy)
@@ -39,7 +39,7 @@
     private static Dataset create() {
         Dataset ds1 = TDBFactory.createDataset() ;
         Directory dir = new RAMDirectory() ;
-        EntityDefinition eDef = new EntityDefinition("iri", "text", RDFS.label) ;
+        EntityDefinition eDef = new EntityDefinition("iri", "text", null, RDFS.label) ;
         TextIndex tidx = new TextIndexLucene(dir, eDef) ;
         Dataset ds = TextDatasetFactory.create(ds1, tidx) ;
         return ds ;
Index: src/test/java/org/apache/jena/query/text/TestBuildTextDataset.java
===================================================================
--- src/test/java/org/apache/jena/query/text/TestBuildTextDataset.java	(revision 1547859)
+++ src/test/java/org/apache/jena/query/text/TestBuildTextDataset.java	(working copy)
@@ -110,7 +110,7 @@
         Dataset ds1 = DatasetFactory.createMem() ;
 
         // Define the index mapping
-        EntityDefinition entDef = new EntityDefinition("uri", "text", RDFS.label.asNode()) ;
+        EntityDefinition entDef = new EntityDefinition("uri", "text", null, RDFS.label.asNode()) ;
 
         // Lucene, in memory.
         Directory dir = new RAMDirectory() ;
Index: src/main/java/examples/JenaTextExample1.java
===================================================================
--- src/main/java/examples/JenaTextExample1.java	(revision 1547859)
+++ src/main/java/examples/JenaTextExample1.java	(working copy)
@@ -59,7 +59,7 @@
         Dataset ds1 = DatasetFactory.createMem() ; 
 
         // Define the index mapping 
-        EntityDefinition entDef = new EntityDefinition("uri", "text", RDFS.label.asNode()) ;
+        EntityDefinition entDef = new EntityDefinition("uri", "text", null, RDFS.label.asNode()) ;
 
         // Lucene, in memory.
         Directory dir =  new RAMDirectory();
Index: src/main/java/org/apache/jena/query/text/assembler/EntityMapAssembler.java
===================================================================
--- src/main/java/org/apache/jena/query/text/assembler/EntityMapAssembler.java	(revision 1547859)
+++ src/main/java/org/apache/jena/query/text/assembler/EntityMapAssembler.java	(working copy)
@@ -62,7 +62,10 @@
                                         "SELECT * {" ,
                                         "  ?eMap  :entityField  ?entityField ;" ,
                                         "         :map ?map ;",
-                                        "         :defaultField ?dftField" , 
+                                        "         :defaultField ?dftField ." ,
+                                        "  OPTIONAL {" ,
+                                        "    ?eMap :graphField ?graphField" ,
+                                        "  }",
                                         "}") ;
         ParameterizedSparqlString pss = new ParameterizedSparqlString(qs1) ;
         pss.setIri("eMap", root.getURI()) ;
@@ -83,7 +86,7 @@
         
         QuerySolution qsol1 = results.get(0) ;
         String entityField = qsol1.getLiteral("entityField").getLexicalForm() ;
-        
+        String graphField = qsol1.contains("graphField") ? qsol1.getLiteral("graphField").getLexicalForm() : null;
         String defaultField = qsol1.contains("dftField") ? qsol1.getLiteral("dftField").getLexicalForm() : null ;
         
         String qs2 = StrUtils.strjoinNL("SELECT * { ?map list:member [ :field ?field ; :predicate ?predicate ] }") ;
@@ -107,7 +110,7 @@
         }
         
         
-        EntityDefinition docDef = new EntityDefinition(entityField, defaultField) ;
+        EntityDefinition docDef = new EntityDefinition(entityField, defaultField, graphField) ;
         for ( String f : mapDefs.keys() ) {
             for ( Node p : mapDefs.get(f)) 
                 docDef.set(f, p) ;
Index: src/main/java/org/apache/jena/query/text/TextQueryPF.java
===================================================================
--- src/main/java/org/apache/jena/query/text/TextQueryPF.java	(revision 1547859)
+++ src/main/java/org/apache/jena/query/text/TextQueryPF.java	(working copy)
@@ -32,6 +32,8 @@
 import com.hp.hpl.jena.graph.Node ;
 import com.hp.hpl.jena.query.QueryBuildException ;
 import com.hp.hpl.jena.sparql.core.DatasetGraph ;
+import com.hp.hpl.jena.sparql.core.GraphView ;
+import com.hp.hpl.jena.sparql.core.Quad ;
 import com.hp.hpl.jena.sparql.core.Substitute ;
 import com.hp.hpl.jena.sparql.core.Var ;
 import com.hp.hpl.jena.sparql.engine.ExecutionContext ;
@@ -111,6 +113,7 @@
             // Not a text dataset - no-op
             return IterLib.result(binding, execCxt) ;
         }
+        
 
         DatasetGraph dsg = execCxt.getDataset() ;
 
@@ -181,6 +184,18 @@
     }
 
     private List<Node> query(String queryString, int limit, ExecutionContext execCxt) {
+        // use the graph information in the text index if possible
+        if (server.getDocDef().getGraphField() != null
+            && execCxt.getActiveGraph() instanceof GraphView) {
+            GraphView activeGraph = (GraphView)execCxt.getActiveGraph() ;
+            if (activeGraph.getGraphName() != null && !Quad.isUnionGraph(activeGraph.getGraphName())) {
+                String uri = activeGraph.getGraphName().getURI() ;
+                String escaped = QueryParser.escape(uri) ;
+                String qs2 = server.getDocDef().getGraphField() + ":" + escaped ;
+                queryString = queryString + " AND " + qs2 ;
+            }
+        }    
+    
         Explain.explain(execCxt.getContext(), "Text query: "+queryString) ;
         if ( log.isDebugEnabled())
             log.debug("Text query: {} ({})", queryString,limit) ;
Index: src/main/java/org/apache/jena/query/text/TextIndexLucene.java
===================================================================
--- src/main/java/org/apache/jena/query/text/TextIndexLucene.java	(revision 1547859)
+++ src/main/java/org/apache/jena/query/text/TextIndexLucene.java	(working copy)
@@ -23,10 +23,13 @@
 import java.util.Map.Entry ;
 
 import org.apache.lucene.analysis.Analyzer ;
+import org.apache.lucene.analysis.core.KeywordAnalyzer ;
+import org.apache.lucene.analysis.miscellaneous.PerFieldAnalyzerWrapper ;
 import org.apache.lucene.analysis.standard.StandardAnalyzer ;
 import org.apache.lucene.document.Document ;
 import org.apache.lucene.document.Field ;
 import org.apache.lucene.document.FieldType ;
+import org.apache.lucene.document.StringField ;
 import org.apache.lucene.document.TextField ;
 import org.apache.lucene.index.DirectoryReader ;
 import org.apache.lucene.index.IndexReader ;
@@ -61,6 +64,7 @@
         ftIRI.setIndexed(true) ;
         ftIRI.freeze() ;
     }
+    public static final FieldType ftString = StringField.TYPE_NOT_STORED ;
     public static final FieldType ftText = TextField.TYPE_NOT_STORED ;
     // Bigger index, easier to debug!
     // public static final FieldType ftText = TextField.TYPE_STORED ;
@@ -68,13 +72,21 @@
     private final EntityDefinition docDef ;
     private final Directory directory ;
     private IndexWriter indexWriter ;
-    private Analyzer analyzer = new StandardAnalyzer(VER);
+    private Analyzer analyzer ;
     
     public TextIndexLucene(Directory directory, EntityDefinition def)
     {
         this.directory = directory ;
         this.docDef = def ;
         
+        // create the analyzer as a wrapper that uses KeywordAnalyzer for
+        // entity and graph fields and StandardAnalyzer for all other
+        Map<String,Analyzer> analyzerPerField = new HashMap<String,Analyzer>() ;
+        analyzerPerField.put(def.getEntityField(), new KeywordAnalyzer()) ;
+        if (def.getGraphField() != null)
+            analyzerPerField.put(def.getGraphField(), new KeywordAnalyzer()) ;
+        this.analyzer = new PerFieldAnalyzerWrapper(new StandardAnalyzer(VER), analyzerPerField) ;
+        
         // force creation of the index if it don't exist
         // otherwise if we get a search before data is written we get an exception
         startIndexing();
@@ -136,6 +148,13 @@
         Document doc = new Document() ;
         Field entField = new Field(docDef.getEntityField(), entity.getId(), ftIRI) ;
         doc.add(entField) ;
+
+        String graphField = docDef.getGraphField() ;
+        if ( graphField != null )
+        {
+            Field gField = new Field(graphField, entity.getGraph(), ftString) ;
+            doc.add(gField) ;
+        }
         
         for ( Entry<String, Object> e : entity.getMap().entrySet() )
         {
Index: src/main/java/org/apache/jena/query/text/Entity.java
===================================================================
--- src/main/java/org/apache/jena/query/text/Entity.java	(revision 1547859)
+++ src/main/java/org/apache/jena/query/text/Entity.java	(working copy)
@@ -24,12 +24,20 @@
 public class Entity
 {
     private final String id ;
+    private final String graph ;
     private final Map<String, Object> map = new HashMap<String, Object>() ;
 
-    public Entity(String entityId)          { this.id = entityId ; }
+    public Entity(String entityId, String entityGraph) {
+        this.id = entityId ;
+        this.graph = entityGraph;
+    }
+
+    public Entity(String entityId)          { this(entityId, null) ; }
     
     public String getId()                   { return id ; }
 
+    public String getGraph()                { return graph ; }
+
     public void put(String key, Object value)
     { map.put(key, value) ; }
     
Index: src/main/java/org/apache/jena/query/text/TextIndexSolr.java
===================================================================
--- src/main/java/org/apache/jena/query/text/TextIndexSolr.java	(revision 1547859)
+++ src/main/java/org/apache/jena/query/text/TextIndexSolr.java	(working copy)
@@ -87,6 +87,13 @@
     {
         SolrInputDocument doc = new SolrInputDocument() ;
         doc.addField(docDef.getEntityField(), entity.getId()) ;
+        
+        String graphField = docDef.getGraphField() ;
+        if ( graphField != null )
+        {
+            doc.addField(graphField, entity.getGraph()) ;
+        }
+        
         // the addition needs to be done as a partial update
         // otherwise, if we have multiple fields, each successive
         // addition will replace the previous one and we are left
Index: src/main/java/org/apache/jena/query/text/TextDocProducerTriples.java
===================================================================
--- src/main/java/org/apache/jena/query/text/TextDocProducerTriples.java	(revision 1547859)
+++ src/main/java/org/apache/jena/query/text/TextDocProducerTriples.java	(working copy)
@@ -65,7 +65,8 @@
             return ;
 
         String x = (s.isURI() ) ? s.getURI() : s.getBlankNodeLabel() ;
-        Entity entity = new Entity(x) ;
+        String graph = (g.isURI() ) ? g.getURI() : g.getBlankNodeLabel() ;
+        Entity entity = new Entity(x, graph) ;
 
         if ( ! o.isLiteral() )
         {
Index: src/main/java/org/apache/jena/query/text/EntityDefinition.java
===================================================================
--- src/main/java/org/apache/jena/query/text/EntityDefinition.java	(revision 1547859)
+++ src/main/java/org/apache/jena/query/text/EntityDefinition.java	(working copy)
@@ -39,6 +39,7 @@
     // Collections.unmodifiableCollection(fieldToPredicate.keySet()) ;
     private final String                 entityField ;
     private final String                 primaryField ;
+    private final String                 graphField ;
     //private final Node                   primaryPredicate ;
 
     /**
@@ -46,10 +47,13 @@
      *            The entity being indexed (e.g. it's URI).
      * @param primaryField
      *            The primary/default field to search
+     * @param graphField
+     *            The field that stores graph URI, or null
      */
-    public EntityDefinition(String entityField, String primaryField) {
+    public EntityDefinition(String entityField, String primaryField, String graphField) {
         this.entityField = entityField ;
         this.primaryField = primaryField ;
+        this.graphField = graphField ;
     }
 
     /**
@@ -57,11 +61,13 @@
      *            The entity being indexed (e.g. it's URI).
      * @param primaryField
      *            The primary/default field to search
+     * @param graphField
+     *            The field that stores graph URI, or null
      * @param primaryPredicate
      *            The property associated with the primary/default field
      */
-    public EntityDefinition(String entityField, String primaryField, Resource primaryPredicate) {
-        this(entityField, primaryField, primaryPredicate.asNode()) ;
+    public EntityDefinition(String entityField, String primaryField, String graphField, Resource primaryPredicate) {
+        this(entityField, primaryField, graphField, primaryPredicate.asNode()) ;
     }
 
     /**
@@ -69,11 +75,13 @@
      *            The entity being indexed (e.g. it's URI).
      * @param primaryField
      *            The primary/default field to search
+     * @param graphField
+     *            The field that stores graph URI, or null
      * @param primaryPredicate
      *            The property associated with the primary/default field
      */
-    public EntityDefinition(String entityField, String primaryField, Node primaryPredicate) {
-        this(entityField, primaryField) ;
+    public EntityDefinition(String entityField, String primaryField, String graphField, Node primaryPredicate) {
+        this(entityField, primaryField, graphField) ;
         set(primaryField, primaryPredicate) ;
     }
 
@@ -107,6 +115,10 @@
         return getOne(c) ;
     }
 
+    public String getGraphField() {
+        return graphField ;
+    }
+
     public Collection<String> fields() {
         return fields ;
     }

## Example of a TDB dataset and text index published using Fuseki

@prefix :        <#> .
@prefix fuseki:  <http://jena.apache.org/fuseki#> .
@prefix rdf:     <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs:    <http://www.w3.org/2000/01/rdf-schema#> .
@prefix tdb:     <http://jena.hpl.hp.com/2008/tdb#> .
@prefix ja:      <http://jena.hpl.hp.com/2005/11/Assembler#> .
@prefix text:    <http://jena.apache.org/text#> .

[] rdf:type fuseki:Server ;
   # Timeout - server-wide default: milliseconds.
   # Format 1: "1000" -- 1 second timeout
   # Format 2: "10000,60000" -- 10s timeout to first result, then 60s timeout 
to for rest of query.
   # See java doc for ARQ.queryTimeout
   # ja:context [ ja:cxtName "arq:queryTimeout" ;  ja:cxtValue "10000" ] ;
   # ja:loadClass "your.code.Class" ;

   fuseki:services (
     <#service_text_tdb>
   ) .

# TDB
[] ja:loadClass "com.hp.hpl.jena.tdb.TDB" .
tdb:DatasetTDB  rdfs:subClassOf  ja:RDFDataset .
tdb:GraphTDB    rdfs:subClassOf  ja:Model .

# Text
[] ja:loadClass "org.apache.jena.query.text.TextQuery" .
text:TextDataset      rdfs:subClassOf   ja:RDFDataset .
#text:TextIndexSolr    rdfs:subClassOf   text:TextIndex .
text:TextIndexLucene  rdfs:subClassOf   text:TextIndex .

## ---------------------------------------------------------------

<#service_text_tdb> rdf:type fuseki:Service ;
    rdfs:label                      "TDB/text service" ;
    fuseki:name                     "ds" ;
    fuseki:serviceQuery             "query" ;
    fuseki:serviceQuery             "sparql" ;
    fuseki:serviceUpdate            "update" ;
    fuseki:serviceUpload            "upload" ;
    fuseki:serviceReadGraphStore    "get" ;
    fuseki:serviceReadWriteGraphStore    "data" ;
    fuseki:dataset                  <#text_dataset> ;
    .

<#text_dataset> rdf:type     text:TextDataset ;
    text:dataset   <#dataset> ;
    ##text:index   <#indexSolr> ;
    text:index     <#indexLucene> ;
    .

<#dataset> rdf:type      tdb:DatasetTDB ;
    tdb:location "DB" ;
    ##tdb:unionDefaultGraph true ;
    .

<#indexSolr> a text:TextIndexSolr ;
    #text:server <http://localhost:8983/solr/COLLECTION> ;
    text:server <embedded:SolrARQ> ;
    text:entityMap <#entMap> ;
    .

<#indexLucene> a text:TextIndexLucene ;
    text:directory <file:Lucene> ;
    ##text:directory "mem" ;
    text:entityMap <#entMap> ;
    .

<#entMap> a text:EntityMap ;
    text:entityField      "uri" ;
    text:graphField       "graph" ;
    text:defaultField     "text" ;        ## Should be defined in the text:map.
    text:map (
         # rdfs:label            
         [ text:field "text" ; text:predicate rdfs:label ]
         ) .

Re: jena-text limit by named graph (and language?)

Reply via email to