jena-text limit by named graph (and language?)

Osma Suominen Wed, 04 Dec 2013 02:11:00 -0800

Hi,

I'm reposting the below message from the users mailing list as thisseems to be a more appropriate place to submit new patches.

I'd like to add support to jena-text to store the named graph (URI) ofthe indexed triples, to get faster text query performance when the queryis intended for only one named graph.

The attached patch adds this information to the index. What is missingis proper support for actually using the graph information at query time- I had some problems implementing that, as detailed in my message below.


Any comments are very welcome!

Best regards
Osma Suominen


-------- Original Message --------
Subject: Re: jena-text limit by language and/or named graph
Date: Fri, 29 Nov 2013 14:02:32 +0200
From: Osma Suominen <[email protected]>
To: [email protected]

Hi Andy!

Should this be per map entry/ per predicate?  I don't know which is
best - whether a index-wide configuration or whether it might be
some predicates are indexed one way and some another.


For now, I think this can be global, i.e. not possible to set per predicate.

(and if there is no lang, presumably "") .


Probably yes, though I'll defer the lang discussion for now and
concentrate on getting the graph information into the index first
because that is more critical for me - I have dozens of graphs, but only
a few languages in each graph.

Sounds sane.


Great!

What would the query predicate in SPARQL look like?


For the graph part, I think there is no need to introduce any new
syntax. Simply having the text:query within the context of a specific
graph should be enough, i.e. this should work:

GRAPH <http://example.com/mygraph> {
   ?s text:query "keyword" .
}

For the language part, I'm not so sure, but I'll defer the discussion
for now.

If it all defaults back to the current mode of operations, we have a
non-disturptive upgrade path which would better if possible.  It's a
change of disk-format which is always more of an issue for existing
use.


Yes, that is my intent, to not disrupt existing use in any way.

Attached is a first draft patch which is my attempt at adding graph
information to the index, iff graphField has been set in the config
file, as in the attached config file.

With this patch, you can use a query such as this:

SELECT ?s {
   ?s text:query '+res* +graph:"http\\://example.com/graphA"' .
}

and you will only get results from within the specified graph. This is
obviously a bit awkward since you have to know the name of the graph
field, and also the URI quoting is ugly. But at least it proves that the
graph information was successfully stored in the index and can be used
for retrieval.

However, I couldn't figure out how to get the URI of the current graph
at query time so that an explicit "graph:" query part could be avoided.

An ExecutionContext is passed to TextQueryPF methods and it has a
getActiveGraph() method which looks promising. But neither the Graph
interface nor the GraphBase implementation seem to be aware of the URI
(or Node in general) they are identified by. The only (possible,
untested) way that I could think of would be to also call
ExecutionContext.getDataset(); then call DatasetGraph.listGraphNodes();
and for each of the Nodes, call DatasetGraph.getGraph(node) and see if
the result matches the Graph that getActiveGraph() returned. But this
seems awfully inefficient, especially if there are lots of graphs. Is
there a better way to find out the URI of the current graph within
TextQueryPF methods?

Finally some misc notes:
- TextDocProducerEntities seems to be unused - not touched
- TextDocProducerTriples.[qQ]uadsToTriples is unused - not touched
- TextIndexLucene.get$ - it seems a bit stupid to use a QueryParser
   when you could directly create a Query programmatically - not touched
- I think get$ was broken anyway because it doesn't take into account
   that the query is tokenized by StandardAnalyzer - but this should now
   be fixed as a side effect of using PerFieldAnalyzerWrapper
- I made similar changes in TextIndexSolr as in TextIndexLucene, but
   have so far tested only the Lucene part

-Osma

--
Osma Suominen
D.Sc. (Tech), Information Systems Specialist
National Library of Finland
P.O. Box 26 (Teollisuuskatu 23)
00014 HELSINGIN YLIOPISTO
Tel. +358 50 3199529
[email protected]
http://www.nationallibrary.fi

Index: src/main/java/examples/JenaTextExample1.java
===================================================================
--- src/main/java/examples/JenaTextExample1.java	(revision 1546529)
+++ src/main/java/examples/JenaTextExample1.java	(working copy)
@@ -59,7 +59,7 @@
         Dataset ds1 = DatasetFactory.createMem() ; 
 
         // Define the index mapping 
-        EntityDefinition entDef = new EntityDefinition("uri", "text", RDFS.label.asNode()) ;
+        EntityDefinition entDef = new EntityDefinition("uri", "text", null, RDFS.label.asNode()) ;
 
         // Lucene, in memory.
         Directory dir =  new RAMDirectory();
Index: src/main/java/org/apache/jena/query/text/EntityDefinition.java
===================================================================
--- src/main/java/org/apache/jena/query/text/EntityDefinition.java	(revision 1546529)
+++ src/main/java/org/apache/jena/query/text/EntityDefinition.java	(working copy)
@@ -39,6 +39,7 @@
     // Collections.unmodifiableCollection(fieldToPredicate.keySet()) ;
     private final String                 entityField ;
     private final String                 primaryField ;
+    private final String                 graphField ;
     //private final Node                   primaryPredicate ;
 
     /**
@@ -46,10 +47,13 @@
      *            The entity being indexed (e.g. it's URI).
      * @param primaryField
      *            The primary/default field to search
+     * @param graphField
+     *            The field that stores graph URI, or null
      */
-    public EntityDefinition(String entityField, String primaryField) {
+    public EntityDefinition(String entityField, String primaryField, String graphField) {
         this.entityField = entityField ;
         this.primaryField = primaryField ;
+        this.graphField = graphField ;
     }
 
     /**
@@ -57,11 +61,13 @@
      *            The entity being indexed (e.g. it's URI).
      * @param primaryField
      *            The primary/default field to search
+     * @param graphField
+     *            The field that stores graph URI, or null
      * @param primaryPredicate
      *            The property associated with the primary/default field
      */
-    public EntityDefinition(String entityField, String primaryField, Resource primaryPredicate) {
-        this(entityField, primaryField, primaryPredicate.asNode()) ;
+    public EntityDefinition(String entityField, String primaryField, String graphField, Resource primaryPredicate) {
+        this(entityField, primaryField, graphField, primaryPredicate.asNode()) ;
     }
 
     /**
@@ -69,11 +75,13 @@
      *            The entity being indexed (e.g. it's URI).
      * @param primaryField
      *            The primary/default field to search
+     * @param graphField
+     *            The field that stores graph URI, or null
      * @param primaryPredicate
      *            The property associated with the primary/default field
      */
-    public EntityDefinition(String entityField, String primaryField, Node primaryPredicate) {
-        this(entityField, primaryField) ;
+    public EntityDefinition(String entityField, String primaryField, String graphField, Node primaryPredicate) {
+        this(entityField, primaryField, graphField) ;
         set(primaryField, primaryPredicate) ;
     }
 
@@ -107,6 +115,10 @@
         return getOne(c) ;
     }
 
+    public String getGraphField() {
+        return graphField ;
+    }
+
     public Collection<String> fields() {
         return fields ;
     }
Index: src/main/java/org/apache/jena/query/text/assembler/EntityMapAssembler.java
===================================================================
--- src/main/java/org/apache/jena/query/text/assembler/EntityMapAssembler.java	(revision 1546529)
+++ src/main/java/org/apache/jena/query/text/assembler/EntityMapAssembler.java	(working copy)
@@ -62,7 +62,10 @@
                                         "SELECT * {" ,
                                         "  ?eMap  :entityField  ?entityField ;" ,
                                         "         :map ?map ;",
-                                        "         :defaultField ?dftField" , 
+                                        "         :defaultField ?dftField ." ,
+                                        "  OPTIONAL {" ,
+                                        "    ?eMap :graphField ?graphField" ,
+                                        "  }",
                                         "}") ;
         ParameterizedSparqlString pss = new ParameterizedSparqlString(qs1) ;
         pss.setIri("eMap", root.getURI()) ;
@@ -83,7 +86,7 @@
         
         QuerySolution qsol1 = results.get(0) ;
         String entityField = qsol1.getLiteral("entityField").getLexicalForm() ;
-        
+        String graphField = qsol1.contains("graphField") ? qsol1.getLiteral("graphField").getLexicalForm() : null;
         String defaultField = qsol1.contains("dftField") ? qsol1.getLiteral("dftField").getLexicalForm() : null ;
         
         String qs2 = StrUtils.strjoinNL("SELECT * { ?map list:member [ :field ?field ; :predicate ?predicate ] }") ;
@@ -107,7 +110,7 @@
         }
         
         
-        EntityDefinition docDef = new EntityDefinition(entityField, defaultField) ;
+        EntityDefinition docDef = new EntityDefinition(entityField, defaultField, graphField) ;
         for ( String f : mapDefs.keys() ) {
             for ( Node p : mapDefs.get(f)) 
                 docDef.set(f, p) ;
Index: src/main/java/org/apache/jena/query/text/TextIndexLucene.java
===================================================================
--- src/main/java/org/apache/jena/query/text/TextIndexLucene.java	(revision 1546529)
+++ src/main/java/org/apache/jena/query/text/TextIndexLucene.java	(working copy)
@@ -23,6 +23,8 @@
 import java.util.Map.Entry ;
 
 import org.apache.lucene.analysis.Analyzer ;
+import org.apache.lucene.analysis.core.KeywordAnalyzer ;
+import org.apache.lucene.analysis.miscellaneous.PerFieldAnalyzerWrapper ;
 import org.apache.lucene.analysis.standard.StandardAnalyzer ;
 import org.apache.lucene.document.Document ;
 import org.apache.lucene.document.Field ;
@@ -68,13 +70,21 @@
     private final EntityDefinition docDef ;
     private final Directory directory ;
     private IndexWriter indexWriter ;
-    private Analyzer analyzer = new StandardAnalyzer(VER);
+    private Analyzer analyzer ;
     
     public TextIndexLucene(Directory directory, EntityDefinition def)
     {
         this.directory = directory ;
         this.docDef = def ;
         
+        // create the analyzer as a wrapper that uses KeywordAnalyzer for
+        // entity and graph fields and StandardAnalyzer for all other
+        Map<String,Analyzer> analyzerPerField = new HashMap<String,Analyzer>() ;
+        analyzerPerField.put(def.getEntityField(), new KeywordAnalyzer()) ;
+        if (def.getGraphField() != null)
+            analyzerPerField.put(def.getGraphField(), new KeywordAnalyzer()) ;
+        this.analyzer = new PerFieldAnalyzerWrapper(new StandardAnalyzer(VER), analyzerPerField) ;
+        
         // force creation of the index if it don't exist
         // otherwise if we get a search before data is written we get an exception
         startIndexing();
@@ -136,6 +146,13 @@
         Document doc = new Document() ;
         Field entField = new Field(docDef.getEntityField(), entity.getId(), ftIRI) ;
         doc.add(entField) ;
+
+        String graphField = docDef.getGraphField() ;
+        if ( graphField != null )
+        {
+            Field gField = new Field(graphField, entity.getGraph(), ftIRI) ;
+            doc.add(gField) ;
+        }
         
         for ( Entry<String, Object> e : entity.getMap().entrySet() )
         {
Index: src/main/java/org/apache/jena/query/text/TextIndexSolr.java
===================================================================
--- src/main/java/org/apache/jena/query/text/TextIndexSolr.java	(revision 1546529)
+++ src/main/java/org/apache/jena/query/text/TextIndexSolr.java	(working copy)
@@ -87,6 +87,13 @@
     {
         SolrInputDocument doc = new SolrInputDocument() ;
         doc.addField(docDef.getEntityField(), entity.getId()) ;
+        
+        String graphField = docDef.getGraphField() ;
+        if ( graphField != null )
+        {
+            doc.addField(graphField, entity.getGraph()) ;
+        }
+        
         // the addition needs to be done as a partial update
         // otherwise, if we have multiple fields, each successive
         // addition will replace the previous one and we are left
Index: src/main/java/org/apache/jena/query/text/Entity.java
===================================================================
--- src/main/java/org/apache/jena/query/text/Entity.java	(revision 1546529)
+++ src/main/java/org/apache/jena/query/text/Entity.java	(working copy)
@@ -24,12 +24,20 @@
 public class Entity
 {
     private final String id ;
+    private final String graph ;
     private final Map<String, Object> map = new HashMap<String, Object>() ;
 
-    public Entity(String entityId)          { this.id = entityId ; }
+    public Entity(String entityId, String entityGraph) {
+        this.id = entityId ;
+        this.graph = entityGraph;
+    }
+
+    public Entity(String entityId)          { this(entityId, null) ; }
     
     public String getId()                   { return id ; }
 
+    public String getGraph()                { return graph ; }
+
     public void put(String key, Object value)
     { map.put(key, value) ; }
     
Index: src/main/java/org/apache/jena/query/text/TextDocProducerTriples.java
===================================================================
--- src/main/java/org/apache/jena/query/text/TextDocProducerTriples.java	(revision 1546529)
+++ src/main/java/org/apache/jena/query/text/TextDocProducerTriples.java	(working copy)
@@ -65,7 +65,8 @@
             return ;
 
         String x = (s.isURI() ) ? s.getURI() : s.getBlankNodeLabel() ;
-        Entity entity = new Entity(x) ;
+        String graph = (g.isURI() ) ? g.getURI() : g.getBlankNodeLabel() ;
+        Entity entity = new Entity(x, graph) ;
 
         if ( ! o.isLiteral() )
         {
Index: src/test/java/org/apache/jena/query/text/TestTextTDB.java
===================================================================
--- src/test/java/org/apache/jena/query/text/TestTextTDB.java	(revision 1546529)
+++ src/test/java/org/apache/jena/query/text/TestTextTDB.java	(working copy)
@@ -39,7 +39,7 @@
     private static Dataset create() {
         Dataset ds1 = TDBFactory.createDataset() ;
         Directory dir = new RAMDirectory() ;
-        EntityDefinition eDef = new EntityDefinition("iri", "text", RDFS.label) ;
+        EntityDefinition eDef = new EntityDefinition("iri", "text", null, RDFS.label) ;
         TextIndex tidx = new TextIndexLucene(dir, eDef) ;
         Dataset ds = TextDatasetFactory.create(ds1, tidx) ;
         return ds ;
Index: src/test/java/org/apache/jena/query/text/TestBuildTextDataset.java
===================================================================
--- src/test/java/org/apache/jena/query/text/TestBuildTextDataset.java	(revision 1546529)
+++ src/test/java/org/apache/jena/query/text/TestBuildTextDataset.java	(working copy)
@@ -110,7 +110,7 @@
         Dataset ds1 = DatasetFactory.createMem() ;
 
         // Define the index mapping
-        EntityDefinition entDef = new EntityDefinition("uri", "text", RDFS.label.asNode()) ;
+        EntityDefinition entDef = new EntityDefinition("uri", "text", null, RDFS.label.asNode()) ;
 
         // Lucene, in memory.
         Directory dir = new RAMDirectory() ;

## Example of a TDB dataset and text index published using Fuseki

@prefix :        <#> .
@prefix fuseki:  <http://jena.apache.org/fuseki#> .
@prefix rdf:     <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs:    <http://www.w3.org/2000/01/rdf-schema#> .
@prefix tdb:     <http://jena.hpl.hp.com/2008/tdb#> .
@prefix ja:      <http://jena.hpl.hp.com/2005/11/Assembler#> .
@prefix text:    <http://jena.apache.org/text#> .

[] rdf:type fuseki:Server ;
   # Timeout - server-wide default: milliseconds.
   # Format 1: "1000" -- 1 second timeout
   # Format 2: "10000,60000" -- 10s timeout to first result, then 60s timeout 
to for rest of query.
   # See java doc for ARQ.queryTimeout
   # ja:context [ ja:cxtName "arq:queryTimeout" ;  ja:cxtValue "10000" ] ;
   # ja:loadClass "your.code.Class" ;

   fuseki:services (
     <#service_text_tdb>
   ) .

# TDB
[] ja:loadClass "com.hp.hpl.jena.tdb.TDB" .
tdb:DatasetTDB  rdfs:subClassOf  ja:RDFDataset .
tdb:GraphTDB    rdfs:subClassOf  ja:Model .

# Text
[] ja:loadClass "org.apache.jena.query.text.TextQuery" .
text:TextDataset      rdfs:subClassOf   ja:RDFDataset .
#text:TextIndexSolr    rdfs:subClassOf   text:TextIndex .
text:TextIndexLucene  rdfs:subClassOf   text:TextIndex .

## ---------------------------------------------------------------

<#service_text_tdb> rdf:type fuseki:Service ;
    rdfs:label                      "TDB/text service" ;
    fuseki:name                     "ds" ;
    fuseki:serviceQuery             "query" ;
    fuseki:serviceQuery             "sparql" ;
    fuseki:serviceUpdate            "update" ;
    fuseki:serviceUpload            "upload" ;
    fuseki:serviceReadGraphStore    "get" ;
    fuseki:serviceReadWriteGraphStore    "data" ;
    fuseki:dataset                  <#text_dataset> ;
    .

<#text_dataset> rdf:type     text:TextDataset ;
    text:dataset   <#dataset> ;
    ##text:index   <#indexSolr> ;
    text:index     <#indexLucene> ;
    .

<#dataset> rdf:type      tdb:DatasetTDB ;
    tdb:location "DB" ;
    ##tdb:unionDefaultGraph true ;
    .

<#indexSolr> a text:TextIndexSolr ;
    #text:server <http://localhost:8983/solr/COLLECTION> ;
    text:server <embedded:SolrARQ> ;
    text:entityMap <#entMap> ;
    .

<#indexLucene> a text:TextIndexLucene ;
    text:directory <file:Lucene> ;
    ##text:directory "mem" ;
    text:entityMap <#entMap> ;
    .

<#entMap> a text:EntityMap ;
    text:entityField      "uri" ;
    text:graphField       "graph" ;
    text:defaultField     "text" ;        ## Should be defined in the text:map.
    text:map (
         # rdfs:label            
         [ text:field "text" ; text:predicate rdfs:label ]
         ) .

jena-text limit by named graph (and language?)

Reply via email to