jena-text : experimental module

Andy Seaborne Fri, 05 Apr 2013 04:20:01 -0700

There's a new module under /Experimental - jena-text.

This is a possible replacement for LARQ (whether to call it "LARQ2" orsomething else is for discussion).


== Example query

# text search on rdfs:label for occurrences of "word"
# then retrieve the actual value from the RDF data
PREFIX : <http://example/>
PREFIX text: <http://jena.apache.org/text#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>

SELECT *
{ ?s text:query (rdfs:label 'word') ;
     rdfs:label ?label
}

== Example Fuseki config -- see end of message.

* works in Fuseki, with assembler setup, without the need for additionaljava code.


* tracks additions to the dataset

* works with Lucene4, and with Solr4 for sharing
  the text index with non-SPARQL apps.

* incompatible with LARQ1 (and the property function is different).

* simpler and smaller index design

It's complete rewrite and uses some new machinery to track changes to adataset so the index is kept in step (if desired - there are differentusage patterns).

The core design is the the index is only an index. It answers textsearches with a list of URIs. Unlike LARQ1, there aren't multiplemodes, and the literal indexed is not stored in the index itself. Onlyindexing information and the URI are in the index; if the app wants tofind the data that lead to an index hit

Currently, it does not expose the score - the real requirement for thatwe found is to retain ordering in text search results: score is apartial solution to that (two hits can have the same score). There isan included patch from an earlier version checked into SVN. (Analternative is to add an "row id" variable to the results.)


While it works, it is not ready yet:

* documentation in text-query.mdtext needs completing.

* not tested heavily at scale (sometime, a better bulk loader andintegration with TDB loader would be good - not a block on a first release).


* needs examples

* machinery for change tracking and graph views of datasets is generalpurpose and needs to migrate to in the proper module.


* needs tidying up

Many thanks to Brian McBride (Epimorphics) who has contributed testing,bug fixes and generally made it better.


Epimorphics has agreed to contribute this to Apache.

        Andy



## Example of a TDB dataset and text index published using Fuseki

@prefix :        <#> .
@prefix fuseki:  <http://jena.apache.org/fuseki#> .
@prefix rdf:     <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs:    <http://www.w3.org/2000/01/rdf-schema#> .
@prefix tdb:     <http://jena.hpl.hp.com/2008/tdb#> .
@prefix ja:      <http://jena.hpl.hp.com/2005/11/Assembler#> .
@prefix text:    <http://jena.apache.org/text#> .

[] rdf:type fuseki:Server ;
   fuseki:services (
     <#service_text_tdb>
   ) .

# TDB
[] ja:loadClass "com.hp.hpl.jena.tdb.TDB" .
tdb:DatasetTDB  rdfs:subClassOf  ja:RDFDataset .
tdb:GraphTDB    rdfs:subClassOf  ja:Model .

# Text
[] ja:loadClass "org.apache.jena.query.text.TextQuery" .
text:TextDataset      rdfs:subClassOf   ja:RDFDataset .
#text:TextIndexSolr    rdfs:subClassOf   text:TextIndex .
text:TextIndexLucene  rdfs:subClassOf   text:TextIndex .

## ---------------------------------------------------------------

<#service_text_tdb> rdf:type fuseki:Service ;
    rdfs:label                      "TDB/text service" ;
    fuseki:name                     "ds" ;
    fuseki:serviceQuery             "query" ;
    fuseki:serviceQuery             "sparql" ;
    fuseki:serviceUpdate            "update" ;
    fuseki:serviceUpload            "upload" ;
    fuseki:serviceReadGraphStore    "get" ;
    fuseki:serviceReadWriteGraphStore    "data" ;
    fuseki:dataset                  <#text_dataset> ;
    .

<#text_dataset> rdf:type     text:TextDataset ;
    text:dataset   <#dataset> ;
    ##text:index   <#indexSolr> ;
    text:index     <#indexLucene> ;
    .

<#dataset> rdf:type      tdb:DatasetTDB ;
    tdb:location "DB" ;
    tdb:unionDefaultGraph true ;
    .

<#indexSolr> a text:TextIndexSolr ;
    #text:server <http://localhost:8983/solr/COLLECTION> ;
    text:server <embedded:SolrARQ> ;
    text:entityMap <#entMap> ;
    .

<#indexLucene> a text:TextIndexLucene ;
    text:directory <file:Lucene> ;
    ##text:directory "mem" ;
    text:entityMap <#entMap> ;
    .

<#entMap> a text:EntityMap ;
    text:entityField      "uri" ;
    text:defaultField     "text" ; ## Must be defined in the text:map
    text:map (
         # rdfs:label
         [ text:field "text" ; text:predicate rdfs:label ]
         ) .

jena-text : experimental module

Reply via email to