There's a new module under /Experimental - jena-text.

This is a possible replacement for LARQ (whether to call it "LARQ2" or something else is for discussion).

== Example query

# text search on rdfs:label for occurrences of "word"
# then retrieve the actual value from the RDF data
PREFIX : <http://example/>
PREFIX text: <http://jena.apache.org/text#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>

SELECT *
{ ?s text:query (rdfs:label 'word') ;
     rdfs:label ?label
}

== Example Fuseki config -- see end of message.


* works in Fuseki, with assembler setup, without the need for additional java code.

* tracks additions to the dataset

* works with Lucene4, and with Solr4 for sharing
  the text index with non-SPARQL apps.

* incompatible with LARQ1 (and the property function is different).

* simpler and smaller index design

It's complete rewrite and uses some new machinery to track changes to a dataset so the index is kept in step (if desired - there are different usage patterns).

The core design is the the index is only an index. It answers text searches with a list of URIs. Unlike LARQ1, there aren't multiple modes, and the literal indexed is not stored in the index itself. Only indexing information and the URI are in the index; if the app wants to find the data that lead to an index hit

Currently, it does not expose the score - the real requirement for that we found is to retain ordering in text search results: score is a partial solution to that (two hits can have the same score). There is an included patch from an earlier version checked into SVN. (An alternative is to add an "row id" variable to the results.)

While it works, it is not ready yet:

* documentation in text-query.mdtext needs completing.

* not tested heavily at scale (sometime, a better bulk loader and integration with TDB loader would be good - not a block on a first release).

* needs examples

* machinery for change tracking and graph views of datasets is general purpose and needs to migrate to in the proper module.

* needs tidying up

Many thanks to Brian McBride (Epimorphics) who has contributed testing, bug fixes and generally made it better.

Epimorphics has agreed to contribute this to Apache.

        Andy



## Example of a TDB dataset and text index published using Fuseki

@prefix :        <#> .
@prefix fuseki:  <http://jena.apache.org/fuseki#> .
@prefix rdf:     <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs:    <http://www.w3.org/2000/01/rdf-schema#> .
@prefix tdb:     <http://jena.hpl.hp.com/2008/tdb#> .
@prefix ja:      <http://jena.hpl.hp.com/2005/11/Assembler#> .
@prefix text:    <http://jena.apache.org/text#> .

[] rdf:type fuseki:Server ;
   fuseki:services (
     <#service_text_tdb>
   ) .

# TDB
[] ja:loadClass "com.hp.hpl.jena.tdb.TDB" .
tdb:DatasetTDB  rdfs:subClassOf  ja:RDFDataset .
tdb:GraphTDB    rdfs:subClassOf  ja:Model .

# Text
[] ja:loadClass "org.apache.jena.query.text.TextQuery" .
text:TextDataset      rdfs:subClassOf   ja:RDFDataset .
#text:TextIndexSolr    rdfs:subClassOf   text:TextIndex .
text:TextIndexLucene  rdfs:subClassOf   text:TextIndex .

## ---------------------------------------------------------------

<#service_text_tdb> rdf:type fuseki:Service ;
    rdfs:label                      "TDB/text service" ;
    fuseki:name                     "ds" ;
    fuseki:serviceQuery             "query" ;
    fuseki:serviceQuery             "sparql" ;
    fuseki:serviceUpdate            "update" ;
    fuseki:serviceUpload            "upload" ;
    fuseki:serviceReadGraphStore    "get" ;
    fuseki:serviceReadWriteGraphStore    "data" ;
    fuseki:dataset                  <#text_dataset> ;
    .

<#text_dataset> rdf:type     text:TextDataset ;
    text:dataset   <#dataset> ;
    ##text:index   <#indexSolr> ;
    text:index     <#indexLucene> ;
    .

<#dataset> rdf:type      tdb:DatasetTDB ;
    tdb:location "DB" ;
    tdb:unionDefaultGraph true ;
    .

<#indexSolr> a text:TextIndexSolr ;
    #text:server <http://localhost:8983/solr/COLLECTION> ;
    text:server <embedded:SolrARQ> ;
    text:entityMap <#entMap> ;
    .

<#indexLucene> a text:TextIndexLucene ;
    text:directory <file:Lucene> ;
    ##text:directory "mem" ;
    text:entityMap <#entMap> ;
    .

<#entMap> a text:EntityMap ;
    text:entityField      "uri" ;
    text:defaultField     "text" ; ## Must be defined in the text:map
    text:map (
         # rdfs:label
         [ text:field "text" ; text:predicate rdfs:label ]
         ) .

Reply via email to