Re: indexing [was Re: Performance of successive identical queries]

Rob Vesse Tue, 06 Mar 2012 13:23:41 -0800

If I might throw my 2 cents into the mix...

In dotNetRDF in the recent releases (2 weeks ago) we added the abilityto automatically have a dataset linked to a full text index and keepthat index in sync with changes in the dataset. My approach to this wasto use the decorator pattern, so what I have is a base decorator [1]which is simply an implementation of our dataset interface which passesthrough all calls to the underlying dataset. We then have a decorator[2] which extends this base class and adds the logic to intercept thecalls that alter the dataset so that it updates the index as well aspassing the call through to the underlying dataset.

Since all updates go through the dataset interface this allows us tocatch all updates and keep the full text index up to date. Whether thisis applicable to Jena or not depends on whether all updates go through asingle dataset interface in Jena which is a part of the code base I amnot so familiar with?

Rob

[1]http://dotnetrdf.svn.sourceforge.net/viewvc/dotnetrdf/Trunk/Libraries/core/Query/Datasets/WrapperDataset.cs?revision=2157&view=markup[2]http://dotnetrdf.svn.sourceforge.net/viewvc/dotnetrdf/Trunk/Libraries/query.fulltext/Datasets/FullTextIndexedDataset.cs?revision=2157&view=markup


On 3/6/12 10:08 AM, Paolo Castagna wrote:

Hi Alexander,
thank you for sharing with us details about data.ox.ac.uk and pointing
me at https://github.com/oucs/humfrey (who needs documentation when you
have the source code? ;-)).

I have been thinking about pros/cons of having a custom/additional index
coupled with a TDB dataset and keeping it up-to-date (and/or in sync with
TDB).

I see two approaches:

  1. internal to Jena
      - pros
         - simplicity for users, it works out of the box
         - ...
      - cons
         - it requires an internal notification sub-system (which we have,
           but it does not covers all possible paths and it might impact
           performances)
         - it might create expectations that indexes will never go out of
           sync (while it might happen)
         - ...
  2. external to Jena
      - pros
         - relatively easy to implement assuming SPARQL and external index
           APIs
         - ...
      - cons
         - it requires an additional service
         - it isn't simple (or possible) with certain SPARQL Update requests
         - ...

Re: sanity/insanity, I don't comment.

We have a system which intercept all the update requests, it puts into a
sort of key-value store in S3 with a cache in front of it. We have a
queuing/messaging system which applies changes to replicas on different
nodes taking stuff from there. Nodes can be different types: RDF stores,
free-text indexes, etc. In this scenario, update requests cannot be
unconstrained SPARQL queries, but you can replay updates and apply them
to different type of nodes/indexes. Some of the stuff is available here:
https://github.com/talis/

I imagine it is the case for you as well, it's not something you can
just download, unzip and run as it is. This sort of simplicity is
something IMHO not to underestimate and it is what drives me towards
option 1. above.

Knowing what others are doing it is certainly useful to better understand
what's needed.

Thanks,
Paolo

PS:
I am not going to ask you what you do for: monitoring, backups,
high-availability, load balancing, etc. ;-)

Alexander Dutton wrote:

Hi Paolo,

On 06/03/12 14:56, Paolo Castagna wrote:

Alexander Dutton wrote:

This is the way we're going with our site, data.ox.ac.uk. After
each update to the triplestore we'll regenerate an ElasticSearch
index from a SPARQL query. […]

interesting...

How do you update your triplestore (SPARQL Update, Jena APIs via
custom code, manually from command line, ...)?

Our administration interface manages grabbing data from elsewhere,
transforming it in various ways, and then uses the graph store HTTP
protocol to push it into Fuseki. Once that's done it fires off a
notification on a redis pubsub channel to say "this update just completed".

There's then something that listens on the relevant channel which will
perform the ElasticSearch update. (There are other things that handle
uploading dataset metadata to thedatahub, and archiving datasets for
bulk download).

There's code at https://github.com/oucs/humfrey, but it's a bit of a
nightmare to set up and (surprise, surprise) lacks documentation. The
ElasticSearch stuff is still in development on the elasticsearch branch.
At some point I'll find the time to make it easier to install and create
a demo site. (as you may have noticed, the whole thing is an eclectic
mix of technologies; Django, ElasticSearch, redis, PostgreSQL, Apache
httpd…)

We (still) have two related JIRA 'issues':

- LARQ needs to update the Lucene index when a SPARQL Update request
is received https://issues.apache.org/jira/browse/JENA-164

- Refactor LARQ so that it becomes easy to plug in different indexes
such as Solr or ElasticSearch instead of Lucene
https://issues.apache.org/jira/browse/JENA-17

I am still unclear how to intercept all the possible update routes
(i.e. SPARQL Update, APIs, bulk loaders, etc...).

Our approach is to limit the ways in which updates can happen (i.e.
things will become inconsistent if it doesn't happen through our admin
interface). This obviously doesn't work in the general case, but could
be a useful half-way house (e.g. say "'INSERT … WHERE …' will leave you
with a stale index. If you care, use 'CONSTRUCT' and the graph store
protocol instead").

But, I think it would be useful to allow people to use Apache Solr
and/or ElasticSearch indexes (and/or other custom indexes) and keep
those up-to- date when changes come in.

For external indexes presumably you either need something that gets
hooked into the JVM and listens for updates there, or a way to push
notifications to external applications/services when things happen.

What do you store in ElasticSearch?

Technically, nothing yet, as I'm still implementing it ;-). Once it's
implemented it'll build indexes tailored to the types of modelling
patterns we expect to have in the store. For example, we might SPARQL
for organisations like<http://is.gd/gsc1Zs>  and for each create a chunk
of JSON to feed into ElasticSearch. Targets for indexing so far include
organisations, people, vacancies, courses, and equipment. We'll add more
indexes as we add new types of things.


All the best,

Alex

PS. I'd be interested to know whether our approach is generally
considered sane…

Re: indexing [was Re: Performance of successive identical queries]

Reply via email to