Re: jena-text limit by named graph (and language?)

Osma Suominen Wed, 04 Dec 2013 05:44:18 -0800

Hi Andy!

Thanks for your comments. I already kind of guessed that 1/ is theproblem here.

As I see it, this is a performance optimization and it is fine to notmake use the graph information in the index in difficult cases. So e.g.if there are multiple graphs, this part of the query can be omitted andthe index may then return hits from any graph (as it currently always does).

So my question is: if we assume that we're dealing with TDB graphs, andthe SPARQL pattern limits the context to a single graph URI (as e.g.<http://example.com/mygraph> in the example below), how can thetext:search property function know that and find out the graph URI?

I didn't quite understand 2/ as I don't know what quadding means in thiscontext, but as I understood your comment, this is not a problem for theproperty function?


-Osma

04.12.2013 15:16, Andy Seaborne kirjoitti:

Osma,

Good to see the patch - sorry I missed it on users@ - I was quite busy
at the end of last week.

There are two reasons why you can't get the graph name from the graph:

1/ Graphs might have more than one name - i.e be in the dataset, or
another dataset, multiple times.

Graph from TDB do know their name - they are views on the dataset.

2/ Quads.  When flatted to quads, the idea of current graph is undefined.

At first glance, it looks quite easy to add the current graph name when
not quadded.  Property functions don't get tangled with quads.

However, the big question is which is best - whether no graph means
index wide, c.f. unionDefaultgraph, or current graph.  I don't know.

     Andy

On 04/12/13 10:09, Osma Suominen wrote:

Hi,

I'm reposting the below message from the users mailing list as this
seems to be a more appropriate place to submit new patches.

I'd like to add support to jena-text to store the named graph (URI) of
the indexed triples, to get faster text query performance when the query
is intended for only one named graph.

The attached patch adds this information to the index. What is missing
is proper support for actually using the graph information at query time
- I had some problems implementing that, as detailed in my message below.

Any comments are very welcome!

Best regards
Osma Suominen


-------- Original Message --------
Subject: Re: jena-text limit by language and/or named graph
Date: Fri, 29 Nov 2013 14:02:32 +0200
From: Osma Suominen <[email protected]>
To: [email protected]

Hi Andy!

Should this be per map entry/ per predicate?  I don't know which is
best - whether a index-wide configuration or whether it might be
some predicates are indexed one way and some another.


For now, I think this can be global, i.e. not possible to set per
predicate.

(and if there is no lang, presumably "") .


Probably yes, though I'll defer the lang discussion for now and
concentrate on getting the graph information into the index first
because that is more critical for me - I have dozens of graphs, but only
a few languages in each graph.

Sounds sane.


Great!

What would the query predicate in SPARQL look like?


For the graph part, I think there is no need to introduce any new
syntax. Simply having the text:query within the context of a specific
graph should be enough, i.e. this should work:

GRAPH <http://example.com/mygraph> {
    ?s text:query "keyword" .
}

For the language part, I'm not so sure, but I'll defer the discussion
for now.

If it all defaults back to the current mode of operations, we have a
non-disturptive upgrade path which would better if possible.  It's a
change of disk-format which is always more of an issue for existing
use.


Yes, that is my intent, to not disrupt existing use in any way.

Attached is a first draft patch which is my attempt at adding graph
information to the index, iff graphField has been set in the config
file, as in the attached config file.

With this patch, you can use a query such as this:

SELECT ?s {
    ?s text:query '+res* +graph:"http\\://example.com/graphA"' .
}

and you will only get results from within the specified graph. This is
obviously a bit awkward since you have to know the name of the graph
field, and also the URI quoting is ugly. But at least it proves that the
graph information was successfully stored in the index and can be used
for retrieval.

However, I couldn't figure out how to get the URI of the current graph
at query time so that an explicit "graph:" query part could be avoided.

An ExecutionContext is passed to TextQueryPF methods and it has a
getActiveGraph() method which looks promising. But neither the Graph
interface nor the GraphBase implementation seem to be aware of the URI
(or Node in general) they are identified by. The only (possible,
untested) way that I could think of would be to also call
ExecutionContext.getDataset(); then call DatasetGraph.listGraphNodes();
and for each of the Nodes, call DatasetGraph.getGraph(node) and see if
the result matches the Graph that getActiveGraph() returned. But this
seems awfully inefficient, especially if there are lots of graphs. Is
there a better way to find out the URI of the current graph within
TextQueryPF methods?

Finally some misc notes:
- TextDocProducerEntities seems to be unused - not touched
- TextDocProducerTriples.[qQ]uadsToTriples is unused - not touched
- TextIndexLucene.get$ - it seems a bit stupid to use a QueryParser
    when you could directly create a Query programmatically - not touched
- I think get$ was broken anyway because it doesn't take into account
    that the query is tokenized by StandardAnalyzer - but this should now
    be fixed as a side effect of using PerFieldAnalyzerWrapper
- I made similar changes in TextIndexSolr as in TextIndexLucene, but
    have so far tested only the Lucene part

-Osma



--
Osma Suominen
D.Sc. (Tech), Information Systems Specialist
National Library of Finland
P.O. Box 26 (Teollisuuskatu 23)
00014 HELSINGIN YLIOPISTO
Tel. +358 50 3199529
[email protected]
http://www.nationallibrary.fi

Re: jena-text limit by named graph (and language?)

Reply via email to