Hi,
I'm reposting the below message from the users mailing list as this
seems to be a more appropriate place to submit new patches.
I'd like to add support to jena-text to store the named graph (URI) of
the indexed triples, to get faster text query performance when the query
is intended for only one named graph.
The attached patch adds this information to the index. What is missing
is proper support for actually using the graph information at query time
- I had some problems implementing that, as detailed in my message below.
Any comments are very welcome!
Best regards
Osma Suominen
-------- Original Message --------
Subject: Re: jena-text limit by language and/or named graph
Date: Fri, 29 Nov 2013 14:02:32 +0200
From: Osma Suominen <[email protected]>
To: [email protected]
Hi Andy!
Should this be per map entry/ per predicate? I don't know which is
best - whether a index-wide configuration or whether it might be
some predicates are indexed one way and some another.
For now, I think this can be global, i.e. not possible to set per
predicate.
(and if there is no lang, presumably "") .
Probably yes, though I'll defer the lang discussion for now and
concentrate on getting the graph information into the index first
because that is more critical for me - I have dozens of graphs, but only
a few languages in each graph.
Sounds sane.
Great!
What would the query predicate in SPARQL look like?
For the graph part, I think there is no need to introduce any new
syntax. Simply having the text:query within the context of a specific
graph should be enough, i.e. this should work:
GRAPH <http://example.com/mygraph> {
?s text:query "keyword" .
}
For the language part, I'm not so sure, but I'll defer the discussion
for now.
If it all defaults back to the current mode of operations, we have a
non-disturptive upgrade path which would better if possible. It's a
change of disk-format which is always more of an issue for existing
use.
Yes, that is my intent, to not disrupt existing use in any way.
Attached is a first draft patch which is my attempt at adding graph
information to the index, iff graphField has been set in the config
file, as in the attached config file.
With this patch, you can use a query such as this:
SELECT ?s {
?s text:query '+res* +graph:"http\\://example.com/graphA"' .
}
and you will only get results from within the specified graph. This is
obviously a bit awkward since you have to know the name of the graph
field, and also the URI quoting is ugly. But at least it proves that the
graph information was successfully stored in the index and can be used
for retrieval.
However, I couldn't figure out how to get the URI of the current graph
at query time so that an explicit "graph:" query part could be avoided.
An ExecutionContext is passed to TextQueryPF methods and it has a
getActiveGraph() method which looks promising. But neither the Graph
interface nor the GraphBase implementation seem to be aware of the URI
(or Node in general) they are identified by. The only (possible,
untested) way that I could think of would be to also call
ExecutionContext.getDataset(); then call DatasetGraph.listGraphNodes();
and for each of the Nodes, call DatasetGraph.getGraph(node) and see if
the result matches the Graph that getActiveGraph() returned. But this
seems awfully inefficient, especially if there are lots of graphs. Is
there a better way to find out the URI of the current graph within
TextQueryPF methods?
Finally some misc notes:
- TextDocProducerEntities seems to be unused - not touched
- TextDocProducerTriples.[qQ]uadsToTriples is unused - not touched
- TextIndexLucene.get$ - it seems a bit stupid to use a QueryParser
when you could directly create a Query programmatically - not touched
- I think get$ was broken anyway because it doesn't take into account
that the query is tokenized by StandardAnalyzer - but this should now
be fixed as a side effect of using PerFieldAnalyzerWrapper
- I made similar changes in TextIndexSolr as in TextIndexLucene, but
have so far tested only the Lucene part
-Osma