Hi,
I'm investigating a performance problem we encountered [1] when updating
from Fuseki to 3.8.0. A very frequently executed query generated by
Skosmos is suddenly taking 3-4 orders of magnitude longer than it used
to. I narrowed it down to a combination of VALUES being used with two
nested FILTER NOT EXISTS blocks - don't ask why the query does this, but
this sort of "double negative" clause is necessary for the application
to get the exact results it needs.
(Well you'll ask anyway. It has to do with presenting information about
the narrower concepts for a SKOS concept, grouped by "arrays"
represented using SKOS collections. For example under the concept
"milk", you could have an array <milk by source animal> and the narrower
concept "goat milk" would be shown within that array. But you don't want
to show an unrelated array <goat products>, even though "goat milk" is
also in that array, because it's intended to be used under the concept
"products" even though that is not explicitly stated in the data and has
to be inferred by the fact that it contains concepts such as "goat fur"
that are not narrower concepts of "milk". Phew.)
I simplified the query as much as I could and with this short version,
the difference is not quite as dramatic - around 5 ms vs 1.5 seconds
when executed via Fuseki and all caches are warm - but still noticeable.
(The original query went from <300ms to several minutes or even an
hour.) I tested different Jena versions, and it seems the change
happened somewhere between 3.7.0 and 3.8.0.
This is the simplified query:
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
SELECT *
FROM <http://www.yso.fi/onto/yso/>
WHERE {
{
?uri skos:narrower ?o .
OPTIONAL {
?x skos:member ?o .
FILTER NOT EXISTS {
?x skos:member ?other .
FILTER NOT EXISTS {
?other skos:broader ?uri
}
}
}
}
}
VALUES (?uri) {
(<http://www.yso.fi/onto/yso/p1251>)
}
When executed on YSO data [2] loaded into TDB,
- using Jena 3.7.0:
$ bin/tdbquery --loc=/tmp/koko --query minimal-slow.rq --time
--------------------------------------------------------------------------------
| uri | o
| x |
================================================================================
| <http://www.yso.fi/onto/yso/p1251> |
<http://www.yso.fi/onto/yso/p7098> | |
| <http://www.yso.fi/onto/yso/p1251> |
<http://www.yso.fi/onto/yso/p15215> | |
| <http://www.yso.fi/onto/yso/p1251> |
<http://www.yso.fi/onto/yso/p25061> | |
| <http://www.yso.fi/onto/yso/p1251> |
<http://www.yso.fi/onto/yso/p25358> | |
| <http://www.yso.fi/onto/yso/p1251> |
<http://www.yso.fi/onto/yso/p1247> | |
| <http://www.yso.fi/onto/yso/p1251> |
<http://www.yso.fi/onto/yso/p10265> | |
--------------------------------------------------------------------------------
Time: 0,156 sec
$ bin/qparse --query minimal-slow.rq --print=opt
(prefix ((skos: <http://www.w3.org/2004/02/skos/core#>))
(sequence
(table (vars ?uri)
(row [?uri <http://www.yso.fi/onto/yso/p1251>])
)
(conditional
(bgp (triple ?uri skos:narrower ?o))
(filter (notexists
(filter (notexists (bgp (triple ?other skos:broader
?uri)))
(bgp (triple ?x skos:member ?other))))
(bgp (triple ?x skos:member ?o))))))
- using Jena 3.9.0:
$ bin/tdbquery --loc=/tmp/koko --query minimal-slow.rq --time
--------------------------------------------------------------------------------
| uri | o
| x |
================================================================================
| <http://www.yso.fi/onto/yso/p1251> |
<http://www.yso.fi/onto/yso/p7098> | |
| <http://www.yso.fi/onto/yso/p1251> |
<http://www.yso.fi/onto/yso/p15215> | |
| <http://www.yso.fi/onto/yso/p1251> |
<http://www.yso.fi/onto/yso/p25061> | |
| <http://www.yso.fi/onto/yso/p1251> |
<http://www.yso.fi/onto/yso/p25358> | |
| <http://www.yso.fi/onto/yso/p1251> |
<http://www.yso.fi/onto/yso/p1247> | |
| <http://www.yso.fi/onto/yso/p1251> |
<http://www.yso.fi/onto/yso/p10265> | |
--------------------------------------------------------------------------------
Time: 2,249 sec
$ bin/qparse --query minimal-slow.rq --print=opt
(prefix ((skos: <http://www.w3.org/2004/02/skos/core#>))
(sequence
(conditional
(bgp (triple ?uri skos:narrower ?o))
(filter (notexists
(filter (notexists (bgp (triple ?other skos:broader
?uri)))
(bgp (triple ?x skos:member ?other))))
(bgp (triple ?x skos:member ?o))))
(table (vars ?uri)
(row [?uri <http://www.yso.fi/onto/yso/p1251>])
)))
I'm not very good at reading the parsed queries but it seems that the
older version applies the VALUES pattern first, then proceeds with the
rest of the query which is now quite trivial. The new version instead
performs the query with ?uri unbound, leading to lots of intermediate
results, and only applies the VALUES last.
There are at least two ways to modify the query so it runs fast again:
1. Skip the VALUES altogether and just use the actual concept URI
instead of the variable ?uri. It works in this case but more generally,
sometimes the same query is used to retrieve information for a bunch of
URIs at the same time and then using VALUES makes more sense.
2. Change the inner FILTER NOT EXISTS to MINUS. The results appear to be
the the same, but the query runs much faster. Curiously, this is the
only combination that runs fast on 3.9.0; MINUS/MINUS and MINUS/FILTER
NOT EXISTS are also slow.
So I can fix this by modifying the queries Skosmos runs, that's no
problem. My questions:
1. Should this be considered a bug, or is it just an obscure case of the
query optimizer working a bit differently than before?
2. Can you recommend how I should fix the query so it won't blow up again?
Thanks,
Osma
[1] https://github.com/NatLibFi/Skosmos/issues/829
[2] http://finto.fi/rest/v1/yso/data
--
Osma Suominen
D.Sc. (Tech), Information Systems Specialist
National Library of Finland
P.O. Box 26 (Kaikukatu 4)
00014 HELSINGIN YLIOPISTO
Tel. +358 50 3199529
[email protected]
http://www.nationallibrary.fi