Hi,

I'm investigating a performance problem we encountered [1] when updating from Fuseki to 3.8.0. A very frequently executed query generated by Skosmos is suddenly taking 3-4 orders of magnitude longer than it used to. I narrowed it down to a combination of VALUES being used with two nested FILTER NOT EXISTS blocks - don't ask why the query does this, but this sort of "double negative" clause is necessary for the application to get the exact results it needs.

(Well you'll ask anyway. It has to do with presenting information about the narrower concepts for a SKOS concept, grouped by "arrays" represented using SKOS collections. For example under the concept "milk", you could have an array <milk by source animal> and the narrower concept "goat milk" would be shown within that array. But you don't want to show an unrelated array <goat products>, even though "goat milk" is also in that array, because it's intended to be used under the concept "products" even though that is not explicitly stated in the data and has to be inferred by the fact that it contains concepts such as "goat fur" that are not narrower concepts of "milk". Phew.)

I simplified the query as much as I could and with this short version, the difference is not quite as dramatic - around 5 ms vs 1.5 seconds when executed via Fuseki and all caches are warm - but still noticeable. (The original query went from <300ms to several minutes or even an hour.) I tested different Jena versions, and it seems the change happened somewhere between 3.7.0 and 3.8.0.


This is the simplified query:

PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
SELECT *
FROM <http://www.yso.fi/onto/yso/>
WHERE {
  {
    ?uri skos:narrower ?o .
    OPTIONAL {
      ?x skos:member ?o .
      FILTER NOT EXISTS {
        ?x skos:member ?other .
        FILTER NOT EXISTS {
          ?other skos:broader ?uri
        }
      }
    }
  }
}
VALUES (?uri) {
  (<http://www.yso.fi/onto/yso/p1251>)
}


When executed on YSO data [2] loaded into TDB,

- using Jena 3.7.0:


$ bin/tdbquery --loc=/tmp/koko --query minimal-slow.rq --time
--------------------------------------------------------------------------------
| uri | o | x |
================================================================================
| <http://www.yso.fi/onto/yso/p1251> | <http://www.yso.fi/onto/yso/p7098> | | | <http://www.yso.fi/onto/yso/p1251> | <http://www.yso.fi/onto/yso/p15215> | | | <http://www.yso.fi/onto/yso/p1251> | <http://www.yso.fi/onto/yso/p25061> | | | <http://www.yso.fi/onto/yso/p1251> | <http://www.yso.fi/onto/yso/p25358> | | | <http://www.yso.fi/onto/yso/p1251> | <http://www.yso.fi/onto/yso/p1247> | | | <http://www.yso.fi/onto/yso/p1251> | <http://www.yso.fi/onto/yso/p10265> | |
--------------------------------------------------------------------------------
Time: 0,156 sec

$ bin/qparse --query minimal-slow.rq --print=opt
(prefix ((skos: <http://www.w3.org/2004/02/skos/core#>))
  (sequence
    (table (vars ?uri)
      (row [?uri <http://www.yso.fi/onto/yso/p1251>])
    )
    (conditional
      (bgp (triple ?uri skos:narrower ?o))
      (filter (notexists
(filter (notexists (bgp (triple ?other skos:broader ?uri)))
                   (bgp (triple ?x skos:member ?other))))
        (bgp (triple ?x skos:member ?o))))))



- using Jena 3.9.0:


$ bin/tdbquery --loc=/tmp/koko --query minimal-slow.rq --time
--------------------------------------------------------------------------------
| uri | o | x |
================================================================================
| <http://www.yso.fi/onto/yso/p1251> | <http://www.yso.fi/onto/yso/p7098> | | | <http://www.yso.fi/onto/yso/p1251> | <http://www.yso.fi/onto/yso/p15215> | | | <http://www.yso.fi/onto/yso/p1251> | <http://www.yso.fi/onto/yso/p25061> | | | <http://www.yso.fi/onto/yso/p1251> | <http://www.yso.fi/onto/yso/p25358> | | | <http://www.yso.fi/onto/yso/p1251> | <http://www.yso.fi/onto/yso/p1247> | | | <http://www.yso.fi/onto/yso/p1251> | <http://www.yso.fi/onto/yso/p10265> | |
--------------------------------------------------------------------------------
Time: 2,249 sec

$ bin/qparse --query minimal-slow.rq --print=opt
(prefix ((skos: <http://www.w3.org/2004/02/skos/core#>))
  (sequence
    (conditional
      (bgp (triple ?uri skos:narrower ?o))
      (filter (notexists
(filter (notexists (bgp (triple ?other skos:broader ?uri)))
                   (bgp (triple ?x skos:member ?other))))
        (bgp (triple ?x skos:member ?o))))
    (table (vars ?uri)
      (row [?uri <http://www.yso.fi/onto/yso/p1251>])
    )))


I'm not very good at reading the parsed queries but it seems that the older version applies the VALUES pattern first, then proceeds with the rest of the query which is now quite trivial. The new version instead performs the query with ?uri unbound, leading to lots of intermediate results, and only applies the VALUES last.

There are at least two ways to modify the query so it runs fast again:

1. Skip the VALUES altogether and just use the actual concept URI instead of the variable ?uri. It works in this case but more generally, sometimes the same query is used to retrieve information for a bunch of URIs at the same time and then using VALUES makes more sense.

2. Change the inner FILTER NOT EXISTS to MINUS. The results appear to be the the same, but the query runs much faster. Curiously, this is the only combination that runs fast on 3.9.0; MINUS/MINUS and MINUS/FILTER NOT EXISTS are also slow.


So I can fix this by modifying the queries Skosmos runs, that's no problem. My questions:

1. Should this be considered a bug, or is it just an obscure case of the query optimizer working a bit differently than before?

2. Can you recommend how I should fix the query so it won't blow up again?

Thanks,
Osma


[1] https://github.com/NatLibFi/Skosmos/issues/829

[2] http://finto.fi/rest/v1/yso/data


--
Osma Suominen
D.Sc. (Tech), Information Systems Specialist
National Library of Finland
P.O. Box 26 (Kaikukatu 4)
00014 HELSINGIN YLIOPISTO
Tel. +358 50 3199529
[email protected]
http://www.nationallibrary.fi

Reply via email to