Hi Julia - Preface: let me be clear when I say that I've wondered about some of this myself, so I don't think I have an answer for you. That being said, I wonder if this is a grouping/data modeling problem: i.e. you have 5,000 aggregations refer to 1 of 3 web resources vs ~7,000 aggregations each with their own distinct web resource.
If you created 3 databases for "abc" ( hm... "a", "b", and "c"? ☺), one for each web resource (i.e. where there would be only aggregations-to-the-specific-web-resource), would that help with query times at all? It might necessitate a bit of pre-processing in your creation step though. In any event, I hope those random thoughts are helpful in some way. Best, Bridger On Fri, Oct 25, 2019 at 10:24 AM Beck, Julia <j.b...@ub.uni-frankfurt.de> wrote: > Hi, > > first of all: thank you, the fix for [1] did the trick and in 9.2.4 the > query is working as expected. > Today, I come back to you with another challenge in performance which > again seems to have something to do with indexing(?). So here's the > situation: > I have two databases "abc" and "def". "abc" contains 1 xml doc with > about 150.000 nodes and "def" contains 1 xml doc with about 400.000 > nodes. Both are similarly strutured and have their up-to-date text and > attr indexes. The xml docs look both (simplified) like the following: > > <rdf:RDF> > <ore:Aggregation rdf:about="123"> > <edm:object rdf:resource="urn1"/> > <...> > </ore:Aggregation> > <edm:WebResource rdf:about="urn1"> > <...> > </edm:WebResource> > <ore:Aggregation rdf:about="124"> > <edm:object rdf:resource="urn2"/> > <...> > </ore:Aggregation> > <edm:WebResource rdf:about="urn2"> > <...> > </edm:WebResource> > <ore:Aggregation rdf:about="125"> > <edm:object rdf:resource="urn2"/> > <edm:object rdf:resource="urn3"/> > <...> > </ore:Aggregation> > <edm:WebResource rdf:about="urn3"> > <...> > </edm:WebResource> > <...> > </rdf:RDF> > > So one aggregation refers to one (or more) web resources. I boiled down > my original query to the following purpose to keep it simple: for each > aggregation give me the corresponding web resource. > > for $agg in db:open($db_name)/rdf:RDF/ore:Aggregation > return for $urn in $agg/edm:object/@rdf:resource > return (# db:enforceindex #) > {db:open($db_name)/rdf:RDF/edm:WebResource[@rdf:about=$urn]} > > For both databases the query gives me the required result and the query > info tells me that the attribute index for $urn is applied in both > cases (this is also the case if I leave out the pragma). However, oddly > enough, for the "larger" database "def" with a larger attribute index > it takes roughly 1 second while the "smaller" database "abc" with a > smaller attribute index takes 20 seconds. This is not very long but the > original query is more complicated and I have bigger databases with the > same structure where it starts to matter. > > The only (and I think important) difference between "abc" and "def" is > that "abc" contains only 3 web resources that all 5.000 aggregations > refer to. While in "def" each aggregation refers to a particular web > resource (== 7.000 aggregations and 7.000 web resources). > > With index:facets I had a look at the facet values and learned that > there is a "maximum number of distinct values to store per name". Is > there a difference in performance because of that? Maybe I do not get > the index structures but it feels strange that it takes longer to find > the correct attribute in a range of 3 different values than in a range > of 7.000. Maybe there is also another problem in my query, databases or > my reasoning that I do not see? Either way, I need help in > understanding this phenomenon :-) > > I hope you could follow, please don't hesitate to ask if you need > anything to reproduce this situation (I am using BaseX 9.2.4). > > Julia > > [1] > https://mailman.uni-konstanz.de/pipermail/basex-talk/2019-July/014511.html >