Hi Julia -

Preface: let me be clear when I say that I've wondered about some of this
myself, so I don't think I have an answer for you. That being said, I
wonder if this is a grouping/data modeling problem: i.e. you have 5,000
aggregations refer to 1 of 3 web resources vs ~7,000 aggregations each with
their own distinct web resource.

If you created 3 databases for "abc" ( hm... "a", "b", and "c"? ☺), one for
each web resource (i.e. where there would be only
aggregations-to-the-specific-web-resource), would that help with query
times at all? It might necessitate a bit of pre-processing in your creation
step though.

In any event, I hope those random thoughts are helpful in some way.
Best,
Bridger


On Fri, Oct 25, 2019 at 10:24 AM Beck, Julia <j.b...@ub.uni-frankfurt.de>
wrote:

> Hi,
>
> first of all: thank you, the fix for [1] did the trick and in 9.2.4 the
> query is working as expected.
> Today, I come back to you with another challenge in performance which
> again seems to have something to do with indexing(?). So here's the
> situation:
> I have two databases "abc" and "def". "abc" contains 1 xml doc with
> about 150.000 nodes and "def" contains 1 xml doc with about 400.000
> nodes. Both are similarly strutured and have their up-to-date text and
> attr indexes. The xml docs look both (simplified) like the following:
>
> <rdf:RDF>
>   <ore:Aggregation rdf:about="123">
>      <edm:object rdf:resource="urn1"/>
>      <...>
>   </ore:Aggregation>
>   <edm:WebResource rdf:about="urn1">
>      <...>
>   </edm:WebResource>
>   <ore:Aggregation rdf:about="124">
>      <edm:object rdf:resource="urn2"/>
>      <...>
>   </ore:Aggregation>
>   <edm:WebResource rdf:about="urn2">
>      <...>
>   </edm:WebResource>
>   <ore:Aggregation rdf:about="125">
>      <edm:object rdf:resource="urn2"/>
>      <edm:object rdf:resource="urn3"/>
>      <...>
>   </ore:Aggregation>
>   <edm:WebResource rdf:about="urn3">
>      <...>
>   </edm:WebResource>
>   <...>
> </rdf:RDF>
>
> So one aggregation refers to one (or more) web resources. I boiled down
> my original query to the following purpose to keep it simple: for each
> aggregation give me the corresponding web resource.
>
>    for $agg in db:open($db_name)/rdf:RDF/ore:Aggregation
>    return for $urn in $agg/edm:object/@rdf:resource
>      return (# db:enforceindex #)
>    {db:open($db_name)/rdf:RDF/edm:WebResource[@rdf:about=$urn]}
>
> For both databases the query gives me the required result and the query
> info tells me that the attribute index for $urn is applied in both
> cases (this is also the case if I leave out the pragma). However, oddly
> enough, for the "larger" database "def" with a larger attribute index
> it takes roughly 1 second while the "smaller" database "abc" with a
> smaller attribute index takes 20 seconds. This is not very long but the
> original query is more complicated and I have bigger databases with the
> same structure where it starts to matter.
>
> The only (and I think important) difference between "abc" and "def" is
> that "abc" contains only 3 web resources that all 5.000 aggregations
> refer to. While in "def" each aggregation refers to a particular web
> resource (== 7.000 aggregations and 7.000 web resources).
>
> With index:facets I had a look at the facet values and learned that
> there is a "maximum number of distinct values to store per name". Is
> there a difference in performance because of that? Maybe I do not get
> the index structures but it feels strange that it takes longer to find
> the correct attribute in a range of 3 different values than in a range
> of 7.000. Maybe there is also another problem in my query, databases or
> my reasoning that I do not see? Either way, I need help in
> understanding this phenomenon :-)
>
> I hope you could follow, please don't hesitate to ask if you need
> anything to reproduce this situation (I am using BaseX 9.2.4).
>
> Julia
>
> [1]
> https://mailman.uni-konstanz.de/pipermail/basex-talk/2019-July/014511.html
>

Reply via email to