Hi Giuseppe,

Thanks for the new query.

If you have a look at the query info, you will see that your query
will in fact be rewritten to take advantage from the index structures:

  for $t_2 in document-node {"tlg0001.tlg001.perseus-grc2.xml"}/*:text/*:s/*:t
  return db:text("splitted-db",
$t_2/@*:o)/parent::*:p/parent::*:d[(*:f = $t_2/text())]

As your input document contains 45.667 texts, however, 45.667 index
lookups will need to be performed, and this can take a while if the
index results have a low selectivity.

However, there’s a chance to speed up your query. You have two
competing index candidates:

  let $match := $lemm//d[./p = $t/@o and ./f = $t/text()]

As it is not possible to statically assess which one will be faster,
the first candidate will be rewritten to an index request. In your
specific case, you will get much better performing by moving the first
comparison to the first place:

  let $match := $lemm//d[./f = $t/text() and ./p = $t/@o]

Here is a short version of your query that takes around 10 seconds on
my machine (it doesn’t really matter if you move the tests in separate
predicates):

  declare variable $txts := doc("tlg0001.tlg001.perseus-grc2.xml");
  declare variable $lemm := db:open("splitted-db");
  for $t in $txts//t
  return $lemm//d[f = $t][p = $t/@o]

One obvious alternative (that we already discussed offline) is to
store repeatedly accessed values in a map. This way, you can get
evaluation times less than a second.

Hope this helps,
Christian



On Thu, Jul 27, 2017 at 2:10 PM, Giuseppe Celano
<cel...@informatik.uni-leipzig.de> wrote:
> Hi Christian,
>
> These are the queries:
>
> (: This works :)
>
> declare variable $txts := doc("tlg0001.tlg001.perseus-grc2.xml");
> declare variable $lemm := db:open("splitted-db"); (: see link sent earlier
> :)
> for $t in $txts//t
> let $match := $lemm//d[./@v = $t/@o || "#" || $t/text()]
> return
> $match
>
>
> (: This does not work :)
>
> declare variable $txts := doc("tlg0001.tlg001.perseus-grc2.xml");
> declare variable $lemm := db:open("splitted-db");
> for $t in $txts//t
> let $match := $lemm//d[./p = $t/@o and ./f = $t/text()]
> return
> $match
>
>
>
>
> Universität Leipzig
> Institute of Computer Science, Digital Humanities
> Augustusplatz 10
> 04109 Leipzig
> Deutschland
> E-mail: cel...@informatik.uni-leipzig.de
> E-mail: giuseppegacel...@gmail.com
> Web site 1: http://www.dh.uni-leipzig.de/wo/team/
> Web site 2: https://sites.google.com/site/giuseppegacelano/
>
> On 27 Jul 2017, at 13:48, Giuseppe Celano <cel...@informatik.uni-leipzig.de>
> wrote:
>
> Hi,
>
> I performed join operations between many files and a dictionary. The files
> contain tokenized texts, where one finds word forms + fine-grained POS tags.
> Look at the following file:
>
> https://raw.githubusercontent.com/gcelano/POStaggedAncientGreekXML/master/texts/tlg0001.tlg001.perseus-grc2.xml
>
> The dictionary, which contains word forms + fine-grained POS tags + lemmas,
> can be found here:
>
> https://github.com/gcelano/LemmatizedAncientGreekXML/tree/master/uniqueTokens/values
>
> I created a database for the dictionary and wrote a query (here simplified)
> like the following:
>
> for $t in $s/t (: t are the tokens in the file containing the tokens :)
> let $match := $lemm//d[./p = $t/@o and ./f = $t/text()] (: $lemm//d are the
> single entries in the dictionary :)
> return
> $match
>
> I see that if I use this query, it is slow, as if the processor cannot use
> the database indexes (./p and ./f). The situation does not seem to improve
> with ./p/text() and ./f/text(), which I would assume to be equivalent to the
> former because of atomization. On the contrary, if the same information
> contained in ./p and ./f are merged together and put in an attribute (see @v
> in the dictionary files) and this is compared against the values in the text
> (after concatenating them properly), the join operation is super fast (i.e.,
> the index for the values in the attributes are used by BaseX).
>
> Does anyone know why? I have been able to get my results via the above
> (slow) comparison, but I would like to know what the cause of the problem
> was, if possible. Thanks.
>
> Best,
> Giuseppe
>
> Universität Leipzig
> Institute of Computer Science, Digital Humanities
> Augustusplatz 10
> 04109 Leipzig
> Deutschland
> E-mail: cel...@informatik.uni-leipzig.de
> E-mail: giuseppegacel...@gmail.com
> Web site 1: http://www.dh.uni-leipzig.de/wo/team/
> Web site 2: https://sites.google.com/site/giuseppegacelano/
>
>

Reply via email to