Hi,

I performed join operations between many files and a dictionary. The files 
contain tokenized texts, where one finds word forms + fine-grained POS tags. 
Look at the following file:

https://raw.githubusercontent.com/gcelano/POStaggedAncientGreekXML/master/texts/tlg0001.tlg001.perseus-grc2.xml
 
<https://raw.githubusercontent.com/gcelano/POStaggedAncientGreekXML/master/texts/tlg0001.tlg001.perseus-grc2.xml>

The dictionary, which contains word forms + fine-grained POS tags + lemmas, can 
be found here:

https://github.com/gcelano/LemmatizedAncientGreekXML/tree/master/uniqueTokens/values
 
<https://github.com/gcelano/LemmatizedAncientGreekXML/tree/master/uniqueTokens/values>

I created a database for the dictionary and wrote a query (here simplified) 
like the following:

for $t in $s/t (: t are the tokens in the file containing the tokens :)
let $match := $lemm//d[./p = $t/@o and ./f = $t/text()] (: $lemm//d are the 
single entries in the dictionary :)
return
$match

I see that if I use this query, it is slow, as if the processor cannot use the 
database indexes (./p and ./f). The situation does not seem to improve with 
./p/text() and ./f/text(), which I would assume to be equivalent to the former 
because of atomization. On the contrary, if the same information contained in 
./p and ./f are merged together and put in an attribute (see @v in the 
dictionary files) and this is compared against the values in the text (after 
concatenating them properly), the join operation is super fast (i.e., the index 
for the values in the attributes are used by BaseX).

Does anyone know why? I have been able to get my results via the above (slow) 
comparison, but I would like to know what the cause of the problem was, if 
possible. Thanks.

Best,
Giuseppe

Universität Leipzig
Institute of Computer Science, Digital Humanities
Augustusplatz 10
04109 Leipzig
Deutschland
E-mail: cel...@informatik.uni-leipzig.de
E-mail: giuseppegacel...@gmail.com
Web site 1: http://www.dh.uni-leipzig.de/wo/team/
Web site 2: https://sites.google.com/site/giuseppegacelano/

Reply via email to