Hi Christian, Thanks for the advise. The BaseX engine is phenomenal so I realized quickly that the problem was performing a naive cross product.
Since this query is run only once a month (to serialize XML to CSV) and applied to new data (DB) each time, a BaseX map will likely be the most straightforward solution (I used the same idea for another project with good results). I will not be able to implement and test this for another couple of weeks but will summarize my findings to the group as soon as possible. Best, Ron > On Aug 4, 2018, at 6:00 AM, Christian Grün <[email protected]> wrote: > > Hi Ron, > >> I believe the slow execution may be due to a combinatorial issue: the cross >> product of 280,000 clinical trials and ~10,000 drugs in DrugBank (not >> counting synonyms). > > Yes, this sounds like a pretty expensive operation. Having maps > (XQuery, Java) will be much faster indeed. > > As Gerrit suggested, and if you will run your query more than once, it > would definitely be another interesting option to build an auxiliary, > custom "index database" that allows you to do exact searches (this > database may still have references to your original data sets). Since > version 9 of BaseX, volatile hash maps will be created for looped > string comparisons. See the following example: > > let $values1 := (1 to 500000) ! string() > let $values2 := (500001 to 1000000) ! string() > return $values1[. = $values2] > > Algorithmically, 500'000 * 500'000 string comparisons will need to be > performed, resulting in a total of 250 billion operations (and no > results). The runtime is much faster as you might expect (and, as far > as I can judge, much faster than in any other XQuery processor). > > Best, > Christian

