Hi Christian,

Thanks for the advise. The BaseX engine is phenomenal so I realized quickly 
that the problem was performing a naive cross product. 

Since this query is run only once a month (to serialize XML to CSV) and applied 
to new data (DB) each time, a BaseX map will likely be the most straightforward 
solution (I used the same idea for another project with good results).

I will not be able to implement and test this for another couple of weeks but 
will summarize my findings to the group as soon as possible.

Best,
Ron


> On Aug 4, 2018, at 6:00 AM, Christian Grün <[email protected]> wrote:
> 
> Hi Ron,
> 
>> I believe the slow execution may be due to a combinatorial issue: the cross 
>> product of 280,000 clinical trials and ~10,000 drugs in DrugBank (not 
>> counting synonyms).
> 
> Yes, this sounds like a pretty expensive operation. Having maps
> (XQuery, Java) will be much faster indeed.
> 
> As Gerrit suggested, and if you will run your query more than once, it
> would definitely be another interesting option to build an auxiliary,
> custom "index database" that allows you to do exact searches (this
> database may still have references to your original data sets). Since
> version 9 of BaseX, volatile hash maps will be created for looped
> string comparisons. See the following example:
> 
>  let $values1 := (1 to 500000) ! string()
>  let $values2 := (500001 to 1000000) ! string()
>  return $values1[. = $values2]
> 
> Algorithmically, 500'000 * 500'000 string comparisons will need to be
> performed, resulting in a total of 250 billion operations (and no
> results). The runtime is much faster as you might expect (and, as far
> as I can judge, much faster than in any other XQuery processor).
> 
> Best,
> Christian

Reply via email to