Re: [basex-talk] diacritics sensitive not working

Ron Katriel Fri, 03 Aug 2018 15:45:35 -0700

Christian,

Thanks for sharing that. I assumed all along that this happens
automatically. Anyway, I ran my query (for one drug, to save time) and see
the following in the Info view


- apply text index for "Lenalidomide"

I believe the slow execution may be due to a combinatorial issue: the cross
product of 280,000 clinical trials and ~10,000 drugs in DrugBank (not
counting synonyms).

I am considering an algorithmic solution that involves storing the DrugBank
information in a hash table (map) and looking it up while iterating through
the CT.gov <http://clinicaltrials.gov> trials.

Best,
Ron

On August 3, 2018 at 5:49:30 PM, Christian Grün ([email protected])
wrote:

Our documentation should help you here: http://docs.basex.org/wiki/Indexes
<https://urldefense.proofpoint.com/v2/url?u=http-3A__docs.basex.org_wiki_Indexes&d=DwMFaQ&c=fi2D4-9xMzmjyjREwHYlAw&r=44jDQvzmnB_-ovfO6Iusj0ItciJrcWMOQQwd2peEBBE&m=mk1COTV1sAZu82fBqU9P70ZPQXi-d6NrV1-5QYTPHOo&s=Esza6Q3FyaDERIFJTWBAjifLIDVFW3bWKMLS4hbqv_A&e=>



Ron Katriel <[email protected]> schrieb am Fr., 3. Aug. 2018, 23:20:

> Hi Christian,
>
> Yes, I created a full-text index when the databases where loaded (see the
> commands below). I also verified that FTINDEX is true for both databases
> (in the GUI under Database > Open & Manage).
>
> How do I ensure that my query is rewritten for index access?
>
> Thanks,
> Ron
>
>
> SET FTINDEX true; SET TOKENINDEX true; CREATE DB CTGov "/Data Sets/
> ct.gov/xml
> <https://urldefense.proofpoint.com/v2/url?u=http-3A__ct.gov_xml&d=DwMFaQ&c=fi2D4-9xMzmjyjREwHYlAw&r=44jDQvzmnB_-ovfO6Iusj0ItciJrcWMOQQwd2peEBBE&m=mk1COTV1sAZu82fBqU9P70ZPQXi-d6NrV1-5QYTPHOo&s=nDUqSutsQr7QyD8E6-XysRp1qudWO6I05tJaWjkCUI4&e=>
> "
> SET FTINDEX true; SET TOKENINDEX true; SET STRIPNS true; CREATE DB
> DrugBank “/Data Sets/DrugBank/drugbank.xml"
>
> On August 3, 2018 at 4:12:43 PM, Christian Grün ([email protected])
> wrote:
>
> Hi Ron,
>
> Did you a) create a full-text index for your data and b) ensure that
> your query is rewritten for index access?
>
> Best,
> Christian
>
>
> On Fri, Aug 3, 2018 at 2:39 PM Ron Katriel <[email protected]> wrote:
> >
> > Christian,
> >
> > Adding diacritics sensitive slows execution by a factor of 3. My script
> (fragment below), which joins two large databases, namely CT.gov and
> DrugBank, takes 2 hours without the diacritics sensitive constraint but 6
> hours with it. Given the combinatorics involved, I am wondering if there is
> a better way to do this in BaseX.
> >
> > Thanks,
> > Ron
> >
> >
> > for $drug in db:open('DrugBank')/drugbank/drug
> > let $drug_name := $drug/name/text()
> > let $drug_synonyms :=
> functx:value-union(normalize-space(lower-case($drug/name)),
> local:drug-synonyms($drug_name))
> > for $synonym_name in $drug_synonyms
> > ...
> > for $study in
> db:open('CTGov')/clinical_study[intervention/intervention_name contains
> text { $synonym_name } using case insensitive using diacritics sensitive]
> > ...
> >
> >
> > Ron Katriel, Ph.D. | Principal Data Scientist | Medidata Solutions
> > 350 Hudson Street, 7th Floor, New York, NY 10014
> > [email protected] | direct: +1 201 337 3622 | mobile: +1 201 675 5598
> | main: +1 212 918 1800
> >
> > On August 1, 2018 at 12:41:26 PM, Ron Katriel ([email protected])
> wrote:
> >
> > Thanks, Christian. Strange, prior to contacting you and on a hunch, I
> tried adding the missing “using” keyword but still got the syntax error.
> Anyway, everything is good now!
> >
> > Best,
> > Ron
> >
> > On August 1, 2018 at 3:57:51 AM, Christian Grün (
> [email protected]) wrote:
> >
> > I have fixed the example in the doc.
> > Best, Christian
> >
> >
> > On Wed, Aug 1, 2018 at 5:08 AM Ron Katriel <[email protected]> wrote:
> > >
> > > Hi,
> > >
> > > The following from your website (docs.basex.org/wiki/Full-Text
> <https://urldefense.proofpoint.com/v2/url?u=http-3A__docs.basex.org_wiki_Full-2DText&d=DwMFaQ&c=fi2D4-9xMzmjyjREwHYlAw&r=44jDQvzmnB_-ovfO6Iusj0ItciJrcWMOQQwd2peEBBE&m=mk1COTV1sAZu82fBqU9P70ZPQXi-d6NrV1-5QYTPHOo&s=fzrCGjX9wfPKGZuwd7u4KJ4_AyzK0ZQtU9_PRyCam3U&e=>)
> appears to be syntactically incorrect
> > >
> > > "'Äpfel' will not be found..." contains text "Apfel" diacritics
> sensitive
> > >
> > > In the BaseX GUI the keyword diacritics is underlined in red and the
> following error is reported
> > >
> > > Unexpected end of query: 'diacritic sens...'.
> > >
> > > This happens in version 8.6.4 and also the latest (9.0.2).
> > >
> > > Thanks,
> > > Ron
> > >
> > >
> > > Ron Katriel, Ph.D. | Principal Data Scientist | Medidata Solutions
> > >
> > > 350 Hudson Street, 7th Floor, New York, NY 10014
> > >
> > > [email protected] | direct: +1 201 337 3622 | mobile: +1 201 675
> 5598 | main: +1 212 918 1800
> > >
> > >
>
>

Re: [basex-talk] diacritics sensitive not working

Reply via email to