Hi Christian, Yes, this helps. By index rewritings, are you referring to the indices created when FTINDEX is set to true?
Thanks, Ron On September 18, 2017 at 11:12:54 AM, Christian Grün ( [email protected]) wrote: Hi Ron, With mixed-content, it can be beneficial if element boundaries are ignored. An example: <xml><b>H</b>ello world!</xml> contains text 'hello' If you set the CHOP option to false before creating a database, whitespaces will be included in your database. As Fabrice has pointed out, however, it is usually better to directly address the text nodes of your database; otherwise, you won’t be able to benefit from the index rewritings. Hope this helps, Christian On Mon, Sep 11, 2017 at 4:59 PM, Ron Katriel <[email protected]> wrote: > Thanks Fabrice and Michael. Solution (1) works great! > > A parting question: why not make the default behavior when querying the > textual representation of a document to not “chop” away critical word > boundary delimiters? So, in the example below it would return > > XQuery > and XPAth are awesome > > The munging together of "XPAth" and “are” seems counter intuitive to me. > > Best, > Ron > > On September 11, 2017 at 4:13:54 AM, Michael Seiferle ([email protected]) wrote: > > Hi Ron, > Hi Fabrice, > > Your observation w.r.t. to element boundaries is right, the document is > converted to a textual representation, by default it returns all nodes in > their string representation: > > $doc := > > <doc> > XQuery > <_>and XPAth</_> > <_>are awesome</_> > </doc>/data() > > Will turn to: > > > XQuery > and XPAthare awesome > > > So: > > $doc contains text { 'XPath‘ } > > > will return false. > > You have 3.5 options: > > 1) => as Fabrice showed, query the individual text nodes > > 2) use the ft:search() Function to query the index directly, > https://urldefense.proofpoint.com/v2/url?u=http-3A__docs.basex.org_wiki_Full-2DText-5FModule-23ft-3Asearch&d=DwIFaQ&c=fi2D4-9xMzmjyjREwHYlAw&r=44jDQvzmnB_-ovfO6Iusj0ItciJrcWMOQQwd2peEBBE&m=n_ahruJkCgxM-EH4-m0dMIKL305fX-u2hwEeRQfL_v4&s=3ALZg_foDFZOpL2OY8SZS_E053zSfBiBcqtQ7Fl98m4&e= > > ft:search( > 'CTGovDebug', > 'neoplasms' > )/.. (: get parent element for the matching text()-node > > > 3) disable chopping when creating the database, > https://urldefense.proofpoint.com/v2/url?u=http-3A__docs.basex.org_wiki_Options-23XML-5FParsing&d=DwIFaQ&c=fi2D4-9xMzmjyjREwHYlAw&r=44jDQvzmnB_-ovfO6Iusj0ItciJrcWMOQQwd2peEBBE&m=n_ahruJkCgxM-EH4-m0dMIKL305fX-u2hwEeRQfL_v4&s=dUP3VlR3Skm4sDb5U1tQAo0eK2Fc3xbgFNsl41XZ-Lc&e= > > db:create( > 'CTGovDebug', > "Path/to/NCT00473512.xml", > "NCT00473512.xml", > > map { > 'ftindex': true(), > 'chop': false() > }) > > > 3.5) use the xml:space="preserve“ attribute to tell the parser not to chop > child nodes of <clinical_study/> when creating a database: > > <clinical_study xml:space="preserve"> > <!-- This xml conforms to an XML Schema at: > https://urldefense.proofpoint.com/v2/url?u=https-3A__clinicaltrials.gov_ct2_html_images_info_public.xsd&d=DwIFaQ&c=fi2D4-9xMzmjyjREwHYlAw&r=44jDQvzmnB_-ovfO6Iusj0ItciJrcWMOQQwd2peEBBE&m=n_ahruJkCgxM-EH4-m0dMIKL305fX-u2hwEeRQfL_v4&s=Y8p_znztMroi9xbxY8TRgECRqNyWSJYuPZWMIgeZopc&e= --> > <required_header> > <download_date>ClinicalTrials.gov processed this data on August 31, > 2017</download_date> > <link_text>Link to the current ClinicalTrials.gov record.</link_text> > > > > Hope this helped shed some light :-) > > Best from Konstanz > Michael > -- > Michael Seiferle, BaseX GmbH, https://urldefense.proofpoint.com/v2/url?u=http-3A__www.basexgmbh.de&d=DwIFaQ&c=fi2D4-9xMzmjyjREwHYlAw&r=44jDQvzmnB_-ovfO6Iusj0ItciJrcWMOQQwd2peEBBE&m=n_ahruJkCgxM-EH4-m0dMIKL305fX-u2hwEeRQfL_v4&s=DUaqsc-g-lnjiBM_qG1YH2IUb0rNL0CwOYYzSbcXoM4&e= > |-- Firmensitz: Obere Laube 73, 78462 Konstanz > |-- Registergericht Freiburg, HRB: 708285, Geschäftsführer: > | Dr. Christian Grün, Dr. Alexander Holupirek, Michael Seiferle > `-- Tel: +49 7531 916 82 77 > > Am 11.09.2017 um 09:35 schrieb Fabrice ETANCHAUD > <[email protected]>: > > Hello Ron, > > I don’t know how ft operators behave on document nodes. > Supposing documents are converted to their data() representation, Your query > would yield the same negative answer. > You should consider applying ft operators on text nodes like this : > > for $trial in db:open('NCT00473512')//text() (: > [clinical_study/id_info/nct_id='NCT00473512'] :) > return $trial[. contains text { 'neoplasms' }] > > Best regards, > Fabrice Etanchaud > > > De : [email protected] > [mailto:[email protected]] De la part de Ron > Katriel > Envoyé : lundi 11 septembre 2017 00:42 > À : BaseX > Objet : [basex-talk] Issue with Full Text Retrieval > > Hi, > > I am seeing strange behavior with Full Text retrieval. The following query > fails for a number of words that are in the XML document (see attached): > > for $trial in db:open('CTGovDebug)' (: > [clinical_study/id_info/nct_id='NCT00473512'] :) > return $trial contains text { 'neoplasms' } > > It fails on a good number of words including neoplasms, cougar, industry, > yes, completed, november, 2005, interventional, single, male, female, > assignment, none, research, principal, primary, secondary, age, years, > gender, etc. But it matches most of the words in the file. > > Observation: The words that fail are located at the beginning and/or end of > the text and do not occur anywhere else in the middle of any text. > > The document is the only one in the database. It does not make a difference > whether full text indexing is on or off. My BaseX version is 8.6.4. > > Thanks, > Ron > > > Ron Katriel, Ph.D. | Principal Data Scientist | Medidata Solutions > 350 Hudson Street, 7th Floor, New York, NY 10014 > [email protected] | direct: +1 201 337 3622 | mobile: +1 201 675 5598 | > main: +1 212 918 1800 > >

