Re: [basex-talk] Issue with Full Text Retrieval

Ron Katriel Mon, 18 Sep 2017 17:01:02 -0700

Hi Christian,

Yes, this helps. By index rewritings, are you referring to the indices
created when FTINDEX is set to true?


Thanks,
Ron

On September 18, 2017 at 11:12:54 AM, Christian Grün (
[email protected]) wrote:

Hi Ron,

With mixed-content, it can be beneficial if element boundaries are
ignored. An example:

<xml><b>H</b>ello world!</xml>
contains text 'hello'

If you set the CHOP option to false before creating a database,
whitespaces will be included in your database. As Fabrice has pointed
out, however, it is usually better to directly address the text nodes
of your database; otherwise, you won’t be able to benefit from the
index rewritings.

Hope this helps,
Christian



On Mon, Sep 11, 2017 at 4:59 PM, Ron Katriel <[email protected]> wrote:
> Thanks Fabrice and Michael. Solution (1) works great!
>
> A parting question: why not make the default behavior when querying the
> textual representation of a document to not “chop” away critical word
> boundary delimiters? So, in the example below it would return
>
> XQuery
> and XPAth are awesome
>
> The munging together of "XPAth" and “are” seems counter intuitive to me.
>
> Best,
> Ron
>
> On September 11, 2017 at 4:13:54 AM, Michael Seiferle ([email protected])
wrote:
>
> Hi Ron,
> Hi Fabrice,
>
> Your observation w.r.t. to element boundaries is right, the document is
> converted to a textual representation, by default it returns all nodes in
> their string representation:
>
> $doc :=
>
> <doc>
> XQuery
> <_>and XPAth</_>
> <_>are awesome</_>
> </doc>/data()
>
> Will turn to:
>
>
> XQuery
> and XPAthare awesome
>
>
> So:
>
> $doc contains text { 'XPath‘ }
>
>
> will return false.
>
> You have 3.5 options:
>
> 1) => as Fabrice showed, query the individual text nodes
>
> 2) use the ft:search() Function to query the index directly,
>
https://urldefense.proofpoint.com/v2/url?u=http-3A__docs.basex.org_wiki_Full-2DText-5FModule-23ft-3Asearch&d=DwIFaQ&c=fi2D4-9xMzmjyjREwHYlAw&r=44jDQvzmnB_-ovfO6Iusj0ItciJrcWMOQQwd2peEBBE&m=n_ahruJkCgxM-EH4-m0dMIKL305fX-u2hwEeRQfL_v4&s=3ALZg_foDFZOpL2OY8SZS_E053zSfBiBcqtQ7Fl98m4&e=
>
> ft:search(
> 'CTGovDebug',
> 'neoplasms'
> )/.. (: get parent element for the matching text()-node
>
>
> 3) disable chopping when creating the database,
>
https://urldefense.proofpoint.com/v2/url?u=http-3A__docs.basex.org_wiki_Options-23XML-5FParsing&d=DwIFaQ&c=fi2D4-9xMzmjyjREwHYlAw&r=44jDQvzmnB_-ovfO6Iusj0ItciJrcWMOQQwd2peEBBE&m=n_ahruJkCgxM-EH4-m0dMIKL305fX-u2hwEeRQfL_v4&s=dUP3VlR3Skm4sDb5U1tQAo0eK2Fc3xbgFNsl41XZ-Lc&e=
>
> db:create(
> 'CTGovDebug',
> "Path/to/NCT00473512.xml",
> "NCT00473512.xml",
>
> map {
> 'ftindex': true(),
> 'chop': false()
> })
>
>
> 3.5) use the xml:space="preserve“ attribute to tell the parser not to
chop
> child nodes of <clinical_study/> when creating a database:
>
> <clinical_study xml:space="preserve">
> <!-- This xml conforms to an XML Schema at:
>
https://urldefense.proofpoint.com/v2/url?u=https-3A__clinicaltrials.gov_ct2_html_images_info_public.xsd&d=DwIFaQ&c=fi2D4-9xMzmjyjREwHYlAw&r=44jDQvzmnB_-ovfO6Iusj0ItciJrcWMOQQwd2peEBBE&m=n_ahruJkCgxM-EH4-m0dMIKL305fX-u2hwEeRQfL_v4&s=Y8p_znztMroi9xbxY8TRgECRqNyWSJYuPZWMIgeZopc&e=
-->
> <required_header>
> <download_date>ClinicalTrials.gov processed this data on August 31,
> 2017</download_date>
> <link_text>Link to the current ClinicalTrials.gov record.</link_text>
>
>
>
> Hope this helped shed some light :-)
>
> Best from Konstanz
> Michael
> --
> Michael Seiferle, BaseX GmbH,
https://urldefense.proofpoint.com/v2/url?u=http-3A__www.basexgmbh.de&d=DwIFaQ&c=fi2D4-9xMzmjyjREwHYlAw&r=44jDQvzmnB_-ovfO6Iusj0ItciJrcWMOQQwd2peEBBE&m=n_ahruJkCgxM-EH4-m0dMIKL305fX-u2hwEeRQfL_v4&s=DUaqsc-g-lnjiBM_qG1YH2IUb0rNL0CwOYYzSbcXoM4&e=
> |-- Firmensitz: Obere Laube 73, 78462 Konstanz
> |-- Registergericht Freiburg, HRB: 708285, Geschäftsführer:
> | Dr. Christian Grün, Dr. Alexander Holupirek, Michael Seiferle
> `-- Tel: +49 7531 916 82 77
>
> Am 11.09.2017 um 09:35 schrieb Fabrice ETANCHAUD
> <[email protected]>:
>
> Hello Ron,
>
> I don’t know how ft operators behave on document nodes.
> Supposing documents are converted to their data() representation, Your
query
> would yield the same negative answer.
> You should consider applying ft operators on text nodes like this :
>
> for $trial in db:open('NCT00473512')//text() (:
> [clinical_study/id_info/nct_id='NCT00473512'] :)
> return $trial[. contains text { 'neoplasms' }]
>
> Best regards,
> Fabrice Etanchaud
>
>
> De : [email protected]
> [mailto:[email protected]] De la part de Ron
> Katriel
> Envoyé : lundi 11 septembre 2017 00:42
> À : BaseX
> Objet : [basex-talk] Issue with Full Text Retrieval
>
> Hi,
>
> I am seeing strange behavior with Full Text retrieval. The following
query
> fails for a number of words that are in the XML document (see attached):
>
> for $trial in db:open('CTGovDebug)' (:
> [clinical_study/id_info/nct_id='NCT00473512'] :)
> return $trial contains text { 'neoplasms' }
>
> It fails on a good number of words including neoplasms, cougar, industry,
> yes, completed, november, 2005, interventional, single, male, female,
> assignment, none, research, principal, primary, secondary, age, years,
> gender, etc. But it matches most of the words in the file.
>
> Observation: The words that fail are located at the beginning and/or end
of
> the text and do not occur anywhere else in the middle of any text.
>
> The document is the only one in the database. It does not make a
difference
> whether full text indexing is on or off. My BaseX version is 8.6.4.
>
> Thanks,
> Ron
>
>
> Ron Katriel, Ph.D. | Principal Data Scientist | Medidata Solutions
> 350 Hudson Street, 7th Floor, New York, NY 10014
> [email protected] | direct: +1 201 337 3622 | mobile: +1 201 675 5598 |
> main: +1 212 918 1800
>
>

Re: [basex-talk] Issue with Full Text Retrieval

Reply via email to