Re: [basex-talk] text() vs string()

Christian Grün Tue, 29 Jan 2013 02:51:57 -0800

Dear Wendell,

if you query structured documents, Query [C] will be automatically
optimized to [A]; this will not apply, however, if the addressed
element contains other elements, such as is the case for mixed
content.


As Cerstin indicated, the text index is based on text nodes, so the
main reason for using [A] is to take advantage of the full-text index.
Discussion on this list and some more use cases have shown that the
current solution is quite restrictive when it comes to mixed content.
One future option could be to extend the full-text index to also
support queries across element boundaries. Several indexing techniques
exist for that approach, so it’s mainly a question of finding someone
to implement it in our core.

Hope this helps, feel free to ask for more,
Christian
___________________________

On Mon, Jan 28, 2013 at 10:27 PM, Wendell Piez <wap...@wendellpiez.com> wrote:
> Hi,
>
> This may be a question about XQuery Full Text, or only about common
> usage (or misusage?) of XPath; in either case I hope it's on topic.
> Please tell me if not.
>
> In BaseX [A]:
>
> let $test :=
>   <test>
>     <p>The apple <em>never</em> falls far from the tree.</p>
>     <p><!-- comment -->Apples and trees.</p>
>     <p>Trees and <!-- comment --> apples.</p>
>     <p><fruit>Apple</fruit> trees.</p>
>   </test>
>
> return
>   $test/*[text() contains text ('apple' ftand 'tree')
>           using stemming using language 'en']
>
> This returns
>
> <p>
>   <!-- comment -->
>       Apples and trees.</p>
>
> As an experienced XPath user, this is what I expect, assuming
> "contains text" allows a sequence of nodes as its first argument (and
> returns true if any of them satisfies the test). Only the second 'p'
> element has a child text node whose value contains both "apple" and
> "tree".
>
> Of course the problem in the others is the mixed content: in the
> first, an element node 'em' intervenes, while in the third, a comment
> intervenes, so both these cases contain text nodes with either "apple"
> or "tree", but not both. In the case of the fourth 'p', there is no
> text node child containing "apple" at all, only a grandchild.
>
> Assuming I want all four back, I can write either:
>
> [B] return
>   $test/*[string() contains text ('apple' ftand 'tree')
>           using stemming using language 'en']
>
> or
>
> [C] return
>   $test/*[. contains text ('apple' ftand 'tree')
>           using stemming using language 'en']
>
> In the case of [B], the string() function casts the element to a
> string, flattening its structure. [C] passes the element itself to the
> "contains text" operation, which happily has the same effect.
>
> I have several related questions about this:
>
> 1. Unless I learn better, I'm going to prefer [B] or [C], because in
> my world, mixed content is common; is there any reason (performance or
> otherwise) to prefer [A] in cases where I know it will be robust? Is
> there any reason to prefer [B] or prefer [C]?
>
> 2. I see examples like [A] offered frequently in the XQuery
> literature, of "text()" being used apparently to refer to an element's
> string (text) value not to its text node children. And I see this
> usage in running code. I can only imagine that those who write it are
> simply not aware that mixed content will complicate their queries like
> this; maybe they have just never thought about it, or they don't know
> what text() actually does. In any case, the error is pernicious, since
> nothing tells you the query you gave isn't the one you intended -- it
> even works, until the day it doesn't, and the cases where gives
> correct but unwanted results may be rare.
>
> But maybe I'm wrong and they just know something about XQuery, XQuery
> FT, or their tools, that I don't.
>
> What do the experts say?
>
> Cheers, Wendell
>
>
> --
> Wendell Piez | http://www.wendellpiez.com
> XML | XSLT | electronic publishing
> Eat Your Vegetables
> _____oo_________o_o___ooooo____ooooooo_^
> _______________________________________________
> BaseX-Talk mailing list
> BaseX-Talk@mailman.uni-konstanz.de
> https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk
_______________________________________________
BaseX-Talk mailing list
BaseX-Talk@mailman.uni-konstanz.de
https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk

Re: [basex-talk] text() vs string()

Reply via email to