Re: [basex-talk] Full-text search and mixed content

Christian Grün Wed, 09 May 2012 05:42:52 -0700

Thanks, Cerstin and Michael, for your suggestions.

Yes, all the use cases sound perfectly reasonable to me. When it comes
to the implementation, I see quite a number of obstacles to make this
happen. One of the reasons is that the full-text expression, as
currently implemented, discards all elements before tokenizing the
texts. This means that the following queries are basically the same:


  <a>X <b>Y</b> Z</a>[. contains text 'X Y']
  <a>X <b>Y</b> Z</a>[data() contains text 'X Y']

As Cerstin indicated, you'll probably have to parse all text nodes individually;

  ft:mark(//*[text() contains text {'X', 'Y'}])

This simple approach, however, won't work out with phrases (multiple
terms) that reach into descendant nodes.

Christian


>>> While I concede this may be useful in numerous use cases (and may even
>>> seem obvious), it would take quite some time to get implemented, so...
>>> please don't expect too much magic for the moment. There will also be
>>> some conceptual issues that need to be resolved. As an example, which
>>> result would you expect for the following query?
>>>
>>>   ft:mark(<a>X <b>Y</b> Z</a>[. contains text 'X Y'])
>>
>> I think it should be
>>
>> <a><mark>X</mark> <b><mark>Y</mark></b> Z</a>
>>
>> Each token from the search string would be enclosed in a <mark>-element.
>
> Exactly.  While this probably wouldn't cover *all* possible scenarios,
> it would still cover most of the useful ones.  In fact, it would be
> similar to <http://www.raymondhill.net/blog/?p=272>.  It would also be
> applicable when ignoring elements in a search.
>
> For complex applications it may help to get the start and end character
> positions of the matches (essentially standoff markup), and the
> application could then do the highlighting itself on the basis of this
> information.
>
> [...]
>
>>> If you don't need the inner elements, you may as well remove them from
>>> your document before applying ft:mark().
>>
>> This is a great idea if you would like to know whether the search
>> elements are somewhere in your text.
>>
>> However, if you would like to show the results to end users (=
>> humanities people) or to annotate the document further, it's not a
>> good idea to destroy the original structure. Or maybe one would have
>> to come up with some tricky workaround to first replace the
>> hierarchical node with a flat one for searching, then annotate
>> something and somehow replace the original hierarchical one with the
>> annotated one preserving the original hierarchy.
>>
>> And for searching only, the scenario is a TEI-document representing an
>> old printed book with highlighting (e.g., some things in italics),
>> foreign-language words printed in a different font, person names
>> already marked, etc. The TEI rendering is intended to mimic the
>> original printed page. When implementing a full-text search, the end
>> user expects to see the highlighted search tokens within the rendered
>> page. Therefore the "easiest" way is to search in descendant nodes and
>> use ft:mark to highlight the hits, without any need to change the TEI
>> rendering. This would also allow the end user to not only see the node
>> where the search string was found, but scroll up and down to inspect
>> the context of the node.
>
> I fully agree, this is exactly what I need in my application: I don't
> want to retrieve snippets from the document, but I always have to
> display the full document with the hits highlighted.
>
> What I'm going to do now is probably highlight the full paragraph which
> contains the node retrieved by the search, i.e., get the node ID, walk
> up the tree until I encounter a <p> and get its @xml:id, which I can
> then use in a CSS stylesheet.  Or something like this.  But this is
> clearly only an approximation.
>
> Best regards
>
> --
> Dr.-Ing. Michael Piotrowski, M.A. <m...@cl.uzh.ch>
> Institute of Computational Linguistics, University of Zurich
> Phone +41 44 63-54313 | OpenPGP public key ID 0x1614A044
> * OUT NOW: Systems and Frameworks for Computational Morphology
> *          <http://www.springeronline.com/978-3-642-23137-7>
> _______________________________________________
> BaseX-Talk mailing list
> BaseX-Talk@mailman.uni-konstanz.de
> https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk
_______________________________________________
BaseX-Talk mailing list
BaseX-Talk@mailman.uni-konstanz.de
https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk

Re: [basex-talk] Full-text search and mixed content

Reply via email to