Re: [basex-talk] Full-text index on mixed content

Christian Grün Mon, 23 Sep 2019 04:19:29 -0700

Hi Daniel,

Thanks for your mail. Just a short while ago, we had thoughts on how
to extend indexing and query rewriting without completely rehauling
our optimization engine, so it might be worth sharing this idea.

At the moment, as you may know, only text nodes and attribute values
end up in the BaseX indexes. This allows us to rewrite as many paths
as possible for index access. Whenever a path expression points to a
text node (or an element that only has text nodes as children), we
know that such a path can be rewritten for index access, no matter how
the exact paths look like that point to this text node. This design
decision turned out to be very powerful for exact searches, and for
full-text queries on arbitrary text nodes, but it is too unflexible
for mixed-content data indeed.

Over the time, we needed to learn that full flexibility can be
helpful, but is not necessarily required in many TEI use cases: Many
users and developers have a rather small and fixed set of XML elements
that is relevant for full-text processing.

A few years ago, we added features to restrict indexing to the text
nodes of specific element names. We could enhance this approach for
full-texts:

1. Index the string value of specific elements, which will be
specified by the user, and
2. Rewrite only paths for index access that do not address descendants
of the indexed element.

As an example, a user might want to query the "head" and "p" elements
of a TEI documents, and there will be no need to write queries for
descendants of these elements.

<div>
  <head>No. 2, September 2006</head>
  <p>It was clearly popular, for it appears in Peter Stent’s
advertisements of 1654 and 1662, and is still listed in his successor
John Overton’s catalogue of 1673,<note>Alexander Globe, <title
level="m">Peter Stent, London Printseller, c.</title> 1642-65
(Vancouver, 1985), p. 123 (no.*448).</note> yet only the unique
impression in the British Museum's Department of Prints & Drawings
survives - testimony to the great rarity of such popular material.</p>
</div>

The following queries could then be answered via the index:

  /div[head contains text '2006']
  //p[. contains text 'popular']

Queries such as the following ones would not be rewritten for index
access anymore:

  //p[text() contains text 'popular']

It might additionally be desirable to exclude specific elements from
indexing. In the given example, users might want to exclude notes
("note" elements) from being included in the indexed string value.

There are numerous other features that could be included. The major
challenge will be to define a simple core functionality that is
flexible enough to be enhanced in future.

Daniel, what’s your opinion on this, and your first thoughts on what
might be missing?

Thanks in advance,
Christian

On Thu, Sep 19, 2019 at 12:43 AM Schopper, Daniel
<daniel.schop...@oeaw.ac.at> wrote:
>
> Dear all,
> chatting after a session of the ongoing TEI conference (
> https://graz-2019.tei-c.org) I was asked about plans to support
> fulltext indexes on mixed content nodes in BaseX – I did not know of
> any, so I wanted to pass the question on to this list: Is there a plan
> to implement this feature in the near (or not-so-near) future? If not,
> did somebody of the core devs estimate the effort to get this done?
> (needless to say that it would an awsome feature to have in BaseX ;-)
> Thanks in advance & best
> Daniel
> (just being curious)

Re: [basex-talk] Full-text index on mixed content

Reply via email to