On Dec 17, 2005, at 2:36 PM, Paul Elschot wrote:
Gentlemen,
While maintaining my bookmarks I ran into this:
"Case Study: Enabling Low-Cost XML-Aware Searching
Capable of Complex Querying":
http://www.idealliance.org/papers/xmle02/dx_xmle02/papers/
03-02-08/03-02-08.html
Some loose thoughts:
In the system described there a Lucene document is used for each
low level xml construct, even when it contains very few characters
of text.
The resulting Lucene indexes are at least 2.5 times the size of the
original document, which is not a surprise given this document
structure.
Normal index size is about one third of the indexed text.
I don't know about the XQuery standard, but I was wondering
whether this unusual document structure and the non straightforward
fit between Lucene queries and XQuery queries are related.
Seems that a lot of metadata beyond the actual text is stored. For
example, node type, ancestors, parent, number of children, etc., for
each element and attribute. If the fulltext is relatively small, as
is often the case in quite structured XML such as the shakespeare
collection, that should significantly increase storage space.
For example, romeo and juliet goes along the following lines:
<SPEECH>
<SPEAKER>FRIAR LAURENCE</SPEAKER>
<LINE>Not in a grave,</LINE>
<LINE>To lay one in, another out to have.</LINE>
</SPEECH>
<SPEECH>
<SPEAKER>ROMEO</SPEAKER>
<LINE>I pray thee, chide not; she whom I love now</LINE>
<LINE>Doth grace for grace and love for love allow;</LINE>
<LINE>The other did not so.</LINE>
</SPEECH>
<SPEECH>
<SPEAKER>FRIAR LAURENCE</SPEAKER>
<LINE>O, she knew well</LINE>
<LINE>Thy love did read by rote and could not spell.</LINE>
<LINE>But come, young waverer, come, go with me,</LINE>
<LINE>In one respect I'll thy assistant be;</LINE>
<LINE>For this alliance may so happy prove,</LINE>
<LINE>To turn your households' rancour to pure love.</LINE>
</SPEECH>
As for the joines and iterations over items from the stream of XML
results: iteration over matching XML constructs should be no problem
in Lucene. Joins in Lucene are normally done via boolean filters,
so I was wondering how XQuery joins fit these.
Similar as in SQL. The engine constructs a locial execution plan for
the query, and rewrites it into an optimized physical plan as deemed
appropriate, perhaps guided by statistics, using a nested loop, hash
join, or any other more sophisticated strategy.
Wolfgang.
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]