On Nov 13, 2007, at 7:21 AM, Cláudio Fernandes wrote:

Hello all,

I don't know if this is a somehow naive question, but here we go:

Does Lucene support index by sections? Like having a text document with three sections divided by XML tags indexed in a way we could do a search by work and section. Does Lucene itself support this kind of indexing or
should it be used with other engines like Cocoon?

Thanks in advance for your time,


Depends on what you mean by sections.
If your document divides up simply into fixed fields:
     <title>...</title>, <author>...</author> , <body>...</body>
or:  <part1>...</part1>, <part2>...</part2>, <part3>...</part3>
then you can make those into fields of your lucene index.

But if there aren't a fixed number of sections, then fields probably won't
work. Lucene doesn't itself handle nesting or inclusion, so finding
text within some arbitrary div or finding the div holding the text
is not so straightforward. However, lucene has a flexible notion
of what a 'document' is. ( Basically, it's whatever unit you feed
it as a document. ) So if this is what you need, you might be able
to make each <div> into a "document" rather than each file.

If you were indexing a large TEI text and wanted to return a particular chapter where the text was found, you could make each chapter a 'document',
and each document would have indexed fields to store the common header
info as well as the file name containing the chapter.

 Lucene is great at finding documents, but not quite as good at finding
things IN documents. The index contains pointers to the terms, but they are pointers to a token in the parsed token stream, so to find a character index into a file, you have to (I believe) run the text thru the tokenizer again. ( But lucene API gives you access to everything, even if it's not simple or easy. I think there are some new features in the latest version that can make this sort of thing easier, but I haven't yet figured out how to use them. )


-- Steve Majewski ( Not much of a lucene expert, but I've spent some time figuring out the difference between document indexers like lucene and text indexers like xpat/ opentext. )







---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to