> Harun Altay wrote:
> 
> Hello Friends,
> 
> I need to search on BOTH --> (1) "XML" data and (2) "Text" data.

We have developed a method for doing this that we posted several months
ago. 

The basic idea is fairly simple: for element nodes in a given XML
document, you create a single Lucene document that captures at least the
following information:

- Document ID of the containing document (e.g., filename, URL, database
OID, etc.)
- tree location 
- tagname 
- ancestry (list of ancestor tag names)
- directly-contained text, if any
- list of attribute names 
- For each attribute, a field that indexes the value of that attribute

We also create a single Lucene document that indexes the entire PCDATA
content of the document.

By this means we can do searches based on tag name, ancestry, attribute
values, etc. The hit list that comes back can then be organized by
document ID to group together all the hits for a single XML document. 

The above list of properties is the minimal, default set. For special
applications you could, of course, index other things. For example, you
could index elements in terms of their mappings to some higher-level
data structure rather than just their syntactic structure. It's just a
matter of writing a different indexing process.

We found it easy to write the indexer that does the above using normal
DOM processing. We found that the XML parsing and tree walking is such a
small part of the total indexing time that there was no point in
stepping up to SAX-based processing. For example, on my 600Mhz laptop
running Win2K and using the Xerces DOM implementation and parser, it
takes about 50 seconds to index Hamlet (from the Jon Bosak Shakespear
collection), of which 2.5 seconds is DOM processing and the rest is
Lucene document creation (in that example, we create about 15 000
documents).

This approach can be applied to XML databases as long as you have a way
to get at the XML documents or individual elements.

We've posted sample code but I can't remember at the moment the exact
location and I don't have a convenient way to find out from home.

Cheers,

Eliot Kimber
Consultant
ISOGEN International, LCC

_______________________________________________
Lucene-users mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/lucene-users

Reply via email to