> Harun Altay wrote: > > Hello Friends, > > I need to search on BOTH --> (1) "XML" data and (2) "Text" data.
We have developed a method for doing this that we posted several months ago. The basic idea is fairly simple: for element nodes in a given XML document, you create a single Lucene document that captures at least the following information: - Document ID of the containing document (e.g., filename, URL, database OID, etc.) - tree location - tagname - ancestry (list of ancestor tag names) - directly-contained text, if any - list of attribute names - For each attribute, a field that indexes the value of that attribute We also create a single Lucene document that indexes the entire PCDATA content of the document. By this means we can do searches based on tag name, ancestry, attribute values, etc. The hit list that comes back can then be organized by document ID to group together all the hits for a single XML document. The above list of properties is the minimal, default set. For special applications you could, of course, index other things. For example, you could index elements in terms of their mappings to some higher-level data structure rather than just their syntactic structure. It's just a matter of writing a different indexing process. We found it easy to write the indexer that does the above using normal DOM processing. We found that the XML parsing and tree walking is such a small part of the total indexing time that there was no point in stepping up to SAX-based processing. For example, on my 600Mhz laptop running Win2K and using the Xerces DOM implementation and parser, it takes about 50 seconds to index Hamlet (from the Jon Bosak Shakespear collection), of which 2.5 seconds is DOM processing and the rest is Lucene document creation (in that example, we create about 15 000 documents). This approach can be applied to XML databases as long as you have a way to get at the XML documents or individual elements. We've posted sample code but I can't remember at the moment the exact location and I don't have a convenient way to find out from home. Cheers, Eliot Kimber Consultant ISOGEN International, LCC _______________________________________________ Lucene-users mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/lucene-users