Hi We transformed our SGML documents to XML with sgml2xml (part of fedora/redhat linux distribution) and processed XML not SGML. We also use a very very very simple "SAX-like SGML parser" (suitable for our SGML documents; not a general purpose SGML parser) to extract data from SGML. But we use XML for indexing.
Best, czinkos On Thu, Dec 02, 2004 at 05:59:35AM -0500, Erik Hatcher wrote: > On Dec 1, 2004, at 3:43 PM, DES wrote: > >i've got to index some SGML documents and need help or some > >expiriences with it. > >Documents contain a number of different articles with the same > >structure (TITLE, AUTHOR, DATA etc.), but i need to index each article > >as a different document in my lucene index. Is there some reader for > >SGML? I can create different documents, but how can i relate this > >index-documents to my articles within SGML files? > > I don't have SGML parsing experience, so I can't help with that detail. > However, I'm doing something quite similar in my current project (once > it goes live I'll send pointers to it). I've got a directory of XML > files. Each of these XML files has sections. These sections are what > the user wants to search on, so each one of those corresponds to a > Lucene Document. There are a lot more "documents" indexed than there > are XML files. These sections have a unique identifier. I store the > filename and section identifier with each document, allowing me to link > back to the original precisely. > > You'll need to identify the unique values in your data that allow you > to relate back to the original files - perhaps using filename as one > stored field. When storing filenames, be sure to account for the > situation where the original source moves location - I store relative > paths from a base directory, so that the base directory can move > without having to re-index to stay in sync. > > Erik > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
