On Dec 1, 2004, at 3:43 PM, DES wrote:
i've got to index some SGML documents and need help or some expiriences with it.
Documents contain a number of different articles with the same structure (TITLE, AUTHOR, DATA etc.), but i need to index each article as a different document in my lucene index. Is there some reader for SGML? I can create different documents, but how can i relate this index-documents to my articles within SGML files?

I don't have SGML parsing experience, so I can't help with that detail. However, I'm doing something quite similar in my current project (once it goes live I'll send pointers to it). I've got a directory of XML files. Each of these XML files has sections. These sections are what the user wants to search on, so each one of those corresponds to a Lucene Document. There are a lot more "documents" indexed than there are XML files. These sections have a unique identifier. I store the filename and section identifier with each document, allowing me to link back to the original precisely.


You'll need to identify the unique values in your data that allow you to relate back to the original files - perhaps using filename as one stored field. When storing filenames, be sure to account for the situation where the original source moves location - I store relative paths from a base directory, so that the base directory can move without having to re-index to stay in sync.

        Erik


--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]



Reply via email to