Some months ago I created an index from the reuters collection. I converted the SGML files to XML using a tool that I've found somewhere on the net (just google for it), then I parsed the files to create the index, using a standard DOM parser. If you have problems parsing the SGML files I think you should consider converting the files to XML. Otherwise post a sketch of your indexing code to get some help.
Lorenzo On 4/21/06, Malcolm Clark <[EMAIL PROTECTED]> wrote: > > Hi all, > I didn't know whether to add this to the thread asking about TREC indexing > or start a new one. > Anyway, has anyone attempted to index/search the Reuters collection which > consists of SGML? > Mine seems to run through the process okay but alas I'm left with nothing > in the index when I check with Luke or my own Search Engine. > Anyone got any hints (apart from don't do it)? > cheers, > MC >