On Fri, 02 Jun 2006 00:33:23 +0100, Matthew Toseland wrote: > Firstly, why do we need two index formats? I'm the first to admit that the > current Librarian index format is limited - way too limited - but why do > we need two? The main changes I would make to the librarian format right > now would be: > - Support splitting. (This is relevant to file indexes) - Include word > indexes to allow for adjacent word searches. (This is > relevant to file indexes too, because you may want to search for > adjacent words in a title). > - Maybe include some amount of metadata - functional (mime type), or > theoretical (category, dublin core...), or other (activelinks?). (This > is definitely relevant to file indexes). > - Include the filename in the index. Possibly using negative word > indexes to indicate "in the filename" words; it must be possible to > distinguish between matches in the page title and matches in the > content. (This is also relevant to both web page indexes and file > indexes, though especially to the latter). > > I am quite happy to change the format. Indeed it needs significant > changes. > > Indexes, like all files, are automatically compressed, so don't worry too > much about it being overly verbose. > > Now, you are proposing additional fields: firstly, the size of the content > (this isn't especially relevant to web page indexes), and the length of > the file if it is audio or video. Both are perfectly reasonable extensions > IMHO. If we are going to support metadata we should support a range of > metadata; we will need support for a category, (probably tied to a > specific site), at least, and this is a very woolly and arbitrary thing. > > An explicit aim of your index format is to be able to index the contents > of text-based files by words. This is a good thing, but if you are going > to do that, then you should define a format, (preferably with some of the > details of splitting indexes worked out), and make Librarian and Spider > use it. Metadata can be shown next to matches, or it can be used to narrow > down searches. > > And I honestly don't care whether it is XML. I see no reason to take > strenuous efforts to keep back compatibility, but filters can be written > easily enough if need be.
I recall Frost having a nasty bug that caused it to crash whenever it encountered a message malformed in a special way due to the parser not handling error cases correctly. Using XML allows one to use existing XML libraries for parsing instead of having to write a new parser, making it much less likely that such unpleasantness occurs again. This is especially important for non-Java programs, since they can easily develop far more serious symptoms than simply crashing. It also allows for trivial backwards-compatible extension: simply state that a program should ignore all tags and attributes it doesn't understand, and you can extend the format as needed while the old programs will still keep on working. _______________________________________________ Devl mailing list [email protected] http://emu.freenetproject.org/cgi-bin/mailman/listinfo/devl
