On Thu, 2005-04-07 at 17:41 +0100, Jamie McCracken wrote: > > Let me illustrate with an example: > > "To index a 1 gigabyte file, do I need 1 gigabyte of memory?" > > Clearly if your answer is `yes', then you are not the most astute > > programmer, nor the sharpest knife in the drawer. > > No but depending on how its implemented you still have to filter the > file into plain text and then generate a unique word list from it. This > word list can potentially be quite large for large files and would > occupy a fair amount of memory.
My lq-text package can index multiple gigabytes of text without needing to have all the words from any file in memory at any one time, and it was written (mostly) in 1989, so is hardly new technology. The algorithms have been published and the code is available. I do have a limitation that you need to be able to fit all occurrences of a single word in memory during indexing (although not during retrieval), so if the record for "the" doesn't fit, you may have to resort to using a stopword. Zipf's law applies remarkably well, so it's very rare to need more than a few stopwords even on small systems. Merely recording which words are in which files leads to what the information retrieval researchers call low precision -- if you're searching for the New York Times you don't want the times that there was news about York Minster in England. The more documents you have, the more you need search services, and the more you need high precision. Note that Google is also subtly sensitive to word order, and can match phrases. Arguments about technology really ought to come after arguments about use cases and needs. It might be that there is a lot of merit in integrating some sort of indexing framework into the desktop -- indexing services sometimes work best if they are told *before* a file is deleted or renamed, for example, so they can "unindex" it efficiently. An API for this might benefit other applications, especially if it helps people to find out "which application made this file and why". "This data file is needed by the game of Empire you've been running for 12 years. If you delete it, your game will be lost. Continue?" is clearer than "really delete emp3016.dat?" So I think there might be useful things to consider, but at the interoperability level, not at the specific implementation level. Liam -- Liam Quin, W3C XML Activity Lead, http://www.w3.org/People/Quin Pictures from old books: http://www.holoweb.net/~liam/pictures/oldbooks/ IRC (chat) programs: www.ircreviews.org/clients/ _______________________________________________ gnome-devel-list mailing list [email protected] http://mail.gnome.org/mailman/listinfo/gnome-devel-list
