OK, I haven't heard any comments on the requirements, so I'm going to consider this a closed deal. I'm including some comments on what was listed. I have some good ideas on implementation of everything here, but I'm sure there may be better ones. In other words, I don't know what *should* come after a list of requirements and I'm sure Andrew will fill us in. In the meantime, I'm going to start trying stuff out. :-) Backend Requirements: (numbered for easy reference, no particular order) 1. phrase searching 2. fuzzy searching (basically as it is now) 3. use of "+" or "-" as prefix to search words (ala altavista) 4. use of "near" as a method to determine relations between search words 5. cross platform (unix, nt) 6. ability to search only in specific areas of documents (title, headers, etc) 7. better relevance ranking 8. faster results generation for searches returning many hits 9. Collections of databases 10. Parallel indexing and searching (no need for alternate files or htmerge) 11. Multithreading support (some sort of locking for writes) 12. Detection and/or removal of duplicate documents 13. Referer links (e.g. AltaVista-style link:) 14. Search for "more like" or "similar to" (a la Excite) 15. On-the-fly editing of search factors (without needing to rebuild the db) 16. Flexible backend (use Berkeley DB, *SQL, Oracle, etc.) 17. Internationalization (e.g. Chinese support, probably through Unicode) Notes: 3) I think this can be done in the parser. I'm going to try a naive hack and see if it works. 6) I think we can do this and #15 by using flags for the markup on a word. Scoring might be a bit slower than currently, but the flexibility gain is huge. 7) This actually goes with the above. If you can change the factors on-the-fly, it's easier to get a better ranking function. BTW, I may be doing some statistical analysis on my site later this semester. :-) 9) Simple looping. Slows things down, but eliminates requirements of merging. 10) Andrew gave me the key on this--Berkeley DB allows duplicate keys. So the word database would have multiple word entries, rather than a list. 13) Rather than storing the seldom-used (ever-used?) DocDescriptions field, which is a list of strings, and the DocBacklink *count*, we could store a list of DocIDs for backlinks... Thoughts? Rants? -Geoff ------------------------------------ To unsubscribe from the htdig3-dev mailing list, send a message to [EMAIL PROTECTED] containing the single word "unsubscribe" in the SUBJECT of the message.
