According to Ravi Starzl: > I have been working with htDig for almost a year now, and see a dramatic > difference in the performance between version 3.1.x and 3.2.x. I am > currently forced to use 3.2.x, because phrase searching is a necessity > for the projects that I work on, however, the performance handicaps of > 3.2.x (indexing times and htmerge instability) make using it very time > consuming for large-scale indexes (over 50,000 documents). How > difficult or how far away is the addition of phrase searching to the > stable release? That is the key feature for several people I have talked > to that would be the biggest improvement in the stable release. I've > worked somewhat-extensively with other information retrieval systems - > would it be easier to use the phrase search code of another public > system as a template for implementing phrase capability? > > I would love to contribute to the development effort if I could get some > specific direction on what would need changing in htDig to enable > phrases.
I think the question would more appropriately be posed "how far away is the 3.2.x code from being stable?" Backporting phrase matching to the 3.1.x branch is absolutely out of the question. In order to support phrase matching, the database stucture had to be completely revamped in the transition from 3.1 to 3.2. Much of the inefficiency in 3.2 is directly as a result of those changes. So, if you'd like to contribute to 3.2 development, the most pressing need is to merge in the latest mifluz code, which supports the new word database format. That should take care of some of the instability and probably some of the inefficiency. Next would be to optimize htmerge's handling of the wordlist, using the new database walking capabilities in the latest mifluz, so it doesn't try to store the whole database in memory at once (that's the main reason htmerge is unstable right now). Finally, we need to port many of 3.1.6's new features over to 3.2, and resume work on the to-do list for 3.2. Probably the best way to get up to speed on all this would be to study the latest 3.2.0b4 development snapshot to see how things work right now, and to review the htdig-dev mailing list archives to see what issues have been discussed in the past while by the developers. -- Gilles R. Detillieux E-mail: <[EMAIL PROTECTED]> Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/~grdetil Dept. Physiology, U. of Manitoba Phone: (204)789-3766 Winnipeg, MB R3E 3J7 (Canada) Fax: (204)789-3930 _______________________________________________ htdig-dev mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/htdig-dev