I'm keeping this on-list so it's archived. I'm by far not the only active member in the group. (esp. judging from # of CVS commits...)
On Tue, 8 Jan 2002, Neal Richter wrote: > > implemented regardless of when I have time to finish wiring in Quim's new > > htsearch framework. > > Need any help? I'd be glad to throw in with you.. this would be a > usefull feature for us by allowing some databasish type functionality. If this is what you want, I'd suggest working on the indexing side. Someone needs to propose how users could configure specific metadata to be wired into the unused word index "flags." There are 8 of 32 bits currently defined for things like Titles, Headings, Keywords, etc. (See htcommon/HtWordReference.h) So if a user is defining custom data-types, then they may want to "alias" these slots, or use some of the remaining unused bits. The parsers will need to be changed to treat this user-defined word flags appropriately. And this is all assuming that there's a global allocation of the flags. This should probably be more fine-grained, (e.g. some documents have DTDs which suggest certain meta-data or XML tags...) but then something needs to be stored in the document DB. So if this is interesting to you, think about it, work out a proposal of how you think it might best be implemented and write the list--we'll give feedback and go from there. I'll definitely let everyone know when the htsearch code is ready. In particular, some things may break and I'm sure there will be bugs. > > While the key documentation (attrs.html, cf_byname.html, etc.) is updated > > as appropriate patches are made, files such as main.html, RELEASE.html, > > TODO.html, etc. are only updated just before a release. > > I was mostly curious about what things are going to be included in > the final 3.2. I can't seem to find anything that clearly defines what > will be incuded in 3.2 or when the release might happen. The release will happen when it's ready. The two main "chunks" that absolutely, positively need to be finished are a sync with the current mifluz code and the new htsearch framework. Beyond that, many things are up in the air--if people come along to do things like Unicode and they can be tested thoroughly, great. If not, it moves back. > Do you have anything to add to Gilles' answer to Question 3 (Unicode)? I > think he may have missed the point. At some level the engine must support > multi-byte matching or the engine wouldn't work. > > One commercial company's search engine claims to be 'binary' at the core, > so Unicode is basically a higher-level API code problem... not a problem > with the base functionality of the index. At the moment, there is support only for 8-bit character chunks. Gilles told you the current state: if you have an 8-bit locale that supports the character set you want, you're probably going to be OK. Significant parts of the code rely on the 8-bit char assumption, which means that there will be some rewriting work beyond just the backend. > Anything you can add to enlighten me on the basics of how the DB & > Matching algorithms work at a high level would be usefull. Ultimately, I > think I'm going to have to do lots of code-reading with our Unicode GURU > to figure this out. The word database is currently keyed by the word itself--though this will change when the new mifluz code is imported. (This assigns word ids, AFAICT.) In either case, I don't think it matters much whether it's 8-bit characters or not--the database treats it as a binary key anyway. My guess is that the biggest challenges for moving to Unicode are going to be removing the 8-bit char assumption and parsing "words." Updating the String class to be Unicode-aware (perhaps via UTF-8) is probably going to help a lot towards the first problem. -Geoff _______________________________________________ htdig-dev mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/htdig-dev
