Re: [htdig-dev] Incremenal Index Efficiency, Unicode, & 2GIG limit

Geoff Hutchison Tue, 08 Jan 2002 20:50:28 -0800


I'm keeping this on-list so it's archived. I'm by far not the only active
member in the group. (esp. judging from # of CVS commits...)

On Tue, 8 Jan 2002, Neal Richter wrote:

> > implemented regardless of when I have time to finish wiring in Quim's new
> > htsearch framework.
> 
>       Need any help?  I'd be glad to throw in with you.. this would be a
> usefull feature for us by allowing some databasish type functionality.

If this is what you want, I'd suggest working on the indexing
side. Someone needs to propose how users could configure specific metadata
to be wired into the unused word index "flags." There are 8 of 32 bits
currently defined for things like Titles, Headings, Keywords, etc. (See
htcommon/HtWordReference.h) So if a user is defining custom data-types,
then they may want to "alias" these slots, or use some of the remaining
unused bits.

The parsers will need to be changed to treat this user-defined word flags
appropriately. And this is all assuming that there's a global allocation
of the flags. This should probably be more fine-grained, (e.g. some
documents have DTDs which suggest certain meta-data or XML tags...) but
then something needs to be stored in the document DB.

So if this is interesting to you, think about it, work out a proposal of
how you think it might best be implemented and write the list--we'll give
feedback and go from there.

I'll definitely let everyone know when the htsearch code is ready. In
particular, some things may break and I'm sure there will be bugs.

> > While the key documentation (attrs.html, cf_byname.html, etc.) is updated
> > as appropriate patches are made, files such as main.html, RELEASE.html,
> > TODO.html, etc. are only updated just before a release.
> 
>       I was mostly curious about what things are going to be included in
> the final 3.2.  I can't seem to find anything that clearly defines what
> will be incuded in 3.2 or when the release might happen.

The release will happen when it's ready. The two main "chunks" that
absolutely, positively need to be finished are a sync with the current
mifluz code and the new htsearch framework. Beyond that, many things are
up in the air--if people come along to do things like Unicode and they can
be tested thoroughly, great. If not, it moves back.

> Do you have anything to add to Gilles' answer to Question 3 (Unicode)?  I
> think he may have missed the point.  At some level the engine must support
> multi-byte matching or the engine wouldn't work.  
> 
> One commercial company's search engine claims to be 'binary' at the core,
> so Unicode is basically a higher-level API code problem... not a problem
> with the base functionality of the index.

At the moment, there is support only for 8-bit character
chunks. Gilles told you the current state: if you have an 8-bit locale
that supports the character set you want, you're probably going to be
OK. Significant parts of the code rely on the 8-bit char assumption, which
means that there will be some rewriting work beyond just the backend.

> Anything you can add to enlighten me on the basics of how the DB &
> Matching algorithms work at a high level would be usefull.  Ultimately, I
> think I'm going to have to do lots of code-reading with our Unicode GURU
> to figure this out.

The word database is currently keyed by the word itself--though this will
change when the new mifluz code is imported. (This assigns word ids,
AFAICT.) In either case, I don't think it matters much whether it's 8-bit
characters or not--the database treats it as a binary key anyway.

My guess is that the biggest challenges for moving to Unicode are going to
be removing the 8-bit char assumption and parsing "words." Updating the
String class to be Unicode-aware (perhaps via UTF-8) is probably going to
help a lot towards the first problem.

-Geoff

_______________________________________________
htdig-dev mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/htdig-dev

Re: [htdig-dev] Incremenal Index Efficiency, Unicode, & 2GIG limit

Reply via email to