Hi,

       I changed the WordList, rewrote part of the WordReference class
and introduced the new WordKey class. This was the core of the modification.
The idea behind the WordKey class is the following.

       The key in the word database is made of a word + integers carrying
information. In the former format wrote by Geoff, it was word + document
id. In the new format it's word + document id + flags + location. 
       Why would we want to do that ? Mainly because it allows us to
sort the word occurences list using multiple criterion:

       word ascending, then for each occurence of the same word by
       document id ascending, then for each occurence of the word in
       the same document, group together the words that have the same flags,
       then sort them ascending according to their location in the document.

       It is then quite easy to find the occurences of the words that are
in the same document. Even easier to find the word that occurs after a
given location in a given document (think phrase search).
       I did not modify the search mechanism to take advantage of this
key structure (yet).
       Encapsulating all that in a class (WordKey) makes it quite 
relatively transparent to the application. I designed the WordKey class
so that it can be generated based on an ascii specification of the key
structure. If we want (afterwards) to add new fields to the search key,
it will be an easy task, as far as the WordKey class is concerned.
       For various reasons too long to explain here (but if someone is
interested I will) it is very important that the WordKey class is structured
this way.

       In order to make that work properly and write the regression
tests, I had to debug and cleanup a large number of things in very basic
classes:
                . DB2_db + Database (DB2_hash does not exist anymore)
                  The interface is simpler and has hooks for prefix and
                  compare functions.
                . Dictionary, Configuration classes were modified to
                  use 'const' where appropriate. 
                . Configuration operator [] and Find now returns a String
                  instead of a char*. This is much more secure. Some 
                  pieces of code were dangerously returning the content
                  of a static String to prevent deallocation. Most of
                  the tests dealing with configuration parameters had to
                  test for null pointer and empty string. Worse, some did
                  not check for null string, source of ugly core dumps.
                . String was modified and enhanced for 'const' and
                  conversion (cast + as_double). Because it's very confusing
                  and error prone to do the following:
                      String foo = fct();
                      if(foo) {
                      }
                  the operator int() aborts with an error message
                  that says : either use as_integer or the new empty()
                  method that says yes if the length of the string is 0.

                . Other classes were modified slightly for constness and
                  use of String instead of char*.

        I commented all the changes in detail in the ChangeLog. The fact
is that I've modified a *lot* of things and that since the tests are not
complete yet, I'm not 100% sure I did not break anything. I'm going to
spend the next few days adding tests and running all that thru purify. 
Don't hesitate to bash me if something is going wrong.

        In addition, I've made a script (htdoc/cf_generate.pl) that
generates attrs.html, cf_byprog.html and cf_byname.html from the 
htlib/defaults.cc file. For that I've changed the structure of the 
configuration defaults to add fields containing the needed information.
This will make things a *lot* easier to document attributes. I did that
because I spent too much time adding the word_dump attribute last time
and noticed that around 10 attributes were not documented.

       Cheers,

-- 
                Loic Dachary

                ECILA
                100 av. du Gal Leclerc
                93500 Pantin - France
                Tel: 33 1 56 96 09 80, Fax: 33 1 56 96 09 61
                e-mail: [EMAIL PROTECTED] URL: http://www.senga.org/


------------------------------------
To unsubscribe from the htdig3-dev mailing list, send a message to
[EMAIL PROTECTED] containing the single word "unsubscribe" in
the SUBJECT of the message.

Reply via email to