Hi,

> Wow!  I wish you had been able to respond earlier!  I think we can

  Yes, the mail was delayed for no reason... too bad.

 > definitely do some work together.  I'm currently approaching the problem
 > from the other direction.  Rather than writing new code that replicates what
 > htdig already does, I'm recoding parts of htsearch to be more
 > object-oriented, and was planning on hooking into the classes from XS.  It
 > sounds like you're getting much more low-level than I am right now.

 Yes again. And this will be nice to get both ends to reach and work together.
I strongly think we have to make plans together. I'm going to work on the
structure of the word database and the classes to manipulate it until we
have something that fulfill the search needs.
 We must first define where we are going to prevent incompatible or 
redundant code. Could you, please comment what you've done and what you'll
do and how long it will take you ? Appart from planning we must also agree
on data format. Separating the parse and search phase requires a well
defined data structure to communicate between the two. Talking on the list
will be usefull but code will talk too :-) I suggest (comments Geoff ?)
creating a temporary CVS branch for that purpose only. We will be able to 
commit ugly and non working code without bothering the main branch until
we're done. Committing daily will prevent to step on each other toes.

 Here is a small description of the current situation (I hope you have
the latest update, I've changed WordReference class a lot last
week, there are many comments that explain why and how in the code) and
my next changes. 

-----------------------------
 Data:

 . The word database
   Key: containing Word, DocID, Flags, Location
   Record: containing Anchor

   I DO: Add an entry for each distinct word that contains statistics
         about the word. Most important : word overall frequency. This
         is critical for search performances.

 . The document database
   DO NOTHING: (? you agree). If you plan to work on that too be aware
               that the next evolution is to use an SQL database to store
               this. Using Berkeley DB is a pain for this purpose. Fits
               *very* well the needs of the word inverted index but is
               definitely not what is needed for the document database.

 Functions:

 a The word insertion/update/delete (indexing)
   I DO: write test, fix the delete make sure htmerge is not needed anymore,
         all that purely dynamic, update statistics.
 b The document parsing
   DO NOTHING (? you agree)
 c The search query parsing (building a query syntax tree)
   YOU DO: ? which syntax ? structure of the syntax tree ? I strongly
           advocate for AltaVista syntax + syntax tree able to contain
           all what is needed for AltaVista syntax (simple + advanced) to
           work. htdig syntax can/should be transltated to the syntax tree.
           If the syntax tree is not powerfull enough to handle AltaVista
           syntax we lose a big opportunity.
 d The query resolution (using the syntax tree to match words)
   WE DO: There is a number of
          constraints we definitely want to match here : the memory space,
          cpu time and I/O used to resolve a query must grow linearly as
          a function of the number of terms and complexity of the query.
          The linear factor must be as small as possible. To achieve that
          I basically have *two* ideas : all search terms must be searched
          in parallel and least frequent terms must always be considered 
          first. The WordList::Walk method allows traversal of the list
          and must be used instead of Find that returns the whole list
          of matching words. Using Find is a killer for big indexes (think
          retrieving all the occurences of 'the' in a 1 million document
          database :-). The new strucuture of the WordKey class also helps.
          It's fast and easy to say : search this word in this document 
          because the document id is now part of the key.
 d The information retrieval (given top N matches for a query
   retrieve the relevant document information)
   DO NOTHING (? you agree)   
 e The information display
   DO NOTHING (? you agree)  
--------------------------------------

 I've commited the hardest part for 'a' and hope to finish it by the end
of the week (understand end of next week :-). I'm still concerned about the
fact that WordReference is still tighted to a specific database structure
and will eventually switch to an abstract implementation. But not for this
version.

 To summarize we have to :

 . Make a rough planning of action (the list above may be a start)
 . Define the structure of the syntax tree (I'm ready to write a 
   proposal in next mail. IMHO we have to think about a structure
   that will map well to Perl). 
 . Create a branch on CVS share partial work

> As guess as far as source, I'll show you mine if you'll show me yours!
> Mine's not really ready for even casual viewing yet, because as you stated,
> the intermingling is pretty ugly and it's a bitch getting the functionality
> separated out into classes.

  Well, you have all the source I wrote in the CVS tree, let see yours :-)

  Cheers,

-- 
                Loic Dachary

                ECILA
                100 av. du Gal Leclerc
                93500 Pantin - France
                Tel: 33 1 56 96 09 80, Fax: 33 1 56 96 09 61
                e-mail: [EMAIL PROTECTED] URL: http://www.senga.org/


------------------------------------
To unsubscribe from the htdig3-dev mailing list, send a message to
[EMAIL PROTECTED] containing the single word "unsubscribe" in
the SUBJECT of the message.

Reply via email to