Re: [htdig3-dev] speed, local indexing, DB, reporting

Tom Metro Wed, 24 Nov 1999 17:22:26 -0800
Geoff Hutchison <[EMAIL PROTECTED]> writes:
> First off, Tom? Where did you hear it wasn't fast enough?
Before installing Ht://Dig I spent about 4 hours reviewing the "word 
on the street" comments on Usenet and another 4 hours reading web 
sites that reviewed search engines and the documentation for a few 
search engines.

Here's one message remarking on the slowness:

        From: [EMAIL PROTECTED] (James A. Treacy)
        Subject: Re: Search Engine
        Date: 23 Sep 1999 00:00:00 GMT
        Newsgroups: linux.debian.www

        Here is what I've looked at so far:
        htdig - can't index locally, too slow
        mg - still evaluating
        namazu - haven't really looked at
        psearch - variant of isearch. promising, but still under
         development
        swish++ - can't merge separate indices. great for straight
         html though
        glimpse - non-free

And I've been meaning to ask you guys about the "can't index locally" 
issue. Recently while browsing through the config directives, I ran 
across:

http://www.htdig.org/attrs.html#local_urls
  local_urls 
  ...
     description: 
          Set this to tell ht://Dig to access certain URLs
          through local filesystems. At first ht://Dig will try to
          access pages with URLs matching the patterns
          through the filesystems specified. If it cannot find the
          file, it will try the URL through HTTP instead. Note
          the example--the equal sign and the final slashes in
          both the URL and the directory path are critical. 
     example: 
           local_urls: http://www.foo.com/=/usr/www/htdocs/ 

Does this work? (I haven't tried it.) If so, why isn't this mentioned 
in some intro document or why doesn't configure prompt the user to set 
it up. My guess is that 90% or more of Ht://Dig installations occur on 
the machine containing the HTML to be indexed, so if this works as an 
efficiency improvement, I'd think most people would want to use it.


And for this related directive:

http://www.htdig.org/attrs.html#local_default_doc
  local_default_doc 
  ...
     default: 
          index.html 
     description: 
          Set this to the default document in a directory used
          by the server. This is used for local filesystem access
          to translate URLs like http://foo.com/ into something
          like /home/foo.com/index.html 
     example: 
          local_default_doc: default.html 

I'd suggest that this should be a string list the same as 
remove_default_doc is. The same rationale applies.

Ideally, one would like to see configure ask for the path to an Apache 
config file and it would extract information like this. (Though 
understandably not trivial to do.)


> We may want to start splitting into different threads.
That may make sense. It is also possible that you might just want to 
concede that Ht://Dig isn't going to be optimized for speed. Instead 
concentrate on other attributes, like ease of installation and 
administration. Define your market and stick with it, so to speak.

The site I applied Ht://Dig to was under 10 MB of HTML and PDF files 
and it was indexed in a matter of a few minutes via HTTP. Running once 
a day at 3 AM, that's insignificant.

If a threaded version required more effort to compile or administer, 
it wouldn't be worth it for such a simple application. Most web sites 
probably fall within Ht://Dig's performance range.

> As far as the indexer, I'd guess the main slowdown comes in database 
> operations. String optimizations wouldn't hurt, but database lookups 
> kill us, esp on large databases.
Database size is the other "word on the street" complaint, but I'm 
sure you're well aware of that, as even the Ht://Dig documentation 
reflects this.

By the way, when did you switch from GDBM to Berkeley DB2? I tried 
using one of the user contributed Perl scripts that was setup for GDBM 
and ended up fiddling with it for an hour. Perl's GDBM_File module is 
notoriously bad at reporting GDBM library errors so I finally threw 
together a test program in C and got back "bad magic number." Argh. 
Further examination of the Ht://Dig and more digging in the contrib 
directory showed you were using Berkeley DB2 now. (And lucky me, the 
target system's Berkeley DB2 library is too old to support Perl's 
Berkeley DB2 interface...cascading upgrade.)

If I can get things to work, I'll update those Perl scripts to 
Berkeley DB2 and submit them to you guys. They appear to be pretty 
close to the reporting functionality I asked about in an earlier 
email, and I think they deserve mention in some of the introductory 
documentation.

 -Tom

-- 
Tom Metro
Venture Logic                                     [EMAIL PROTECTED]
Newton, MA, USA


------------------------------------
To unsubscribe from the htdig3-dev mailing list, send a message to
[EMAIL PROTECTED]
You'll receive a message confirming the unsubscription.
Re: [htdig3-dev] speed, local indexing, DB, reporting

Reply via email to