[EMAIL PROTECTED] wrote:
> 1) Switch to automake + libtool
I'm not so sure about using automake, though libtool has been on my list
for a while. It would be nice to have htlib and htcommon be shared since
users may have multiple copies of htdig or htsearch running at the same
time.
My only complaint about automake is that the generated Makefiles seem so
massive compared to hand-coded ones. It seems like killing a gnat with a
bazooka.
> 2) Use and SQL backend (encapsulate specific things in a shared lib
> module and implement a DBI like interface is roughly the idea).
At the moment, I was thinking along the lines of an ODBC subclass of
Database, but I think people would be quite happy to see some sort of
SQL support, regardless of implementation.
> . The list of starting points URLs is in the configuration file.
> Our search engine has 150 000 starting points URLs, it is hard to
> manage if in a configuration file.
You can include files in config attributes. Commonly people store URLs
in a separate file.
e.g.
start_url: `path/to/my/urls`
> (loaded, not modified, not found). I even want to specify a
> different update strategy for every site, if appropriate (daily
> for newspapers, monthly for archives etc..).
At the moment the config file doesn't support per-URL or per-site
configuration. So this isn't much of an option. You could certainly hack
it together with the 3.2 feature of -m (file of URLs). This will only
index and/or update those URLs. A better solution for your class of
problems is to have a continuously running indexer that keeps a queue of
URLs to update based on access time and/or other parameters. This
requires the database updates to keep the data in a consistent state,
which isn't true in 3.1.x--you need to stop the indexer to run htmerge.
There are also issues with performance scalability. But you asked
initially whether it could handle millions of URLs. It *can*, but that
doesn't mean it can't be improved. ;-)
> Of course this (and many other things) depend on the fact that you have
> a real database in the back-end, not just a hash table.
I don't know whether you're mentioning that we *do* actually have a real
database (unlike some programs), or whether you question the Berkeley
DB's ability to handle as "a real database."
> energy. I strongly believe a project like ht://dig needs at least two or
> three full time, motivated, computer geeks.
I doubt you'll get any disagreement on this list. But for the record
I'll state that at the present time *I'm* not interested in job offers.
Even if I was, I wouldn't want to discuss it on a public list. ;-)
-Geoff
------------------------------------
To unsubscribe from the htdig3-dev mailing list, send a message to
[EMAIL PROTECTED] containing the single word "unsubscribe" in
the SUBJECT of the message.