Toivo Pedaste writes:
>
> I was able to index about a 100000 pages in less than a day on a
> machine with 512meg of memory, on a 256meg machine it had only
> done 50000 pages after two days. The indexing process does
> seem very memory intensive if you want decent performance, I'm
> not sure what can be done about it though, it seems to be
> just lack of locality of reference into the db.words.db file.
No locality of references, indeed.
> I believe there are plans to checksum pages so as to reject
> aliases (duplicates), how is that going? It is really something
> of an administrative nightmare to deal with a large site without it.
>
> I'm also getting close to the 2Gig file size limit on my
> words.db file, is there any strucural reason that it
> couldn't be split into multiple files?
Four solutions : activate compression in WordList.cc, db_dump + db_load
would reduce the size of the file by half, implemnet dynamic repacker in
Berkeley DB, implement autosplit files in WordList.cc based on a key
calculated from the word.
Of all these we are working on 1 and 2.
What is the size of your original data ?
--
Loic Dachary
ECILA
100 av. du Gal Leclerc
93500 Pantin - France
Tel: 33 1 56 96 10 85
e-mail: [EMAIL PROTECTED]
URL: http://www.senga.org/
------------------------------------
To unsubscribe from the htdig3-dev mailing list, send a message to
[EMAIL PROTECTED] containing the single word "unsubscribe" in
the SUBJECT of the message.