According to Owen Boyle: > Bill Akins wrote: > > After running the following script, I can only find a very few > > documents... > > > > /usr/bin/htdig -v -a -s -u user:password > > /usr/bin/htmerge -v -a -s > > cp /var/lib/htdig/db.docdb.work /var/lib/htdig/db.docdb > > mv /var/lib/htdig/db.docs.index.work /var/lib/htdig/db.docs.index > > mv /var/lib/htdig/db.words.db.work /var/lib/htdig/db.words.db > > This is all very well, but what does it say in the output? Does it look > like it is pushing files (i.e. loading them into the DB)? Increase the > debug level to check (i.e. -vv or -vvv). Are you getting a lot of "URL > rejected" messages? Do they look sensible? > > The important thing to check is that your start_url is well-chosen and > that htdig is suceeding to crawl down through your site. Also check your > limit_normalized and limit_urls_to directives.
Given the disproportionate sizes of db.docdb[.work] and db.docs.index[.work], it seems likely that a lot of documents did get indexed by htdig (into db.docdb[.work] and db.wordlist[.work]), but that some error prevented htmerge from creating a usable db.docs.index[.work] from db.docdb[.work]. > > Am I doing something wrong here? > > > > File sizes: > > 1522201600 db.docdb > > 1522201600 db.docdb.work > > 6144 db.docs.index <==== Isn't this way too small for 500,000+ > > documents?! > > 2072429267 db.wordlist > > 1830162432 db.words.db > > > > This looks pretty odd... I would junk the whole lot and start again. If > you run htdig with the "-i" option it will regenerate the DB every time > - you could do this until you are sure it is working properly. Before junking the whole thing, it may be worth rerunning htmerge -a -vvv to see if it can successfully recreate db.docs.index.work, or give some meaningful debugging info as to what's going wrong. If that doesn't help, then yes, you should probably try starting again from scratch. Also, if you're not currently running 3.1.6, you probably should upgrade. I do find it odd that you have a db.wordlist in the listing above, not a db.wordlist.work. htdig -a should create the latter, and you don't show any commands above that would move or copy the db.wordlist.work file to db.wordlist. -- Gilles R. Detillieux E-mail: <[EMAIL PROTECTED]> Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/ Dept. Physiology, U. of Manitoba Winnipeg, MB R3E 3J7 (Canada) _______________________________________________________________ Don't miss the 2002 Sprint PCS Application Developer's Conference August 25-28 in Las Vegas -- http://devcon.sprintpcs.com/adp/index.cfm _______________________________________________ htdig-general mailing list <[EMAIL PROTECTED]> To unsubscribe, send a message to <[EMAIL PROTECTED]> with a subject of unsubscribe FAQ: http://htdig.sourceforge.net/FAQ.html

