Hello all,
 
I've recently had some success tackling some problems that I had created for myself, and my guess is that one or two of you may benefit from my suffering <grin>. I had a fairly large development project that required a search engine capable of handling approximately 150,000 internal pages and 10,000+ pages on thousands of external web servers. Despite being warned by the FAQ that htdig wasn't built for this task, I couldn't resist giving it a go; especially since the source code for the entire engine was in C and was readily available.
 
My first problem was that I was encapsulating the output of htsearch within my own cgi engine, a group collaboration tool written entirely in C, that was driving this particular online community. So I had to comment out the output of the "content-type=text/html" string in the relevant locations within the htsearch source (Display.cc?). Whew, that was easy.
 
My second problem was related to the first. Since my cgi engine needed its own query string to work its magic, and indirectly controlled htsearch, I had to modify the htsearch source so that it didn't detect the presence of the query string (I renamed REQUEST_METHOD in htlib/cgi.cc). In this manner, I could then pass the goods directly as arguments to an external call to htsearch during the execution of my cgi (i.e. htsearch -c /my/custom.conf "page=2&words=woah&cmd=command&searchtype=mysearch"). I then had to modify the portion of Display.cc that built the hrefs to the next, previous and page number links so that my own special query string name/value pairs were piggy-backed. Whew, that was pretty easy too. After whipping up my own set of html templates for htsearch (simple ones really, since the interface framework was actually being provided by my group collaboration engine) I was ready to start the real fun stuff.
 
My third problem was figuring out how to hand 150,000 unique urls to htdig without having it spider the web site to find them. I already knew the urls I wanted it to index and, in fact, I didn't want it to go any further than the urls I specified. Luckily it lets you hand it a file of proper urls to get it going, so I wrote a program to create the list file for me. I then specified a maximum hop count of zero in the htdig configuration file. Lest some of you disbelievers think that htdig can't handle the big stuff, I can testify that it did just fine when I force fed it a fourteen megabyte text file containing not less than one hundred forty nine thousand, nine hundred seventy four unique urls! It swallowed them up in well under four hours on a puny little 256meg PII-400 running Linux Redhat 5.0 and Apache using an insignificant amount of CPU time.
 
My fourth problem was the 10,000+ pages on 2,000+ external web servers. Again, I only wanted the pages I specified so I built a program to create the list and then I force fed it to htdig. This wasn't the problem. The problem was that htdig would go to sleep during the indexing process and, seemingly, never wake up. I ran it in debug mode and saw that it would eventually hit a web page on a new server and stall. And stall. And stall. After about twenty or thirty minutes htdig would finally timeout and continue until it eventually hit another web page on a dead server and stall again. When you're dealing with 2,000 web servers there is bound to be dozens of dead machines (likely in direct correlation with the number of NT servers. Heh heh). I tried without luck to find an explanation of why the "timeout: 20" (seconds) in my htdig configuration file was being translated as 30 minutes. I spent an entire day researching on the Net to uncover possible causes. The author of htdig indicated in an earlier mailing digest post that he couldn't recreate the problem, and wasn't sure why the timeout setting wasn't working for some people. This was NOT encouraging, but I'm stubborn, so I kept hammering away. During a stall, I found that netstat indicated that the htdig process owned a socket connection stuck in SYN_SENT mode. I went searching for info on that (I'm not a IP guru) and found some Linux kernel tweaking notes. I peaked at the value contained within my /proc/sys/net/ipv4/tcp_syn_retries file and found "10". I peaked at the value in my /proc/sys/net/ipv4/tcp_fin_timeout file and found "180" seconds. Using my superior math skills (heh) I determined that 10 retries at 180 seconds each is 30 minutes, which was pretty close to how long each htdig stall was. So I crossed my fingers and changed the timeout to 30 seconds and the number of retries to 2. Voila! The htdig index process still stalled, but each stall didn't take but a minute or so and the entire index was built quite quickly.
 
I'd be interested in comments from any IP / Linux gurus regarding my tcp_fin_timeout / tcp_syn_retries tweaking. Is the 30 seconds and 2 retries too limiting or dangerous for a production machine?
 
All the best,
 
Sean.
# Digital Spinner, Inc.
# Web Design, Development and Consulting.
# Phone: 802.948.2020
# Fax: 802.948.2749
# http://www.digitalspinner.com

Reply via email to