According to Emma Jane Hogbin: > On Tue, Feb 18, 2003 at 12:57:30PM -0600, Gilles Detillieux wrote: > > Actually, in the values above, max_doc_size is 4 times the value of > > max_head_length, not a smaller value. That makes sense, though, as > > the document "head" (i.e. the chunk of plain text extracted from the > > start of the document) would never be larger than max_doc_size. > > *laugh* I appear to have issues in all my emails today. I might as well > just head back to bed as the extra coffee doesn't seem to have helped!
I hope you don't feel I'm picking on you or anything! :-) It's just that you had a lot of responses out there that I thought deserved clarification and/or supplementary info. I do the same to Jim and Geoff after they post a flurry of reponses too. > I think in one instance of troubleshooting I > actually made my "head" larger than the "doc_size" to force the pages into > submission. In the end it was manually deleting databases that got the > results to appear the way I wanted them to. > > > The rule of thumb is that max_doc_size should be at least as large as > > the largest file you want to index, and max_head_length is somewhere > > between 0 and max_doc_size, depending on how important it is to you > > to make sure an excerpt containing the search word can be displayed > > (vs. how much disk space you're willing to use up to accomplish that). > > Because I'm indexing a discussion forum finding the max_doc_size and the > max_head_length are moving targets... Yeah, that's usually the case when indexing very dynamic content. On the plus side, though, there isn't really any penalty associated with setting these attributes much larger than you need - it's just that if things are going to blow up you can sometimes contain the mess a bit more with tighter restrictions on these. > And while I'm here making absurd statements about The Way Things > Work...does max_head_length count any content that is included in > htdig_noindex bits? What about max_doc_size? > > The config file info says: > http://htdig.org/attrs.html#max_doc_size > "This is the upper limit to the amount of data retrieved for documents." > Including noindex? Or omitting noindex? max_doc_size is an absolute limit, imposed on data as it's retrieved, and before it's parsed. So, whatever the file contains, it gets abruptly truncated at max_doc_size bytes even before anything gets stripped out of it. On the other hand, max_head_length is just for plain text that's retained after parsing. This means that all noindex blocks, script blocks, style blocks, HTML comments, and even all HTML tags that don't contain indexable text, get stripped out before the "head" is built up (to a maximum of max_head_length characters). In practise, if your HTML pages have a fair bit of comments, style sheets or JavaScript, the actual indexable text may be just a small fraction of the total document size. So, you might be able to get by with setting max_head_length to a quarter of max_doc_size without losing any actual indexed text. However, if your goal is to keep all text in the head, so that excerpt highlighting always works, then there's no harm in setting max_head_length: ${max_doc_size} > And you know what else I'd like? A library of config files (start_urls > probably ought to be omitted so that people don't index other peoples' > sites by mistake). I learn a LOT from reading other peoples configs... Yeah, there are a few sample config files in the contributed works on the www.htdig.org web site, but not a whole lot. (There are quite a few fine examples in the docs too.) It would be nice to have more examples of how the various attributes can be used to do all sorts of things others may not have thought about. However, you also have to look out for misinformation that's there in some config files. For example, the htdig.conf file in the Mandrake Linux ht://Dig RPM package includes an example of the usage of the local_urls_only attribute, with a comment above it that's so misleading that it's caused several Mandrake users on this list a lot of grief when they took it at face value. -- Gilles R. Detillieux E-mail: <[EMAIL PROTECTED]> Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/ Dept. Physiology, U. of Manitoba Winnipeg, MB R3E 3J7 (Canada) ------------------------------------------------------- This sf.net email is sponsored by:ThinkGeek Welcome to geek heaven. http://thinkgeek.com/sf _______________________________________________ htdig-general mailing list <[EMAIL PROTECTED]> To unsubscribe, send a message to <[EMAIL PROTECTED]> with a subject of unsubscribe FAQ: http://htdig.sourceforge.net/FAQ.html

