According to Emma Jane Hogbin:
> On Tue, Feb 18, 2003 at 12:57:30PM -0600, Gilles Detillieux wrote:
> > Actually, in the values above, max_doc_size is 4 times the value of
> > max_head_length, not a smaller value.  That makes sense, though, as
> > the document "head" (i.e. the chunk of plain text extracted from the
> > start of the document) would never be larger than max_doc_size.
> 
> *laugh* I appear to have issues in all my emails today. I might as well
> just head back to bed as the extra coffee doesn't seem to have helped!

I hope you don't feel I'm picking on you or anything!  :-)  It's just that
you had a lot of responses out there that I thought deserved clarification
and/or supplementary info.  I do the same to Jim and Geoff after they
post a flurry of reponses too.

> I think in one instance of troubleshooting I
> actually made my "head" larger than the "doc_size" to force the pages into
> submission. In the end it was manually deleting databases that got the
> results to appear the way I wanted them to.
> 
> > The rule of thumb is that max_doc_size should be at least as large as
> > the largest file you want to index, and max_head_length is somewhere
> > between 0 and max_doc_size, depending on how important it is to you
> > to make sure an excerpt containing the search word can be displayed
> > (vs. how much disk space you're willing to use up to accomplish that).
> 
> Because I'm indexing a discussion forum finding the max_doc_size and the
> max_head_length are moving targets...

Yeah, that's usually the case when indexing very dynamic content.  On the
plus side, though, there isn't really any penalty associated with setting
these attributes much larger than you need - it's just that if things
are going to blow up you can sometimes contain the mess a bit more with
tighter restrictions on these.

> And while I'm here making absurd statements about The Way Things
> Work...does max_head_length count any content that is included in 
> htdig_noindex bits? What about max_doc_size?
> 
> The config file info says:
> http://htdig.org/attrs.html#max_doc_size
> "This is the upper limit to the amount of data retrieved for documents."
> Including noindex? Or omitting noindex?

max_doc_size is an absolute limit, imposed on data as it's retrieved,
and before it's parsed.  So, whatever the file contains, it gets abruptly
truncated at max_doc_size bytes even before anything gets stripped out
of it.

On the other hand, max_head_length is just for plain text that's
retained after parsing.  This means that all noindex blocks, script
blocks, style blocks, HTML comments, and even all HTML tags that don't
contain indexable text, get stripped out before the "head" is built up
(to a maximum of max_head_length characters).

In practise, if your HTML pages have a fair bit of comments, style sheets
or JavaScript, the actual indexable text may be just a small fraction of
the total document size.  So, you might be able to get by with setting
max_head_length to a quarter of max_doc_size without losing any actual
indexed text.  However, if your goal is to keep all text in the head,
so that excerpt highlighting always works, then there's no harm in setting

max_head_length: ${max_doc_size}

> And you know what else I'd like? A library of config files (start_urls
> probably ought to be omitted so that people don't index other peoples'
> sites by mistake). I learn a LOT from reading other peoples configs...

Yeah, there are a few sample config files in the contributed works
on the www.htdig.org web site, but not a whole lot.  (There are quite
a few fine examples in the docs too.)  It would be nice to have more
examples of how the various attributes can be used to do all sorts of
things others may not have thought about.

However, you also have to look out for misinformation that's there in
some config files.  For example, the htdig.conf file in the Mandrake
Linux ht://Dig RPM package includes an example of the usage of the
local_urls_only attribute, with a comment above it that's so misleading
that it's caused several Mandrake users on this list a lot of grief when
they took it at face value.

-- 
Gilles R. Detillieux              E-mail: <[EMAIL PROTECTED]>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/
Dept. Physiology, U. of Manitoba  Winnipeg, MB  R3E 3J7  (Canada)


-------------------------------------------------------
This sf.net email is sponsored by:ThinkGeek
Welcome to geek heaven.
http://thinkgeek.com/sf
_______________________________________________
htdig-general mailing list <[EMAIL PROTECTED]>
To unsubscribe, send a message to <[EMAIL PROTECTED]> with a 
subject of unsubscribe
FAQ: http://htdig.sourceforge.net/FAQ.html

Reply via email to