According to Reich, Stefan:
> Hi, sorry but I thought the information might be enough. As it says in the
> subject we are using 3.1.5.
Oops, I missed that. I tend to focus on the message body, and I don't
see the subject when composing the reply in elm.
> You could try it on
> http://www.yavivo.de/Expertenrat/Forum/Reisemedizin/985878142/message.html.
> It's a german site, but you should be able to find the link to the print
> version on the lower left corner where it says "Druckversion" with the
> printer symbol. It will lead you to
> http://www.yavivo.de/Expertenrat/Forum/Reisemedizin/985878142/message.html?p
> p=1
OK, I got this file, and I think I know what the problem is. There's a
bug in how htdig's HTML parser turns indexing and following on and off.
It's very indiscriminate about which tags can do it, so one closing
tag can override an incompatible opening tag which turned off indexing.
This is compounded by the fact that the handling of <style> and </style>
is done just like <noindex> and </noindex>, so when the parser sees the
</style> tag, it turns indexing and following back on, regardless of
which tag turned one or both of these off before. Unfortunately, fixing
the source isn't trivial, and I don't have the time to do it right now.
As a workaround, if you can sacrifice the use of the current noindex_start
and noindex_end values, which cut out code between <!--htdig_noindex-->
and <!--/htdig_noindex--> by default, you can define thse as:
noindex_start: <style
noindex_end: </style>
This will cut out the style tags and everything in between them, before
the parser has a chance to see them, so they won't interfere with the
meta robots tag setting. The next 3.1.x release of htdig will hopefully
have an overhauled HTML parser, backported from 3.2.0b3 but with this
bug also fixed.
Another workaround would be to patch htdig/HTML.cc to ignore style tags
altogether, as older versions did, but then you'd need to put any inline
stylesheets inside HTML comment delimiters so that htdig won't try to
index them.
...
> start_url:
> http://www.yavivo.de/Expertenrat/Forum/Reisemedizin/985878142/message.html
> restrict_urls: http://www.yavivo.de/
Shouldn't that be limit_urls_to above instead of restrict_urls? There
is no restrict_urls attribute in an unpatched 3.1.5 htdig.
--
Gilles R. Detillieux E-mail: <[EMAIL PROTECTED]>
Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba Phone: (204)789-3766
Winnipeg, MB R3E 3J7 (Canada) Fax: (204)789-3930
_______________________________________________
htdig-general mailing list <[EMAIL PROTECTED]>
To unsubscribe, send a message to <[EMAIL PROTECTED]> with a
subject of unsubscribe
FAQ: http://htdig.sourceforge.net/FAQ.html