Gilles Detillieux wrote:
> 
(snip)
> >
> > It finds the first occurence of --> so don't recurse comments. Anyway,
> > it works on my htdig system.
> 
> This isn't quite right.  We had a big discussion about this two weeks ago.
> The HTML standard allows white space (even newlines) between the closing
> "--" and ">" of a comment.  The trick is to gobble up any extra dashes
> after the first two, and then skip white space.  If that doesn't leave
> you at a ">", I think you have to start over again, scanning for the next
> "--".

You're right about that, but HTML.cc did miss the end > anyway.
Personaly
I don't see why there may be a white space between -- and >. Now we're
getting at points like the user uses ---. You'll have to scan for -- but
do not skip these two because you'll have on - left and htdig will miss
the >. Okay, the user didn't create good HTML, but I don't want to miss
links for indexing because of some "programming" error.

Somewhere in that piece of code there is a position update with:

position = q+2 (to get after the --). May be changing it to
position = q+1 will do the trick.

> > Another problem is that M$ Frontpage 98 in combination with Frontpage
> > Server Extension don't do
> > <AREA> tags. They create a webbot (inside a comment). If the webbot has
> > links, these links don't
> > get indexed. Of couse this is a M$ / user problem, it just that you know
> > of it.
> 
> Yes, M$ server extensions pose a problem (as does JavaScript).  If anyone
> can enhance the HTML parser to deal with these webbot links reliably,
> without breaking anything else, go for it.  Otherwise, it'll remain a
> problem, until M$ learns to adhere to standards other than their own.  ;-)

You'll need to parse the comments to do that.

Greetz from nighty Holland,
--jesse
------------------------------------
To unsubscribe from the htdig3-dev mailing list, send a message to
[EMAIL PROTECTED] containing the single word "unsubscribe" in
the SUBJECT of the message.

Reply via email to