From: "Jeff Szuhay" <[EMAIL PROTECTED]>
> >  1. remove all CR and LF characters.
> > 2 remove all </p>
> > 3 change all <p> to CR/LF
> > 4 change all <br> to CR/LF
>
> While I recognize this is a "first stab" heuristic, it fails because of
> too many assumptions.
> For line endings:
>    Windows/DOS use CR/LF
>    Unix/Linux/Mac OS X use  LF
>    Mac classic uses CR
<snip>
> Also HTML should be _parsed_ and not just willy-nilly remove </p> info.

I think you are reading more into this than was inteded. As far as I
uderstood the problem, we were looking at a program that is trying to
recognise and extract abc from web pages. CR and LF would have to be removed
separately for the reasons you gave but I don't see it would have mattered
which choice of line ending was used as long as the program recognised it (I
think abc programs are supposed to cater for all 3 BTW). As long as the abc
portions turned out OK with consistant endings, I don't really see the need
for full parsing of the HTML or whether the HTML became mangled as being an
issuse - no-one would be viewing it.

To subscribe/unsubscribe, point your browser to: http://www.tullochgorm.com/lists.html

Reply via email to