From: "Jeff Szuhay" <[EMAIL PROTECTED]> > > 1. remove all CR and LF characters. > > 2 remove all </p> > > 3 change all <p> to CR/LF > > 4 change all <br> to CR/LF > > While I recognize this is a "first stab" heuristic, it fails because of > too many assumptions. > For line endings: > Windows/DOS use CR/LF > Unix/Linux/Mac OS X use LF > Mac classic uses CR <snip> > Also HTML should be _parsed_ and not just willy-nilly remove </p> info.
I think you are reading more into this than was inteded. As far as I uderstood the problem, we were looking at a program that is trying to recognise and extract abc from web pages. CR and LF would have to be removed separately for the reasons you gave but I don't see it would have mattered which choice of line ending was used as long as the program recognised it (I think abc programs are supposed to cater for all 3 BTW). As long as the abc portions turned out OK with consistant endings, I don't really see the need for full parsing of the HTML or whether the HTML became mangled as being an issuse - no-one would be viewing it. To subscribe/unsubscribe, point your browser to: http://www.tullochgorm.com/lists.html