| From: "Jon Freeman" <[EMAIL PROTECTED]> | > I've just had another thought. Every tune starts X: A further replace of | X: | > for CR/LF X: might do the trick as I don't suppose you care how many blank | > lines end up between tunes. | > | > Still, whatever you did, I'm sure you are right that there will be cases | > that fail and a simple approach like this would never deal with something | > like the http://www.ibiblio.org/fiddlers/AA_ABEL.htm example you gave. | | And yet another thought. I can't claim to understand it but I gather you | work with Perl. How about something like this | http://www.perldoc.com/perl5.6/lib/HTML/Parser.html?
Yeah; that's one of several HTML parsers that I've looked at. The problem with all of them is that after they do a parse, you get back an "object" that is the parsed version of the HTML. Getting the data out of it is MUCH more difficult than just attacking the original text. It's a fully tree-structured version of the document, and the data you're looking for will be at a different place in the tree for every document. You can't just look for some giveaway char strings; your code has to understand the structure of the document. And no two are ever the same. I've played around with a number of these parsers. I've always given up, and written my own semi-parser in less time than I'd already wasted trying to understand the parser's output. After all, if what you're after is just the plain text that's hidden in the HTML, the problem isn't all that difficult. Most tags you just discard. You look for a few that are line terminators and replace them with one or more newlines. You reduce strings of white space to a single space. The result is usually readable. The main problem with ABC that has been HTMLized is that the newlines often come out wrong. You get runtogether lines, or the text ends up double spaced. Sometimes both. (There's one goofy site where you get the X: and T: lines single spaced, and the rest of the tune double spaced. ;-) And this brings us back to the original question of what you do about ABC that is double spaced. The best I've found so far is to send a note to the site's owner describing the problem, and if they fix it, fine. If not, then their tunes will require human editing when you want to feed them to any software. To subscribe/unsubscribe, point your browser to: http://www.tullochgorm.com/lists.html