| From: "Jon Freeman" <[EMAIL PROTECTED]>
| > I've just had another thought. Every tune starts X: A further replace of
| X:
| > for CR/LF X: might do the trick as I don't suppose you care how many blank
| > lines end up between tunes.
| >
| > Still, whatever you did, I'm sure you are right that there will be cases
| > that fail and a simple approach like this would never deal with something
| > like the http://www.ibiblio.org/fiddlers/AA_ABEL.htm example you gave.
|
| And yet another thought. I can't claim to understand it but I gather you
| work with Perl. How about something like this
| http://www.perldoc.com/perl5.6/lib/HTML/Parser.html?

Yeah; that's one of several HTML parsers that I've  looked  at.   The
problem  with all of them is that after they do a parse, you get back
an "object" that is the parsed version of the HTML.  Getting the data
out  of  it  is  MUCH more difficult than just attacking the original
text.  It's a fully tree-structured version of the document, and  the
data  you're looking for will be at a different place in the tree for
every document.  You can't just look for some giveaway char  strings;
your code has to understand the structure of the document. And no two
are ever the same.

I've played around with a number of these parsers.  I've always given
up,  and  written  my  own  semi-parser in less time than I'd already
wasted trying to understand the parser's output.  After all, if  what
you're  after  is  just the plain text that's hidden in the HTML, the
problem isn't all that difficult.  Most tags you just  discard.   You
look for a few that are line terminators and replace them with one or
more newlines.  You reduce strings of white space to a single  space.
The result is usually readable.

The main problem with ABC that has been HTMLized is that the newlines
often come out wrong.  You get runtogether lines, or the text ends up
double spaced. Sometimes both.  (There's one goofy site where you get
the  X:  and  T: lines single spaced, and the rest of the tune double
spaced.  ;-)

And this brings us back to the original question of what you do about
ABC  that  is double spaced.  The best I've found so far is to send a
note to the site's owner describing the problem, and if they fix  it,
fine.   If  not, then their tunes will require human editing when you
want to feed them to any software.

To subscribe/unsubscribe, point your browser to: http://www.tullochgorm.com/lists.html

Reply via email to