I'm parsing HTML using HTML::TokeParser, and occasionally it splits a
single short bit of text into two tokens.  The manual page doesn't say
whether this is to be expected.

I tried to get a test case, but the problem is hard to
reproduce; stripping out HTML further up the page causes the problem to
go away.  But this is the offending section of HTML:

<B>The Movie Chart Show</B>

I'm just calling get_token() and printing out any tokens which have type
'T'.  The above line is split into two tokens:

'The'
' Movie Chart Show'

If you want, you can see the files at
<http://www.doc.ic.ac.uk/~epa98/toke/>.  But given that the problem is
hard to reproduce cleanly, it seems unlikely you'll see it unless you
have exactly the same versions of everything.  My versions are these:

% perl -v
 
This is perl, version 5.005_03 built for i386-linux

[snip]

% perl -MHTML::TokeParser -e 'print "$HTML::TokeParser::VERSION\n"'
2.19
% perl -MHTML::Parser -e 'print "$HTML::Parser::VERSION\n"'
3.10                                                                                   
                          


Is HTML::TokeParser meant to break text into multiple tokens?  Curiously
it seems to do it after the word 'The', that's probably just a
coincidence though.

Please cc: replies to me since I don't read this list.  TIA,

-- 
Ed Avis
[EMAIL PROTECTED]

Reply via email to