I'm parsing HTML using HTML::TokeParser, and occasionally it splits a single short bit of text into two tokens. The manual page doesn't say whether this is to be expected. I tried to get a test case, but the problem is hard to reproduce; stripping out HTML further up the page causes the problem to go away. But this is the offending section of HTML: <B>The Movie Chart Show</B> I'm just calling get_token() and printing out any tokens which have type 'T'. The above line is split into two tokens: 'The' ' Movie Chart Show' If you want, you can see the files at <http://www.doc.ic.ac.uk/~epa98/toke/>. But given that the problem is hard to reproduce cleanly, it seems unlikely you'll see it unless you have exactly the same versions of everything. My versions are these: % perl -v This is perl, version 5.005_03 built for i386-linux [snip] % perl -MHTML::TokeParser -e 'print "$HTML::TokeParser::VERSION\n"' 2.19 % perl -MHTML::Parser -e 'print "$HTML::Parser::VERSION\n"' 3.10 Is HTML::TokeParser meant to break text into multiple tokens? Curiously it seems to do it after the word 'The', that's probably just a coincidence though. Please cc: replies to me since I don't read this list. TIA, -- Ed Avis [EMAIL PROTECTED]
