I said I'd do this ages ago, but I've only just got around to doing it…
Also, the unit tests don't run on a fresh checkout, which makes me not
really want to try and change what is listed below. If you can get
them working, I'll have a shot at some of the below (a lot of which
apply all over the place).
Tokenizer.php:
[…]
>
> public function __construct($data) {
>
> // XXX this is actually parsing stuff
>
> /* Given an encoding, the bytes in the input stream must be
> converted to Unicode characters for the tokeniser, as
> described by the rules for that encoding, except that the
> leading U+FEFF BYTE ORDER MARK character, if any, must not
> be stripped by the encoding layer (it is stripped by the
> rule below).
>
> Bytes or sequences of bytes in the original byte stream that
> could not be converted to Unicode characters must be converted
> to U+FFFD REPLACEMENT CHARACTER code points. */
>
> // XXX currently assuming input data is UTF-8; once we
> // build encoding detection this will no longer be the case
> if (function_exists('mb_convert_encoding')) {
> $orig = ini_get('mbstring.substitute_character');
> ini_set('mbstring.substitute_character', "\xEF\xBF\xBD");
> $data = mb_convert_encoding($data, 'UTF-8', 'UTF-8');
> ini_set('mbstring.substitute_character', $orig);
> } elseif (function_exists('iconv')) {
> // non-conforming
> $data = iconv('UTF-8', 'UTF-8//IGNORE', $data);
> } else {
> // we can make a conforming native implementation
> throw new Exception('Not implemented, please install
> mbstring or iconv');
> }
iconv is correctly noted as being non-conforming, but the mbstring
implementation is equally non-conforming. Consider the following:
<?php
ini_set('mbstring.substitute_character', "\xEF\xBF\xBD");
$data = "\xFC\x84\x80\x80\x80\x80";
var_dump(mb_convert_encoding($data, 'UTF-8', 'UTF-8') ===
ini_get('mbstring.substitute_character'));
$data = "\x80";
var_dump(mb_convert_encoding($data, 'UTF-8', 'UTF-8') ===
ini_get('mbstring.substitute_character'));
$data = "\xbf";
var_dump(mb_convert_encoding($data, 'UTF-8', 'UTF-8') ===
ini_get('mbstring.substitute_character'));
$data = "\xfe";
var_dump(mb_convert_encoding($data, 'UTF-8', 'UTF-8') ===
ini_get('mbstring.substitute_character'));
$data = "\xff";
var_dump(mb_convert_encoding($data, 'UTF-8', 'UTF-8') ===
ini_get('mbstring.substitute_character'));
$data = "\xc0\xaf";
var_dump(mb_convert_encoding($data, 'UTF-8', 'UTF-8') ===
ini_get('mbstring.substitute_character'));
$data = "\xe0\xa0\x80";
var_dump(mb_convert_encoding($data, 'UTF-8', 'UTF-8') ===
ini_get('mbstring.substitute_character'));
?>
This should output bool(true) for all. It returns bool(false) for all.
<http://hg.gsnedders.com/unicode/> has a UTF-8 decoder that can be
used as a base for our own implementation.
> /* One leading U+FEFF BYTE ORDER MARK character must be
> ignored if any are present. */
> if (strlen($data) >= 3 && substr($data, 0, 3) === "\xEF\xBB
> \xBF") {
> $data = substr($data, 3);
> }
The call to strlen is unneeded.
> /* Any occurrences of any characters in the ranges U+0001 to
> U+0008, U+000B, U+000E to U+001F, U+007F to U+009F,
> U+D800 to U+DFFF , U+FDD0 to U+FDEF, and
> characters U+FFFE, U+FFFF, U+1FFFE, U+1FFFF, U+2FFFE, U+2FFFF,
> U+3FFFE, U+3FFFF, U+4FFFE, U+4FFFF, U+5FFFE, U+5FFFF, U+6FFFE,
> U+6FFFF, U+7FFFE, U+7FFFF, U+8FFFE, U+8FFFF, U+9FFFE, U+9FFFF,
> U+AFFFE, U+AFFFF, U+BFFFE, U+BFFFF, U+CFFFE, U+CFFFF, U+DFFFE,
> U+DFFFF, U+EFFFE, U+EFFFF, U+FFFFE, U+FFFFF, U+10FFFE, and
> U+10FFFF are parse errors. (These are all control characters
> or permanently undefined Unicode characters.) */
> // XXX not implemented!
> // code for this exists later below, factor it out
The cheapest way to do this if PCRE is compiled with Unicode support
would just be to use regex. I guess we could equally do this with
strtr(), though it would require an array several megabytes in size.
> // XXX maybe convert this into an iterator? regardless, this
> function
> // and the save function should go into a Parser facade of some
> sort
> /**
> * Performs the actual parsing of the document.
> */
> public function parse() {
> while($this->state !== null) {
> //echo $this->state . "\n";
> if (!isset($this->noConsumeStates[$this->state])) {
> // consume a character, and assign it to $this->c
> $this->c = (++$this->char < $this->EOF) ? $this-
> >data[$this->char] : false;
>
> // UTF8COL
> if ($this->skipBytes) {
> $this->skipBytes--;
> } else {
> if ($this->c === "\n") {
> $this->line++;
> $this->col = 0;
> } else {
> $this->col++;
> $ord = ord($this->c);
> if ($ord >= 0xC2 && $ord <= 0xF4) {
> if ($ord <= 0xDF) $this->skipBytes = 1;
> elseif ($ord <= 0xEF) $this->skipBytes =
> 2;
> else $this->skipBytes = 3;
> }
> }
> }
> }
> // OOO get rid of this concatentation
> $this->{$this->state.'State'}();
>
> }
> }
Is there any reason to actually track column normally? I can
understand wanting it on parse-errors, but in that case I'd rather
just calculate it on-error, and not take the cost of calculating it
normally. Also, due to the issues with mbstring (above) you cannot
assume you have a valid UTF-8 string.
> } else {
> /* Anything else
> THIS IS AN OPTIMIZATION: Get as many character that
> otherwise would also be treated as a character token and
> emit it
> as a single character token. Stay in the data state. */
> // XXX The complexity of the extra code we would need
> // to insert for this optimization makes it suspicious.
> // At the very least, we will need to think twice before
> // applying this optimization to other cases.
It is certainly cheaper: you otherwise end up getting bogged down with
countless function calls all over the place, and the code to deal with
just one character would not be much simpler.
>
> $mask = '->';
> if ($amp_cond) $mask .= '&';
> if ($lt_cond) $mask .= '<';
>
> $len = strcspn($this->data, $mask, $this->char + 1);
> $char = substr($this->data, $this->char + 1, $len);
>
> // OOO It might be faster to just iterate through the
> // string; performance testing is necessary
I assume this refers to what is below it. You'd have to implement
something almost identical to what is already there, just doing it at
an interpreted level and not within the interpreter, so there's no way
it'll be quicker. You'd need to be very careful to avoid having to run
UTF8COL on every single line, likely by having to see if there was
another \n character, which would mean looping from where you
currently were numerous times, which alone would make it more expensive.
>
> // calculate number of newlines
> $nl_cnt = substr_count($char, "\n");
> $this->line += $nl_cnt;
> // calculate last newline
> if ($nl_cnt) {
> $i = strrpos($char, "\n") + 1;
> $this->skipBytes = 0;
> $this->col = 0;
> } else {
> $i = 0;
> }
> // UTF8COL
> // We assume there are no newlines in the string we are
> // iterating over.
> for (; $i < $len; $i++) {
> if ($this->skipBytes) {
> $this->skipBytes--;
> } else {
> $this->col++;
> $ord = ord($char[$i]);
> if ($ord >= 0xC2 && $ord <= 0xF4) {
> if ($ord <= 0xDF) $this->skipBytes = 1;
> elseif ($ord <= 0xEF) $this->skipBytes = 2;
> else $this->skipBytes = 3;
> }
> }
> }
> // -- end line number code
>
> $this->char += $len;
>
> $this->emitToken(array(
> 'type' => self::CHARACTER,
> 'data' => $this->c . $char
> ));
>
> $this->state = 'data';
> }
[…]
> } elseif(preg_match('/^[A-Za-z]$/', $char)) {
> // possible optimization: ctype_alpha() in C
> locale,
> // or ctype_upper()/ctype_lower(). This could also
> // be applied to later code.
I guess this would work if you could just get away with setting the
locale in Tokenizer::parse and then just changing it back (how? — I
see no way to get the initial value) at the end.
> /* U+0041 LATIN LETTER A through to U+005A LATIN
> LETTER Z
> Create a new start tag token, set its tag name
> to the lowercase
> version of the input character (add 0x0020 to
> the character's code
> point), then switch to the tag name state.
> (Don't emit the token
> yet; further details will be filled in before it
> is emitted.) */
> /* U+0061 LATIN SMALL LETTER A through to U+007A
> LATIN SMALL LETTER Z
> Create a new start tag token, set its tag name
> to the input
> character, then switch to the tag name state.
> (Don't emit
> the token yet; further details will be filled in
> before it
> is emitted.) */
> // optimization: run strtolower regardless of
> case.
> // this probably is faster than two preg_matches
> $this->token = array(
> 'name' => strtolower($char),
> 'type' => self::STARTTAG,
> 'attr' => array()
> );
strtolower() is surprisingly expensive. preg_match() is expensive.
Some basic numbers, for 1000000 calls (setting $bar was outside the
loop, setlocale(LC_CTYPE, 'C') was called outside the loop for all),
average of three runs.
$bar = 'a';
$l = strtolower($bar);
16.67 ± 0.03 CPU seconds.
$bar = 'A';
$l = strtolower($bar);
16.9 ± 0.2 CPU seconds.
$bar = 'a';
if (ctype_upper($bar))
$l = strtolower($bar);
16.36 ± 0.06 CPU seconds.
$bar = 'A';
if (ctype_upper($bar))
$l = strtolower($bar);
32.4 ± 0.2 CPU seconds.
$bar = 'a';
if ($bar < 'a')
$l = strtolower($bar);
0.552 ± 0.005 CPU seconds.
$bar = 'A';
if ($bar < 'a')
$l = strtolower($bar);
17.00 ± 0.03 CPU seconds.
By far the most expensive thing appears to be any function call.
Putting ord() into the final two increases the cost by around 17s. The
obvious solution seems to be just to compare whether it is less than
'a' (this converts the ordinal values of both, so is reliable), which
makes a _big_ difference when it is lowercase. Also, "In some unknown
number of pages from dmoz.org with five million things that look kind
of like tags, I see 11723600 lowercase characters and 1547593
uppercase (and 59370 digits)", which makes it seem as if the right
thing to do is optimize for the lowercase scenario.
> private function closeTagOpenState() {
> $next_node = strtolower($this->characters('A-Za-z', $this-
> >char));
Um, what is this doing? Getting what the next token will be?
> if(preg_match('/^[\t\n\x0c ]$/', $char)) {
To randomly think aloud, how does strspn($char, "\t\n\x0c ") compare
to in_array($char, array("\t", …))? I expect the latter will be slower…
$char = 'a';
for ($i = 0; $i < 1000000; $i++)
{
if (preg_match('/^[\t\n\x0c ]$/', $char))
$foo = 1;
}
17.95 ± 0.03 CPU seconds.
$char = 'a';
for ($i = 0; $i < 1000000; $i++)
{
if (strspn($char, "\t\n\x0c "))
$foo = 1;
}
16.46 ± 0.01 CPU seconds.
$char = 'a';
for ($i = 0; $i < 1000000; $i++)
{
if (strspn($char, "\t\n\x0c ", 0, 1))
$foo = 1;
}
16.80 ± 0.01 CPU seconds.
$char = 'a';
for ($i = 0; $i < 1000000; $i++)
{
if (in_array($char, array("\t", "\n", "\x0c", ' ')))
$foo = 1;
}
18.14 ± 0.04 CPU seconds.
There was no difference between $char = 'a' and $char = ' '.
> // U+0041 LATIN CAPITAL LETTER A through to U+005A LATIN
> CAPITAL LETTER Z
> // can be found in the Anything else condition
[…]
> } else {
> // optimization: lowercase all
> // possible optimization: glob further
> /* U+0041 LATIN CAPITAL LETTER A through to U+005A LATIN
> CAPITAL LETTER Z
> Append the lowercase version of the current input
> character (add 0x0020 to the character's code point) to
> the current tag token's tag name. Stay in the tag name
> state. */
> /* Anything else
> Append the current input character to the current tag
> token's tag name.
> Stay in the tag name state. */
> $this->token['name'] .= strtolower($char);
> $this->state = 'tagName';
> }
> }
See above about this "optimization". Also, noting lowercase is + 0x20
is irrelevant when you use strtolower().
> /* When the user agent leaves the attribute name state
> (and before emitting the tag token, if appropriate), the
> complete attribute's name must be compared to the other
> attributes on the same token; if there is already an
> attribute on the token with the exact same name, then this
> is a parse error and the new attribute must be dropped, along
> with the value that gets associated with it (if any). */
> // this is implemented in the emitToken method
Where? I see it not.
> private function bogusCommentState() {
> /* (This can only happen if the content model flag is set to
> the PCDATA state.) */
Can we add an assert to check such statements?
> /* Otherwise if the next seven characters are a case-
> insensitive match
> for the word "DOCTYPE", then consume those characters and
> switch to the
> DOCTYPE state. */
> } elseif(strtoupper($this->character($this->char, 7)) ===
> 'DOCTYPE') {
> $this->char += 6;
> $this->col += 6;
> $this->state = 'doctype';
This I suppose it does make sense to just always call strtoupper() on.
Finally, generally:
Ideally we'd have no regex in the tokenizer, but that'll take a fair
amount of work. (I see no reason to have any, as we should be able to
do everything quicker.)
Also, from memory, the Python implementation splits up character
tokens and space-character tokens (the latter being just "space
characters", as defined in HTML 5), at least in places. This makes
tree building a lot cheaper (no checking of character tokens to see if
they are whitespace as a result).
We should create elements in the HTML namespace.
utf8chr() is kinda whacky, referring to codepoints as decimal values
(who does that?). It also returns an empty string on some invalid
codepoints, not the expected U+FFFD (on others it returns the UTF-8
encoding of the invalid codepoint).
--
Geoffrey Sneddon
<http://gsnedders.com/>
--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups
"html5lib-discuss" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/html5lib-discuss?hl=en-GB
-~----------~----~----~----~------~----~------~--~---