Oh, Ok. I misunderstood your original message to say that the project as a whole had reached this conclusion, which certainly isn't the case, rather than that you personally had reached that conclusion.
As for the long-term direction of the HTML parser, my guess is that the optimum design will be to deliver the network bytes to the parser directly on the parser thread. On the parser thread, we can merge charset decoding, input stream pre-processing, and tokenization to move directly from network bytes to CompactHTMLTokens. That approach removes a number of copies, 8-bit-to-16-bit, and 16-bit-to-8-bit conversions. Parsing directly into CompactHTMLTokens also means we won't have to do any copies or conversions at all for well-known strings (e.g., "div" and friends from HTMLNames). If you're about to reply complaining about the above, please save your complaints for another time. I realize that some parts of that design will be difficult or impossible to implement on some ports due to limitations on how then interact with their networking stack. In any case, I don't plan to implement that design anytime soon, and I'm sure we'll have plenty of time to discuss its merits in the future. Adam On Mon, Mar 11, 2013 at 8:56 AM, Michael Saboff <msab...@apple.com> wrote: > Maciej, > > *I* deemed using a character type template for the HTMLTokenizer as being > unwieldy. Given there was the existing SegmentedString input abstraction, > it made logical sense to put the 8/16 bit coding there. If I would have > moved the 8/16 logic into the tokenizer itself, we might have needed to do > 8->16 up conversions when a SegmentedStrings had mixed bit-ness in the > contained substrings. Even if that wasn't the case, the patch would have > been far larger and likely include tricky code for escapes. > > As I got into the middle of the 8-bit strings, I realized that not only > could I keep performance parity, but some of the techniques I came up with > offered good performance improvement. The HTMLTokenizer ended up being one > of those cases. This patch required a couple of reworks for performance > reasons and garnered a lot of discussion from various parts of the webkit > community. See https://bugs.webkit.org/show_bug.cgi?id=90321 for the trail. > Ryosuke noted that this patch was responsible for a 24% improvement in the > url-parser test in their bots (comment 47). My performance final results > are in comment 43 and show between 1 and 9% progression on the various HTML > parser tests. > > Adam, If you believe there is more work to be done in the HTMLTokenizer, > file a bug and cc me. I'm interested in hearing your thoughts. > > - Michael > > On Mar 9, 2013, at 4:24 PM, Maciej Stachowiak <m...@apple.com> wrote: > > > On Mar 9, 2013, at 3:05 PM, Adam Barth <aba...@webkit.org> wrote: > > > In retrospect, I think what I was reacting to was msaboff statement > that an unnamed group of people had decided that the HTML tokenizer > was too unwieldy to have a dedicated 8-bit path. In particular, it's > unclear to me who made that decision. I certainly do not consider the > matter decided. > > > It would be good to find out who it was that said that (or more > specifically: "Using a character type template approach was deemed to be too > unwieldy for the HTML tokenizer.") so you can talk to them about it. > > Michael? > > Regards, > Maciej > > _______________________________________________ webkit-dev mailing list webkit-dev@lists.webkit.org https://lists.webkit.org/mailman/listinfo/webkit-dev