Re: [PHP-DEV] Re: [RFC] [Discussion] DOM HTML5 parsing and serialization support

Niels Dossche Fri, 29 Sep 2023 14:55:15 -0700

Hi Dennis

On 9/29/23 23:38, Dennis Snell wrote:
>> Just chiming in here to say that while we don't offer a createFragment() in 
>> this proposal, it's possible to parse fragments by passing the 
>> LIBXML_HTML_NOIMPLIED option. Alternatively, in the future I plan to offer 
>> innerHTML which you could use then in conjunction with 
>> createDocumentFragment().
> 
> 
> It’s not my understanding that this is right here, because fragment parsing 
> implies more than having or not having the HTML and BODY elements implicitly.


Right. I plan on adding innerHTML/outerHTML in the near future. This RFC is a 
prerequisite for that. As those properties invoke the html fragment parser this 
somewhat accomplishes what you'd like.
Additionally in the future we might also expose the fragment parser in a more 
low-level API. Depends on the demand of users and other feature requests that 
come in.

> 
> 
>>  Sets HTML_PARSE_NOIMPLIED flag, which turns off the automatic adding of 
>> implied html/body... elements.
> 
> 
> The HTML5 spec defines fragment parsing as starting within a context node 
> which exists within a broader document. For example, many people will parse a 
> string of HTML that should form the contents of an LI element. They are 
> grabbing that HTML from a database somewhere, from user input. If that HTML 
> contains “</li>” then our behavior diverges. In a fragment parser it would 
> close out the list we started with but in full document parsing mode the end 
> tag would be ignored, a parse error. If the goal is to ensure that user input 
> doesn’t break out and change the page, then it’s important to use fragment 
> parsing and grab the inner contents of that LI context node.
> 
> 
> This can be valuable to have as a tool to guard against injection attacks or 
> against accidentally breaking the page someone is building, because the 
> fragment parser is aware of its environment. It becomes even more important 
> when parsing within RCDATA or RAWTEXT sections. For example, if wanting to 
> parse and analyze or manipulate a web page’s title then the parser should 
> treat everything as plaintext until it reaches the end or encounters a 
> closing TITLE tag. If trying to do this with `createFromString()` then it’s 
> up to the caller to remember to prepend and then remove the environment, 
> `createFromString( ‘<title>’ . $page_title . ‘</title>’ )`. The fragment 
> parser would be similar in practice, but more explicit and hard to 
> misunderstand in these circumstances.
> 

You're right, it is dangerous indeed to place the burden of dealing with 
wrapping and unwrapping on the user, as mistakes are bound to happen and they 
could result in very bad injection attacks.
innerHTML would help, a low-level fragment parser API maybe even more. I'd have 
to think about that, but that's for future work.

> 
> This is complicated stuff. I understand that the spec provides for a wide 
> variety of use-cases and needs, and that it’s hard to pin down exactly what a 
> spec-compliant parser is supposed to do in all situations (it depends), so 
> I’m only wanting to share from the perspective of people doing a lot of small 
> HTML manipulation. There’s not much code out there using the fragment parser, 
> but I can’t help but think that part of the reason is because it’s not 
> exposed where it ought to be.
> 
> 
> Have a great weekend!
> Dennis Snell
>>
> 

Thanks for the discussion and sharing your insight.

Likewise, have a great weekend.

Kind regards
Niels

-- 
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: https://www.php.net/unsub.php

Re: [PHP-DEV] Re: [RFC] [Discussion] DOM HTML5 parsing and serialization support

Reply via email to