HTML sanitization, CSP, nsIContentPolicy, ServiceWorkers (was: Re: How to efficiently walk the DOM tree and its strings)
On 03/05/2014 01:52 AM, nsm.nik...@gmail.com wrote: On Tuesday, March 4, 2014 1:26:15 AM UTC-8, somb...@gmail.com wrote: While we have a defense-in-depth strategy (CSP and iframe sandbox should be protecting us from the worst possible scenarios) and we're hopeful that Service Workers will eventually let us provide nsIContentPolicy-level protection, the quality of the HTML parser is of course fairly important[1] to the operation of the HTML sanitizer. Sorry to go off-topic, but how are ServiceWorkers different from normal Workers here? I'm asking without context, so forgive me if I misunderstood. Context in short: Thunderbird does not use an HTML sanitizer in the default case when displaying HTML emails because it can turn off JavaScript execution, network accesses, and other stuff via nsIContentPolicy. iframe sandboxes let the Gaia email app which runs in content turn off JavaScript but do nothing to stop remote image fetches/etc. We want to be able to stop network fetches by for both bandwidth reasons and privacy reasons. I am referring to the dream of being able to skip sanitization and instead just enforce greater control over the iframe through either use of CSP or ServiceWorkers. ServiceWorker's onfetch capability doesn't actually work for this purpose because of the origin restrictions, but the mooted allowConnectionsTo CSP 1.1 API Alex Russell's blog post http://infrequently.org/2013/05/use-case-zero/ (about CSP and an early NavigationController/ServiceWorker proposal) would have been perfect. In the event CSP grew an API like that again in the future, I assume ServiceWorker is where it would end up. It doesn't seem super likely since it seems like CSP 1.1 generally covers the required use-cases. If we are (eventually) able to specify a more strict CSP on an iframe than the CSP in which the e-mail app already lives, we may able to use img-src/media-src/etc. for our fairly simple stop this iframe from accessing any resources control purposes. More context is available at the https://groups.google.com/d/msg/mozilla.dev.webapi/wDFM_T9v7Tc/_yTofMrjBk4J thread. Andrew ___ dev-platform mailing list dev-platform@lists.mozilla.org https://lists.mozilla.org/listinfo/dev-platform
Re: How to efficiently walk the DOM tree and its strings
On Mon, Mar 3, 2014 at 10:19 PM, Boris Zbarsky bzbar...@mit.edu wrote: How feasible is just doing .innerHTML to do that, then doing some sort of async parse (e.g. XHR or DOMParser) to get a DOM snapshot? Seems more efficient to write the walk in C++, since the innerHTML getter already includes the walk in C++. How important is it to avoid C++? On Mon, Mar 3, 2014 at 10:45 PM, Ehsan Akhgari ehsan.akhg...@gmail.com wrote: There's https://github.com/google/gumbo-parser which can be compiled to js. The parser we use in Gecko can be compiled to JS using GWT. However, the current glue code assumes the parser is running in the context of a browser window object and a browser DOM. Writing the glue code that assumes something else about the environment should be easy. Also, David Flanagan has implemented the HTML parsing algorithm (pre-template; not sure if updated since) directly in JS. On Tue, Mar 4, 2014 at 1:57 AM, Andrew Sutherland asutherl...@asutherland.org wrote: The Gaia e-mail app has a streaming HTML parser in its worker-friendly sanitizer at https://github.com/mozilla-b2g/bleach.js/blob/worker-thread-friendly/lib/bleach.js. On Tue, Mar 4, 2014 at 7:14 AM, Wesley Johnston wjohns...@mozilla.com wrote: Android also ships a parser that we wrote for Reader mode: http://mxr.mozilla.org/mozilla-central/source/mobile/android/chrome/content/JSDOMParser.js It saddens me that we are using non-compliant ad hoc parsers when we already have two spec-compliant (at least at some point in time) ones. -- Henri Sivonen hsivo...@hsivonen.fi https://hsivonen.fi/ ___ dev-platform mailing list dev-platform@lists.mozilla.org https://lists.mozilla.org/listinfo/dev-platform
Re: How to efficiently walk the DOM tree and its strings
Chrome imports a JS script into the webpage and this script does all the translation work. Felipe On Mon, Mar 3, 2014 at 4:31 PM, Jeff Muizelaar jmuizel...@mozilla.comwrote: On Mar 3, 2014, at 2:28 PM, Felipe G fel...@gmail.com wrote: Hi everyone, I'm working on a feature to offer webpage translation in Firefox. Translation involves, quite unsurprisingly, a lot of DOM and strings manipulation. Since DOM access only happens in the main thread, it brings the question of how to do it properly without causing hank. What does Chrome do? -Jeff ___ dev-platform mailing list dev-platform@lists.mozilla.org https://lists.mozilla.org/listinfo/dev-platform
Re: How to efficiently walk the DOM tree and its strings
Thanks for the feedback so far! If I go with the clone route (to work on the snapshot'ed version of the data), how can I later associate the cloned nodes to the original nodes from the document? One way that I thought is to set a a userdata on the DOM nodes and then use the clone handler callback to associate the cloned node with the original one (through weak refs or a WeakMap). That would mean iterating first through all nodes to add the handlers, but that's probably fine (I don't need to analyze anything or visit text nodes). I think serializing and re-parsing everything in the worker is not the ideal solution unless we can find a way to also keep accurate associations with the original nodes from content. Anything that introduces a possibly lossy data aspect will probably hurt translation which is already an innacurate science. On Tue, Mar 4, 2014 at 6:26 AM, Andrew Sutherland asutherl...@asutherland.org wrote: On 03/04/2014 03:13 AM, Henri Sivonen wrote: It saddens me that we are using non-compliant ad hoc parsers when we already have two spec-compliant (at least at some point in time) ones. Interesting! I assume you are referring to: https://github.com/davidflanagan/html5/blob/master/html5parser.js Which seems to be (explicitly) derived from: https://github.com/aredridel/html5 Which in turn seems to actually includes a few parser variants. Per the discussion with you on https://groups.google.com/d/ msg/mozilla.dev.webapi/wDFM_T9v7Tc/Nr9Df4FUwuwJ for the Gaia e-mail app we initially ended up using an in-page data document mechanism for sanitization. We later migrated to using a worker based parser. There were some coordination hiccups with this migration ( https://bugzil.la/814257) and some time B2G time-pressure so a comprehensive survey of HTML parsers did not happen so much. While we have a defense-in-depth strategy (CSP and iframe sandbox should be protecting us from the worst possible scenarios) and we're hopeful that Service Workers will eventually let us provide nsIContentPolicy-level protection, the quality of the HTML parser is of course fairly important[1] to the operation of the HTML sanitizer. If you'd like to bless a specific implementation for workers to perform streaming HTML parsing or other some other explicit strategy, I'd be happy to file a bug for us to go in that direction. Because we are using a white-list based mechanism and are fairly limited and arguably fairly luddite in what we whitelist, it's my hope that our errors are on the side of safety (and breaking adventurous HTML email :), but that is indeed largely hope. Your input is definitely appreciated, especially as it relates to prioritizing such enhancements and potential risk from our current strategy. Andrew 1: understatement ___ dev-platform mailing list dev-platform@lists.mozilla.org https://lists.mozilla.org/listinfo/dev-platform ___ dev-platform mailing list dev-platform@lists.mozilla.org https://lists.mozilla.org/listinfo/dev-platform
Re: How to efficiently walk the DOM tree and its strings
On Wed, Mar 5, 2014 at 8:47 AM, Felipe G fel...@gmail.com wrote: If I go with the clone route (to work on the snapshot'ed version of the data), how can I later associate the cloned nodes to the original nodes from the document? One way that I thought is to set a a userdata on the DOM nodes and then use the clone handler callback to associate the cloned node with the original one (through weak refs or a WeakMap). That would mean iterating first through all nodes to add the handlers, but that's probably fine (I don't need to analyze anything or visit text nodes). I think serializing and re-parsing everything in the worker is not the ideal solution unless we can find a way to also keep accurate associations with the original nodes from content. Anything that introduces a possibly lossy data aspect will probably hurt translation which is already an innacurate science. Maybe you can do the translation incrementally, and just annotate the DOM with custom attributes (or userdata) to record the progress of the translation? Plus a reference to the last tranlated node (subtree) to speed up finding the next node subtree to translate. I assume it would be OK to translate one CSS block at a time. Rob -- Jtehsauts tshaei dS,o n Wohfy Mdaon yhoaus eanuttehrotraiitny eovni le atrhtohu gthot sf oirng iyvoeu rs ihnesa.rt sS?o Whhei csha iids teoa stiheer :p atroa lsyazye,d 'mYaonu,r sGients uapr,e tfaokreg iyvoeunr, 'm aotr atnod sgaoy ,h o'mGee.t uTph eann dt hwea lmka'n? gBoutt uIp waanndt wyeonut thoo mken.o w ___ dev-platform mailing list dev-platform@lists.mozilla.org https://lists.mozilla.org/listinfo/dev-platform
Re: How to efficiently walk the DOM tree and its strings
The actual translation needs to happen at once, but that's ok if I can work in the chunks incrementally, and only when everything is ready I send it off to the translation service. What I need to find then is a good (and fast) partitioning algorithm that will give me a list of several blocks to translate. A CSS block is a good start but I need something more detailed than that for some of these reasons: - I can't skip invisible or display:none nodes because websites have navigation menus and etc. that have text on them and need to be translated (I don't know what's the correct definition of CSS block that you mention to know if it covers this or not) - In direct opposition of the first point, I can't blindly just consider all nodes (including invisible ones) with text content on them because websites have script, style... tags in the body which should be skipped Also, some other properties that I'd like this algorithm to have: - It would be nice if it can treat a ul lia/li libli /ul as one individual block instead of one-per-li (and other similar constructs) -- [not a major req] - It should only give me blocks that have useful text content inside to be translated. For example, for sites with a lot of divdivdivdiv (or worse with tabletbodytrtdtable.. ad infinitum) nesting, I'm only interested in the blocks that have actual text content on them (which can probably be defined as with at least one non-whitespace-only child text node). The more junk (useless nodes) that this algorithm can skip, the better. Then I imagine we'll be good in performance if I implement it in C++ and have all other handling and further filtering be done in JS, one chunk at a time. Felipe On Tue, Mar 4, 2014 at 6:02 PM, Robert O'Callahan rob...@ocallahan.orgwrote: On Wed, Mar 5, 2014 at 8:47 AM, Felipe G fel...@gmail.com wrote: If I go with the clone route (to work on the snapshot'ed version of the data), how can I later associate the cloned nodes to the original nodes from the document? One way that I thought is to set a a userdata on the DOM nodes and then use the clone handler callback to associate the cloned node with the original one (through weak refs or a WeakMap). That would mean iterating first through all nodes to add the handlers, but that's probably fine (I don't need to analyze anything or visit text nodes). I think serializing and re-parsing everything in the worker is not the ideal solution unless we can find a way to also keep accurate associations with the original nodes from content. Anything that introduces a possibly lossy data aspect will probably hurt translation which is already an innacurate science. Maybe you can do the translation incrementally, and just annotate the DOM with custom attributes (or userdata) to record the progress of the translation? Plus a reference to the last tranlated node (subtree) to speed up finding the next node subtree to translate. I assume it would be OK to translate one CSS block at a time. Rob -- Jtehsauts tshaei dS,o n Wohfy Mdaon yhoaus eanuttehrotraiitny eovni le atrhtohu gthot sf oirng iyvoeu rs ihnesa.rt sS?o Whhei csha iids teoa stiheer :p atroa lsyazye,d 'mYaonu,r sGients uapr,e tfaokreg iyvoeunr, 'm aotr atnod sgaoy ,h o'mGee.t uTph eann dt hwea lmka'n? gBoutt uIp waanndt wyeonut thoo mken.o w ___ dev-platform mailing list dev-platform@lists.mozilla.org https://lists.mozilla.org/listinfo/dev-platform
Re: How to efficiently walk the DOM tree and its strings
On Tuesday, March 4, 2014 1:26:15 AM UTC-8, somb...@gmail.com wrote: While we have a defense-in-depth strategy (CSP and iframe sandbox should be protecting us from the worst possible scenarios) and we're hopeful that Service Workers will eventually let us provide nsIContentPolicy-level protection, the quality of the HTML parser is of course fairly important[1] to the operation of the HTML sanitizer. Sorry to go off-topic, but how are ServiceWorkers different from normal Workers here? I'm asking without context, so forgive me if I misunderstood. Thanks, Nikhil ___ dev-platform mailing list dev-platform@lists.mozilla.org https://lists.mozilla.org/listinfo/dev-platform
How to efficiently walk the DOM tree and its strings
Hi everyone, I'm working on a feature to offer webpage translation in Firefox. Translation involves, quite unsurprisingly, a lot of DOM and strings manipulation. Since DOM access only happens in the main thread, it brings the question of how to do it properly without causing jank. This is the use case that I'm dealing with in bug 971043: When the user decides to translate a webpage, we want to build a tree that is a cleaned-up version of the page's DOM tree (to remove nodes that do not contain any useful content for translation; more details in the bug for the curious). To do this we must visit all elements and text nodes once and decide which ones to keep and which ones to throw away. One idea suggested is to perform the task in chunks to let the event loop breathe in between. The problem is that the page can dynamically change and then a tree representation of the page may no longer exist. A possible solution to that is to only pause the page that is being translated (with, say, EnterModalState) until we can finish working on it, while letting other pages and the UI work normally. That sounds a reasonable option to me but I'd like to hear opinions. Another option exists if it's possible to make a fast copy of the whole DOM, and then work on this snapshot'ed copy which is not live. Better yet if we can send this copy with a non-copy move to a Worker thread. But it brings the question if the snapshot'ing itself won't cause jank, and if the extra memory usage for this is worth the trade-off. Even if we properly chunk the task, it is still bounded by the size of the strings on the page. To decide if a text node should be kept or thrown away we need to run a regexp on it, and there's no way to pause that midway through. And after we have our tree representation, it must be serialized and encodeURIComponent'ed to be sent to the translation service. ___ dev-platform mailing list dev-platform@lists.mozilla.org https://lists.mozilla.org/listinfo/dev-platform
Re: How to efficiently walk the DOM tree and its strings
On Mar 3, 2014, at 2:28 PM, Felipe G fel...@gmail.com wrote: Hi everyone, I'm working on a feature to offer webpage translation in Firefox. Translation involves, quite unsurprisingly, a lot of DOM and strings manipulation. Since DOM access only happens in the main thread, it brings the question of how to do it properly without causing hank. What does Chrome do? -Jeff ___ dev-platform mailing list dev-platform@lists.mozilla.org https://lists.mozilla.org/listinfo/dev-platform
Re: How to efficiently walk the DOM tree and its strings
On 3/3/14 2:28 PM, Felipe G wrote: A possible solution to that is to only pause the page that is being translated (with, say, EnterModalState) until we can finish working on it, while letting other pages and the UI work normally. The other pages can still modify the DOM of the page in question, right? It'll just be a bit more rare... Another option exists if it's possible to make a fast copy of the whole DOM, and then work on this snapshot'ed copy which is not live. How feasible is just doing .innerHTML to do that, then doing some sort of async parse (e.g. XHR or DOMParser) to get a DOM snapshot? That said, this would mean that you end up with a snapshot that's actual DOM stuff, on the main thread. Better yet if we can send this copy with a non-copy move to a Worker thread. You could send the string to a worker (if you do this in C++ you won't even need to copy it, afaict), but then on the worker you have to parse the HTML... That said, there might be JS implementations of an HTML5 parser out there. There's definitely some tradeoff here for the string and the parsed representation and all the parsing code, though. :( -Boris ___ dev-platform mailing list dev-platform@lists.mozilla.org https://lists.mozilla.org/listinfo/dev-platform
Re: How to efficiently walk the DOM tree and its strings
During the translation phase, Chrome imports a JS script into the webpage and this script does all the translation work. There's the language detection phase (another use case that I plan to ask on a separate e-mail) in which chrome does a .textContent and run the language detection off of it on that page's renderer thread. On Mon, Mar 3, 2014 at 4:31 PM, Jeff Muizelaar jmuizel...@mozilla.comwrote: On Mar 3, 2014, at 2:28 PM, Felipe G fel...@gmail.com wrote: Hi everyone, I'm working on a feature to offer webpage translation in Firefox. Translation involves, quite unsurprisingly, a lot of DOM and strings manipulation. Since DOM access only happens in the main thread, it brings the question of how to do it properly without causing hank. What does Chrome do? -Jeff ___ dev-platform mailing list dev-platform@lists.mozilla.org https://lists.mozilla.org/listinfo/dev-platform
Re: How to efficiently walk the DOM tree and its strings
On 2014-03-03, 3:30 PM, Felipe G wrote: During the translation phase, Chrome imports a JS script into the webpage and this script does all the translation work. There's the language detection phase (another use case that I plan to ask on a separate e-mail) in which chrome does a .textContent and run the language detection off of it on that page's renderer thread. Note that Chrome might get away with janking the content process, but that's not necessarily going to be acceptable for us. Cheers, Ehsan On Mon, Mar 3, 2014 at 4:31 PM, Jeff Muizelaar jmuizel...@mozilla.comwrote: On Mar 3, 2014, at 2:28 PM, Felipe G fel...@gmail.com wrote: Hi everyone, I'm working on a feature to offer webpage translation in Firefox. Translation involves, quite unsurprisingly, a lot of DOM and strings manipulation. Since DOM access only happens in the main thread, it brings the question of how to do it properly without causing hank. What does Chrome do? -Jeff ___ dev-platform mailing list dev-platform@lists.mozilla.org https://lists.mozilla.org/listinfo/dev-platform ___ dev-platform mailing list dev-platform@lists.mozilla.org https://lists.mozilla.org/listinfo/dev-platform
Re: How to efficiently walk the DOM tree and its strings
On 2014-03-03, 3:19 PM, Boris Zbarsky wrote: Better yet if we can send this copy with a non-copy move to a Worker thread. You could send the string to a worker (if you do this in C++ you won't even need to copy it, afaict), but then on the worker you have to parse the HTML... That said, there might be JS implementations of an HTML5 parser out there. There's definitely some tradeoff here for the string and the parsed representation and all the parsing code, though. :( There's https://github.com/google/gumbo-parser which can be compiled to js. Cheers, Ehsan ___ dev-platform mailing list dev-platform@lists.mozilla.org https://lists.mozilla.org/listinfo/dev-platform
Fwd: How to efficiently walk the DOM tree and its strings
On Mon, Mar 3, 2014 at 5:19 PM, Boris Zbarsky bzbar...@mit.edu wrote: On 3/3/14 2:28 PM, Felipe G wrote: A possible solution to that is to only pause the page that is being translated (with, say, EnterModalState) until we can finish working on it, while letting other pages and the UI work normally. The other pages can still modify the DOM of the page in question, right? It'll just be a bit more rare... That is true.. Although given the rarity of this case, I think I could watch for changes and just bail out of the work if that happens. Keeping the page's script/events from running is probably the main thing to cover. Another option exists if it's possible to make a fast copy of the whole DOM, and then work on this snapshot'ed copy which is not live. How feasible is just doing .innerHTML to do that, then doing some sort of async parse (e.g. XHR or DOMParser) to get a DOM snapshot? That said, this would mean that you end up with a snapshot that's actual DOM stuff, on the main thread. Hmm ideally I need to keep (weak) references to the text nodes from the page in order to replace their text when translation is ready. If I do a .innerHTML conversion back and forth I'll lose the refs. In theory they can be found again but it would require (again) walking the tree and running heuristics on it, which is potentially worse.. ___ dev-platform mailing list dev-platform@lists.mozilla.org https://lists.mozilla.org/listinfo/dev-platform
Re: How to efficiently walk the DOM tree and its strings
Hi, translating DOM is a bit funky. Generally, you can probably translate block elements one by one, but you need to persist inline elements. You should mark up the inline elements in the string that you send to the translation engine, such that you can support inline markup changing the order. Something like You would think the a href=foofunkyness/a would strongrule/rule. could translate into strongRuling/strong would be the a href=foofunkyness/a, you would think. Are you intending to also localize tooltips and the like? Axel On 3/3/14, 8:28 PM, Felipe G wrote: Hi everyone, I'm working on a feature to offer webpage translation in Firefox. Translation involves, quite unsurprisingly, a lot of DOM and strings manipulation. Since DOM access only happens in the main thread, it brings the question of how to do it properly without causing jank. This is the use case that I'm dealing with in bug 971043: When the user decides to translate a webpage, we want to build a tree that is a cleaned-up version of the page's DOM tree (to remove nodes that do not contain any useful content for translation; more details in the bug for the curious). To do this we must visit all elements and text nodes once and decide which ones to keep and which ones to throw away. One idea suggested is to perform the task in chunks to let the event loop breathe in between. The problem is that the page can dynamically change and then a tree representation of the page may no longer exist. A possible solution to that is to only pause the page that is being translated (with, say, EnterModalState) until we can finish working on it, while letting other pages and the UI work normally. That sounds a reasonable option to me but I'd like to hear opinions. Another option exists if it's possible to make a fast copy of the whole DOM, and then work on this snapshot'ed copy which is not live. Better yet if we can send this copy with a non-copy move to a Worker thread. But it brings the question if the snapshot'ing itself won't cause jank, and if the extra memory usage for this is worth the trade-off. Even if we properly chunk the task, it is still bounded by the size of the strings on the page. To decide if a text node should be kept or thrown away we need to run a regexp on it, and there's no way to pause that midway through. And after we have our tree representation, it must be serialized and encodeURIComponent'ed to be sent to the translation service. ___ dev-platform mailing list dev-platform@lists.mozilla.org https://lists.mozilla.org/listinfo/dev-platform
Re: How to efficiently walk the DOM tree and its strings
On Tue, Mar 4, 2014 at 9:19 AM, Boris Zbarsky bzbar...@mit.edu wrote: How feasible is just doing .innerHTML to do that, then doing some sort of async parse (e.g. XHR or DOMParser) to get a DOM snapshot? That said, this would mean that you end up with a snapshot that's actual DOM stuff, on the main thread. Wouldn't a deep clone of the root element be more efficient? Rob -- Jtehsauts tshaei dS,o n Wohfy Mdaon yhoaus eanuttehrotraiitny eovni le atrhtohu gthot sf oirng iyvoeu rs ihnesa.rt sS?o Whhei csha iids teoa stiheer :p atroa lsyazye,d 'mYaonu,r sGients uapr,e tfaokreg iyvoeunr, 'm aotr atnod sgaoy ,h o'mGee.t uTph eann dt hwea lmka'n? gBoutt uIp waanndt wyeonut thoo mken.o w ___ dev-platform mailing list dev-platform@lists.mozilla.org https://lists.mozilla.org/listinfo/dev-platform
Re: How to efficiently walk the DOM tree and its strings
On 3/3/14 9:07 PM, Boris Zbarsky wrote: document.documentElement.cloneNode(true): ~18ms document.cloneNode(true): ~8ms Oh, and the difference between these two is that in the former clones of img elements try to do image loads, which takes about 70% of the cloning time, but in the latter we're in a data document, so the image load start attempts only take about 15% of the time. -Boris ___ dev-platform mailing list dev-platform@lists.mozilla.org https://lists.mozilla.org/listinfo/dev-platform
Re: How to efficiently walk the DOM tree and its strings
Android also ships a parser that we wrote for Reader mode: http://mxr.mozilla.org/mozilla-central/source/mobile/android/chrome/content/JSDOMParser.js We've talked about extending it to also do phone number/address detection as well, but haven't tried (reader mode doesn't need to modify the original dom, unlike the examples here). Memory use (during the parse) isn't great, so the streaming parser actually sounds interesting... Thanks :) - Wes - Original Message - From: Andrew Sutherland asutherl...@asutherland.org To: dev-platform@lists.mozilla.org Sent: Monday, March 3, 2014 3:57:04 PM Subject: Re: How to efficiently walk the DOM tree and its strings On 03/03/2014 03:19 PM, Boris Zbarsky wrote: That said, there might be JS implementations of an HTML5 parser out there. The Gaia e-mail app has a streaming HTML parser in its worker-friendly sanitizer at https://github.com/mozilla-b2g/bleach.js/blob/worker-thread-friendly/lib/bleach.js. It's derived from jresig's http://ejohn.org/blog/pure-javascript-html-parser/ Note: There are probably better options out there, just thought I'd call it out. Andrew ___ dev-platform mailing list dev-platform@lists.mozilla.org https://lists.mozilla.org/listinfo/dev-platform ___ dev-platform mailing list dev-platform@lists.mozilla.org https://lists.mozilla.org/listinfo/dev-platform