HTML sanitization, CSP, nsIContentPolicy, ServiceWorkers (was: Re: How to efficiently walk the DOM tree and its strings)

2014-03-05 Thread Andrew Sutherland

On 03/05/2014 01:52 AM, nsm.nik...@gmail.com wrote:

On Tuesday, March 4, 2014 1:26:15 AM UTC-8, somb...@gmail.com wrote:

While we have a defense-in-depth strategy (CSP and iframe sandbox should
be protecting us from the worst possible scenarios) and we're hopeful
that Service Workers will eventually let us provide
nsIContentPolicy-level protection, the quality of the HTML parser is of
course fairly important[1] to the operation of the HTML sanitizer.

Sorry to go off-topic, but how are ServiceWorkers different from normal Workers 
here? I'm asking without context, so forgive me if I misunderstood.


Context in short: Thunderbird does not use an HTML sanitizer in the 
default case when displaying HTML emails because it can turn off 
JavaScript execution, network accesses, and other stuff via 
nsIContentPolicy.  iframe sandboxes let the Gaia email app which runs in 
content turn off JavaScript but do nothing to stop remote image 
fetches/etc.  We want to be able to stop network fetches by for both 
bandwidth reasons and privacy reasons.


I am referring to the dream of being able to skip sanitization and 
instead just enforce greater control over the iframe through either use 
of CSP or ServiceWorkers.  ServiceWorker's onfetch capability doesn't 
actually work for this purpose because of the origin restrictions, but 
the mooted allowConnectionsTo CSP 1.1 API Alex Russell's blog post 
http://infrequently.org/2013/05/use-case-zero/ (about CSP and an early 
NavigationController/ServiceWorker proposal) would have been perfect.


In the event CSP grew an API like that again in the future, I assume 
ServiceWorker is where it would end up.  It doesn't seem super likely 
since it seems like CSP 1.1 generally covers the required use-cases.  If 
we are (eventually) able to specify a more strict CSP on an iframe than 
the CSP in which the e-mail app already lives, we may able to use 
img-src/media-src/etc. for our fairly simple stop this iframe from 
accessing any resources control purposes.


More context is available at the 
https://groups.google.com/d/msg/mozilla.dev.webapi/wDFM_T9v7Tc/_yTofMrjBk4J 
thread.


Andrew
___
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform


Re: How to efficiently walk the DOM tree and its strings

2014-03-04 Thread Henri Sivonen
On Mon, Mar 3, 2014 at 10:19 PM, Boris Zbarsky bzbar...@mit.edu wrote:
 How feasible is just doing .innerHTML to do that, then doing some sort of
 async parse (e.g. XHR or DOMParser) to get a DOM snapshot?

Seems more efficient to write the walk in C++, since the innerHTML
getter already includes the walk in C++. How important is it to avoid
C++?

On Mon, Mar 3, 2014 at 10:45 PM, Ehsan Akhgari ehsan.akhg...@gmail.com wrote:
 There's https://github.com/google/gumbo-parser which can be compiled to js.

The parser we use in Gecko can be compiled to JS using GWT. However,
the current glue code assumes the parser is running in the context of
a browser window object and a browser DOM. Writing the glue code that
assumes something else about the environment should be easy.

Also, David Flanagan has implemented the HTML parsing algorithm
(pre-template; not sure if updated since) directly in JS.

On Tue, Mar 4, 2014 at 1:57 AM, Andrew Sutherland
asutherl...@asutherland.org wrote:
 The Gaia e-mail app has a streaming HTML parser in its worker-friendly
 sanitizer at
 https://github.com/mozilla-b2g/bleach.js/blob/worker-thread-friendly/lib/bleach.js.

On Tue, Mar 4, 2014 at 7:14 AM, Wesley Johnston wjohns...@mozilla.com wrote:
 Android also ships a parser that we wrote for Reader mode:

 http://mxr.mozilla.org/mozilla-central/source/mobile/android/chrome/content/JSDOMParser.js

It saddens me that we are using non-compliant ad hoc parsers when we
already have two spec-compliant (at least at some point in time) ones.

-- 
Henri Sivonen
hsivo...@hsivonen.fi
https://hsivonen.fi/
___
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform


Re: How to efficiently walk the DOM tree and its strings

2014-03-04 Thread Felipe G
Chrome imports a JS script into the webpage and this script does all the
translation work.

Felipe

On Mon, Mar 3, 2014 at 4:31 PM, Jeff Muizelaar jmuizel...@mozilla.comwrote:


 On Mar 3, 2014, at 2:28 PM, Felipe G fel...@gmail.com wrote:

  Hi everyone, I'm working on a feature to offer webpage translation in
  Firefox. Translation involves, quite unsurprisingly, a lot of DOM and
  strings manipulation. Since DOM access only happens in the main thread,
 it
  brings the question of how to do it properly without causing hank.

 What does Chrome do?

 -Jeff

___
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform


Re: How to efficiently walk the DOM tree and its strings

2014-03-04 Thread Felipe G
Thanks for the feedback so far!

If I go with the clone route (to work on the snapshot'ed version of the
data), how can I later associate the cloned nodes to the original nodes
from the document?  One way that I thought is to set a a userdata on the
DOM nodes and then use the clone handler callback to associate the cloned
node with the original one (through weak refs or a WeakMap).  That would
mean iterating first through all nodes to add the handlers, but that's
probably fine (I don't need to analyze anything or visit text nodes).

I think serializing and re-parsing everything in the worker is not the
ideal solution unless we can find a way to also keep accurate associations
with the original nodes from content. Anything that introduces a possibly
lossy data aspect will probably hurt translation which is already an
innacurate science.


On Tue, Mar 4, 2014 at 6:26 AM, Andrew Sutherland 
asutherl...@asutherland.org wrote:

 On 03/04/2014 03:13 AM, Henri Sivonen wrote:

 It saddens me that we are using non-compliant ad hoc parsers when we
 already have two spec-compliant (at least at some point in time) ones.


 Interesting!  I assume you are referring to:
 https://github.com/davidflanagan/html5/blob/master/html5parser.js

 Which seems to be (explicitly) derived from:
 https://github.com/aredridel/html5

 Which in turn seems to actually includes a few parser variants.

 Per the discussion with you on https://groups.google.com/d/
 msg/mozilla.dev.webapi/wDFM_T9v7Tc/Nr9Df4FUwuwJ for the Gaia e-mail app
 we initially ended up using an in-page data document mechanism for
 sanitization.  We later migrated to using a worker based parser.  There
 were some coordination hiccups with this migration (
 https://bugzil.la/814257) and some time B2G time-pressure so a
 comprehensive survey of HTML parsers did not happen so much.

 While we have a defense-in-depth strategy (CSP and iframe sandbox should
 be protecting us from the worst possible scenarios) and we're hopeful that
 Service Workers will eventually let us provide nsIContentPolicy-level
 protection, the quality of the HTML parser is of course fairly important[1]
 to the operation of the HTML sanitizer.  If you'd like to bless a specific
 implementation for workers to perform streaming HTML parsing or other some
 other explicit strategy, I'd be happy to file a bug for us to go in that
 direction.  Because we are using a white-list based mechanism and are
 fairly limited and arguably fairly luddite in what we whitelist, it's my
 hope that our errors are on the side of safety (and breaking adventurous
 HTML email :), but that is indeed largely hope.  Your input is definitely
 appreciated, especially as it relates to prioritizing such enhancements and
 potential risk from our current strategy.

 Andrew


 1: understatement

 ___
 dev-platform mailing list
 dev-platform@lists.mozilla.org
 https://lists.mozilla.org/listinfo/dev-platform

___
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform


Re: How to efficiently walk the DOM tree and its strings

2014-03-04 Thread Robert O'Callahan
On Wed, Mar 5, 2014 at 8:47 AM, Felipe G fel...@gmail.com wrote:

 If I go with the clone route (to work on the snapshot'ed version of the
 data), how can I later associate the cloned nodes to the original nodes
 from the document?  One way that I thought is to set a a userdata on the
 DOM nodes and then use the clone handler callback to associate the cloned
 node with the original one (through weak refs or a WeakMap).  That would
 mean iterating first through all nodes to add the handlers, but that's
 probably fine (I don't need to analyze anything or visit text nodes).

 I think serializing and re-parsing everything in the worker is not the
 ideal solution unless we can find a way to also keep accurate associations
 with the original nodes from content. Anything that introduces a possibly
 lossy data aspect will probably hurt translation which is already an
 innacurate science.


Maybe you can do the translation incrementally, and just annotate the DOM
with custom attributes (or userdata) to record the progress of the
translation? Plus a reference to the last tranlated node (subtree) to speed
up finding the next node subtree to translate. I assume it would be OK to
translate one CSS block at a time.

Rob
-- 
Jtehsauts  tshaei dS,o n Wohfy  Mdaon  yhoaus  eanuttehrotraiitny  eovni
le atrhtohu gthot sf oirng iyvoeu rs ihnesa.rt sS?o  Whhei csha iids  teoa
stiheer :p atroa lsyazye,d  'mYaonu,r  sGients  uapr,e  tfaokreg iyvoeunr,
'm aotr  atnod  sgaoy ,h o'mGee.t  uTph eann dt hwea lmka'n?  gBoutt  uIp
waanndt  wyeonut  thoo mken.o w
___
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform


Re: How to efficiently walk the DOM tree and its strings

2014-03-04 Thread Felipe G
The actual translation needs to happen at once, but that's ok if I can work
in the chunks incrementally, and only when everything is ready I send it
off to the translation service.  What I need to find then is a good (and
fast) partitioning algorithm that will give me a list of several blocks to
translate. A CSS block is a good start but I need something more detailed
than that for some of these reasons:

- I can't skip invisible or display:none nodes because websites have
navigation menus and etc. that have text on them and need to be translated
(I don't know what's the correct definition of CSS block that you mention
to know if it covers this or not)
- In direct opposition of the first point, I can't blindly just consider
all nodes (including invisible ones) with text content on them because
websites have script, style... tags in the body which should be skipped


Also, some other properties that I'd like this algorithm to have:

- It would be nice if it can treat a ul lia/li libli /ul as one
individual block instead of one-per-li   (and other similar constructs)
-- [not a major req]

- It should only give me blocks that have useful text content inside to be
translated. For example, for sites with a lot of divdivdivdiv (or
worse with tabletbodytrtdtable.. ad infinitum) nesting, I'm only
interested in the blocks that have actual text content on them (which can
probably be defined as with at least one non-whitespace-only child text
node).


The more junk (useless nodes) that this algorithm can skip, the better.
Then I imagine we'll be good in performance if I implement it in C++ and
have all other handling and further filtering be done in JS, one chunk at a
time.


Felipe



On Tue, Mar 4, 2014 at 6:02 PM, Robert O'Callahan rob...@ocallahan.orgwrote:

 On Wed, Mar 5, 2014 at 8:47 AM, Felipe G fel...@gmail.com wrote:

 If I go with the clone route (to work on the snapshot'ed version of the
 data), how can I later associate the cloned nodes to the original nodes
 from the document?  One way that I thought is to set a a userdata on the
 DOM nodes and then use the clone handler callback to associate the cloned
 node with the original one (through weak refs or a WeakMap).  That would
 mean iterating first through all nodes to add the handlers, but that's
 probably fine (I don't need to analyze anything or visit text nodes).

 I think serializing and re-parsing everything in the worker is not the
 ideal solution unless we can find a way to also keep accurate associations
 with the original nodes from content. Anything that introduces a possibly
 lossy data aspect will probably hurt translation which is already an
 innacurate science.


 Maybe you can do the translation incrementally, and just annotate the DOM
 with custom attributes (or userdata) to record the progress of the
 translation? Plus a reference to the last tranlated node (subtree) to speed
 up finding the next node subtree to translate. I assume it would be OK to
 translate one CSS block at a time.

 Rob
 --
 Jtehsauts  tshaei dS,o n Wohfy  Mdaon  yhoaus  eanuttehrotraiitny  eovni
 le atrhtohu gthot sf oirng iyvoeu rs ihnesa.rt sS?o  Whhei csha iids  teoa
 stiheer :p atroa lsyazye,d  'mYaonu,r  sGients  uapr,e  tfaokreg iyvoeunr,
 'm aotr  atnod  sgaoy ,h o'mGee.t  uTph eann dt hwea lmka'n?  gBoutt  uIp
 waanndt  wyeonut  thoo mken.o w

___
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform


Re: How to efficiently walk the DOM tree and its strings

2014-03-04 Thread nsm . nikhil
On Tuesday, March 4, 2014 1:26:15 AM UTC-8, somb...@gmail.com wrote:
 While we have a defense-in-depth strategy (CSP and iframe sandbox should 
 
 be protecting us from the worst possible scenarios) and we're hopeful 
 
 that Service Workers will eventually let us provide 
 
 nsIContentPolicy-level protection, the quality of the HTML parser is of 
 
 course fairly important[1] to the operation of the HTML sanitizer.

Sorry to go off-topic, but how are ServiceWorkers different from normal Workers 
here? I'm asking without context, so forgive me if I misunderstood.

Thanks,
Nikhil
___
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform


How to efficiently walk the DOM tree and its strings

2014-03-03 Thread Felipe G
Hi everyone, I'm working on a feature to offer webpage translation in
Firefox. Translation involves, quite unsurprisingly, a lot of DOM and
strings manipulation. Since DOM access only happens in the main thread, it
brings the question of how to do it properly without causing jank.

This is the use case that I'm dealing with in bug 971043:

When the user decides to translate a webpage, we want to build a tree that
is a cleaned-up version of the page's DOM tree (to remove nodes that do not
contain any useful content for translation; more details in the bug for the
curious). To do this we must visit all elements and text nodes once and
decide which ones to keep and which ones to throw away.

One idea suggested is to perform the task in chunks to let the event loop
breathe in between. The problem is that the page can dynamically change and
then a tree representation of the page may no longer exist. A possible
solution to that is to only pause the page that is being translated (with,
say, EnterModalState) until we can finish working on it, while letting
other pages and the UI work normally. That sounds a reasonable option to me
but I'd like to hear opinions.

Another option exists if it's possible to make a fast copy of the whole
DOM, and then work on this snapshot'ed copy which is not live. Better yet
if we can send this copy with a non-copy move to a Worker thread. But it
brings the question if the snapshot'ing itself won't cause jank, and if the
extra memory usage for this is worth the trade-off.

Even if we properly chunk the task, it is still bounded by the size of the
strings on the page. To decide if a text node should be kept or thrown away
we need to run a regexp on it, and there's no way to pause that midway
through. And after we have our tree representation, it must be serialized
and encodeURIComponent'ed to be sent to the translation service.
___
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform


Re: How to efficiently walk the DOM tree and its strings

2014-03-03 Thread Jeff Muizelaar

On Mar 3, 2014, at 2:28 PM, Felipe G fel...@gmail.com wrote:

 Hi everyone, I'm working on a feature to offer webpage translation in
 Firefox. Translation involves, quite unsurprisingly, a lot of DOM and
 strings manipulation. Since DOM access only happens in the main thread, it
 brings the question of how to do it properly without causing hank.

What does Chrome do?

-Jeff
___
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform


Re: How to efficiently walk the DOM tree and its strings

2014-03-03 Thread Boris Zbarsky

On 3/3/14 2:28 PM, Felipe G wrote:

A possible solution to that is to only pause the page that is being translated 
(with,
say, EnterModalState) until we can finish working on it, while letting
other pages and the UI work normally.


The other pages can still modify the DOM of the page in question, right? 
 It'll just be a bit more rare...



Another option exists if it's possible to make a fast copy of the whole
DOM, and then work on this snapshot'ed copy which is not live.


How feasible is just doing .innerHTML to do that, then doing some sort 
of async parse (e.g. XHR or DOMParser) to get a DOM snapshot?  That 
said, this would mean that you end up with a snapshot that's actual DOM 
stuff, on the main thread.



Better yet if we can send this copy with a non-copy move to a Worker thread.


You could send the string to a worker (if you do this in C++ you won't 
even need to copy it, afaict), but then on the worker you have to parse 
the HTML...  That said, there might be JS implementations of an HTML5 
parser out there.  There's definitely some tradeoff here for the string 
and the parsed representation and all the parsing code, though.  :(


-Boris

___
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform


Re: How to efficiently walk the DOM tree and its strings

2014-03-03 Thread Felipe G
During the translation phase, Chrome imports a JS script into the webpage
and this script does all the translation work.

There's the language detection phase (another use case that I plan to ask
on a separate e-mail) in which chrome does a .textContent and run the
language detection off of it on that page's renderer thread.


On Mon, Mar 3, 2014 at 4:31 PM, Jeff Muizelaar jmuizel...@mozilla.comwrote:


 On Mar 3, 2014, at 2:28 PM, Felipe G fel...@gmail.com wrote:

  Hi everyone, I'm working on a feature to offer webpage translation in
  Firefox. Translation involves, quite unsurprisingly, a lot of DOM and
  strings manipulation. Since DOM access only happens in the main thread,
 it
  brings the question of how to do it properly without causing hank.

 What does Chrome do?

 -Jeff

___
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform


Re: How to efficiently walk the DOM tree and its strings

2014-03-03 Thread Ehsan Akhgari

On 2014-03-03, 3:30 PM, Felipe G wrote:

During the translation phase, Chrome imports a JS script into the webpage
and this script does all the translation work.

There's the language detection phase (another use case that I plan to ask
on a separate e-mail) in which chrome does a .textContent and run the
language detection off of it on that page's renderer thread.


Note that Chrome might get away with janking the content process, but 
that's not necessarily going to be acceptable for us.


Cheers,
Ehsan


On Mon, Mar 3, 2014 at 4:31 PM, Jeff Muizelaar jmuizel...@mozilla.comwrote:



On Mar 3, 2014, at 2:28 PM, Felipe G fel...@gmail.com wrote:


Hi everyone, I'm working on a feature to offer webpage translation in
Firefox. Translation involves, quite unsurprisingly, a lot of DOM and
strings manipulation. Since DOM access only happens in the main thread,

it

brings the question of how to do it properly without causing hank.


What does Chrome do?

-Jeff


___
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform



___
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform


Re: How to efficiently walk the DOM tree and its strings

2014-03-03 Thread Ehsan Akhgari

On 2014-03-03, 3:19 PM, Boris Zbarsky wrote:

Better yet if we can send this copy with a non-copy move to a Worker
thread.


You could send the string to a worker (if you do this in C++ you won't
even need to copy it, afaict), but then on the worker you have to parse
the HTML...  That said, there might be JS implementations of an HTML5
parser out there.  There's definitely some tradeoff here for the string
and the parsed representation and all the parsing code, though.  :(


There's https://github.com/google/gumbo-parser which can be compiled to js.

Cheers,
Ehsan

___
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform


Fwd: How to efficiently walk the DOM tree and its strings

2014-03-03 Thread Felipe G
On Mon, Mar 3, 2014 at 5:19 PM, Boris Zbarsky bzbar...@mit.edu wrote:

 On 3/3/14 2:28 PM, Felipe G wrote:

 A possible solution to that is to only pause the page that is being
 translated (with,
 say, EnterModalState) until we can finish working on it, while letting
 other pages and the UI work normally.


 The other pages can still modify the DOM of the page in question, right?
  It'll just be a bit more rare...


That is true..  Although given the rarity of this case, I think I could
watch for changes and just bail out of the work if that happens.  Keeping
the page's script/events from running is probably the main thing to cover.




  Another option exists if it's possible to make a fast copy of the whole
 DOM, and then work on this snapshot'ed copy which is not live.


 How feasible is just doing .innerHTML to do that, then doing some sort of
 async parse (e.g. XHR or DOMParser) to get a DOM snapshot?  That said, this
 would mean that you end up with a snapshot that's actual DOM stuff, on the
 main thread.

 Hmm ideally I need to keep (weak) references to the text nodes from the
page in order to replace their text when translation is ready.  If I do a
.innerHTML conversion back and forth I'll lose the refs.  In theory they
can be found again but it would require (again) walking the tree and
running heuristics on it, which is potentially worse..
___
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform


Re: How to efficiently walk the DOM tree and its strings

2014-03-03 Thread Axel Hecht

Hi,

translating DOM is a bit funky. Generally, you can probably translate 
block elements one by one, but you need to persist inline elements.


You should mark up the inline elements in the string that you send to 
the translation engine, such that you can support inline markup changing 
the order.


Something like

You would think the a href=foofunkyness/a would strongrule/rule.

could translate into

strongRuling/strong would be the a href=foofunkyness/a, you 
would think.


Are you intending to also localize tooltips and the like?

Axel


On 3/3/14, 8:28 PM, Felipe G wrote:

Hi everyone, I'm working on a feature to offer webpage translation in
Firefox. Translation involves, quite unsurprisingly, a lot of DOM and
strings manipulation. Since DOM access only happens in the main thread, it
brings the question of how to do it properly without causing jank.

This is the use case that I'm dealing with in bug 971043:

When the user decides to translate a webpage, we want to build a tree that
is a cleaned-up version of the page's DOM tree (to remove nodes that do not
contain any useful content for translation; more details in the bug for the
curious). To do this we must visit all elements and text nodes once and
decide which ones to keep and which ones to throw away.

One idea suggested is to perform the task in chunks to let the event loop
breathe in between. The problem is that the page can dynamically change and
then a tree representation of the page may no longer exist. A possible
solution to that is to only pause the page that is being translated (with,
say, EnterModalState) until we can finish working on it, while letting
other pages and the UI work normally. That sounds a reasonable option to me
but I'd like to hear opinions.

Another option exists if it's possible to make a fast copy of the whole
DOM, and then work on this snapshot'ed copy which is not live. Better yet
if we can send this copy with a non-copy move to a Worker thread. But it
brings the question if the snapshot'ing itself won't cause jank, and if the
extra memory usage for this is worth the trade-off.

Even if we properly chunk the task, it is still bounded by the size of the
strings on the page. To decide if a text node should be kept or thrown away
we need to run a regexp on it, and there's no way to pause that midway
through. And after we have our tree representation, it must be serialized
and encodeURIComponent'ed to be sent to the translation service.



___
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform


Re: How to efficiently walk the DOM tree and its strings

2014-03-03 Thread Robert O'Callahan
On Tue, Mar 4, 2014 at 9:19 AM, Boris Zbarsky bzbar...@mit.edu wrote:

 How feasible is just doing .innerHTML to do that, then doing some sort of
 async parse (e.g. XHR or DOMParser) to get a DOM snapshot?  That said, this
 would mean that you end up with a snapshot that's actual DOM stuff, on the
 main thread.


Wouldn't a deep clone of the root element be more efficient?

Rob
-- 
Jtehsauts  tshaei dS,o n Wohfy  Mdaon  yhoaus  eanuttehrotraiitny  eovni
le atrhtohu gthot sf oirng iyvoeu rs ihnesa.rt sS?o  Whhei csha iids  teoa
stiheer :p atroa lsyazye,d  'mYaonu,r  sGients  uapr,e  tfaokreg iyvoeunr,
'm aotr  atnod  sgaoy ,h o'mGee.t  uTph eann dt hwea lmka'n?  gBoutt  uIp
waanndt  wyeonut  thoo mken.o w
___
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform


Re: How to efficiently walk the DOM tree and its strings

2014-03-03 Thread Boris Zbarsky

On 3/3/14 9:07 PM, Boris Zbarsky wrote:

   document.documentElement.cloneNode(true): ~18ms
   document.cloneNode(true): ~8ms


Oh, and the difference between these two is that in the former clones of 
img elements try to do image loads, which takes about 70% of the 
cloning time, but in the latter we're in a data document, so the image 
load start attempts only take about 15% of the time.


-Boris
___
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform


Re: How to efficiently walk the DOM tree and its strings

2014-03-03 Thread Wesley Johnston
Android also ships a parser that we wrote for Reader mode:

http://mxr.mozilla.org/mozilla-central/source/mobile/android/chrome/content/JSDOMParser.js

We've talked about extending it to also do phone number/address detection as 
well, but haven't tried (reader mode doesn't need to modify the original dom, 
unlike the examples here). Memory use (during the parse) isn't great, so the 
streaming parser actually sounds interesting... Thanks :)

- Wes

- Original Message -
From: Andrew Sutherland asutherl...@asutherland.org
To: dev-platform@lists.mozilla.org
Sent: Monday, March 3, 2014 3:57:04 PM
Subject: Re: How to efficiently walk the DOM tree and its strings

On 03/03/2014 03:19 PM, Boris Zbarsky wrote:
 That said, there might be JS implementations of an HTML5 parser out there.

The Gaia e-mail app has a streaming HTML parser in its worker-friendly 
sanitizer at 
https://github.com/mozilla-b2g/bleach.js/blob/worker-thread-friendly/lib/bleach.js.
 
It's derived from jresig's 
http://ejohn.org/blog/pure-javascript-html-parser/

Note: There are probably better options out there, just thought I'd call 
it out.

Andrew
___
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform
___
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform