Re: How to efficiently walk the DOM tree and its strings
On Mon, Mar 3, 2014 at 10:19 PM, Boris Zbarsky bzbar...@mit.edu wrote: How feasible is just doing .innerHTML to do that, then doing some sort of async parse (e.g. XHR or DOMParser) to get a DOM snapshot? Seems more efficient to write the walk in C++, since the innerHTML getter already includes the walk in C++. How important is it to avoid C++? On Mon, Mar 3, 2014 at 10:45 PM, Ehsan Akhgari ehsan.akhg...@gmail.com wrote: There's https://github.com/google/gumbo-parser which can be compiled to js. The parser we use in Gecko can be compiled to JS using GWT. However, the current glue code assumes the parser is running in the context of a browser window object and a browser DOM. Writing the glue code that assumes something else about the environment should be easy. Also, David Flanagan has implemented the HTML parsing algorithm (pre-template; not sure if updated since) directly in JS. On Tue, Mar 4, 2014 at 1:57 AM, Andrew Sutherland asutherl...@asutherland.org wrote: The Gaia e-mail app has a streaming HTML parser in its worker-friendly sanitizer at https://github.com/mozilla-b2g/bleach.js/blob/worker-thread-friendly/lib/bleach.js. On Tue, Mar 4, 2014 at 7:14 AM, Wesley Johnston wjohns...@mozilla.com wrote: Android also ships a parser that we wrote for Reader mode: http://mxr.mozilla.org/mozilla-central/source/mobile/android/chrome/content/JSDOMParser.js It saddens me that we are using non-compliant ad hoc parsers when we already have two spec-compliant (at least at some point in time) ones. -- Henri Sivonen hsivo...@hsivonen.fi https://hsivonen.fi/ ___ dev-platform mailing list dev-platform@lists.mozilla.org https://lists.mozilla.org/listinfo/dev-platform
Re: How to efficiently walk the DOM tree and its strings
Chrome imports a JS script into the webpage and this script does all the translation work. Felipe On Mon, Mar 3, 2014 at 4:31 PM, Jeff Muizelaar jmuizel...@mozilla.comwrote: On Mar 3, 2014, at 2:28 PM, Felipe G fel...@gmail.com wrote: Hi everyone, I'm working on a feature to offer webpage translation in Firefox. Translation involves, quite unsurprisingly, a lot of DOM and strings manipulation. Since DOM access only happens in the main thread, it brings the question of how to do it properly without causing hank. What does Chrome do? -Jeff ___ dev-platform mailing list dev-platform@lists.mozilla.org https://lists.mozilla.org/listinfo/dev-platform
Re: How to efficiently walk the DOM tree and its strings
Thanks for the feedback so far! If I go with the clone route (to work on the snapshot'ed version of the data), how can I later associate the cloned nodes to the original nodes from the document? One way that I thought is to set a a userdata on the DOM nodes and then use the clone handler callback to associate the cloned node with the original one (through weak refs or a WeakMap). That would mean iterating first through all nodes to add the handlers, but that's probably fine (I don't need to analyze anything or visit text nodes). I think serializing and re-parsing everything in the worker is not the ideal solution unless we can find a way to also keep accurate associations with the original nodes from content. Anything that introduces a possibly lossy data aspect will probably hurt translation which is already an innacurate science. On Tue, Mar 4, 2014 at 6:26 AM, Andrew Sutherland asutherl...@asutherland.org wrote: On 03/04/2014 03:13 AM, Henri Sivonen wrote: It saddens me that we are using non-compliant ad hoc parsers when we already have two spec-compliant (at least at some point in time) ones. Interesting! I assume you are referring to: https://github.com/davidflanagan/html5/blob/master/html5parser.js Which seems to be (explicitly) derived from: https://github.com/aredridel/html5 Which in turn seems to actually includes a few parser variants. Per the discussion with you on https://groups.google.com/d/ msg/mozilla.dev.webapi/wDFM_T9v7Tc/Nr9Df4FUwuwJ for the Gaia e-mail app we initially ended up using an in-page data document mechanism for sanitization. We later migrated to using a worker based parser. There were some coordination hiccups with this migration ( https://bugzil.la/814257) and some time B2G time-pressure so a comprehensive survey of HTML parsers did not happen so much. While we have a defense-in-depth strategy (CSP and iframe sandbox should be protecting us from the worst possible scenarios) and we're hopeful that Service Workers will eventually let us provide nsIContentPolicy-level protection, the quality of the HTML parser is of course fairly important[1] to the operation of the HTML sanitizer. If you'd like to bless a specific implementation for workers to perform streaming HTML parsing or other some other explicit strategy, I'd be happy to file a bug for us to go in that direction. Because we are using a white-list based mechanism and are fairly limited and arguably fairly luddite in what we whitelist, it's my hope that our errors are on the side of safety (and breaking adventurous HTML email :), but that is indeed largely hope. Your input is definitely appreciated, especially as it relates to prioritizing such enhancements and potential risk from our current strategy. Andrew 1: understatement ___ dev-platform mailing list dev-platform@lists.mozilla.org https://lists.mozilla.org/listinfo/dev-platform ___ dev-platform mailing list dev-platform@lists.mozilla.org https://lists.mozilla.org/listinfo/dev-platform
Re: We live in a memory-constrained world
On Tue, Mar 04, 2014 at 09:48:33AM +0200, Henri Sivonen wrote: On Fri, Feb 28, 2014 at 8:09 PM, L. David Baron dba...@dbaron.org wrote: In other words, whenever you have a pointer in a static data structure pointing to some other data, that pointer needs to get fixed up when the library loads, which makes the memory that pointer is in less likely to be shared across processes (depending, I guess, on how many processes are able to load the library at its default address, which may in turn depend on security features that try to randomize library base addresses). This also slows down loading of shared libraries. So all things considered, do we want things like static atoms and the HTML parser's pre-interned element name and attribute name objects (which have pointers to static atoms and a virtual method) to move from the heap to POD-like syntax even if it results in relocations or, with MSVC, static initializers? its generally gcc that does dumb things with static initializers, but that can generally be fixed with liberal use of constexpr. Anyway I suspect the real answer is it's complicated ;) but it's probably a good idea at least for things on the staart up path anyway adding a relocation and saving a call to malloc before we can parse html is probably a win on its own. Shouldn't be an issue with Nuwa-cloned processes on B2G, though. Are static atoms and the HTML parser's pre-interned element name and attribute name objects that are on the heap shared between processes under Nuwa already? I.e. is the heap cloned with copy-on-write sharing? On the memory page granularity, right? Do we know if the aiui yes the heap is made copy on write at the time fork(2) is called to create the Nuwa process, and presumably we don't have KSM turned on so only the initial heap will ever be shared. I'd guess that HTML parser stuff you mentioned is created beforewe fork the Nuwa process and so its included. stuff we heap-allocate at startup pack nicely into memory pages so that they won't have free spots that the allocator would use after cloning? probably not perfectly, but it seems like in practice it does fairly well other wise Nuwa wouldn't help. Trev -- Henri Sivonen hsivo...@hsivonen.fi https://hsivonen.fi/ ___ dev-platform mailing list dev-platform@lists.mozilla.org https://lists.mozilla.org/listinfo/dev-platform signature.asc Description: Digital signature ___ dev-platform mailing list dev-platform@lists.mozilla.org https://lists.mozilla.org/listinfo/dev-platform
Re: How to efficiently walk the DOM tree and its strings
On Wed, Mar 5, 2014 at 8:47 AM, Felipe G fel...@gmail.com wrote: If I go with the clone route (to work on the snapshot'ed version of the data), how can I later associate the cloned nodes to the original nodes from the document? One way that I thought is to set a a userdata on the DOM nodes and then use the clone handler callback to associate the cloned node with the original one (through weak refs or a WeakMap). That would mean iterating first through all nodes to add the handlers, but that's probably fine (I don't need to analyze anything or visit text nodes). I think serializing and re-parsing everything in the worker is not the ideal solution unless we can find a way to also keep accurate associations with the original nodes from content. Anything that introduces a possibly lossy data aspect will probably hurt translation which is already an innacurate science. Maybe you can do the translation incrementally, and just annotate the DOM with custom attributes (or userdata) to record the progress of the translation? Plus a reference to the last tranlated node (subtree) to speed up finding the next node subtree to translate. I assume it would be OK to translate one CSS block at a time. Rob -- Jtehsauts tshaei dS,o n Wohfy Mdaon yhoaus eanuttehrotraiitny eovni le atrhtohu gthot sf oirng iyvoeu rs ihnesa.rt sS?o Whhei csha iids teoa stiheer :p atroa lsyazye,d 'mYaonu,r sGients uapr,e tfaokreg iyvoeunr, 'm aotr atnod sgaoy ,h o'mGee.t uTph eann dt hwea lmka'n? gBoutt uIp waanndt wyeonut thoo mken.o w ___ dev-platform mailing list dev-platform@lists.mozilla.org https://lists.mozilla.org/listinfo/dev-platform
Re: Column numbers appended to URLs recently
On 3/3/14, 12:54 PM, Jan Honza Odvarko wrote: URLs in stack traces for exception objects have been recently changed. There is a column number appended at the end (I am seeing this in Nightly, but it could be also in Aurora). Code that parses error.stack now needs to handle the case where both a line number and a column number appear after the URL. This should be easy to fix. What's affected? Just Firebug? -j ___ dev-platform mailing list dev-platform@lists.mozilla.org https://lists.mozilla.org/listinfo/dev-platform
Re: We live in a memory-constrained world
On Mon, Mar 3, 2014 at 11:48 PM, Henri Sivonen hsivo...@hsivonen.fi wrote: Are static atoms and the HTML parser's pre-interned element name and attribute name objects that are on the heap shared between processes under Nuwa already? I.e. is the heap cloned with copy-on-write sharing? On the memory page granularity, right? Do we know if the stuff we heap-allocate at startup pack nicely into memory pages so that they won't have free spots that the allocator would use after cloning? https://bugzilla.mozilla.org/show_bug.cgi?id=948648 is open for investigating this. Preliminary results seem to indicate that things work pretty well at the moment -- i.e. not much sharing is lost -- but I may be misinterpreting. Nick ___ dev-platform mailing list dev-platform@lists.mozilla.org https://lists.mozilla.org/listinfo/dev-platform
Re: W3C Proposed Recommendations: WAI-ARIA (accessibility)
On Tuesday 2014-02-25 09:43 -0500, david bolter wrote: I support this W3C Recommendation. Yep. While I wasn't entirely happy with some of the history that led to the current state, I agree we should support it. (In particular, in the early days of ARIA I was told, in private conversations, that it was intended as a temporary measure until HTML5 was ready and had enough of the semantics needed. But I never asked the people telling me that that that intent should be documented publicly, and as far as I know there's no public record of it, and it probably didn't represent any consensus at the time. I'd probably have preferred that the semantics needed for accessibility were part of the semantics of the language rather than a separate add-on that can be inconsistent, but that's also not the way the Web platform works today. I did learn something about the value of working in public, though.) So I'll submit a review in support of advancing to Recommendation as-is. -David On Tue, Feb 11, 2014 at 2:22 PM, L. David Baron dba...@dbaron.org wrote: W3C recently published the following proposed recommendations (the stage before W3C's final stage, Recommendation): Accessible Rich Internet Applications (WAI-ARIA) 1.0 http://www.w3.org/TR/2014/PR-wai-aria-20140206/ WAI-ARIA 1.0 User Agent Implementation Guide http://www.w3.org/TR/2014/PR-wai-aria-implementation-20140206/ -- 턞 L. David Baron http://dbaron.org/ 턂 턢 Mozilla https://www.mozilla.org/ 턂 Before I built a wall I'd ask to know What I was walling in or walling out, And to whom I was like to give offense. - Robert Frost, Mending Wall (1914) signature.asc Description: Digital signature ___ dev-platform mailing list dev-platform@lists.mozilla.org https://lists.mozilla.org/listinfo/dev-platform
Re: How to efficiently walk the DOM tree and its strings
The actual translation needs to happen at once, but that's ok if I can work in the chunks incrementally, and only when everything is ready I send it off to the translation service. What I need to find then is a good (and fast) partitioning algorithm that will give me a list of several blocks to translate. A CSS block is a good start but I need something more detailed than that for some of these reasons: - I can't skip invisible or display:none nodes because websites have navigation menus and etc. that have text on them and need to be translated (I don't know what's the correct definition of CSS block that you mention to know if it covers this or not) - In direct opposition of the first point, I can't blindly just consider all nodes (including invisible ones) with text content on them because websites have script, style... tags in the body which should be skipped Also, some other properties that I'd like this algorithm to have: - It would be nice if it can treat a ul lia/li libli /ul as one individual block instead of one-per-li (and other similar constructs) -- [not a major req] - It should only give me blocks that have useful text content inside to be translated. For example, for sites with a lot of divdivdivdiv (or worse with tabletbodytrtdtable.. ad infinitum) nesting, I'm only interested in the blocks that have actual text content on them (which can probably be defined as with at least one non-whitespace-only child text node). The more junk (useless nodes) that this algorithm can skip, the better. Then I imagine we'll be good in performance if I implement it in C++ and have all other handling and further filtering be done in JS, one chunk at a time. Felipe On Tue, Mar 4, 2014 at 6:02 PM, Robert O'Callahan rob...@ocallahan.orgwrote: On Wed, Mar 5, 2014 at 8:47 AM, Felipe G fel...@gmail.com wrote: If I go with the clone route (to work on the snapshot'ed version of the data), how can I later associate the cloned nodes to the original nodes from the document? One way that I thought is to set a a userdata on the DOM nodes and then use the clone handler callback to associate the cloned node with the original one (through weak refs or a WeakMap). That would mean iterating first through all nodes to add the handlers, but that's probably fine (I don't need to analyze anything or visit text nodes). I think serializing and re-parsing everything in the worker is not the ideal solution unless we can find a way to also keep accurate associations with the original nodes from content. Anything that introduces a possibly lossy data aspect will probably hurt translation which is already an innacurate science. Maybe you can do the translation incrementally, and just annotate the DOM with custom attributes (or userdata) to record the progress of the translation? Plus a reference to the last tranlated node (subtree) to speed up finding the next node subtree to translate. I assume it would be OK to translate one CSS block at a time. Rob -- Jtehsauts tshaei dS,o n Wohfy Mdaon yhoaus eanuttehrotraiitny eovni le atrhtohu gthot sf oirng iyvoeu rs ihnesa.rt sS?o Whhei csha iids teoa stiheer :p atroa lsyazye,d 'mYaonu,r sGients uapr,e tfaokreg iyvoeunr, 'm aotr atnod sgaoy ,h o'mGee.t uTph eann dt hwea lmka'n? gBoutt uIp waanndt wyeonut thoo mken.o w ___ dev-platform mailing list dev-platform@lists.mozilla.org https://lists.mozilla.org/listinfo/dev-platform
Re: Consensus sought - when to reset try repository?
On 2/28/14, 5:24 PM, Hal Wine wrote: tl;dr: what is the balance point between pushes to try taking too long and loosing repository history of recent try pushes? Summary: As most developers have experienced, pushing to try can sometimes take a long time. Once it takes too long (as measured by screams of pain in #releng) https://etherpad.mozilla.org/ep/search?query=releng%29, a try [repository] reset is scheduled. This hurts productivity and increases frustration for everyone involved (devs, IT, RelEng). We don't want to do this anymore. A reset of the try repository deletes the existing contents, and replaces with a fresh clone from mozilla-central. While the tbpl information will remain valid for any completed build, any attempt to view the diffs for a try build will fail (unless you already had them in your local repository). Progress on resolution of the root cause: - IT has made tremendous progress in reducing the occurrence of long push times, but they still are not predictable. Various attempts at monitoring[1] and auto correction[2] have not been successful in improving the situation. Work continues on additional changes that should improve the situation[3]. The most recent mitigation strategy is to trade the unknown timing disruption of the push times increasing to a pain threshold with a known timing of reseting the try repository every TCW (tree closing window - every 6 wks currently). However, we heard from some folks that this is too often. The most recent try-reset-triggered-by-pain was a duration of 6 months[4]. There was at least one report just 3 months after reset of problems[5]. So, the question is - what say developers -- what's the balance point between: - too often, making collaborating on try pushes hard - too infrequent, introducing increasing push times I wouldn't have such a big issue with Try resets if we didn't lose information in the process. I believe every time there's been a Try reset, I've lost data from a recent (1 week) Try push and I needed to re-run that job - incurring extra cost to Mozilla and wasting my time. I also periodically find myself wanting to answer questions like what percentage of tree closures are due to pushes that didn't go to Try first. Data loss stinks. I'd say the goal should be no data loss. I have an idea that will enable us to achieve this. Let's expose every newly-reset instance of the Try repo as a separate URL. We would still push to ssh://hg.mozilla.org/try, but the URLs printed and the URLs used by automation would be URLs to repos that would never go away. e.g. https://hg.mozilla.org/tries/try1/rev/840f122d1286 (try1 being the important bit in there). When we reset Try, you'd hand out URLs to try2. You could reset the writable Try repo as frequently as you desired and aside from a slightly different repo URL being given out, nobody should notice. The main drawbacks of this approach that I can think of are all in automation: parts of automation are very repo/URL centric and having effectively dynamic URLs might break assumptions. But making automation work against arbitrary URLs is a good thing, as it allows automation to be more flexible and this allows people to experiment with alternate repo hosting, landing tools, landing-integrated code review tools, etc without requiring special involvement from RelEng. Everything is a web service and is self-service, etc. ___ dev-platform mailing list dev-platform@lists.mozilla.org https://lists.mozilla.org/listinfo/dev-platform
Re: How to efficiently walk the DOM tree and its strings
On Tuesday, March 4, 2014 1:26:15 AM UTC-8, somb...@gmail.com wrote: While we have a defense-in-depth strategy (CSP and iframe sandbox should be protecting us from the worst possible scenarios) and we're hopeful that Service Workers will eventually let us provide nsIContentPolicy-level protection, the quality of the HTML parser is of course fairly important[1] to the operation of the HTML sanitizer. Sorry to go off-topic, but how are ServiceWorkers different from normal Workers here? I'm asking without context, so forgive me if I misunderstood. Thanks, Nikhil ___ dev-platform mailing list dev-platform@lists.mozilla.org https://lists.mozilla.org/listinfo/dev-platform