Re: How to efficiently walk the DOM tree and its strings

2014-03-04 Thread Henri Sivonen
On Mon, Mar 3, 2014 at 10:19 PM, Boris Zbarsky bzbar...@mit.edu wrote:
 How feasible is just doing .innerHTML to do that, then doing some sort of
 async parse (e.g. XHR or DOMParser) to get a DOM snapshot?

Seems more efficient to write the walk in C++, since the innerHTML
getter already includes the walk in C++. How important is it to avoid
C++?

On Mon, Mar 3, 2014 at 10:45 PM, Ehsan Akhgari ehsan.akhg...@gmail.com wrote:
 There's https://github.com/google/gumbo-parser which can be compiled to js.

The parser we use in Gecko can be compiled to JS using GWT. However,
the current glue code assumes the parser is running in the context of
a browser window object and a browser DOM. Writing the glue code that
assumes something else about the environment should be easy.

Also, David Flanagan has implemented the HTML parsing algorithm
(pre-template; not sure if updated since) directly in JS.

On Tue, Mar 4, 2014 at 1:57 AM, Andrew Sutherland
asutherl...@asutherland.org wrote:
 The Gaia e-mail app has a streaming HTML parser in its worker-friendly
 sanitizer at
 https://github.com/mozilla-b2g/bleach.js/blob/worker-thread-friendly/lib/bleach.js.

On Tue, Mar 4, 2014 at 7:14 AM, Wesley Johnston wjohns...@mozilla.com wrote:
 Android also ships a parser that we wrote for Reader mode:

 http://mxr.mozilla.org/mozilla-central/source/mobile/android/chrome/content/JSDOMParser.js

It saddens me that we are using non-compliant ad hoc parsers when we
already have two spec-compliant (at least at some point in time) ones.

-- 
Henri Sivonen
hsivo...@hsivonen.fi
https://hsivonen.fi/
___
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform


Re: How to efficiently walk the DOM tree and its strings

2014-03-04 Thread Felipe G
Chrome imports a JS script into the webpage and this script does all the
translation work.

Felipe

On Mon, Mar 3, 2014 at 4:31 PM, Jeff Muizelaar jmuizel...@mozilla.comwrote:


 On Mar 3, 2014, at 2:28 PM, Felipe G fel...@gmail.com wrote:

  Hi everyone, I'm working on a feature to offer webpage translation in
  Firefox. Translation involves, quite unsurprisingly, a lot of DOM and
  strings manipulation. Since DOM access only happens in the main thread,
 it
  brings the question of how to do it properly without causing hank.

 What does Chrome do?

 -Jeff

___
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform


Re: How to efficiently walk the DOM tree and its strings

2014-03-04 Thread Felipe G
Thanks for the feedback so far!

If I go with the clone route (to work on the snapshot'ed version of the
data), how can I later associate the cloned nodes to the original nodes
from the document?  One way that I thought is to set a a userdata on the
DOM nodes and then use the clone handler callback to associate the cloned
node with the original one (through weak refs or a WeakMap).  That would
mean iterating first through all nodes to add the handlers, but that's
probably fine (I don't need to analyze anything or visit text nodes).

I think serializing and re-parsing everything in the worker is not the
ideal solution unless we can find a way to also keep accurate associations
with the original nodes from content. Anything that introduces a possibly
lossy data aspect will probably hurt translation which is already an
innacurate science.


On Tue, Mar 4, 2014 at 6:26 AM, Andrew Sutherland 
asutherl...@asutherland.org wrote:

 On 03/04/2014 03:13 AM, Henri Sivonen wrote:

 It saddens me that we are using non-compliant ad hoc parsers when we
 already have two spec-compliant (at least at some point in time) ones.


 Interesting!  I assume you are referring to:
 https://github.com/davidflanagan/html5/blob/master/html5parser.js

 Which seems to be (explicitly) derived from:
 https://github.com/aredridel/html5

 Which in turn seems to actually includes a few parser variants.

 Per the discussion with you on https://groups.google.com/d/
 msg/mozilla.dev.webapi/wDFM_T9v7Tc/Nr9Df4FUwuwJ for the Gaia e-mail app
 we initially ended up using an in-page data document mechanism for
 sanitization.  We later migrated to using a worker based parser.  There
 were some coordination hiccups with this migration (
 https://bugzil.la/814257) and some time B2G time-pressure so a
 comprehensive survey of HTML parsers did not happen so much.

 While we have a defense-in-depth strategy (CSP and iframe sandbox should
 be protecting us from the worst possible scenarios) and we're hopeful that
 Service Workers will eventually let us provide nsIContentPolicy-level
 protection, the quality of the HTML parser is of course fairly important[1]
 to the operation of the HTML sanitizer.  If you'd like to bless a specific
 implementation for workers to perform streaming HTML parsing or other some
 other explicit strategy, I'd be happy to file a bug for us to go in that
 direction.  Because we are using a white-list based mechanism and are
 fairly limited and arguably fairly luddite in what we whitelist, it's my
 hope that our errors are on the side of safety (and breaking adventurous
 HTML email :), but that is indeed largely hope.  Your input is definitely
 appreciated, especially as it relates to prioritizing such enhancements and
 potential risk from our current strategy.

 Andrew


 1: understatement

 ___
 dev-platform mailing list
 dev-platform@lists.mozilla.org
 https://lists.mozilla.org/listinfo/dev-platform

___
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform


Re: We live in a memory-constrained world

2014-03-04 Thread Trevor Saunders
On Tue, Mar 04, 2014 at 09:48:33AM +0200, Henri Sivonen wrote:
 On Fri, Feb 28, 2014 at 8:09 PM, L. David Baron dba...@dbaron.org wrote:
  In other words, whenever you have a pointer in a static data
  structure pointing to some other data, that pointer needs to get
  fixed up when the library loads, which makes the memory that pointer
  is in less likely to be shared across processes (depending, I guess,
  on how many processes are able to load the library at its default
  address, which may in turn depend on security features that try to
  randomize library base addresses).  This also slows down loading of
  shared libraries.
 
 So all things considered, do we want things like static atoms and the
 HTML parser's pre-interned element name and attribute name objects
 (which have pointers to static atoms and a virtual method) to move
 from the heap to POD-like syntax even if it results in relocations or,
 with MSVC, static initializers?

its generally gcc that does dumb things with static initializers, but
that can generally be fixed with liberal use of constexpr.

Anyway I suspect the real answer is it's complicated ;) but it's
probably  a good idea at least for things on the staart up path anyway
adding a relocation and saving a call to malloc before we can parse html
is probably a win on its own.

  Shouldn't be an issue with Nuwa-cloned processes on B2G, though.
 
 Are static atoms and the HTML parser's pre-interned element name and
 attribute name objects that are on the heap shared between processes
 under Nuwa already? I.e. is the heap cloned with copy-on-write
 sharing? On the memory page granularity, right? Do we know if the

aiui yes the heap is made copy on write at the time fork(2) is called to
create the Nuwa process, and presumably we don't have KSM turned on so
only the initial heap will ever be shared.  I'd guess that HTML parser
stuff you mentioned is created beforewe fork the Nuwa process and so its
included.

 stuff we heap-allocate at startup pack nicely into memory pages so
 that they won't have free spots that the allocator would use after
 cloning?

probably not perfectly, but it seems like in practice it does fairly
well other wise Nuwa wouldn't help.

Trev

 
 -- 
 Henri Sivonen
 hsivo...@hsivonen.fi
 https://hsivonen.fi/
 ___
 dev-platform mailing list
 dev-platform@lists.mozilla.org
 https://lists.mozilla.org/listinfo/dev-platform


signature.asc
Description: Digital signature
___
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform


Re: How to efficiently walk the DOM tree and its strings

2014-03-04 Thread Robert O'Callahan
On Wed, Mar 5, 2014 at 8:47 AM, Felipe G fel...@gmail.com wrote:

 If I go with the clone route (to work on the snapshot'ed version of the
 data), how can I later associate the cloned nodes to the original nodes
 from the document?  One way that I thought is to set a a userdata on the
 DOM nodes and then use the clone handler callback to associate the cloned
 node with the original one (through weak refs or a WeakMap).  That would
 mean iterating first through all nodes to add the handlers, but that's
 probably fine (I don't need to analyze anything or visit text nodes).

 I think serializing and re-parsing everything in the worker is not the
 ideal solution unless we can find a way to also keep accurate associations
 with the original nodes from content. Anything that introduces a possibly
 lossy data aspect will probably hurt translation which is already an
 innacurate science.


Maybe you can do the translation incrementally, and just annotate the DOM
with custom attributes (or userdata) to record the progress of the
translation? Plus a reference to the last tranlated node (subtree) to speed
up finding the next node subtree to translate. I assume it would be OK to
translate one CSS block at a time.

Rob
-- 
Jtehsauts  tshaei dS,o n Wohfy  Mdaon  yhoaus  eanuttehrotraiitny  eovni
le atrhtohu gthot sf oirng iyvoeu rs ihnesa.rt sS?o  Whhei csha iids  teoa
stiheer :p atroa lsyazye,d  'mYaonu,r  sGients  uapr,e  tfaokreg iyvoeunr,
'm aotr  atnod  sgaoy ,h o'mGee.t  uTph eann dt hwea lmka'n?  gBoutt  uIp
waanndt  wyeonut  thoo mken.o w
___
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform


Re: Column numbers appended to URLs recently

2014-03-04 Thread Jason Orendorff
On 3/3/14, 12:54 PM, Jan Honza Odvarko wrote:
 URLs in stack traces for exception objects have been recently changed. There 
 is a column number appended at the end (I am seeing this in Nightly, but it 
 could be also in Aurora).

Code that parses error.stack now needs to handle the case where both a
line number and a column number appear after the URL.

This should be easy to fix. What's affected? Just Firebug?

-j

___
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform


Re: We live in a memory-constrained world

2014-03-04 Thread Nicholas Nethercote
On Mon, Mar 3, 2014 at 11:48 PM, Henri Sivonen hsivo...@hsivonen.fi wrote:

 Are static atoms and the HTML parser's pre-interned element name and
 attribute name objects that are on the heap shared between processes
 under Nuwa already? I.e. is the heap cloned with copy-on-write
 sharing? On the memory page granularity, right? Do we know if the
 stuff we heap-allocate at startup pack nicely into memory pages so
 that they won't have free spots that the allocator would use after
 cloning?

https://bugzilla.mozilla.org/show_bug.cgi?id=948648 is open for
investigating this. Preliminary results seem to indicate that things
work pretty well at the moment -- i.e. not much sharing is lost -- but
I may be misinterpreting.

Nick
___
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform


Re: W3C Proposed Recommendations: WAI-ARIA (accessibility)

2014-03-04 Thread L. David Baron
On Tuesday 2014-02-25 09:43 -0500, david bolter wrote:
 I support this W3C Recommendation.

Yep.

While I wasn't entirely happy with some of the history that led to
the current state, I agree we should support it.

(In particular, in the early days of ARIA I was told, in private
conversations, that it was intended as a temporary measure until
HTML5 was ready and had enough of the semantics needed.  But I never
asked the people telling me that that that intent should be
documented publicly, and as far as I know there's no public record
of it, and it probably didn't represent any consensus at the time.
I'd probably have preferred that the semantics needed for
accessibility were part of the semantics of the language rather than
a separate add-on that can be inconsistent, but that's also not the
way the Web platform works today.  I did learn something about the
value of working in public, though.)

So I'll submit a review in support of advancing to Recommendation
as-is.

-David

 On Tue, Feb 11, 2014 at 2:22 PM, L. David Baron dba...@dbaron.org wrote:
  W3C recently published the following proposed recommendations (the
  stage before W3C's final stage, Recommendation):
 
Accessible Rich Internet Applications (WAI-ARIA) 1.0
http://www.w3.org/TR/2014/PR-wai-aria-20140206/
 
WAI-ARIA 1.0 User Agent Implementation Guide
http://www.w3.org/TR/2014/PR-wai-aria-implementation-20140206/

-- 
턞   L. David Baron http://dbaron.org/   턂
턢   Mozilla  https://www.mozilla.org/   턂
 Before I built a wall I'd ask to know
 What I was walling in or walling out,
 And to whom I was like to give offense.
   - Robert Frost, Mending Wall (1914)


signature.asc
Description: Digital signature
___
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform


Re: How to efficiently walk the DOM tree and its strings

2014-03-04 Thread Felipe G
The actual translation needs to happen at once, but that's ok if I can work
in the chunks incrementally, and only when everything is ready I send it
off to the translation service.  What I need to find then is a good (and
fast) partitioning algorithm that will give me a list of several blocks to
translate. A CSS block is a good start but I need something more detailed
than that for some of these reasons:

- I can't skip invisible or display:none nodes because websites have
navigation menus and etc. that have text on them and need to be translated
(I don't know what's the correct definition of CSS block that you mention
to know if it covers this or not)
- In direct opposition of the first point, I can't blindly just consider
all nodes (including invisible ones) with text content on them because
websites have script, style... tags in the body which should be skipped


Also, some other properties that I'd like this algorithm to have:

- It would be nice if it can treat a ul lia/li libli /ul as one
individual block instead of one-per-li   (and other similar constructs)
-- [not a major req]

- It should only give me blocks that have useful text content inside to be
translated. For example, for sites with a lot of divdivdivdiv (or
worse with tabletbodytrtdtable.. ad infinitum) nesting, I'm only
interested in the blocks that have actual text content on them (which can
probably be defined as with at least one non-whitespace-only child text
node).


The more junk (useless nodes) that this algorithm can skip, the better.
Then I imagine we'll be good in performance if I implement it in C++ and
have all other handling and further filtering be done in JS, one chunk at a
time.


Felipe



On Tue, Mar 4, 2014 at 6:02 PM, Robert O'Callahan rob...@ocallahan.orgwrote:

 On Wed, Mar 5, 2014 at 8:47 AM, Felipe G fel...@gmail.com wrote:

 If I go with the clone route (to work on the snapshot'ed version of the
 data), how can I later associate the cloned nodes to the original nodes
 from the document?  One way that I thought is to set a a userdata on the
 DOM nodes and then use the clone handler callback to associate the cloned
 node with the original one (through weak refs or a WeakMap).  That would
 mean iterating first through all nodes to add the handlers, but that's
 probably fine (I don't need to analyze anything or visit text nodes).

 I think serializing and re-parsing everything in the worker is not the
 ideal solution unless we can find a way to also keep accurate associations
 with the original nodes from content. Anything that introduces a possibly
 lossy data aspect will probably hurt translation which is already an
 innacurate science.


 Maybe you can do the translation incrementally, and just annotate the DOM
 with custom attributes (or userdata) to record the progress of the
 translation? Plus a reference to the last tranlated node (subtree) to speed
 up finding the next node subtree to translate. I assume it would be OK to
 translate one CSS block at a time.

 Rob
 --
 Jtehsauts  tshaei dS,o n Wohfy  Mdaon  yhoaus  eanuttehrotraiitny  eovni
 le atrhtohu gthot sf oirng iyvoeu rs ihnesa.rt sS?o  Whhei csha iids  teoa
 stiheer :p atroa lsyazye,d  'mYaonu,r  sGients  uapr,e  tfaokreg iyvoeunr,
 'm aotr  atnod  sgaoy ,h o'mGee.t  uTph eann dt hwea lmka'n?  gBoutt  uIp
 waanndt  wyeonut  thoo mken.o w

___
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform


Re: Consensus sought - when to reset try repository?

2014-03-04 Thread Gregory Szorc

On 2/28/14, 5:24 PM, Hal Wine wrote:

tl;dr: what is the balance point between pushes to try taking too long
and loosing repository history of recent try pushes?

Summary:


As most developers have experienced, pushing to try can sometimes take a
long time. Once it takes too long (as measured by screams of pain in
#releng) https://etherpad.mozilla.org/ep/search?query=releng%29, a
try [repository] reset is scheduled. This hurts productivity and
increases frustration for everyone involved (devs, IT, RelEng). We don't
want to do this anymore.

A reset of the try repository deletes the existing contents, and
replaces with a fresh clone from mozilla-central. While the tbpl
information will remain valid for any completed build, any attempt to
view the diffs for a try build will fail (unless you already had them in
your local repository).

Progress on resolution of the root cause:
-

IT has made tremendous progress in reducing the occurrence of long push
times, but they still are not predictable. Various attempts at
monitoring[1] and auto correction[2] have not been successful in
improving the situation. Work continues on additional changes that
should improve the situation[3].

The most recent mitigation strategy is to trade the unknown timing
disruption of the push times increasing to a pain threshold with a
known timing of reseting the try repository every TCW (tree closing
window - every 6 wks currently). However, we heard from some folks that
this is too often.

The most recent try-reset-triggered-by-pain was a duration of 6
months[4]. There was at least one report just 3 months after reset of
problems[5].

So, the question is - what say developers -- what's the balance point
between:
  - too often, making collaborating on try pushes hard
  - too infrequent, introducing increasing push times


I wouldn't have such a big issue with Try resets if we didn't lose 
information in the process. I believe every time there's been a Try 
reset, I've lost data from a recent (1 week) Try push and I needed to 
re-run that job - incurring extra cost to Mozilla and wasting my time. I 
also periodically find myself wanting to answer questions like what 
percentage of tree closures are due to pushes that didn't go to Try 
first. Data loss stinks.


I'd say the goal should be no data loss. I have an idea that will 
enable us to achieve this.


Let's expose every newly-reset instance of the Try repo as a separate 
URL. We would still push to ssh://hg.mozilla.org/try, but the URLs 
printed and the URLs used by automation would be URLs to repos that 
would never go away. e.g. 
https://hg.mozilla.org/tries/try1/rev/840f122d1286 (try1 being the 
important bit in there). When we reset Try, you'd hand out URLs to 
try2. You could reset the writable Try repo as frequently as you 
desired and aside from a slightly different repo URL being given out, 
nobody should notice.


The main drawbacks of this approach that I can think of are all in 
automation: parts of automation are very repo/URL centric and having 
effectively dynamic URLs might break assumptions. But making automation 
work against arbitrary URLs is a good thing, as it allows automation to 
be more flexible and this allows people to experiment with alternate 
repo hosting, landing tools, landing-integrated code review tools, etc 
without requiring special involvement from RelEng. Everything is a web 
service and is self-service, etc.

___
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform


Re: How to efficiently walk the DOM tree and its strings

2014-03-04 Thread nsm . nikhil
On Tuesday, March 4, 2014 1:26:15 AM UTC-8, somb...@gmail.com wrote:
 While we have a defense-in-depth strategy (CSP and iframe sandbox should 
 
 be protecting us from the worst possible scenarios) and we're hopeful 
 
 that Service Workers will eventually let us provide 
 
 nsIContentPolicy-level protection, the quality of the HTML parser is of 
 
 course fairly important[1] to the operation of the HTML sanitizer.

Sorry to go off-topic, but how are ServiceWorkers different from normal Workers 
here? I'm asking without context, so forgive me if I misunderstood.

Thanks,
Nikhil
___
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform