[MediaWiki-commits] [Gerrit] Handle exts. in tokenizer and removed ExtensionContentCollec... - change (mediawiki...Parsoid)

GWicke (Code Review) Thu, 07 Mar 2013 08:27:35 -0800

GWicke has submitted this change and it was merged.

Change subject: Handle exts. in tokenizer and removed ExtensionContentCollector.
......................................................................



Handle exts. in tokenizer and removed ExtensionContentCollector.

* Since the tokenizer can now identify the extension content and
  that strategy proved successful, this patch pushes that one step
  further by matching start and end tags and encapsulating it in
  the self-closing tag that the ExtensionContentCollector was
  doing thus far. There is no collection involved anymore since
  extension content is entirely encapsulated in this token.  So,
  there is no use for the ExtensionContentCollector anymore.

* If an xml tag is not a html5 tag, an older html tag, is a known
  installed extension, an include directive, or a native parsoid
  extension tag (<ref>, <references>), the xml tag is now
  converted to plain text.  This fixes parsing of text like:
  "x<y\n\na>b" which is no longer parsed as an xml-tag

* Fixed Util.arrayToHash to record "e: true" for each element 'e'
  of the array rather than '1'.  Also fixed all users to explicitly
  check for 'true' rather than a truthy value.
     Ex: var h = Util.arrayToHash(a);
         if (h[candidate] === true) instead of if (h[candidate])
  This is because if candidate is 'prototype' or 'constructor'
  h[candidate] will be a truthy value (functions), but not true.
  I realized this yesterday, courtesy marktraceur.

* A couple minor fixes and simplifications

* Change in parser test results:
  - 2 wt2wt tests green
  - 1 wt2wt test red (to be investigated)
  - 1 html2html test green

* This also speeds up parser tests since all tokens (and their attrs)
  are no longer repeatedly examined for matching extension tags.

Notes for next steps
--------------------
* There are 3 different ways how the non-html tags are handled:
  - natively-supported extensions like <ref>
    * tokenized in tokenizer in new scope and inserted back into
      top-level scope
    * go through stages 1 & 2 of the main pipeline
    * handled in ext.Cite.js where they get pulled out of stage 3.
    * separately post-procesed in a new stage 3 pipeline.

  - php-parser-supported extensions like <math>
    * content pulled out of tokenizer and hidden in a SelfclosingTag
      of type mw:Object/Extension.
    * processed by ext.core.TemplateHandler.js during stage 2 of the
      main pipeline:
      - content sent to php parser to generate html
      - html retokenized to tags and inserted back into token stream

  - <noinclude>, <includeonly>, <onlyinclude> directives
     * tokenized in a new scope and inserted back into the token
       stream if matched by the extension-tag production.
     * tokenized normally if matched by include_limits production
     * handled by TokenAndAttrCollector and processed by
       ext.core.NoIncludeOnly.js

* Next step: uniform handling of both natively-supported and
  php-parser supported extensions that effectively mimics a full
  expansion right upto to DOM building (either natively or via
  php-parser).  In a later step, possibly fold *include* directives
  into this abstraction if it lends itself to that.

Change-Id: Iecac76c27518a4690242e584c8c95ef363c749af
---
D js/lib/ext.ExtensionContentCollector.js
M js/lib/ext.core.Sanitizer.js
M js/lib/mediawiki.DOMPostProcessor.js
M js/lib/mediawiki.Util.js
M js/lib/mediawiki.parser.js
M js/lib/pegTokenizer.pegjs.txt
6 files changed, 96 insertions(+), 197 deletions(-)

Approvals:
  GWicke: Verified; Looks good to me, approved



diff --git a/js/lib/ext.ExtensionContentCollector.js 
b/js/lib/ext.ExtensionContentCollector.js
deleted file mode 100644
index 4935d84..0000000
--- a/js/lib/ext.ExtensionContentCollector.js
+++ /dev/null
@@ -1,147 +0,0 @@
-"use strict";
-
-var Collector = require( './ext.util.TokenAndAttrCollector.js' 
).TokenAndAttrCollector,
-       Util = require( './mediawiki.Util.js' ).Util;
-
-// SSS FIXME: Since we sweep the entire token stream in TokenAndAttrCollector
-// and since we add a new collector for each entry below, this is an expensive 
way
-// to collect extension content.  We should probably use a single collector and
-// match up against all these tags.
-//
-// List of supported extensions
-var supportedExtensions = [
-       'categorytree', 'charinsert', 'gallery', 'hiero', 'imagemap',
-       'inputbox', 'math', 'poem', 'syntaxhighlight', 'tag', 'timeline'
-];
-
-/**
- * Simple token collector for extensions
- */
-function ExtensionContent ( manager, options ) {
-       this.manager = manager;
-       this.options = options;
-       for (var i = 0; i < supportedExtensions.length; i++) {
-               var ext = supportedExtensions[i];
-               new Collector(
-                       manager,
-                       this.handleExtensionTag.bind(this, ext),
-                       true, // match the end-of-input if closing tag is 
missing
-                       // *NEVER* register several independent transformers 
with the
-                       // same rank, as deregistration will *not* work 
otherwise.
-                       // This gives us a few thousand extensions.
-                       this.rank + i * 0.00001,
-                       ext);
-       }
-}
-
-ExtensionContent.prototype.rank = 0.04;
-
-function defaultNestedDelimiterHandler(tokens, nestedDelimiterInfo) {
-       // Always clone the container token before modifying it
-       var token = nestedDelimiterInfo.token.clone();
-       var i = nestedDelimiterInfo.attrIndex;
-       var delimiter = nestedDelimiterInfo.delimiter;
-
-       // Strip the delimiter token wherever it is nested
-       // and strip upto/from the delimiter depending on the
-       // token type and where in the stream we are.
-       if (delimiter.constructor === TagTk) {
-               token.attribs.splice(i+1);
-               if (nestedDelimiterInfo.k >= 0) {
-                       token.attribs[i].k.splice(nestedDelimiterInfo.k);
-                       token.attribs[i].ksrc = undefined;
-               } else {
-                       token.attribs[i].v.splice(nestedDelimiterInfo.v);
-                       token.attribs[i].vsrc = undefined;
-               }
-
-               tokens.push(delimiter);
-               tokens.push(token);
-       } else { // stripUpto
-
-               // Since we are stripping upto the delimiter,
-               // change the token to a simple span.
-               // SSS FIXME: For sure in the case of table tags 
(tr,td,th,etc.) but, always??
-               token.name = 'span';
-               token.attribs.splice(0, i);
-               if (nestedDelimiterInfo.k >= 0) {
-                       token.attribs[0].k.splice(0, nestedDelimiterInfo.k);
-                       token.attribs[0].ksrc = undefined;
-               } else {
-                       token.attribs[0].v.splice(0, nestedDelimiterInfo.v);
-                       token.attribs[0].vsrc = undefined;
-               }
-
-               tokens.push(token);
-               tokens.push(delimiter);
-       }
-}
-
-ExtensionContent.prototype.handleExtensionTag = function(extension, 
collection) {
-       var wrapTemplates = this.options.wrapTemplates;
-
-       function wrappedExtensionContent(env, startTag, tagTsr) {
-               var dp = {}, content = '';
-
-               if (wrapTemplates && tagTsr[0] !== null && tagTsr[1] !== null) {
-                       dp.tsr = [tagTsr[0], tagTsr[1]];
-               }
-
-               content = startTag.dataAttribs.src;
-               dp.src = content;
-
-               var nt = new SelfclosingTagTk('extension', [
-                               new KV('typeof', 'mw:Object/Extension'),
-                               new KV('name', extension),
-                               new KV('about', "#" + env.newObjectId()),
-                               new KV('content', content)
-                       ], dp);
-
-
-               return { tokens: [nt] };
-       }
-
-       var tokens = [], start = collection.start, end = collection.end;
-
-       // Handle self-closing tag case specially!
-       if (start.constructor === SelfclosingTagTk) {
-               var tsr = (start.dataAttribs || {}).tsr || [null, null];
-               return wrappedExtensionContent(this.manager.env, start, tsr);
-       }
-
-       // Deal with nested opening delimiter found in another token
-       if (start.constructor !== TagTk) {
-               defaultNestedDelimiterHandler(tokens, start);
-       } else {
-               tokens.push(start);
-       }
-
-       tokens = tokens.concat(collection.tokens);
-
-       // Deal with nested closing delimiter found in another token
-       if (end && end.constructor !== EndTagTk) {
-               defaultNestedDelimiterHandler(tokens, end);
-       } else if (end) {
-               tokens.push(end);
-       }
-
-       // We can only use tsr if we are the top-level
-       // since env. only stores top-level wikitext and
-       // not template wikitext.
-       if (tokens.length > 1) {
-               // Discard tokens and just create a span with text content
-               // with span typeof set to mw:Object/Extension/Content
-               var st = tokens[0],
-                       et = tokens.last(),
-                       sTsr = (st.dataAttribs || {}).tsr || [null,null],
-                       eTsr = (et.dataAttribs || {}).tsr || [null,null];
-
-               return wrappedExtensionContent(this.manager.env, st, [sTsr[0], 
eTsr[1]]);
-       }
-
-       return { tokens: tokens };
-};
-
-if (typeof module === "object") {
-       module.exports.ExtensionContent = ExtensionContent;
-}
diff --git a/js/lib/ext.core.Sanitizer.js b/js/lib/ext.core.Sanitizer.js
index 0a31eb0..f26d1cc 100644
--- a/js/lib/ext.core.Sanitizer.js
+++ b/js/lib/ext.core.Sanitizer.js
@@ -660,7 +660,7 @@
                noEndTagHash = this.constants.noEndTagHash;
 
        if (token.isHTMLTag && token.isHTMLTag() &&
-                       ( !tagWLHash[token.name] ||
+                       ( tagWLHash[token.name] !== true ||
                          ( token.constructor === EndTagTk && 
noEndTagHash[token.name] )
                        )
                )
@@ -896,7 +896,7 @@
                }
 
                // Allow any attribute beginning with "data-", if in HTML5 mode
-               if (!(html5Mode && k.match(/^data-/i)) && !wlist[k]) {
+               if (!(html5Mode && k.match(/^data-/i)) && wlist[k] !== true) {
                        newAttrs[k] = [null, origV, origK];
                        continue;
                }
diff --git a/js/lib/mediawiki.DOMPostProcessor.js 
b/js/lib/mediawiki.DOMPostProcessor.js
index 8426339..264c6f9 100644
--- a/js/lib/mediawiki.DOMPostProcessor.js
+++ b/js/lib/mediawiki.DOMPostProcessor.js
@@ -669,7 +669,7 @@
        }
 
        // 2. Process 'elt' itself after -- skip literal-HTML nodes
-       if (nodesToMigrateFrom[elt.nodeName.toLowerCase()] && 
!DU.isLiteralHTMLNode(elt)) {
+       if (nodesToMigrateFrom[elt.nodeName.toLowerCase()] === true && 
!DU.isLiteralHTMLNode(elt)) {
                var firstEltToMigrate = null,
                        partialContent = false,
                        n = elt.lastChild;
diff --git a/js/lib/mediawiki.Util.js b/js/lib/mediawiki.Util.js
index ba85028..028bba6 100644
--- a/js/lib/mediawiki.Util.js
+++ b/js/lib/mediawiki.Util.js
@@ -765,7 +765,7 @@
        arrayToHash: function(a) {
                var h = {};
                for (var i = 0, n = a.length; i < n; i++) {
-                       h[a[i]] = 1;
+                       h[a[i]] = true;
                }
                return h;
        },
@@ -820,7 +820,7 @@
                                'inputbox', 'math', 'poem', 'syntaxhighlight', 
'tag', 'timeline'
                        ]);
                }
-               return this.installedExts[name] !== undefined;
+               return this.installedExts[name] === true;
        }
 };
 
diff --git a/js/lib/mediawiki.parser.js b/js/lib/mediawiki.parser.js
index 94c97ec..411be4e 100644
--- a/js/lib/mediawiki.parser.js
+++ b/js/lib/mediawiki.parser.js
@@ -1,4 +1,3 @@
-"use strict";
 /**
  * This module assembles parser pipelines from parser stages with
  * asynchronous communnication between stages based on events. Apart from the
@@ -9,6 +8,7 @@
  * http://www.mediawiki.org/wiki/Parsoid/Token_stream_transformations
  * for illustrations of the pipeline architecture.
  */
+"use strict";
 
 // make this global for now
 // XXX: figure out a way to get away without a global for PEG actions!
@@ -26,7 +26,6 @@
        IncludeOnly = NoIncludeOnly.IncludeOnly,
        NoInclude = NoIncludeOnly.NoInclude,
        OnlyInclude     = NoIncludeOnly.OnlyInclude,
-       ExtensionContent = 
require('./ext.ExtensionContentCollector.js').ExtensionContent,
        QuoteTransformer = 
require('./ext.core.QuoteTransformer.js').QuoteTransformer,
        TokenStreamPatcher = 
require('./ext.core.TokenStreamPatcher.js').TokenStreamPatcher,
        PreHandler = require('./ext.core.PreHandler.js').PreHandler,
@@ -119,7 +118,6 @@
                                OnlyInclude,    // 0.01
                                IncludeOnly,    // 0.02
                                NoInclude,              // 0.03
-                               ExtensionContent, // 0.04
 
                                // Preprocess behavior switches
                                BehaviorSwitchPreprocessor, // 0.05
diff --git a/js/lib/pegTokenizer.pegjs.txt b/js/lib/pegTokenizer.pegjs.txt
index 4551667..fbeae09 100644
--- a/js/lib/pegTokenizer.pegjs.txt
+++ b/js/lib/pegTokenizer.pegjs.txt
@@ -297,9 +297,12 @@
     };
 
     // Current extension tag being parsed.
-    var extParseInfo = {
-        currTag: null,
-    };
+    var currExtTag = null;
+
+    // SSS FIXME: Temporary hack till the next round of cleanup and 
refactoring.
+    var nativeParsoidExts = Util.arrayToHash([
+        "includeonly", "noinclude", "onlyinclude", "ref", "references"
+    ]);
 
     // text start position
     var textStart = 0;
@@ -333,7 +336,13 @@
         "title", "tr", "track", "u", "ul", "var", "video", "wbr"
     ]);
 
-    var html_old_names = Util.arrayToHash([ "center", "font", "tt" ]);
+    // From http://www.w3.org/TR/html5-diff/#obsolete-elements
+    // SSS FIXME: basefont is missing here, but looks like the PHP parser
+    // does not support it anyway and treats it as plain text.  So, skipping
+    // this one in Parsoid as well.
+    var html_old_names = Util.arrayToHash([
+        "strike", "big", "center", "font", "tt"
+    ]);
 
     var self = this;
 
@@ -637,7 +646,8 @@
 
 inline_element
   = //& { dp('inline_element enter' + input.substr(pos, 10)); return true; }
-      & '<' xmlish_tag
+      & '<' nowiki
+    / & '<' xmlish_tag
     / & '<' comment
     /// & '{' ( & '{{{{{' template / tplarg / template )
     / & '{' tplarg_or_template
@@ -1311,21 +1321,40 @@
  * outer templates.
  * ----------------------------------------------------------------------- */
 
-xmlish_tag = nowiki
-    / t2:(t:generic_tag {
-        var tagName = t.name;
+xmlish_tag = t2:(t:generic_tag {
+        var tagName = t.name.toLowerCase(),
+            dp = t.dataAttribs,
+            isHtmlTag = html5_tag_names[tagName] === true || 
html_old_names[tagName] === true,
+            isInstalledExt = Util.extensionInstalled(pegArgs.env, tagName),
+            supportedTag = nativeParsoidExts[tagName] === true;
 
-        // TagTk and SelfclosingTagTk
-        if (t.constructor !== EndTagTk && !html5_tag_names[tagName] && 
!html_old_names[tagName]) {
+        if (!isHtmlTag && !isInstalledExt && !supportedTag) {
+            // convert tag to text, but convert "\n" and "\r\n" NlTk tokens
+            var toks = input.substring(dp.tsr[0], dp.tsr[1]).split(/\n|\r\n/),
+                ret = [],
+                tsr = dp.tsr[0];
+
+            // Add one NlTk between each pair, hence toks.length-1
+            for (var i = 0, n = toks.length-1; i < n; i++) {
+                ret.push(toks[i]);
+                tsr += toks[i].length;
+                ret.push(new NlTk(tsr, tsr+1));
+            }
+            ret.push(toks[i]);
+
+            return ret;
+        }
+
+        if (t.constructor !== EndTagTk && !isHtmlTag) {
             if (t.constructor === TagTk) {
-                var tsr0 = t.dataAttribs.tsr[0],
+                var tsr0 = dp.tsr[0],
                     restOfInput = input.substring(tsr0),
                     tagContent = restOfInput.match(new 
RegExp("^(.|\n)*?(</\s*" + tagName + ">)", "m")),
                     extSrc = null;
 
                 if (tagContent) {
                     extSrc = tagContent[0];
-                } else if (Util.extensionInstalled(pegArgs.env, tagName)) {
+                } else if (isInstalledExt) {
                     extSrc = restOfInput;
                 }
 
@@ -1334,46 +1363,56 @@
                         extContentLen = extSrc.length - startTagLen - 
(tagContent ? tagContent[2].length : 0),
                         extContent = extSrc.substring(startTagLen, startTagLen 
+ extContentLen);
 
-                    t.dataAttribs.src = extSrc;
+                    // If the xml-tag is a known installed extension, skip the 
end-tag as well.
+                    var skipLen = extContentLen;
+                    if (isInstalledExt && tagContent) {
+                        skipLen += tagContent[2].length;
+                    }
+
+                    // Replace extension content (and possibly the end tag, as 
well) with
+                    // dummy content so it matches the rule following this 
match and
+                    // can be tokenized independently (if required).  This is 
just a trick
+                    // to tokenize ref content with higher priority.
+                    input = input.slice(0,pos) + Util.charSequence('', '#', 
skipLen) + input.slice(pos+skipLen);
+
+                    // Extension content source
+                    dp.src = extSrc;
 
                     // Temporary state
-                    t.dataAttribs.isExt = true;
-                    t.dataAttribs.extContent = extContent;
-                    t.dataAttribs.extContentOffset = pos;
-                    t.dataAttribs.origInput = input;
-
-                    // Replace extension content with dummy content so it 
matches the
-                    // rule following this match and can be tokenized 
independently (if required).
-                    // This is just a trick to tokenize ref content with 
higher priority.
-                    input = input.slice(0,pos) + Util.charSequence('', '#', 
extContentLen) + input.slice(pos+extContentLen);
+                    dp.extLikeTag = true;
+                    dp.isInstalledExt = isInstalledExt;
+                    dp.extContent = extContent;
+                    dp.extContentOffset = pos;
+                    dp.origInput = input;
 
                     // console.warn("input: " + input);
                 }
             } else {
-                t.dataAttribs.src = input.substring(t.dataAttribs.tsr[0], 
t.dataAttribs.tsr[1]);
+                dp.src = input.substring(dp.tsr[0], dp.tsr[1]);
             }
         }
 
-        extParseInfo.currTag = t;
-        // console.warn("curr: " + JSON.stringify(extParseInfo.currTag));
+        currExtTag = t;
+        // console.warn("curr: " + JSON.stringify(currExtTag));
         return t;
     }) (
       dummyText:'#'* {
-        // Should only match if we tricked the tokenizer
-        var currExtTag = extParseInfo.currTag;
-        return currExtTag.dataAttribs.isExt &&
-            currExtTag.constructor !== SelfclosingTagTk &&
-            dummyText.length === currExtTag.dataAttribs.extContent.length;
+        // Should only match if currExtTag is an extension
+        var dp = currExtTag ? currExtTag.dataAttribs : null;
+        return dp && dp.extLikeTag && dummyText.length === 
dp.extContent.length;
       }
       / &  {
-        // Should not match if we tricked the tokenizer
-        var currExtTag = extParseInfo.currTag;
-        return !currExtTag.dataAttribs.isExt ||
-            currExtTag.constructor === SelfclosingTagTk;
+        // Should not match if currExtTag is an extension
+        return !currExtTag || !currExtTag.dataAttribs.extLikeTag;
       }
     ) {
-        var ret = t2, dp = t2.dataAttribs;
-        if (dp.isExt) {
+        if (t2.constructor === Array) {
+            return t2;
+        }
+
+        var ret = t2,
+            dp = t2.dataAttribs;
+        if (dp.extLikeTag) {
             // If not a known installed extension, parse content as wikitext.
             // - include-directives: <noinclude>, <includeonly>, ...
             // - a non-html5 tag like <big>
@@ -1383,13 +1422,21 @@
             //   of cite extension refactoring for supporting DOM rerendering,
             //   incremental parsing, etc.  Till then, we'll let this go 
through
             //   the same codepaths as before.
-            var name = t2.name.toLowerCase();
-            if (!Util.extensionInstalled(name)) {
+            var tagName = t2.name.toLowerCase();
+            if (dp.isInstalledExt) {
+                // update tsr[1] to span the start and end tags.
+                dp.tsr[1] = pos;
+                ret = new SelfclosingTagTk('extension', [
+                    new KV('typeof', 'mw:Object/Extension'),
+                    new KV('name', tagName),
+                    new KV('about', "#" + pegArgs.env.newObjectId()),
+                    new KV('content', dp.src)
+                ], dp);
+            } else {
                 // Parse ref-content, strip eof, and shift tsr
                 var extContentToks = (new 
PegTokenizer(pegArgs.env)).tokenize(dp.extContent);
                 extContentToks = Util.stripEOFTkfromTokens(extContentToks);
                 Util.shiftTokenTSR(extContentToks, dp.extContentOffset);
-
                 ret = [t2].concat(extContentToks);
             }
 
@@ -1397,14 +1444,15 @@
             input = dp.origInput;
 
             // Clear temporary state
-            dp.isExt = undefined;
+            dp.extLikeTag = undefined;
+            dp.isInstalledExt = undefined;
             dp.extContent = undefined;
             dp.extContentOffset = undefined;
             dp.origInput = undefined;
         }
         // console.warn("RET: " + JSON.stringify(ret));
 
-        extParseInfo.currTag = null;
+        currExtTag = null;
 
         return ret;
     }
@@ -1629,7 +1677,7 @@
     selfclose:"/"?
     ">" {
         var lcName = name.toLowerCase();
-        if (block_names[lcName]) {
+        if (block_names[lcName] === true) {
             return [buildXMLTag(name, lcName, attribs, end, selfclose, [pos0, 
pos])];
         } else {
             // abort match if tag is not block-level

-- 
To view, visit https://gerrit.wikimedia.org/r/52561
To unsubscribe, visit https://gerrit.wikimedia.org/r/settings

Gerrit-MessageType: merged
Gerrit-Change-Id: Iecac76c27518a4690242e584c8c95ef363c749af
Gerrit-PatchSet: 4
Gerrit-Project: mediawiki/extensions/Parsoid
Gerrit-Branch: master
Gerrit-Owner: Subramanya Sastry <[email protected]>
Gerrit-Reviewer: GWicke <[email protected]>
Gerrit-Reviewer: jenkins-bot

_______________________________________________
MediaWiki-commits mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/mediawiki-commits

[MediaWiki-commits] [Gerrit] Handle exts. in tokenizer and removed ExtensionContentCollec... - change (mediawiki...Parsoid)

Reply via email to