[MediaWiki-commits] [Gerrit] Handle exts. in tokenizer and removed ExtensionContentCollec... - change (mediawiki...Parsoid)

Subramanya Sastry (Code Review) Wed, 06 Mar 2013 16:19:43 -0800

Subramanya Sastry has uploaded a new change for review.

  https://gerrit.wikimedia.org/r/52561



Change subject: Handle exts. in tokenizer and removed ExtensionContentCollector.
......................................................................

Handle exts. in tokenizer and removed ExtensionContentCollector.

* Since the tokenizer can now identify the extension content and
  that strategy proved successful, this patch pushes that one step
  further by matching start and end tags and encapsulating it in
  the self-closing tag that the ExtensionContentCollector was
  doing thus far. There is no collection involved anymore since
  extension content is entirely encapsulated in this token.  So,
  there is no use for the ExtensionContentCollector anymore.

* Other unrelated fixes and simplifications that does get one
  more wt2wt test green.

* This also speeds up parser tests since all tokens (and their attrs)
  are no longer repeatedly examined for matching extension tags.

Notes for next steps
--------------------
* There are 3 different ways how the non-html tags are handled:
  - natively-supported extensions like <ref>
    * tokenized in tokenizer in new scope and inserted back into
      top-level scope
    * go through stages 1 & 2 of the main pipeline
    * handled in ext.Cite.js where they get pulled out of stage 3.
    * separately post-procesed in a new stage 3 pipeline.

  - php-parser-supported extensions like <math>
    * content pulled out of tokenizer and hidden in a SelfclosingTag
      of type mw:Object/Extension.
    * processed by ext.core.TemplateHandler.js during stage 2 of the
      main pipeline:
      - content sent to php parser to generate html
      - html retokenized to tags and inserted back into token stream

  - <noinclude>, <includeonly>, <onlyinclude> directives
     * tokenized in a new scope and inserted back into the token
       stream if matched by the extension-tag production.
     * tokenized normally if matched by include_limits production
     * handled by TokenAndAttrCollector and processed by
       ext.core.NoIncludeOnly.js

* Next step: uniform handling of both natively-supported and
  php-parser supported extensions that effectively mimics a full
  expansion right upto to DOM building (either natively or via
  php-parser).  In a later step, possibly fold *include* directives
  into this abstraction if it lends itself to that.

Change-Id: Iecac76c27518a4690242e584c8c95ef363c749af
---
D js/lib/ext.ExtensionContentCollector.js
M js/lib/mediawiki.parser.js
M js/lib/pegTokenizer.pegjs.txt
3 files changed, 43 insertions(+), 176 deletions(-)


  git pull ssh://gerrit.wikimedia.org:29418/mediawiki/extensions/Parsoid 
refs/changes/61/52561/1

diff --git a/js/lib/ext.ExtensionContentCollector.js 
b/js/lib/ext.ExtensionContentCollector.js
deleted file mode 100644
index 4935d84..0000000
--- a/js/lib/ext.ExtensionContentCollector.js
+++ /dev/null
@@ -1,147 +0,0 @@
-"use strict";
-
-var Collector = require( './ext.util.TokenAndAttrCollector.js' 
).TokenAndAttrCollector,
-       Util = require( './mediawiki.Util.js' ).Util;
-
-// SSS FIXME: Since we sweep the entire token stream in TokenAndAttrCollector
-// and since we add a new collector for each entry below, this is an expensive 
way
-// to collect extension content.  We should probably use a single collector and
-// match up against all these tags.
-//
-// List of supported extensions
-var supportedExtensions = [
-       'categorytree', 'charinsert', 'gallery', 'hiero', 'imagemap',
-       'inputbox', 'math', 'poem', 'syntaxhighlight', 'tag', 'timeline'
-];
-
-/**
- * Simple token collector for extensions
- */
-function ExtensionContent ( manager, options ) {
-       this.manager = manager;
-       this.options = options;
-       for (var i = 0; i < supportedExtensions.length; i++) {
-               var ext = supportedExtensions[i];
-               new Collector(
-                       manager,
-                       this.handleExtensionTag.bind(this, ext),
-                       true, // match the end-of-input if closing tag is 
missing
-                       // *NEVER* register several independent transformers 
with the
-                       // same rank, as deregistration will *not* work 
otherwise.
-                       // This gives us a few thousand extensions.
-                       this.rank + i * 0.00001,
-                       ext);
-       }
-}
-
-ExtensionContent.prototype.rank = 0.04;
-
-function defaultNestedDelimiterHandler(tokens, nestedDelimiterInfo) {
-       // Always clone the container token before modifying it
-       var token = nestedDelimiterInfo.token.clone();
-       var i = nestedDelimiterInfo.attrIndex;
-       var delimiter = nestedDelimiterInfo.delimiter;
-
-       // Strip the delimiter token wherever it is nested
-       // and strip upto/from the delimiter depending on the
-       // token type and where in the stream we are.
-       if (delimiter.constructor === TagTk) {
-               token.attribs.splice(i+1);
-               if (nestedDelimiterInfo.k >= 0) {
-                       token.attribs[i].k.splice(nestedDelimiterInfo.k);
-                       token.attribs[i].ksrc = undefined;
-               } else {
-                       token.attribs[i].v.splice(nestedDelimiterInfo.v);
-                       token.attribs[i].vsrc = undefined;
-               }
-
-               tokens.push(delimiter);
-               tokens.push(token);
-       } else { // stripUpto
-
-               // Since we are stripping upto the delimiter,
-               // change the token to a simple span.
-               // SSS FIXME: For sure in the case of table tags 
(tr,td,th,etc.) but, always??
-               token.name = 'span';
-               token.attribs.splice(0, i);
-               if (nestedDelimiterInfo.k >= 0) {
-                       token.attribs[0].k.splice(0, nestedDelimiterInfo.k);
-                       token.attribs[0].ksrc = undefined;
-               } else {
-                       token.attribs[0].v.splice(0, nestedDelimiterInfo.v);
-                       token.attribs[0].vsrc = undefined;
-               }
-
-               tokens.push(token);
-               tokens.push(delimiter);
-       }
-}
-
-ExtensionContent.prototype.handleExtensionTag = function(extension, 
collection) {
-       var wrapTemplates = this.options.wrapTemplates;
-
-       function wrappedExtensionContent(env, startTag, tagTsr) {
-               var dp = {}, content = '';
-
-               if (wrapTemplates && tagTsr[0] !== null && tagTsr[1] !== null) {
-                       dp.tsr = [tagTsr[0], tagTsr[1]];
-               }
-
-               content = startTag.dataAttribs.src;
-               dp.src = content;
-
-               var nt = new SelfclosingTagTk('extension', [
-                               new KV('typeof', 'mw:Object/Extension'),
-                               new KV('name', extension),
-                               new KV('about', "#" + env.newObjectId()),
-                               new KV('content', content)
-                       ], dp);
-
-
-               return { tokens: [nt] };
-       }
-
-       var tokens = [], start = collection.start, end = collection.end;
-
-       // Handle self-closing tag case specially!
-       if (start.constructor === SelfclosingTagTk) {
-               var tsr = (start.dataAttribs || {}).tsr || [null, null];
-               return wrappedExtensionContent(this.manager.env, start, tsr);
-       }
-
-       // Deal with nested opening delimiter found in another token
-       if (start.constructor !== TagTk) {
-               defaultNestedDelimiterHandler(tokens, start);
-       } else {
-               tokens.push(start);
-       }
-
-       tokens = tokens.concat(collection.tokens);
-
-       // Deal with nested closing delimiter found in another token
-       if (end && end.constructor !== EndTagTk) {
-               defaultNestedDelimiterHandler(tokens, end);
-       } else if (end) {
-               tokens.push(end);
-       }
-
-       // We can only use tsr if we are the top-level
-       // since env. only stores top-level wikitext and
-       // not template wikitext.
-       if (tokens.length > 1) {
-               // Discard tokens and just create a span with text content
-               // with span typeof set to mw:Object/Extension/Content
-               var st = tokens[0],
-                       et = tokens.last(),
-                       sTsr = (st.dataAttribs || {}).tsr || [null,null],
-                       eTsr = (et.dataAttribs || {}).tsr || [null,null];
-
-               return wrappedExtensionContent(this.manager.env, st, [sTsr[0], 
eTsr[1]]);
-       }
-
-       return { tokens: tokens };
-};
-
-if (typeof module === "object") {
-       module.exports.ExtensionContent = ExtensionContent;
-}
diff --git a/js/lib/mediawiki.parser.js b/js/lib/mediawiki.parser.js
index 94c97ec..411be4e 100644
--- a/js/lib/mediawiki.parser.js
+++ b/js/lib/mediawiki.parser.js
@@ -1,4 +1,3 @@
-"use strict";
 /**
  * This module assembles parser pipelines from parser stages with
  * asynchronous communnication between stages based on events. Apart from the
@@ -9,6 +8,7 @@
  * http://www.mediawiki.org/wiki/Parsoid/Token_stream_transformations
  * for illustrations of the pipeline architecture.
  */
+"use strict";
 
 // make this global for now
 // XXX: figure out a way to get away without a global for PEG actions!
@@ -26,7 +26,6 @@
        IncludeOnly = NoIncludeOnly.IncludeOnly,
        NoInclude = NoIncludeOnly.NoInclude,
        OnlyInclude     = NoIncludeOnly.OnlyInclude,
-       ExtensionContent = 
require('./ext.ExtensionContentCollector.js').ExtensionContent,
        QuoteTransformer = 
require('./ext.core.QuoteTransformer.js').QuoteTransformer,
        TokenStreamPatcher = 
require('./ext.core.TokenStreamPatcher.js').TokenStreamPatcher,
        PreHandler = require('./ext.core.PreHandler.js').PreHandler,
@@ -119,7 +118,6 @@
                                OnlyInclude,    // 0.01
                                IncludeOnly,    // 0.02
                                NoInclude,              // 0.03
-                               ExtensionContent, // 0.04
 
                                // Preprocess behavior switches
                                BehaviorSwitchPreprocessor, // 0.05
diff --git a/js/lib/pegTokenizer.pegjs.txt b/js/lib/pegTokenizer.pegjs.txt
index 4551667..01e83d7 100644
--- a/js/lib/pegTokenizer.pegjs.txt
+++ b/js/lib/pegTokenizer.pegjs.txt
@@ -297,9 +297,7 @@
     };
 
     // Current extension tag being parsed.
-    var extParseInfo = {
-        currTag: null,
-    };
+    var currExtTag = null;
 
     // text start position
     var textStart = 0;
@@ -333,7 +331,9 @@
         "title", "tr", "track", "u", "ul", "var", "video", "wbr"
     ]);
 
-    var html_old_names = Util.arrayToHash([ "center", "font", "tt" ]);
+    var html_old_names = Util.arrayToHash([
+        "strike", "big", "center", "font", "tt"
+    ]);
 
     var self = this;
 
@@ -637,7 +637,8 @@
 
 inline_element
   = //& { dp('inline_element enter' + input.substr(pos, 10)); return true; }
-      & '<' xmlish_tag
+      & '<' nowiki
+    / & '<' xmlish_tag
     / & '<' comment
     /// & '{' ( & '{{{{{' template / tplarg / template )
     / & '{' tplarg_or_template
@@ -1311,8 +1312,7 @@
  * outer templates.
  * ----------------------------------------------------------------------- */
 
-xmlish_tag = nowiki
-    / t2:(t:generic_tag {
+xmlish_tag = t2:(t:generic_tag {
         var tagName = t.name;
 
         // TagTk and SelfclosingTagTk
@@ -1321,12 +1321,15 @@
                 var tsr0 = t.dataAttribs.tsr[0],
                     restOfInput = input.substring(tsr0),
                     tagContent = restOfInput.match(new 
RegExp("^(.|\n)*?(</\s*" + tagName + ">)", "m")),
-                    extSrc = null;
+                    extSrc = null,
+                    isInstalledExt = Util.extensionInstalled(pegArgs.env, 
tagName);
 
                 if (tagContent) {
                     extSrc = tagContent[0];
-                } else if (Util.extensionInstalled(pegArgs.env, tagName)) {
+                    t.dataAttribs.matchEndTag = isInstalledExt;
+                } else if (isInstalledExt) {
                     extSrc = restOfInput;
+                    t.dataAttribs.matchEndTag = false;
                 }
 
                 if (extSrc) {
@@ -1337,7 +1340,8 @@
                     t.dataAttribs.src = extSrc;
 
                     // Temporary state
-                    t.dataAttribs.isExt = true;
+                    t.dataAttribs.extLikeTag = true;
+                    t.dataAttribs.isInstalledExt = isInstalledExt;
                     t.dataAttribs.extContent = extContent;
                     t.dataAttribs.extContentOffset = pos;
                     t.dataAttribs.origInput = input;
@@ -1354,26 +1358,28 @@
             }
         }
 
-        extParseInfo.currTag = t;
-        // console.warn("curr: " + JSON.stringify(extParseInfo.currTag));
+        currExtTag = t;
+        // console.warn("curr: " + JSON.stringify(currExtTag));
         return t;
     }) (
       dummyText:'#'* {
         // Should only match if we tricked the tokenizer
-        var currExtTag = extParseInfo.currTag;
-        return currExtTag.dataAttribs.isExt &&
-            currExtTag.constructor !== SelfclosingTagTk &&
-            dummyText.length === currExtTag.dataAttribs.extContent.length;
+        var dp = currExtTag.dataAttribs;
+        return dp.extLikeTag && dummyText.length === dp.extContent.length;
       }
       / &  {
         // Should not match if we tricked the tokenizer
-        var currExtTag = extParseInfo.currTag;
-        return !currExtTag.dataAttribs.isExt ||
-            currExtTag.constructor === SelfclosingTagTk;
+        return !currExtTag.dataAttribs.extLikeTag;
       }
+    ) (
+      & { var dp = currExtTag.dataAttribs; return dp.extLikeTag && 
dp.matchEndTag; }
+      // swallow end-tag as well
+      "</" [0-9a-zA-Z]+ (space / newline)* ">"
+      / { return true; }
     ) {
-        var ret = t2, dp = t2.dataAttribs;
-        if (dp.isExt) {
+        var ret = t2,
+            dp = t2.dataAttribs;
+        if (dp.extLikeTag) {
             // If not a known installed extension, parse content as wikitext.
             // - include-directives: <noinclude>, <includeonly>, ...
             // - a non-html5 tag like <big>
@@ -1383,13 +1389,21 @@
             //   of cite extension refactoring for supporting DOM rerendering,
             //   incremental parsing, etc.  Till then, we'll let this go 
through
             //   the same codepaths as before.
-            var name = t2.name.toLowerCase();
-            if (!Util.extensionInstalled(name)) {
+            var tagName = t2.name.toLowerCase();
+            if (dp.isInstalledExt) {
+                // update tsr[1] to span the start and end tags.
+                dp.tsr[1] = pos;
+                ret = new SelfclosingTagTk('extension', [
+                    new KV('typeof', 'mw:Object/Extension'),
+                    new KV('name', tagName),
+                    new KV('about', "#" + pegArgs.env.newObjectId()),
+                    new KV('content', dp.src)
+                ], dp);
+            } else {
                 // Parse ref-content, strip eof, and shift tsr
                 var extContentToks = (new 
PegTokenizer(pegArgs.env)).tokenize(dp.extContent);
                 extContentToks = Util.stripEOFTkfromTokens(extContentToks);
                 Util.shiftTokenTSR(extContentToks, dp.extContentOffset);
-
                 ret = [t2].concat(extContentToks);
             }
 
@@ -1397,14 +1411,16 @@
             input = dp.origInput;
 
             // Clear temporary state
-            dp.isExt = undefined;
+            dp.extLikeTag = undefined;
+            dp.isInstalledExt = undefined;
+            dp.matchEndTag = undefined;
             dp.extContent = undefined;
             dp.extContentOffset = undefined;
             dp.origInput = undefined;
         }
         // console.warn("RET: " + JSON.stringify(ret));
 
-        extParseInfo.currTag = null;
+        currExtTag = null;
 
         return ret;
     }

-- 
To view, visit https://gerrit.wikimedia.org/r/52561
To unsubscribe, visit https://gerrit.wikimedia.org/r/settings

Gerrit-MessageType: newchange
Gerrit-Change-Id: Iecac76c27518a4690242e584c8c95ef363c749af
Gerrit-PatchSet: 1
Gerrit-Project: mediawiki/extensions/Parsoid
Gerrit-Branch: master
Gerrit-Owner: Subramanya Sastry <[email protected]>

_______________________________________________
MediaWiki-commits mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/mediawiki-commits

[MediaWiki-commits] [Gerrit] Handle exts. in tokenizer and removed ExtensionContentCollec... - change (mediawiki...Parsoid)

Reply via email to