GWicke has submitted this change and it was merged.
Change subject: Handle exts. in tokenizer and removed ExtensionContentCollector.
......................................................................
Handle exts. in tokenizer and removed ExtensionContentCollector.
* Since the tokenizer can now identify the extension content and
that strategy proved successful, this patch pushes that one step
further by matching start and end tags and encapsulating it in
the self-closing tag that the ExtensionContentCollector was
doing thus far. There is no collection involved anymore since
extension content is entirely encapsulated in this token. So,
there is no use for the ExtensionContentCollector anymore.
* If an xml tag is not a html5 tag, an older html tag, is a known
installed extension, an include directive, or a native parsoid
extension tag (<ref>, <references>), the xml tag is now
converted to plain text. This fixes parsing of text like:
"x<y\n\na>b" which is no longer parsed as an xml-tag
* Fixed Util.arrayToHash to record "e: true" for each element 'e'
of the array rather than '1'. Also fixed all users to explicitly
check for 'true' rather than a truthy value.
Ex: var h = Util.arrayToHash(a);
if (h[candidate] === true) instead of if (h[candidate])
This is because if candidate is 'prototype' or 'constructor'
h[candidate] will be a truthy value (functions), but not true.
I realized this yesterday, courtesy marktraceur.
* A couple minor fixes and simplifications
* Change in parser test results:
- 2 wt2wt tests green
- 1 wt2wt test red (to be investigated)
- 1 html2html test green
* This also speeds up parser tests since all tokens (and their attrs)
are no longer repeatedly examined for matching extension tags.
Notes for next steps
--------------------
* There are 3 different ways how the non-html tags are handled:
- natively-supported extensions like <ref>
* tokenized in tokenizer in new scope and inserted back into
top-level scope
* go through stages 1 & 2 of the main pipeline
* handled in ext.Cite.js where they get pulled out of stage 3.
* separately post-procesed in a new stage 3 pipeline.
- php-parser-supported extensions like <math>
* content pulled out of tokenizer and hidden in a SelfclosingTag
of type mw:Object/Extension.
* processed by ext.core.TemplateHandler.js during stage 2 of the
main pipeline:
- content sent to php parser to generate html
- html retokenized to tags and inserted back into token stream
- <noinclude>, <includeonly>, <onlyinclude> directives
* tokenized in a new scope and inserted back into the token
stream if matched by the extension-tag production.
* tokenized normally if matched by include_limits production
* handled by TokenAndAttrCollector and processed by
ext.core.NoIncludeOnly.js
* Next step: uniform handling of both natively-supported and
php-parser supported extensions that effectively mimics a full
expansion right upto to DOM building (either natively or via
php-parser). In a later step, possibly fold *include* directives
into this abstraction if it lends itself to that.
Change-Id: Iecac76c27518a4690242e584c8c95ef363c749af
---
D js/lib/ext.ExtensionContentCollector.js
M js/lib/ext.core.Sanitizer.js
M js/lib/mediawiki.DOMPostProcessor.js
M js/lib/mediawiki.Util.js
M js/lib/mediawiki.parser.js
M js/lib/pegTokenizer.pegjs.txt
6 files changed, 96 insertions(+), 197 deletions(-)
Approvals:
GWicke: Verified; Looks good to me, approved
diff --git a/js/lib/ext.ExtensionContentCollector.js
b/js/lib/ext.ExtensionContentCollector.js
deleted file mode 100644
index 4935d84..0000000
--- a/js/lib/ext.ExtensionContentCollector.js
+++ /dev/null
@@ -1,147 +0,0 @@
-"use strict";
-
-var Collector = require( './ext.util.TokenAndAttrCollector.js'
).TokenAndAttrCollector,
- Util = require( './mediawiki.Util.js' ).Util;
-
-// SSS FIXME: Since we sweep the entire token stream in TokenAndAttrCollector
-// and since we add a new collector for each entry below, this is an expensive
way
-// to collect extension content. We should probably use a single collector and
-// match up against all these tags.
-//
-// List of supported extensions
-var supportedExtensions = [
- 'categorytree', 'charinsert', 'gallery', 'hiero', 'imagemap',
- 'inputbox', 'math', 'poem', 'syntaxhighlight', 'tag', 'timeline'
-];
-
-/**
- * Simple token collector for extensions
- */
-function ExtensionContent ( manager, options ) {
- this.manager = manager;
- this.options = options;
- for (var i = 0; i < supportedExtensions.length; i++) {
- var ext = supportedExtensions[i];
- new Collector(
- manager,
- this.handleExtensionTag.bind(this, ext),
- true, // match the end-of-input if closing tag is
missing
- // *NEVER* register several independent transformers
with the
- // same rank, as deregistration will *not* work
otherwise.
- // This gives us a few thousand extensions.
- this.rank + i * 0.00001,
- ext);
- }
-}
-
-ExtensionContent.prototype.rank = 0.04;
-
-function defaultNestedDelimiterHandler(tokens, nestedDelimiterInfo) {
- // Always clone the container token before modifying it
- var token = nestedDelimiterInfo.token.clone();
- var i = nestedDelimiterInfo.attrIndex;
- var delimiter = nestedDelimiterInfo.delimiter;
-
- // Strip the delimiter token wherever it is nested
- // and strip upto/from the delimiter depending on the
- // token type and where in the stream we are.
- if (delimiter.constructor === TagTk) {
- token.attribs.splice(i+1);
- if (nestedDelimiterInfo.k >= 0) {
- token.attribs[i].k.splice(nestedDelimiterInfo.k);
- token.attribs[i].ksrc = undefined;
- } else {
- token.attribs[i].v.splice(nestedDelimiterInfo.v);
- token.attribs[i].vsrc = undefined;
- }
-
- tokens.push(delimiter);
- tokens.push(token);
- } else { // stripUpto
-
- // Since we are stripping upto the delimiter,
- // change the token to a simple span.
- // SSS FIXME: For sure in the case of table tags
(tr,td,th,etc.) but, always??
- token.name = 'span';
- token.attribs.splice(0, i);
- if (nestedDelimiterInfo.k >= 0) {
- token.attribs[0].k.splice(0, nestedDelimiterInfo.k);
- token.attribs[0].ksrc = undefined;
- } else {
- token.attribs[0].v.splice(0, nestedDelimiterInfo.v);
- token.attribs[0].vsrc = undefined;
- }
-
- tokens.push(token);
- tokens.push(delimiter);
- }
-}
-
-ExtensionContent.prototype.handleExtensionTag = function(extension,
collection) {
- var wrapTemplates = this.options.wrapTemplates;
-
- function wrappedExtensionContent(env, startTag, tagTsr) {
- var dp = {}, content = '';
-
- if (wrapTemplates && tagTsr[0] !== null && tagTsr[1] !== null) {
- dp.tsr = [tagTsr[0], tagTsr[1]];
- }
-
- content = startTag.dataAttribs.src;
- dp.src = content;
-
- var nt = new SelfclosingTagTk('extension', [
- new KV('typeof', 'mw:Object/Extension'),
- new KV('name', extension),
- new KV('about', "#" + env.newObjectId()),
- new KV('content', content)
- ], dp);
-
-
- return { tokens: [nt] };
- }
-
- var tokens = [], start = collection.start, end = collection.end;
-
- // Handle self-closing tag case specially!
- if (start.constructor === SelfclosingTagTk) {
- var tsr = (start.dataAttribs || {}).tsr || [null, null];
- return wrappedExtensionContent(this.manager.env, start, tsr);
- }
-
- // Deal with nested opening delimiter found in another token
- if (start.constructor !== TagTk) {
- defaultNestedDelimiterHandler(tokens, start);
- } else {
- tokens.push(start);
- }
-
- tokens = tokens.concat(collection.tokens);
-
- // Deal with nested closing delimiter found in another token
- if (end && end.constructor !== EndTagTk) {
- defaultNestedDelimiterHandler(tokens, end);
- } else if (end) {
- tokens.push(end);
- }
-
- // We can only use tsr if we are the top-level
- // since env. only stores top-level wikitext and
- // not template wikitext.
- if (tokens.length > 1) {
- // Discard tokens and just create a span with text content
- // with span typeof set to mw:Object/Extension/Content
- var st = tokens[0],
- et = tokens.last(),
- sTsr = (st.dataAttribs || {}).tsr || [null,null],
- eTsr = (et.dataAttribs || {}).tsr || [null,null];
-
- return wrappedExtensionContent(this.manager.env, st, [sTsr[0],
eTsr[1]]);
- }
-
- return { tokens: tokens };
-};
-
-if (typeof module === "object") {
- module.exports.ExtensionContent = ExtensionContent;
-}
diff --git a/js/lib/ext.core.Sanitizer.js b/js/lib/ext.core.Sanitizer.js
index 0a31eb0..f26d1cc 100644
--- a/js/lib/ext.core.Sanitizer.js
+++ b/js/lib/ext.core.Sanitizer.js
@@ -660,7 +660,7 @@
noEndTagHash = this.constants.noEndTagHash;
if (token.isHTMLTag && token.isHTMLTag() &&
- ( !tagWLHash[token.name] ||
+ ( tagWLHash[token.name] !== true ||
( token.constructor === EndTagTk &&
noEndTagHash[token.name] )
)
)
@@ -896,7 +896,7 @@
}
// Allow any attribute beginning with "data-", if in HTML5 mode
- if (!(html5Mode && k.match(/^data-/i)) && !wlist[k]) {
+ if (!(html5Mode && k.match(/^data-/i)) && wlist[k] !== true) {
newAttrs[k] = [null, origV, origK];
continue;
}
diff --git a/js/lib/mediawiki.DOMPostProcessor.js
b/js/lib/mediawiki.DOMPostProcessor.js
index 8426339..264c6f9 100644
--- a/js/lib/mediawiki.DOMPostProcessor.js
+++ b/js/lib/mediawiki.DOMPostProcessor.js
@@ -669,7 +669,7 @@
}
// 2. Process 'elt' itself after -- skip literal-HTML nodes
- if (nodesToMigrateFrom[elt.nodeName.toLowerCase()] &&
!DU.isLiteralHTMLNode(elt)) {
+ if (nodesToMigrateFrom[elt.nodeName.toLowerCase()] === true &&
!DU.isLiteralHTMLNode(elt)) {
var firstEltToMigrate = null,
partialContent = false,
n = elt.lastChild;
diff --git a/js/lib/mediawiki.Util.js b/js/lib/mediawiki.Util.js
index ba85028..028bba6 100644
--- a/js/lib/mediawiki.Util.js
+++ b/js/lib/mediawiki.Util.js
@@ -765,7 +765,7 @@
arrayToHash: function(a) {
var h = {};
for (var i = 0, n = a.length; i < n; i++) {
- h[a[i]] = 1;
+ h[a[i]] = true;
}
return h;
},
@@ -820,7 +820,7 @@
'inputbox', 'math', 'poem', 'syntaxhighlight',
'tag', 'timeline'
]);
}
- return this.installedExts[name] !== undefined;
+ return this.installedExts[name] === true;
}
};
diff --git a/js/lib/mediawiki.parser.js b/js/lib/mediawiki.parser.js
index 94c97ec..411be4e 100644
--- a/js/lib/mediawiki.parser.js
+++ b/js/lib/mediawiki.parser.js
@@ -1,4 +1,3 @@
-"use strict";
/**
* This module assembles parser pipelines from parser stages with
* asynchronous communnication between stages based on events. Apart from the
@@ -9,6 +8,7 @@
* http://www.mediawiki.org/wiki/Parsoid/Token_stream_transformations
* for illustrations of the pipeline architecture.
*/
+"use strict";
// make this global for now
// XXX: figure out a way to get away without a global for PEG actions!
@@ -26,7 +26,6 @@
IncludeOnly = NoIncludeOnly.IncludeOnly,
NoInclude = NoIncludeOnly.NoInclude,
OnlyInclude = NoIncludeOnly.OnlyInclude,
- ExtensionContent =
require('./ext.ExtensionContentCollector.js').ExtensionContent,
QuoteTransformer =
require('./ext.core.QuoteTransformer.js').QuoteTransformer,
TokenStreamPatcher =
require('./ext.core.TokenStreamPatcher.js').TokenStreamPatcher,
PreHandler = require('./ext.core.PreHandler.js').PreHandler,
@@ -119,7 +118,6 @@
OnlyInclude, // 0.01
IncludeOnly, // 0.02
NoInclude, // 0.03
- ExtensionContent, // 0.04
// Preprocess behavior switches
BehaviorSwitchPreprocessor, // 0.05
diff --git a/js/lib/pegTokenizer.pegjs.txt b/js/lib/pegTokenizer.pegjs.txt
index 4551667..fbeae09 100644
--- a/js/lib/pegTokenizer.pegjs.txt
+++ b/js/lib/pegTokenizer.pegjs.txt
@@ -297,9 +297,12 @@
};
// Current extension tag being parsed.
- var extParseInfo = {
- currTag: null,
- };
+ var currExtTag = null;
+
+ // SSS FIXME: Temporary hack till the next round of cleanup and
refactoring.
+ var nativeParsoidExts = Util.arrayToHash([
+ "includeonly", "noinclude", "onlyinclude", "ref", "references"
+ ]);
// text start position
var textStart = 0;
@@ -333,7 +336,13 @@
"title", "tr", "track", "u", "ul", "var", "video", "wbr"
]);
- var html_old_names = Util.arrayToHash([ "center", "font", "tt" ]);
+ // From http://www.w3.org/TR/html5-diff/#obsolete-elements
+ // SSS FIXME: basefont is missing here, but looks like the PHP parser
+ // does not support it anyway and treats it as plain text. So, skipping
+ // this one in Parsoid as well.
+ var html_old_names = Util.arrayToHash([
+ "strike", "big", "center", "font", "tt"
+ ]);
var self = this;
@@ -637,7 +646,8 @@
inline_element
= //& { dp('inline_element enter' + input.substr(pos, 10)); return true; }
- & '<' xmlish_tag
+ & '<' nowiki
+ / & '<' xmlish_tag
/ & '<' comment
/// & '{' ( & '{{{{{' template / tplarg / template )
/ & '{' tplarg_or_template
@@ -1311,21 +1321,40 @@
* outer templates.
* ----------------------------------------------------------------------- */
-xmlish_tag = nowiki
- / t2:(t:generic_tag {
- var tagName = t.name;
+xmlish_tag = t2:(t:generic_tag {
+ var tagName = t.name.toLowerCase(),
+ dp = t.dataAttribs,
+ isHtmlTag = html5_tag_names[tagName] === true ||
html_old_names[tagName] === true,
+ isInstalledExt = Util.extensionInstalled(pegArgs.env, tagName),
+ supportedTag = nativeParsoidExts[tagName] === true;
- // TagTk and SelfclosingTagTk
- if (t.constructor !== EndTagTk && !html5_tag_names[tagName] &&
!html_old_names[tagName]) {
+ if (!isHtmlTag && !isInstalledExt && !supportedTag) {
+ // convert tag to text, but convert "\n" and "\r\n" NlTk tokens
+ var toks = input.substring(dp.tsr[0], dp.tsr[1]).split(/\n|\r\n/),
+ ret = [],
+ tsr = dp.tsr[0];
+
+ // Add one NlTk between each pair, hence toks.length-1
+ for (var i = 0, n = toks.length-1; i < n; i++) {
+ ret.push(toks[i]);
+ tsr += toks[i].length;
+ ret.push(new NlTk(tsr, tsr+1));
+ }
+ ret.push(toks[i]);
+
+ return ret;
+ }
+
+ if (t.constructor !== EndTagTk && !isHtmlTag) {
if (t.constructor === TagTk) {
- var tsr0 = t.dataAttribs.tsr[0],
+ var tsr0 = dp.tsr[0],
restOfInput = input.substring(tsr0),
tagContent = restOfInput.match(new
RegExp("^(.|\n)*?(</\s*" + tagName + ">)", "m")),
extSrc = null;
if (tagContent) {
extSrc = tagContent[0];
- } else if (Util.extensionInstalled(pegArgs.env, tagName)) {
+ } else if (isInstalledExt) {
extSrc = restOfInput;
}
@@ -1334,46 +1363,56 @@
extContentLen = extSrc.length - startTagLen -
(tagContent ? tagContent[2].length : 0),
extContent = extSrc.substring(startTagLen, startTagLen
+ extContentLen);
- t.dataAttribs.src = extSrc;
+ // If the xml-tag is a known installed extension, skip the
end-tag as well.
+ var skipLen = extContentLen;
+ if (isInstalledExt && tagContent) {
+ skipLen += tagContent[2].length;
+ }
+
+ // Replace extension content (and possibly the end tag, as
well) with
+ // dummy content so it matches the rule following this
match and
+ // can be tokenized independently (if required). This is
just a trick
+ // to tokenize ref content with higher priority.
+ input = input.slice(0,pos) + Util.charSequence('', '#',
skipLen) + input.slice(pos+skipLen);
+
+ // Extension content source
+ dp.src = extSrc;
// Temporary state
- t.dataAttribs.isExt = true;
- t.dataAttribs.extContent = extContent;
- t.dataAttribs.extContentOffset = pos;
- t.dataAttribs.origInput = input;
-
- // Replace extension content with dummy content so it
matches the
- // rule following this match and can be tokenized
independently (if required).
- // This is just a trick to tokenize ref content with
higher priority.
- input = input.slice(0,pos) + Util.charSequence('', '#',
extContentLen) + input.slice(pos+extContentLen);
+ dp.extLikeTag = true;
+ dp.isInstalledExt = isInstalledExt;
+ dp.extContent = extContent;
+ dp.extContentOffset = pos;
+ dp.origInput = input;
// console.warn("input: " + input);
}
} else {
- t.dataAttribs.src = input.substring(t.dataAttribs.tsr[0],
t.dataAttribs.tsr[1]);
+ dp.src = input.substring(dp.tsr[0], dp.tsr[1]);
}
}
- extParseInfo.currTag = t;
- // console.warn("curr: " + JSON.stringify(extParseInfo.currTag));
+ currExtTag = t;
+ // console.warn("curr: " + JSON.stringify(currExtTag));
return t;
}) (
dummyText:'#'* {
- // Should only match if we tricked the tokenizer
- var currExtTag = extParseInfo.currTag;
- return currExtTag.dataAttribs.isExt &&
- currExtTag.constructor !== SelfclosingTagTk &&
- dummyText.length === currExtTag.dataAttribs.extContent.length;
+ // Should only match if currExtTag is an extension
+ var dp = currExtTag ? currExtTag.dataAttribs : null;
+ return dp && dp.extLikeTag && dummyText.length ===
dp.extContent.length;
}
/ & {
- // Should not match if we tricked the tokenizer
- var currExtTag = extParseInfo.currTag;
- return !currExtTag.dataAttribs.isExt ||
- currExtTag.constructor === SelfclosingTagTk;
+ // Should not match if currExtTag is an extension
+ return !currExtTag || !currExtTag.dataAttribs.extLikeTag;
}
) {
- var ret = t2, dp = t2.dataAttribs;
- if (dp.isExt) {
+ if (t2.constructor === Array) {
+ return t2;
+ }
+
+ var ret = t2,
+ dp = t2.dataAttribs;
+ if (dp.extLikeTag) {
// If not a known installed extension, parse content as wikitext.
// - include-directives: <noinclude>, <includeonly>, ...
// - a non-html5 tag like <big>
@@ -1383,13 +1422,21 @@
// of cite extension refactoring for supporting DOM rerendering,
// incremental parsing, etc. Till then, we'll let this go
through
// the same codepaths as before.
- var name = t2.name.toLowerCase();
- if (!Util.extensionInstalled(name)) {
+ var tagName = t2.name.toLowerCase();
+ if (dp.isInstalledExt) {
+ // update tsr[1] to span the start and end tags.
+ dp.tsr[1] = pos;
+ ret = new SelfclosingTagTk('extension', [
+ new KV('typeof', 'mw:Object/Extension'),
+ new KV('name', tagName),
+ new KV('about', "#" + pegArgs.env.newObjectId()),
+ new KV('content', dp.src)
+ ], dp);
+ } else {
// Parse ref-content, strip eof, and shift tsr
var extContentToks = (new
PegTokenizer(pegArgs.env)).tokenize(dp.extContent);
extContentToks = Util.stripEOFTkfromTokens(extContentToks);
Util.shiftTokenTSR(extContentToks, dp.extContentOffset);
-
ret = [t2].concat(extContentToks);
}
@@ -1397,14 +1444,15 @@
input = dp.origInput;
// Clear temporary state
- dp.isExt = undefined;
+ dp.extLikeTag = undefined;
+ dp.isInstalledExt = undefined;
dp.extContent = undefined;
dp.extContentOffset = undefined;
dp.origInput = undefined;
}
// console.warn("RET: " + JSON.stringify(ret));
- extParseInfo.currTag = null;
+ currExtTag = null;
return ret;
}
@@ -1629,7 +1677,7 @@
selfclose:"/"?
">" {
var lcName = name.toLowerCase();
- if (block_names[lcName]) {
+ if (block_names[lcName] === true) {
return [buildXMLTag(name, lcName, attribs, end, selfclose, [pos0,
pos])];
} else {
// abort match if tag is not block-level
--
To view, visit https://gerrit.wikimedia.org/r/52561
To unsubscribe, visit https://gerrit.wikimedia.org/r/settings
Gerrit-MessageType: merged
Gerrit-Change-Id: Iecac76c27518a4690242e584c8c95ef363c749af
Gerrit-PatchSet: 4
Gerrit-Project: mediawiki/extensions/Parsoid
Gerrit-Branch: master
Gerrit-Owner: Subramanya Sastry <[email protected]>
Gerrit-Reviewer: GWicke <[email protected]>
Gerrit-Reviewer: jenkins-bot
_______________________________________________
MediaWiki-commits mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/mediawiki-commits