jenkins-bot has submitted this change and it was merged.
Change subject: (Bug 52296) Delete empty autoinserted tags in the last DOM pass
......................................................................
(Bug 52296) Delete empty autoinserted tags in the last DOM pass
* Normally td-tags should have a leading pipe-char in start-of-line
position. However, with buggy wikitext, table-row tag ("|-")
is followed by text on a new line without the leading pipe.
In order to accommodate and support such buggy wikitext, the tokenizer
attempts to recognize "implicit" td-tags that follow a tr-tag.
However, the tokenizer doesn't have sufficient information to
always do this correctly and occasionally introduces an unnecessary
td-tag (which turns out to be empty).
We also handle this scenario in a DOM pass by recognizing empty
DOM elements that were auto-inserted by the tokenizer (or the
tree-builder when fixing mis-nested buggy wikitext -- more often
than you would think).
However, the dom pass that deleted empty DOM elements was run
too early (when the "empty" elements had marker meta tags used
for other DOM passes). So, it wouldn't get deleted and would
then introduce empty cells in output (examples + live diffs are
linked in the bug report). Simplified test case below:
{{Certification Table Top}}
|-
|foo||bar||baz
|}
* This patch moves the dom pass that deletes empty elements right
to the end before saving data-parsoid. This now fixes the bug,
the test snippet and the examples in the bug report.
* Cleaned up saveDataParsoid uses.
* We have a wt2wt failure introduced by the reordering. The test
case is a contrived test case where a newline is missing in wt2wt
output. This also leads to a bunch of selser failures (normal
when wt2wt fails).
The reason for wt2wt failure is because the html output is
slightly different. "<tr>\n</tr>" instead of "<tr></tr>\n".
Since the empty td is not deleted before 'migrateTrailingNLs'
pass is run, the newline in <tr> before the empty-td doesn't
migrate out which accounts for the difference in HTML output.
In turn since this <tr> is now followed by a template, the
serializer does not emit any separator before it which leads
to the newline diff in RT-ing.
This behavior should be harmless in production and might introduce
dirty syntactic diffs in some cases.
I have blacklisted these failures for now.
Change-Id: Iaea6e144aae4d7461775b898968231c9e4cb7254
---
M js/lib/mediawiki.DOMPostProcessor.js
M js/tests/parserTests-blacklist.js
2 files changed, 56 insertions(+), 37 deletions(-)
Approvals:
Cscott: Looks good to me, approved
jenkins-bot: Verified
diff --git a/js/lib/mediawiki.DOMPostProcessor.js
b/js/lib/mediawiki.DOMPostProcessor.js
index d3b8193..c9320c7 100644
--- a/js/lib/mediawiki.DOMPostProcessor.js
+++ b/js/lib/mediawiki.DOMPostProcessor.js
@@ -769,7 +769,7 @@
function migrateTrailingNLs(elt, env) {
// We can migrate a newline out of a node if one of the following is
true:
- // (1) The ends a line in wikitext (=> not a literal html tag)
+ // (1) The node ends a line in wikitext (=> not a literal html tag)
// (2) The node has an auto-closed end-tag (wikitext-generated or
literal html tag)
// (3) It is the rightmost node in the DOM subtree rooted at a node
// that ends a line in wikitext
@@ -1835,8 +1835,7 @@
// 1. Finds start-tag marker metas that dont have a corresponding start
tag
// and adds placeholder metas for the purposes of round-tripping.
// 2. Deletes any useless end-tag marker metas
- // 3. Deletes empty nodes that is entirely builder inserted (both
start/end)
- function findDeletedStartTagsAndDeleteEmptyTags(node) {
+ function findDeletedStartTags(node) {
// handle unmatched mw:StartTag meta tags
var c = node.firstChild;
while (c !== null) {
@@ -1864,10 +1863,8 @@
// other brittle DOM passes
working on the DOM.
deleteNode(c);
}
- } else if (c.childNodes.length === 0 &&
dp.autoInsertedStart && dp.autoInsertedEnd) {
- deleteNode(c);
} else {
-
findDeletedStartTagsAndDeleteEmptyTags(c);
+ findDeletedStartTags(c);
}
}
c = sibling;
@@ -1977,7 +1974,7 @@
}
findAutoInsertedTags(document.body);
- findDeletedStartTagsAndDeleteEmptyTags(document.body);
+ findDeletedStartTags(document.body);
}
/* ---------------------------------------------------------------------------
@@ -2441,14 +2438,19 @@
return [cs, e];
}
-var saveDataParsoid; // forward declaration
+function saveDataParsoid( options, node) {
+ if ( node.nodeType === node.ELEMENT_NODE && node.data ) {
+ DU.saveDataAttribs( node );
+ }
+ return true;
+}
function dumpDomWithDataAttribs( options, root ) {
function cloneData(node, clone) {
var d = node.data;
if (d && d.constructor === Object &&
(Object.keys(d.parsoid).length > 0)) {
clone.data = Util.clone(d);
- saveDataParsoid( options, clone, true );
+ saveDataParsoid( options, clone );
}
node = node.firstChild;
@@ -2940,43 +2942,48 @@
/**
* @method
*
- * Save the data-parsoid attributes on each node.
+ * Perform some final cleaup and save data-parsoid attributes on each node.
*/
-saveDataParsoid = function( options, node, debugDump ) {
+function cleanupAndSaveDataParsoid( options, node ) {
if ( node.nodeType === node.ELEMENT_NODE && node.data ) {
- if (!debugDump) {
- var dp = node.data.parsoid;
- if (dp) {
- dp.tagId = undefined;
+ var dp = node.data.parsoid;
+ if (dp) {
+ // Delete empty auto-inserted elements
+ var next = node.nextSibling;
+ if (node.childNodes.length === 0 &&
dp.autoInsertedStart && dp.autoInsertedEnd) {
+ deleteNode(node);
+ return next;
+ }
- // Remove data-parsoid.src from templates and
extensions that have
- // valid data-mw and dsr. This should reduce
data-parsoid bloat.
- //
- // Transcluded nodes will not have dp.tsr set
and dont need dp.src either
- if
(/\bmw:(Transclusion|Extension)\b/.test(node.getAttribute("typeof")) &&
- (!dp.tsr ||
- node.getAttribute("data-mw") && dp.dsr
&& dp.dsr[0] && dp.dsr[1]))
- {
- dp.src = undefined;
- }
+ dp.tagId = undefined;
- // Remove tsr
- if (dp.tsr) {
- dp.tsr = undefined;
- }
+ // Remove data-parsoid.src from templates and
extensions that have
+ // valid data-mw and dsr. This should reduce
data-parsoid bloat.
+ //
+ // Transcluded nodes will not have dp.tsr set and dont
need dp.src either
+ if
(/\bmw:(Transclusion|Extension)\b/.test(node.getAttribute("typeof")) &&
+ (!dp.tsr ||
+ node.getAttribute("data-mw") && dp.dsr &&
dp.dsr[0] && dp.dsr[1]))
+ {
+ dp.src = undefined;
+ }
- // Make dsr zero-range for fostered content
- // to prevent selser from duplicating this
content
- // outside the table from where this came.
- if (dp.fostered && dp.dsr) {
- dp.dsr[0] = dp.dsr[1];
- }
+ // Remove tsr
+ if (dp.tsr) {
+ dp.tsr = undefined;
+ }
+
+ // Make dsr zero-range for fostered content
+ // to prevent selser from duplicating this content
+ // outside the table from where this came.
+ if (dp.fostered && dp.dsr) {
+ dp.dsr[0] = dp.dsr[1];
}
}
DU.saveDataAttribs( node );
}
return true;
-};
+}
/**
* @method
@@ -3035,7 +3042,7 @@
var domVisitor2 = new DOMTraverser();
domVisitor2.addHandler( 'meta', stripMarkerMetas.bind(null,
env.conf.parsoid.editMode) );
domVisitor2.addHandler( 'li', cleanUpLIHack.bind( null, env ) );
- domVisitor2.addHandler( null, saveDataParsoid.bind(null, this.options)
);
+ domVisitor2.addHandler( null, cleanupAndSaveDataParsoid.bind(null,
this.options) );
this.processors.push(domVisitor2.traverse.bind(domVisitor2));
}
diff --git a/js/tests/parserTests-blacklist.js
b/js/tests/parserTests-blacklist.js
index 34cd0ef..5e2caee 100644
--- a/js/tests/parserTests-blacklist.js
+++ b/js/tests/parserTests-blacklist.js
@@ -689,6 +689,7 @@
add("wt2wt", "RT-ed inter-element separators should be valid separators");
add("wt2wt", "Trailing newlines in a deep dom-subtree that ends a wikitext
line should be migrated out\n(Parsoid-only since PHP parser relies on Tidy for
correct output)");
add("wt2wt", "Empty TD followed by TD with tpl-generated attribute");
+add("wt2wt", "Empty TR followed by a template-generated TR\n(Parsoid-specific
since PHP parser doesn't handle this mixed tbl-wikitext)");
add("wt2wt", "Empty TR followed by mixed-ws-comment line should RT correctly");
add("wt2wt", "Improperly nested inline or quotes tags with whitespace in
between");
@@ -3903,6 +3904,17 @@
add("selser", "Empty TD followed by TD with tpl-generated attribute
[[4,[[0,0,4,0],0]]]");
add("selser", "Empty TD followed by TD with tpl-generated attribute [[4,1]]");
add("selser", "Indented table with an empty td [2,[3,[[0,0,0,2],4]]]");
+add("selser", "Empty TR followed by a template-generated TR\n(Parsoid-specific
since PHP parser doesn't handle this mixed tbl-wikitext) [[3,2]]");
+add("selser", "Empty TR followed by a template-generated TR\n(Parsoid-specific
since PHP parser doesn't handle this mixed tbl-wikitext) [[0,[2,0,3]]]");
+add("selser", "Empty TR followed by a template-generated TR\n(Parsoid-specific
since PHP parser doesn't handle this mixed tbl-wikitext) [[4,1]]");
+add("selser", "Empty TR followed by a template-generated TR\n(Parsoid-specific
since PHP parser doesn't handle this mixed tbl-wikitext) [[4,[[4],0,0]]]");
+add("selser", "Empty TR followed by a template-generated TR\n(Parsoid-specific
since PHP parser doesn't handle this mixed tbl-wikitext) [2]");
+add("selser", "Empty TR followed by a template-generated TR\n(Parsoid-specific
since PHP parser doesn't handle this mixed tbl-wikitext) [[3,[[2],0,0]]]");
+add("selser", "Empty TR followed by a template-generated TR\n(Parsoid-specific
since PHP parser doesn't handle this mixed tbl-wikitext) [[3,[0,0,3]]]");
+add("selser", "Empty TR followed by a template-generated TR\n(Parsoid-specific
since PHP parser doesn't handle this mixed tbl-wikitext) [[0,[0,0,2]]]");
+add("selser", "Empty TR followed by a template-generated TR\n(Parsoid-specific
since PHP parser doesn't handle this mixed tbl-wikitext) [1]");
+add("selser", "Empty TR followed by a template-generated TR\n(Parsoid-specific
since PHP parser doesn't handle this mixed tbl-wikitext) [[0,2]]");
+add("selser", "Empty TR followed by a template-generated TR\n(Parsoid-specific
since PHP parser doesn't handle this mixed tbl-wikitext) [[0,[0,0,4]]]");
add("selser", "Empty TR followed by mixed-ws-comment line should RT correctly
[2]");
add("selser", "Empty TR followed by mixed-ws-comment line should RT correctly
[[0,[2,2,2,0]]]");
add("selser", "Empty TR followed by mixed-ws-comment line should RT correctly
[[0,1]]");
--
To view, visit https://gerrit.wikimedia.org/r/76842
To unsubscribe, visit https://gerrit.wikimedia.org/r/settings
Gerrit-MessageType: merged
Gerrit-Change-Id: Iaea6e144aae4d7461775b898968231c9e4cb7254
Gerrit-PatchSet: 8
Gerrit-Project: mediawiki/extensions/Parsoid
Gerrit-Branch: master
Gerrit-Owner: Subramanya Sastry <[email protected]>
Gerrit-Reviewer: Cscott <[email protected]>
Gerrit-Reviewer: jenkins-bot
_______________________________________________
MediaWiki-commits mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/mediawiki-commits