Subramanya Sastry has uploaded a new change for review.
https://gerrit.wikimedia.org/r/87019
Change subject: Improved handling of leading whitespace during serialization
......................................................................
Improved handling of leading whitespace during serialization
* "Normally", leading whitespace in wikitext trigger indent-pres.
However, there are some exceptions to this rule. This behavior
is captured in information about weak and strong indent-pre
suppressing tags. Weak tags only suppress indent-pres in their
immediate DOM children (caveat: while ignoring any zero-width
wikitext nodes). Strong tags suppress indent-pres in the entire
DOM subtree they generate.
Ex: Whitespace before <td> wikitext (" |foo") will not generate
pre-wraps. But, whitespace inside <td> will generate pre-wraps
("|\n foo"). So, <table>, <tbody>, <tr> are all weak indent-pre
suppressing tags.
Ex: <blockquote>, <ref> extension tag are examples of strong
indent-pre suppressing tags. Indent-pres generation is entirely
turned off in these tags.
In addition, block tags suppress indent-pres on the lines
they are present.
This patch updates wikitext.constants.js with the right set of
weak/strong indent-pre suppressing tags.
* makeSeparator in the serializer tries to account for leading
white-space in separator text. However, it needs to know whether
the separator is being generated between siblings or between
a parent and child. The leading-whitespace-stripping behavior
differs in these scenarios. This patch carries over this information
to makeSeparator (and appropriately updates it in selser mode)
to account for block tags.
* There are still gaps in our handling (there is a FIXME in the
code as well as a new failing test that I've added to the
blacklist). We dont need to address that immediately but as we
continue to update our support for serialization of arbitrary
HTML, we will have to tackel it.
* Added various tests for serialization of new content that will
exercise these different code paths.
* Test result changes:
- One html2wt failure is because of missing support for arbitrary
HTML (see above).
- One additional selser failure is because selser is more accurate
than wt2wt (which strips leading whitespace unnecessarily).
"Extra newlines: More paragraphs with indented comment [3,4,0,2,0]"
- One additional passing selser test is because of improved
indent-pre handling
- Two new html2wt tests pass with this patch (but would have
failed without it).
Change-Id I518237a2aec3c54281b3641d1a7e4a3c76331686
Change-Id: I49123699b7710737cfcb835143a2fb861c3ad1f0
---
M js/lib/mediawiki.WikitextSerializer.js
M js/lib/mediawiki.wikitext.constants.js
M js/tests/parserTests-blacklist.js
M js/tests/parserTests.txt
4 files changed, 191 insertions(+), 23 deletions(-)
git pull ssh://gerrit.wikimedia.org:29418/mediawiki/extensions/Parsoid
refs/changes/19/87019/1
diff --git a/js/lib/mediawiki.WikitextSerializer.js
b/js/lib/mediawiki.WikitextSerializer.js
index 63613b7..62db9d0 100644
--- a/js/lib/mediawiki.WikitextSerializer.js
+++ b/js/lib/mediawiki.WikitextSerializer.js
@@ -3574,7 +3574,7 @@
* Create a separator given a (potentially empty) separator text and newline
* constraints
*/
-WSP.makeSeparator = function(sep, node, nlConstraints, state) {
+WSP.makeSeparator = function(sep, nlConstraints, state) {
var origSep = sep;
// TODO: Move to Util?
@@ -3634,21 +3634,74 @@
// sep.replace(/^([^\n<]*<!--(?:[^\-]|-(?!->))*-->)?[^\n<]+/g,
'$1');
//}
- // Strip non-nl ws from last line, but preserve comments
- // This avoids triggering indent-pres.
+ // SSS FIXME: 'nlConstraints.min > 0' only applies to nodes that have
native wikitext
+ // equivalents. For HTML nodes that will be serialized as HTML tags, we
have to check
+ // for indent-pre safety always. This is currently not handled by the
code below and
+ // will require other fixes to handle arbitrary HTML. To be done later!
See example
+ // below that fails.
//
- // 'node' has min-nl constraint, but we dont know that 'node' is
pre-safe.
- // SSS FIXME: The check for 'node.nodeName in preSafeTags' should be
possible
- // at a nested level rather than just 'node'. If 'node' is an
IEW/comment,
- // we should find the "next" (at this and and ancestor levels), the
non-sep
- // sibling and check if that node is one of these types.
- //
- // SSS FIXME: how is it that parentNode can be null?? is body
getting here?
- var parentName = node.parentNode && node.parentNode.nodeName;
- if (nlConstraints.min > 0 && !(node.nodeName in Consts.PreSafeTags)) {
- sep =
sep.replace(/[^\n>]+(<!--(?:[^\-]|-(?!->))*-->[^\n]*)?$/g, '$1');
+ // Ex: "<div>foo</div>\n <span>bar</span>"
+ if (nlConstraints.min > 0 &&
sep.match(/[^\n<>]+(<!--(?:[^\-]|-(?!->))*-->[^\n]*)?$/g)) {
+ // 'sep' is the separator before 'nodeB' and it has leading
spaces on a newline.
+ // We have to decide whether that leading space will trigger
indent-pres in wikitext.
+ // The decision depends on where this separator will be emitted
relative
+ // to 'nodeA' and 'nodeB'.
+
+ var isIndentPreSafe = false,
+ constraintInfo = nlConstraints.constraintInfo || {};
+
+ // Example of sibling sepType scenario:
+ // <p>foo</p>
+ // <span>bar</span>
+ // The span will be wrapped in an indent-pre if the leading
space
+ // is not stripped since span is not a block tag
+ //
+ // Example of child-parent sepType scenario:
+ // <span>foo
+ // </span>bar
+ // The " </span>bar" will be wrapped in an indent-pre if the
+ // leading space is not stripped since span is not a block tag
+ if ((constraintInfo.sepType === 'sibling' ||
+ constraintInfo.sepType === 'child-parent') &&
+ Util.isBlockTag(constraintInfo.nodeB.nodeName))
+ {
+ isIndentPreSafe = true;
+ } else if (constraintInfo.sepType === 'parent-child') {
+ // Separator between parent node and child node
+ var node = constraintInfo.nodeA;
+
+ // Walk up past zero-wikitext width ancestors
+ while (node.nodeName in Consts.ZeroWidthWikitextTags) {
+ node = node.parentNode;
+ }
+
+ // Deal with weak/strong indent-pre suppressing tags
+ if (node.nodeName in
Consts.WeakIndentPreSuppressingTags) {
+ isIndentPreSafe = true;
+ } else {
+ // Strong indent-pre suppressing tags suppress
indent-pres
+ // in entire DOM subtree rooted at that node
+ while (node.nodeName !== 'BODY') {
+ if (node.nodeName in
Consts.StrongIndentPreSuppressingTags) {
+ isIndentPreSafe = true;
+ }
+ node = node.parentNode;
+ }
+ }
+ }
+
+ if (!isIndentPreSafe) {
+ // Strip non-nl ws from last line, but preserve
comments.
+ // This avoids triggering indent-pres.
+ sep =
sep.replace(/[^\n>]+(<!--(?:[^\-]|-(?!->))*-->[^\n]*)?$/g, '$1');
+ }
}
- this.trace('makeSeparator', sep, origSep, minNls, sepNlCount,
nlConstraints);
+
+ if (this.debugging) {
+ var constraints = nlConstraints;
+ constraints.constraintInfo = undefined;
+ this.trace('makeSeparator', sep, origSep, minNls, sepNlCount,
constraints);
+ }
return sep;
};
@@ -3697,24 +3750,29 @@
* }
* }
*/
-WSP.updateSeparatorConstraints = function( state, nodeA, handlerA, nodeB,
handlerB, dir) {
+WSP.updateSeparatorConstraints = function( state, nodeA, handlerA, nodeB,
handlerB) {
var nlConstraints,
sepHandlerA = handlerA && handlerA.sepnls || {},
- sepHandlerB = handlerB && handlerB.sepnls || {};
+ sepHandlerB = handlerB && handlerB.sepnls || {},
+ sepType = null;
if ( nodeA.nextSibling === nodeB ) {
// sibling separator
+ sepType = "sibling";
nlConstraints = this.getSepNlConstraints(state, nodeA,
sepHandlerA.after,
nodeB, sepHandlerB.before);
- } else if ( nodeB.parentNode === nodeA || dir === 'prev' ) {
+ } else if ( nodeB.parentNode === nodeA ) {
+ sepType = "parent-child";
// parent-child separator, nodeA parent of nodeB
nlConstraints = this.getSepNlConstraints(state, nodeA,
sepHandlerA.firstChild,
nodeB, sepHandlerB.before);
- } else if ( nodeA.parentNode === nodeB || dir === 'next') {
+ } else if ( nodeA.parentNode === nodeB ) {
+ sepType = "child-parent";
// parent-child separator, nodeB parent of nodeA
nlConstraints = this.getSepNlConstraints(state, nodeA,
sepHandlerA.after,
nodeB, sepHandlerB.lastChild);
} else {
// sibling separator
+ sepType = "sibling";
nlConstraints = this.getSepNlConstraints(state, nodeA,
sepHandlerA.after,
nodeB, sepHandlerB.before);
}
@@ -3725,6 +3783,7 @@
if (this.debugging) {
this.trace('hSep', nodeA.nodeName, nodeB.nodeName,
+ sepType,
nlConstraints,
(nodeA.outerHTML || nodeA.nodeValue ||
'').substr(0,40),
(nodeB.outerHTML || nodeB.nodeValue ||
'').substr(0,40)
@@ -3741,6 +3800,13 @@
state.sep.constraints = nlConstraints;
//state.sep.lastSourceNode = state.sep.lastSourceNode || nodeA;
}
+
+ state.sep.constraints.constraintInfo = {
+ sepType: sepType,
+ nodeA: nodeA,
+ nodeB: nodeB
+ };
+
//console.log('nlConstraints', state.sep.constraints);
};
@@ -3902,7 +3968,6 @@
// TODO: set modified flag if start or end node (but
not both) are
// modified / new so that the selser can use the
separator
sep = this.makeSeparator(state.sep.src || '',
- origNode,
state.sep.constraints ||
{a:{},b:{}, max:0},
state);
} else {
@@ -4013,6 +4078,23 @@
state.currNodeUnmodified = true;
+ // If this HTML node will disappear in
wikitext because of zero width,
+ // then the separator constraints will
carry over to the node's children.
+ //
+ // Since we dont recurse into 'node' in
selser mode, we update the
+ // separator constraintInfo to apply to
'node' and its first child.
+ //
+ // We could clear constraintInfo
altogether which would be correct (but
+ // could normalize separators and
introduce dirty diffs unnecessarily).
+ if (node.nodeName in
Consts.ZeroWidthWikitextTags &&
+ node.childNodes.length > 0 &&
+
state.sep.constraints.constraintInfo.sepType === 'sibling')
+ {
+
state.sep.constraints.constraintInfo.sepType = 'parent-child';
+
state.sep.constraints.constraintInfo.nodeA = node;
+
state.sep.constraints.constraintInfo.nodeB = node.firstChild;
+ }
+
// console.warn("USED ORIG");
this.trace("ORIG-src:", src, '; out:',
out);
cb(out, node);
diff --git a/js/lib/mediawiki.wikitext.constants.js
b/js/lib/mediawiki.wikitext.constants.js
index d96c659..8b93146 100644
--- a/js/lib/mediawiki.wikitext.constants.js
+++ b/js/lib/mediawiki.wikitext.constants.js
@@ -90,8 +90,21 @@
// These wikitext tags are composed with quote-chars
WTQuoteTags: JSUtils.arrayToHash(['I', 'B']),
- // Whitespace in these elements does not lead to indent-pre
- PreSafeTags: JSUtils.arrayToHash(['BR', 'TABLE', 'TBODY', 'CAPTION',
'TR', 'TD', 'TH']),
+ // Leading whitespace on new lines in these elements does not lead to
indent-pre
+ // This only applies to immediate children (while skipping past
zero-wikitext tags)
+ // (Ex: content in table-cells induce indent pres)
+ WeakIndentPreSuppressingTags: JSUtils.arrayToHash([
+ 'TABLE', 'TBODY', 'TR'
+ ]),
+
+ // Leading whitespace on new lines in these elements does not lead to
indent-pre
+ // This applies to all nested content in these tags
+ // (Ex: content in table-cells nested in blocktags do not induce indent
pres)
+ // FIXME: <ref> extension tag is another such -- is it possible to fold
them
+ // into this setup so we can get rid of the 'noPre' hack in token
transformers?
+ StrongIndentPreSuppressingTags: JSUtils.arrayToHash([
+ 'BLOCKQUOTE'
+ ]),
// In the PHP parser, these block tags open block-tag scope
// See doBlockLevels in the PHP parser (includes/parser/Parser.php)
@@ -213,7 +226,12 @@
"i" : [2,2],
"br" : [0,0],
"figure": [2,2]
- }
+ },
+
+ // Derived information from 'WT_TagWidths' and should be kept
consistent with it
+ ZeroWidthWikitextTags: JSUtils.arrayToHash([
+ 'P', 'META', 'TBODY', 'OL', 'UL', 'DL', 'BR'
+ ])
};
// Fill in reverse map of prefix options.
diff --git a/js/tests/parserTests-blacklist.js
b/js/tests/parserTests-blacklist.js
index 79dbcb0..492ad02 100644
--- a/js/tests/parserTests-blacklist.js
+++ b/js/tests/parserTests-blacklist.js
@@ -2168,6 +2168,7 @@
add("html2wt", "Empty TR followed by mixed-ws-comment line should RT
correctly");
add("html2wt", "Multi-line image caption generated by templates with/without
trailing newlines");
add("html2wt", "Improperly nested inline or quotes tags with whitespace in
between");
+add("html2wt", "Strip leading whitespace when handling indent-pre inducing
tags");
// Blacklist for selser
@@ -2184,12 +2185,12 @@
add("selser", "Paragraphs with newline spacing with non-empty white-space
lines in between [0,2,3,0,[3],0,0,0,4,0,[0,4],4,4]");
add("selser", "Paragraphs with newline spacing with non-empty white-space
lines in between [3,2,4,4,0,0,0,4,[4],0,2,3,2]");
add("selser", "Paragraphs with newline spacing with non-empty mixed comment
and white-space lines in between
[3,2,[0,0,3],0,0,0,1,0,3,0,2,0,0,0,3,0,3,4,2,3,[2],0,0,4,2,0,2,3,0]");
-add("selser", "Paragraphs with newline spacing with non-empty mixed comment
and white-space lines in between
[4,3,[4,2,4],3,3,3,[4,0,0,4],2,0,3,1,0,0,0,3,0,[2],0,0,3,3,0,0,0,3,3,[0,4],0,3]");
add("selser", "Paragraphs with newline spacing with non-empty mixed comment
and white-space lines in between
[0,3,4,3,0,0,[4,0,0,3],0,2,0,0,0,3,3,4,2,4,4,2,3,4,0,0,2,0,0,0,0,2]");
add("selser", "Paragraphs with newline spacing with non-empty mixed comment
and white-space lines in between
[0,0,[2,0,3],4,0,4,2,3,0,3,[2],0,0,3,0,0,0,3,0,3,0,0,3,0,3,0,2,3,0]");
add("selser", "Paragraphs with newline spacing with non-empty mixed comment
and white-space lines in between
[0,0,2,0,0,3,4,0,0,0,1,3,0,0,3,3,[4],0,2,2,0,2,2,0,0,4,2,3,4]");
add("selser", "Extra newlines: More paragraphs with indented comment
[1,0,4,0,2]");
add("selser", "Extra newlines: More paragraphs with indented comment
[[2],3,4,2,2]");
+add("selser", "Extra newlines: More paragraphs with indented comment
[3,4,0,2,0]");
add("selser", "Extra newlines: More paragraphs with indented comment
[0,3,0,4,2]");
add("selser", "Italics and possessives (1) [1]");
add("selser", "Italics and possessives (1) [2]");
diff --git a/js/tests/parserTests.txt b/js/tests/parserTests.txt
index b2919dc..7691bee 100644
--- a/js/tests/parserTests.txt
+++ b/js/tests/parserTests.txt
@@ -1390,11 +1390,21 @@
!! input
<blockquote>
Blah
+{|
+|
+ indented cell (no pre-wrapping!)
+|}
</blockquote>
!! result
<blockquote>
<p> Blah
</p>
+<table>
+<tr>
+<td>
+<p> indented cell (no pre-wrapping!)
+</p>
+</td></tr></table>
</blockquote>
!! end
@@ -17722,6 +17732,63 @@
</ul>
!! end
+!! test
+Dont strip leading whitespace when handling indent-pre suppressing tags
+!! options
+parsoid=html2wt
+!! input
+{|
+ | indented row
+|}
+<blockquote>
+ '''This is very bold of you!'''
+
+{|
+|
+ indented cell (no pre-wrapping!)
+|}
+</blockquote>
+foo
+ <div>bar</div>
+!! result
+<table>
+ <tr><td> indented row</td></tr>
+</table>
+<blockquote><p>
+ <b>This is very bold of you!</b>
+</p>
+<table><tr><td>
+ indented cell (no pre-wrapping!)
+</td></tr></table>
+</blockquote>
+<p>foo</p>
+ <div>bar</div>
+!! end
+
+!! test
+Strip leading whitespace when handling indent-pre inducing tags
+!! options
+parsoid=html2wt
+!! input
+foo
+<span>bar</span>
+
+<span>foo2
+</span>bar2
+
+<div>foo</div>
+<span>bar</span>
+!! result
+<p>foo</p>
+ <span>bar</span>
+
+<span>foo2
+ </span>bar2
+
+<div>foo</div>
+ <span>bar</span>
+!! end
+
# Wacky -- the leading newline in input is required because
# that is what the serializer emits. To be fixed. Not fixing
# the test because this test is required to test serialization of
--
To view, visit https://gerrit.wikimedia.org/r/87019
To unsubscribe, visit https://gerrit.wikimedia.org/r/settings
Gerrit-MessageType: newchange
Gerrit-Change-Id: I49123699b7710737cfcb835143a2fb861c3ad1f0
Gerrit-PatchSet: 1
Gerrit-Project: mediawiki/extensions/Parsoid
Gerrit-Branch: master
Gerrit-Owner: Subramanya Sastry <[email protected]>
_______________________________________________
MediaWiki-commits mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/mediawiki-commits