jenkins-bot has submitted this change and it was merged. (
https://gerrit.wikimedia.org/r/403881 )
Change subject: Summary: update regex for finding parentheticals
......................................................................
Summary: update regex for finding parentheticals
For the latin character case (stays the same):
Parentheticals which have at least one space inside should be removed.
I removed the useless escaping of parentheses in the latin variant.
For the non-latin character case:
Parentheticals which have at least a space, colon, or comma inside
should be removed.
Updated expected results for a couple of Asian test cases to be less
aggressive in removing parentheticals.
Change-Id: I39a62342456a214a341f2694ed32edc01eed6597
---
M lib/transformations/summarize.js
M test/lib/transformations/summarize.js
2 files changed, 11 insertions(+), 5 deletions(-)
Approvals:
jenkins-bot: Verified
Mholloway: Looks good to me, approved
diff --git a/lib/transformations/summarize.js b/lib/transformations/summarize.js
index c5775f4..2b70ece 100644
--- a/lib/transformations/summarize.js
+++ b/lib/transformations/summarize.js
@@ -119,20 +119,26 @@
html = doc.body.innerHTML;
html = removeNestedParentheticals(html);
// 1. Replace any parentheticals which have at least one space inside
- html = html.replace(/\([^\)]+ [^\)]+\)/g, ' '); // eslint-disable-line
no-useless-escape
+ html = html.replace(/\([^)]+ [^)]+\)/g, ' ');
// 2. Remove any empty parentheticals due to transformations
html = html.replace(/\(\)/g, ' ');
// 3. Remove content inside any other non-latin parentheticals. The
behaviour is
- // the same as 1 but for languages that are not latin based
- html = html.replace(/(.+ .+)/g, ' ');
+ // the same as 1 but for languages that are not latin based. The other
difference
+ // to #1 is that in addition to a space the non-latin colon or comma could
also
+ // trigger the removal of parentheticals.
+ html = html.replace(/([^)]+[ :,][^)]+)/g, ' ');
// 4. remove all double spaces created by the above
html = html.replace(/ +/g, ' ');
// 5. Replace any leading whitespace before commas
+ // (which could be the result of earlier transformations)
html = html.replace(/ , /g, ', ');
+ // 6. Same as 5 but for non-latin comma and no space afterwards
+ html = html.replace(/ ,/g, ',');
+
doc.body.innerHTML = html;
return {
extract: doc.body.textContent,
diff --git a/test/lib/transformations/summarize.js
b/test/lib/transformations/summarize.js
index abe6348..fe90123 100644
--- a/test/lib/transformations/summarize.js
+++ b/test/lib/transformations/summarize.js
@@ -115,7 +115,7 @@
// Content inside Chinese parentheticals are also stripped
[
'<p><b>台北101</b>(<b>TAIPEI
101</b>)是位於的,樓高509.2米(1,671英尺),樓層共有101層、另有5層,總樓地板面積37萬4千,由設計,團隊、韩国等承造,於1999年動工,2004年12月31日完工啟用;最初名稱為<b>台北國際金融中心</b>(<span
lang="en">Taipei World Financial
Center</span>),2003年改為現名,亦俗稱為<b>101大樓</b>。興建與經營機構為。其為,曾於2004年12月31日至2010年1月4日間擁有的紀錄,目前為以及環最高,完工以來即成為重要之一。此外,大樓內擁有全球第二大的(僅次)、全球唯二開放遊客觀賞的巨型阻尼器(另一個為上海中心之「上海慧眼」),以及全球起降速度第四快的,僅次於、與。</p>',
- '<p><b>台北101</b> ,以及全球起降速度第四快的,僅次於、與。</p>'
+ '<p><b>台北101</b>
是位於的,樓高509.2米(1,671英尺),樓層共有101層、另有5層,總樓地板面積37萬4千,由設計,團隊、韩国等承造,於1999年動工,2004年12月31日完工啟用;最初名稱為<b>台北國際金融中心</b>,2003年改為現名,亦俗稱為<b>101大樓</b>。興建與經營機構為。其為,曾於2004年12月31日至2010年1月4日間擁有的紀錄,目前為以及環最高,完工以來即成為重要之一。此外,大樓內擁有全球第二大的(僅次)、全球唯二開放遊客觀賞的巨型阻尼器(另一個為上海中心之「上海慧眼」),以及全球起降速度第四快的,僅次於、與。</p>',
],
// Content inside Japanese parentheticals are also stripped
[
@@ -135,7 +135,7 @@
// Content inside parentheticals written in `gan` language variant
are also stripped
[
'<p><b>亞細亞洲</b>(古希臘文:Ασία),又簡稱<b>亞洲</b>,絕大部分都位到北半球,係全世界上最大,最多人嗰一隻<a
class="mw-redirect">洲</a>。佢東頭一徑到白令海峽嗰傑日尼奧夫角(西經169度40分,北緯60度5分),南頭一徑到努沙登加拉群島(東經103度30分,南緯11度7分),西頭一徑到巴巴角(東經26度3分,北緯39度27分),北頭一徑到切柳斯金角(東經104度18分,北緯77度43分),最高嗰山係<a>珠穆朗瑪峰</a>。亞洲東西嗰時差係11小時。佢西首連到<a>歐洲</a>,箇就係世界上最大嗰大陸-<a
class="new">歐亞大陸</a>。</p>',
- '<p><b>亞細亞洲</b>
,最高嗰山係<span>珠穆朗瑪峰</span>。亞洲東西嗰時差係11小時。佢西首連到<span>歐洲</span>,箇就係世界上最大嗰大陸-<span
class="new">歐亞大陸</span>。</p>'
+ '<p><b>亞細亞洲</b>,又簡稱<b>亞洲</b>,絕大部分都位到北半球,係全世界上最大,最多人嗰一隻<span
class="mw-redirect">洲</span>。佢東頭一徑到白令海峽嗰傑日尼奧夫角,南頭一徑到努沙登加拉群島,西頭一徑到巴巴角,北頭一徑到切柳斯金角,最高嗰山係<span>珠穆朗瑪峰</span>。亞洲東西嗰時差係11小時。佢西首連到<span>歐洲</span>,箇就係世界上最大嗰大陸-<span
class="new">歐亞大陸</span>。</p>'
],
// Content inside parentheticals is not stripped if it doesn't
include any spaces
[
--
To view, visit https://gerrit.wikimedia.org/r/403881
To unsubscribe, visit https://gerrit.wikimedia.org/r/settings
Gerrit-MessageType: merged
Gerrit-Change-Id: I39a62342456a214a341f2694ed32edc01eed6597
Gerrit-PatchSet: 1
Gerrit-Project: mediawiki/services/mobileapps
Gerrit-Branch: master
Gerrit-Owner: BearND <[email protected]>
Gerrit-Reviewer: Fjalapeno <[email protected]>
Gerrit-Reviewer: Jdlrobson <[email protected]>
Gerrit-Reviewer: Mholloway <[email protected]>
Gerrit-Reviewer: Mhurd <[email protected]>
Gerrit-Reviewer: Ppchelko <[email protected]>
Gerrit-Reviewer: jenkins-bot <>
_______________________________________________
MediaWiki-commits mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/mediawiki-commits