jenkins-bot has submitted this change and it was merged. ( 
https://gerrit.wikimedia.org/r/403881 )

Change subject: Summary: update regex for finding parentheticals
......................................................................


Summary: update regex for finding parentheticals

For the latin character case (stays the same):
Parentheticals which have at least one space inside should be removed.
I removed the useless escaping of parentheses in the latin variant.

For the non-latin character case:
Parentheticals which have at least a space, colon, or comma inside
should be removed.
Updated expected results for a couple of Asian test cases to be less
aggressive in removing parentheticals.

Change-Id: I39a62342456a214a341f2694ed32edc01eed6597
---
M lib/transformations/summarize.js
M test/lib/transformations/summarize.js
2 files changed, 11 insertions(+), 5 deletions(-)

Approvals:
  jenkins-bot: Verified
  Mholloway: Looks good to me, approved



diff --git a/lib/transformations/summarize.js b/lib/transformations/summarize.js
index c5775f4..2b70ece 100644
--- a/lib/transformations/summarize.js
+++ b/lib/transformations/summarize.js
@@ -119,20 +119,26 @@
     html = doc.body.innerHTML;
     html = removeNestedParentheticals(html);
     // 1. Replace any parentheticals which have at least one space inside
-    html = html.replace(/\([^\)]+ [^\)]+\)/g, ' '); // eslint-disable-line 
no-useless-escape
+    html = html.replace(/\([^)]+ [^)]+\)/g, ' ');
     // 2. Remove any empty parentheticals due to transformations
     html = html.replace(/\(\)/g, ' ');
 
     // 3. Remove content inside any other non-latin parentheticals. The 
behaviour is
-    // the same as 1 but for languages that are not latin based
-    html = html.replace(/(.+ .+)/g, ' ');
+    // the same as 1 but for languages that are not latin based. The other 
difference
+    // to #1 is that in addition to a space the non-latin colon or comma could 
also
+    // trigger the removal of parentheticals.
+    html = html.replace(/([^)]+[ :,][^)]+)/g, ' ');
 
     // 4. remove all double spaces created by the above
     html = html.replace(/ +/g, ' ');
 
     // 5. Replace any leading whitespace before commas
+    // (which could be the result of earlier transformations)
     html = html.replace(/ , /g, ', ');
 
+    // 6. Same as 5 but for non-latin comma and no space afterwards
+    html = html.replace(/ ,/g, ',');
+
     doc.body.innerHTML = html;
     return {
         extract: doc.body.textContent,
diff --git a/test/lib/transformations/summarize.js 
b/test/lib/transformations/summarize.js
index abe6348..fe90123 100644
--- a/test/lib/transformations/summarize.js
+++ b/test/lib/transformations/summarize.js
@@ -115,7 +115,7 @@
             // Content inside Chinese parentheticals are also stripped
             [
                 '<p><b>台北101</b>(<b>TAIPEI 
101</b>)是位於的,樓高509.2米(1,671英尺),樓層共有101層、另有5層,總樓地板面積37萬4千,由設計,團隊、韩国等承造,於1999年動工,2004年12月31日完工啟用;最初名稱為<b>台北國際金融中心</b>(<span
 lang="en">Taipei World Financial 
Center</span>),2003年改為現名,亦俗稱為<b>101大樓</b>。興建與經營機構為。其為,曾於2004年12月31日至2010年1月4日間擁有的紀錄,目前為以及環最高,完工以來即成為重要之一。此外,大樓內擁有全球第二大的(僅次)、全球唯二開放遊客觀賞的巨型阻尼器(另一個為上海中心之「上海慧眼」),以及全球起降速度第四快的,僅次於、與。</p>',
-                '<p><b>台北101</b> ,以及全球起降速度第四快的,僅次於、與。</p>'
+                '<p><b>台北101</b> 
是位於的,樓高509.2米(1,671英尺),樓層共有101層、另有5層,總樓地板面積37萬4千,由設計,團隊、韩国等承造,於1999年動工,2004年12月31日完工啟用;最初名稱為<b>台北國際金融中心</b>,2003年改為現名,亦俗稱為<b>101大樓</b>。興建與經營機構為。其為,曾於2004年12月31日至2010年1月4日間擁有的紀錄,目前為以及環最高,完工以來即成為重要之一。此外,大樓內擁有全球第二大的(僅次)、全球唯二開放遊客觀賞的巨型阻尼器(另一個為上海中心之「上海慧眼」),以及全球起降速度第四快的,僅次於、與。</p>',
             ],
             // Content inside Japanese parentheticals are also stripped
             [
@@ -135,7 +135,7 @@
             // Content inside parentheticals written in `gan` language variant 
are also stripped
             [
                 
'<p><b>亞細亞洲</b>(古希臘文:Ασία),又簡稱<b>亞洲</b>,絕大部分都位到北半球,係全世界上最大,最多人嗰一隻<a 
class="mw-redirect">洲</a>。佢東頭一徑到白令海峽嗰傑日尼奧夫角(西經169度40分,北緯60度5分),南頭一徑到努沙登加拉群島(東經103度30分,南緯11度7分),西頭一徑到巴巴角(東經26度3分,北緯39度27分),北頭一徑到切柳斯金角(東經104度18分,北緯77度43分),最高嗰山係<a>珠穆朗瑪峰</a>。亞洲東西嗰時差係11小時。佢西首連到<a>歐洲</a>,箇就係世界上最大嗰大陸-<a
 class="new">歐亞大陸</a>。</p>',
-                '<p><b>亞細亞洲</b> 
,最高嗰山係<span>珠穆朗瑪峰</span>。亞洲東西嗰時差係11小時。佢西首連到<span>歐洲</span>,箇就係世界上最大嗰大陸-<span 
class="new">歐亞大陸</span>。</p>'
+                '<p><b>亞細亞洲</b>,又簡稱<b>亞洲</b>,絕大部分都位到北半球,係全世界上最大,最多人嗰一隻<span 
class="mw-redirect">洲</span>。佢東頭一徑到白令海峽嗰傑日尼奧夫角,南頭一徑到努沙登加拉群島,西頭一徑到巴巴角,北頭一徑到切柳斯金角,最高嗰山係<span>珠穆朗瑪峰</span>。亞洲東西嗰時差係11小時。佢西首連到<span>歐洲</span>,箇就係世界上最大嗰大陸-<span
 class="new">歐亞大陸</span>。</p>'
             ],
             // Content inside parentheticals is not stripped if it doesn't 
include any spaces
             [

-- 
To view, visit https://gerrit.wikimedia.org/r/403881
To unsubscribe, visit https://gerrit.wikimedia.org/r/settings

Gerrit-MessageType: merged
Gerrit-Change-Id: I39a62342456a214a341f2694ed32edc01eed6597
Gerrit-PatchSet: 1
Gerrit-Project: mediawiki/services/mobileapps
Gerrit-Branch: master
Gerrit-Owner: BearND <[email protected]>
Gerrit-Reviewer: Fjalapeno <[email protected]>
Gerrit-Reviewer: Jdlrobson <[email protected]>
Gerrit-Reviewer: Mholloway <[email protected]>
Gerrit-Reviewer: Mhurd <[email protected]>
Gerrit-Reviewer: Ppchelko <[email protected]>
Gerrit-Reviewer: jenkins-bot <>

_______________________________________________
MediaWiki-commits mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/mediawiki-commits

Reply via email to