jenkins-bot has submitted this change and it was merged.

Change subject: Strip citation links (like [1], [2], etc) from HTML
......................................................................


Strip citation links (like [1], [2], etc) from HTML

These are most annoying in section headers, but are
generally unuseful text to index. Also trim() headers
while we're at it for consistency.

Bug: 62539
Change-Id: I6dd3a96074a6258f2640826261abf937d887c58d
---
M includes/BuildDocument/PageDataBuilder.php
M includes/BuildDocument/PageTextBuilder.php
M tests/browser/features/highlighting.feature
A tests/browser/features/support/articles/references_highlight_test.txt
M tests/browser/features/support/hooks.rb
M tests/jenkins/Jenkins.php
6 files changed, 47 insertions(+), 2 deletions(-)

Approvals:
  Chad: Looks good to me, approved
  jenkins-bot: Verified



diff --git a/includes/BuildDocument/PageDataBuilder.php 
b/includes/BuildDocument/PageDataBuilder.php
index bbf44d4..eafc9a1 100644
--- a/includes/BuildDocument/PageDataBuilder.php
+++ b/includes/BuildDocument/PageDataBuilder.php
@@ -106,8 +106,18 @@
                $ignoredHeadings = $this->getIgnoredHeadings();
                foreach ( $this->parserOutput->getSections() as $heading ) {
                        $heading = $heading[ 'line' ];
+                       // First strip out things that look like references.  
We can't use HTML filtering becase
+                       // the references come back as <sup> tags without a 
class.  To keep from breaking stuff like
+                       //  ==Applicability of the strict mass–energy 
equivalence formula, ''E'' = ''mc''<sup>2</sup>==
+                       // we don't remove the whole <sup> tag.  We also don't 
want to strip the <sup> tag and remove
+                       // everything that looks like [2] because, I dunno, 
maybe there is a band named Word [2] Foo
+                       // or something.  Whatever.  So we only strip things 
that look like <sup> tags wrapping a
+                       // refence.  And we do it with regexes because 
HtmlFormatter doesn't support css selectors.
+                       $heading = preg_replace( '/<sup>\s*\[\d+\]\s*<\/sup>/', 
'', $heading );
+
                        // Strip tags from the heading or else we'll display 
them (escaped) in search results
-                       $heading = Sanitizer::stripAllTags( $heading );
+                       $heading = trim( Sanitizer::stripAllTags( $heading ) );
+
                        // Note that we don't take the level of the heading 
into account - all headings are equal.
                        // Except the ones we ignore.
                        if ( !in_array( $heading, $ignoredHeadings ) ) {
diff --git a/includes/BuildDocument/PageTextBuilder.php 
b/includes/BuildDocument/PageTextBuilder.php
index 8218b33..745c358 100644
--- a/includes/BuildDocument/PageTextBuilder.php
+++ b/includes/BuildDocument/PageTextBuilder.php
@@ -60,7 +60,10 @@
        private function formatWikitext( ParserOutput $parserOutput ) {
                $parserOutput->setEditSectionTokens( false );
                $formatter = new HtmlFormatter( $parserOutput->getText() );
-               $formatter->remove( array( 'audio', 'video', '#toc', 
'.thumbcaption' ) );
+               $formatter->remove( array( 'audio', 'video', '#toc', 
'.thumbcaption',
+                       'sup.reference',        // The [1] for references
+                       '.mw-cite-backlink',    // The ↑ next to refenences in 
the references section
+               ) );
                $formatter->filterContent();
                return $formatter->getText();
        }
diff --git a/tests/browser/features/highlighting.feature 
b/tests/browser/features/highlighting.feature
index 1c5acae..68353bf 100644
--- a/tests/browser/features/highlighting.feature
+++ b/tests/browser/features/highlighting.feature
@@ -90,6 +90,21 @@
     When I search for user_talk:test
     Then User talk:*Test* is the highlighted title of the first search result
 
+  @highlighting @references
+  Scenario: References don't appear in highlighted section titles
+    When I search for Reference Section Highlight Test
+    And *Reference* *Section* is the highlighted alttitle of the first search 
result
+
+  @highlighting @references
+  Scenario: References don't appear in highlighted text
+    When I search for Reference Text Highlight Test
+    And *Reference* Section *Reference* *Text* *References*  foo   baz   bar 
is the highlighted text of the first search result
+
+  @highlighting @references
+  Scenario: References are highlighted if you search for them
+    When I search for Reference foo bar baz Highlight Test
+    And *Reference* Section *Reference* Text *References*  *foo*   *baz*   
*bar* is the highlighted text of the first search result
+
   @programmer_friendly @highlighting
   Scenario: camelCase is highlighted correctly
     When I search for namespace aliases
diff --git 
a/tests/browser/features/support/articles/references_highlight_test.txt 
b/tests/browser/features/support/articles/references_highlight_test.txt
new file mode 100644
index 0000000..89558db
--- /dev/null
+++ b/tests/browser/features/support/articles/references_highlight_test.txt
@@ -0,0 +1,6 @@
+== Reference Section<ref>foo</ref> ==
+
+Reference<ref>baz</ref> Text<ref>bar</ref>
+
+== References ==
+<references/>
diff --git a/tests/browser/features/support/hooks.rb 
b/tests/browser/features/support/hooks.rb
index 49816fe..b15df09 100644
--- a/tests/browser/features/support/hooks.rb
+++ b/tests/browser/features/support/hooks.rb
@@ -158,6 +158,15 @@
   $highlighting = true
 end
 
+Before("@highlighting", "@references") do
+  if !$highlighting
+    steps %Q{
+      Given a page named References Highlight Test exists with contents 
@references_highlight_test.txt
+    }
+  end
+  $highlighting = true
+end
+
 Before("@setup_more_like_this") do
   if !$setup_more_like_this
     # The MoreLikeMe term must appear in "a bunch" of pages for it to be used 
in morelike: searches
diff --git a/tests/jenkins/Jenkins.php b/tests/jenkins/Jenkins.php
index 6b24b96..7a839fc 100644
--- a/tests/jenkins/Jenkins.php
+++ b/tests/jenkins/Jenkins.php
@@ -41,6 +41,7 @@
 require_once( "$IP/extensions/MwEmbedSupport/MwEmbedSupport.php" );
 require_once( "$IP/extensions/TimedMediaHandler/TimedMediaHandler.php" );
 require_once( "$IP/extensions/PdfHandler/PdfHandler.php" );
+require_once( "$IP/extensions/Cite/Cite.php" );
 
 // Configuration
 $wgSearchType = 'CirrusSearch';
@@ -67,6 +68,7 @@
                'password' => $wgRedisPassword,
        ),
 );
+$wgCiteEnablePopups = true;
 
 // Running a ton of jobs every request helps to make sure all the pages that 
are created
 // are indexed as fast as possible.

-- 
To view, visit https://gerrit.wikimedia.org/r/123878
To unsubscribe, visit https://gerrit.wikimedia.org/r/settings

Gerrit-MessageType: merged
Gerrit-Change-Id: I6dd3a96074a6258f2640826261abf937d887c58d
Gerrit-PatchSet: 3
Gerrit-Project: mediawiki/extensions/CirrusSearch
Gerrit-Branch: master
Gerrit-Owner: Chad <[email protected]>
Gerrit-Reviewer: Chad <[email protected]>
Gerrit-Reviewer: Manybubbles <[email protected]>
Gerrit-Reviewer: jenkins-bot <>

_______________________________________________
MediaWiki-commits mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/mediawiki-commits

Reply via email to