jenkins-bot has submitted this change and it was merged.
Change subject: Strip citation links (like [1], [2], etc) from HTML
......................................................................
Strip citation links (like [1], [2], etc) from HTML
These are most annoying in section headers, but are
generally unuseful text to index. Also trim() headers
while we're at it for consistency.
Bug: 62539
Change-Id: I6dd3a96074a6258f2640826261abf937d887c58d
---
M includes/BuildDocument/PageDataBuilder.php
M includes/BuildDocument/PageTextBuilder.php
M tests/browser/features/highlighting.feature
A tests/browser/features/support/articles/references_highlight_test.txt
M tests/browser/features/support/hooks.rb
M tests/jenkins/Jenkins.php
6 files changed, 47 insertions(+), 2 deletions(-)
Approvals:
Chad: Looks good to me, approved
jenkins-bot: Verified
diff --git a/includes/BuildDocument/PageDataBuilder.php
b/includes/BuildDocument/PageDataBuilder.php
index bbf44d4..eafc9a1 100644
--- a/includes/BuildDocument/PageDataBuilder.php
+++ b/includes/BuildDocument/PageDataBuilder.php
@@ -106,8 +106,18 @@
$ignoredHeadings = $this->getIgnoredHeadings();
foreach ( $this->parserOutput->getSections() as $heading ) {
$heading = $heading[ 'line' ];
+ // First strip out things that look like references.
We can't use HTML filtering becase
+ // the references come back as <sup> tags without a
class. To keep from breaking stuff like
+ // ==Applicability of the strict mass–energy
equivalence formula, ''E'' = ''mc''<sup>2</sup>==
+ // we don't remove the whole <sup> tag. We also don't
want to strip the <sup> tag and remove
+ // everything that looks like [2] because, I dunno,
maybe there is a band named Word [2] Foo
+ // or something. Whatever. So we only strip things
that look like <sup> tags wrapping a
+ // refence. And we do it with regexes because
HtmlFormatter doesn't support css selectors.
+ $heading = preg_replace( '/<sup>\s*\[\d+\]\s*<\/sup>/',
'', $heading );
+
// Strip tags from the heading or else we'll display
them (escaped) in search results
- $heading = Sanitizer::stripAllTags( $heading );
+ $heading = trim( Sanitizer::stripAllTags( $heading ) );
+
// Note that we don't take the level of the heading
into account - all headings are equal.
// Except the ones we ignore.
if ( !in_array( $heading, $ignoredHeadings ) ) {
diff --git a/includes/BuildDocument/PageTextBuilder.php
b/includes/BuildDocument/PageTextBuilder.php
index 8218b33..745c358 100644
--- a/includes/BuildDocument/PageTextBuilder.php
+++ b/includes/BuildDocument/PageTextBuilder.php
@@ -60,7 +60,10 @@
private function formatWikitext( ParserOutput $parserOutput ) {
$parserOutput->setEditSectionTokens( false );
$formatter = new HtmlFormatter( $parserOutput->getText() );
- $formatter->remove( array( 'audio', 'video', '#toc',
'.thumbcaption' ) );
+ $formatter->remove( array( 'audio', 'video', '#toc',
'.thumbcaption',
+ 'sup.reference', // The [1] for references
+ '.mw-cite-backlink', // The ↑ next to refenences in
the references section
+ ) );
$formatter->filterContent();
return $formatter->getText();
}
diff --git a/tests/browser/features/highlighting.feature
b/tests/browser/features/highlighting.feature
index 1c5acae..68353bf 100644
--- a/tests/browser/features/highlighting.feature
+++ b/tests/browser/features/highlighting.feature
@@ -90,6 +90,21 @@
When I search for user_talk:test
Then User talk:*Test* is the highlighted title of the first search result
+ @highlighting @references
+ Scenario: References don't appear in highlighted section titles
+ When I search for Reference Section Highlight Test
+ And *Reference* *Section* is the highlighted alttitle of the first search
result
+
+ @highlighting @references
+ Scenario: References don't appear in highlighted text
+ When I search for Reference Text Highlight Test
+ And *Reference* Section *Reference* *Text* *References* foo baz bar
is the highlighted text of the first search result
+
+ @highlighting @references
+ Scenario: References are highlighted if you search for them
+ When I search for Reference foo bar baz Highlight Test
+ And *Reference* Section *Reference* Text *References* *foo* *baz*
*bar* is the highlighted text of the first search result
+
@programmer_friendly @highlighting
Scenario: camelCase is highlighted correctly
When I search for namespace aliases
diff --git
a/tests/browser/features/support/articles/references_highlight_test.txt
b/tests/browser/features/support/articles/references_highlight_test.txt
new file mode 100644
index 0000000..89558db
--- /dev/null
+++ b/tests/browser/features/support/articles/references_highlight_test.txt
@@ -0,0 +1,6 @@
+== Reference Section<ref>foo</ref> ==
+
+Reference<ref>baz</ref> Text<ref>bar</ref>
+
+== References ==
+<references/>
diff --git a/tests/browser/features/support/hooks.rb
b/tests/browser/features/support/hooks.rb
index 49816fe..b15df09 100644
--- a/tests/browser/features/support/hooks.rb
+++ b/tests/browser/features/support/hooks.rb
@@ -158,6 +158,15 @@
$highlighting = true
end
+Before("@highlighting", "@references") do
+ if !$highlighting
+ steps %Q{
+ Given a page named References Highlight Test exists with contents
@references_highlight_test.txt
+ }
+ end
+ $highlighting = true
+end
+
Before("@setup_more_like_this") do
if !$setup_more_like_this
# The MoreLikeMe term must appear in "a bunch" of pages for it to be used
in morelike: searches
diff --git a/tests/jenkins/Jenkins.php b/tests/jenkins/Jenkins.php
index 6b24b96..7a839fc 100644
--- a/tests/jenkins/Jenkins.php
+++ b/tests/jenkins/Jenkins.php
@@ -41,6 +41,7 @@
require_once( "$IP/extensions/MwEmbedSupport/MwEmbedSupport.php" );
require_once( "$IP/extensions/TimedMediaHandler/TimedMediaHandler.php" );
require_once( "$IP/extensions/PdfHandler/PdfHandler.php" );
+require_once( "$IP/extensions/Cite/Cite.php" );
// Configuration
$wgSearchType = 'CirrusSearch';
@@ -67,6 +68,7 @@
'password' => $wgRedisPassword,
),
);
+$wgCiteEnablePopups = true;
// Running a ton of jobs every request helps to make sure all the pages that
are created
// are indexed as fast as possible.
--
To view, visit https://gerrit.wikimedia.org/r/123878
To unsubscribe, visit https://gerrit.wikimedia.org/r/settings
Gerrit-MessageType: merged
Gerrit-Change-Id: I6dd3a96074a6258f2640826261abf937d887c58d
Gerrit-PatchSet: 3
Gerrit-Project: mediawiki/extensions/CirrusSearch
Gerrit-Branch: master
Gerrit-Owner: Chad <[email protected]>
Gerrit-Reviewer: Chad <[email protected]>
Gerrit-Reviewer: Manybubbles <[email protected]>
Gerrit-Reviewer: jenkins-bot <>
_______________________________________________
MediaWiki-commits mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/mediawiki-commits