Manybubbles has uploaded a new change for review.
https://gerrit.wikimedia.org/r/95071
Change subject: Trim text on the way into elasticsearch
......................................................................
Trim text on the way into elasticsearch
Most article text seems to come up with a hand full of trailing spaces.
Trim it to save a tiny bit of space and time.
Change-Id: If42b751257b9727869f5b9d7b18a5608e3ca421a
---
M includes/CirrusSearchUpdater.php
1 file changed, 1 insertion(+), 0 deletions(-)
git pull ssh://gerrit.wikimedia.org:29418/mediawiki/extensions/CirrusSearch
refs/changes/71/95071/1
diff --git a/includes/CirrusSearchUpdater.php b/includes/CirrusSearchUpdater.php
index 73e7f5b..4e811b5 100644
--- a/includes/CirrusSearchUpdater.php
+++ b/includes/CirrusSearchUpdater.php
@@ -254,6 +254,7 @@
$parserOutput = $page->getParserOutput( new
ParserOptions(), $page->getRevision()->getId() );
$text = Sanitizer::stripAllTags( SearchEngine::create(
'CirrusSearch' )
->getTextFromContent( $title,
$page->getContent(), $parserOutput ) );
+ $text = trim( $text ); // No need to store the trailing
spaces in Elasticsearch....
$doc->add( 'text', $text );
$doc->add( 'text_bytes', strlen( $text ) );
$doc->add( 'text_words', str_word_count( $text ) ); //
It would be better if we could let ES calculate it
--
To view, visit https://gerrit.wikimedia.org/r/95071
To unsubscribe, visit https://gerrit.wikimedia.org/r/settings
Gerrit-MessageType: newchange
Gerrit-Change-Id: If42b751257b9727869f5b9d7b18a5608e3ca421a
Gerrit-PatchSet: 1
Gerrit-Project: mediawiki/extensions/CirrusSearch
Gerrit-Branch: master
Gerrit-Owner: Manybubbles <[email protected]>
_______________________________________________
MediaWiki-commits mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/mediawiki-commits