Manybubbles has uploaded a new change for review.

  https://gerrit.wikimedia.org/r/95071


Change subject: Trim text on the way into elasticsearch
......................................................................

Trim text on the way into elasticsearch

Most article text seems to come up with a hand full of trailing spaces.
Trim it to save a tiny bit of space and time.

Change-Id: If42b751257b9727869f5b9d7b18a5608e3ca421a
---
M includes/CirrusSearchUpdater.php
1 file changed, 1 insertion(+), 0 deletions(-)


  git pull ssh://gerrit.wikimedia.org:29418/mediawiki/extensions/CirrusSearch 
refs/changes/71/95071/1

diff --git a/includes/CirrusSearchUpdater.php b/includes/CirrusSearchUpdater.php
index 73e7f5b..4e811b5 100644
--- a/includes/CirrusSearchUpdater.php
+++ b/includes/CirrusSearchUpdater.php
@@ -254,6 +254,7 @@
                        $parserOutput = $page->getParserOutput( new 
ParserOptions(), $page->getRevision()->getId() );
                        $text = Sanitizer::stripAllTags( SearchEngine::create( 
'CirrusSearch' )
                                ->getTextFromContent( $title, 
$page->getContent(), $parserOutput ) );
+                       $text = trim( $text ); // No need to store the trailing 
spaces in Elasticsearch....
                        $doc->add( 'text', $text );
                        $doc->add( 'text_bytes', strlen( $text ) );
                        $doc->add( 'text_words', str_word_count( $text ) ); // 
It would be better if we could let ES calculate it

-- 
To view, visit https://gerrit.wikimedia.org/r/95071
To unsubscribe, visit https://gerrit.wikimedia.org/r/settings

Gerrit-MessageType: newchange
Gerrit-Change-Id: If42b751257b9727869f5b9d7b18a5608e3ca421a
Gerrit-PatchSet: 1
Gerrit-Project: mediawiki/extensions/CirrusSearch
Gerrit-Branch: master
Gerrit-Owner: Manybubbles <[email protected]>

_______________________________________________
MediaWiki-commits mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/mediawiki-commits

Reply via email to