EBernhardson has uploaded a new change for review.

  https://gerrit.wikimedia.org/r/261323

Change subject: Don't sanitize non-wikitext going into search
......................................................................

Don't sanitize non-wikitext going into search

As far as I can tell this sanitization was added back when
CirrusSearch was being written for solr cloud. I've tested
a few different methods and afaict the experimental highlighter
along with the normal elasticsearch highlighter both properly
encode the highlighted output and this is unnecessary.

Currently only removing the Sanitizer call from the `text`
field. After reviewing this I don't believe there to be any
security concerns around removing the Sanitizer, but it could
have some effect on search results and highlighting output that
would start to have html tags that might only confuse the user.

This had the unintended side effect of stripping non-tag content
from javascript, causing some searches for content of non-wikitext
pages to be incorrect.

A full reindex of all non-wikitext content will be required
after merging this patch.

Change-Id: I0706ce3f178ad791bdd537fa88d6454f6ac9ae15
---
M includes/BuildDocument/PageTextBuilder.php
M maintenance/forceSearchIndex.php
2 files changed, 16 insertions(+), 2 deletions(-)


  git pull ssh://gerrit.wikimedia.org:29418/mediawiki/extensions/CirrusSearch 
refs/changes/23/261323/1

diff --git a/includes/BuildDocument/PageTextBuilder.php 
b/includes/BuildDocument/PageTextBuilder.php
index 73df127..3bc4c29 100644
--- a/includes/BuildDocument/PageTextBuilder.php
+++ b/includes/BuildDocument/PageTextBuilder.php
@@ -65,14 +65,14 @@
 
        /**
         * Fetch text to index. If $content is wikitext then render and strip 
things from it.
-        * Otherwise delegate to the $content itself. Then trim and sanitize 
the result.
+        * Otherwise delegate to the $content itself.
         */
        private function buildTextToIndex() {
                switch ( $this->content->getModel() ) {
                        case CONTENT_MODEL_WIKITEXT:
                                return $this->formatWikitext( 
$this->parserOutput );
                        default:
-                               $text = trim( Sanitizer::stripAllTags( 
$this->content->getTextForSearchIndex() ) );
+                               $text = $this->content->getTextForSearchIndex();
                                return array( $text, null, array() );
                }
 
diff --git a/maintenance/forceSearchIndex.php b/maintenance/forceSearchIndex.php
index cf7cf0e..f5fc1e5 100644
--- a/maintenance/forceSearchIndex.php
+++ b/maintenance/forceSearchIndex.php
@@ -49,6 +49,7 @@
        public $maxJobs;
        public $pauseForJobs;
        public $namespace;
+       public $excludeContentTypes;
 
        public function __construct() {
                parent::__construct();
@@ -85,6 +86,7 @@
                $this->addOption( 'skipLinks', 'Skip looking for links to the 
page (counting and finding redirects).  Use ' .
                        'this with --indexOnSkip for the first half of the two 
phase index build.' );
                $this->addOption( 'namespace', 'Only index pages in this given 
namespace', false, true );
+               $this->addOption( 'excludeContentTypes', 'Exclude pages of the 
specified content types. These must be a comma separated list of strings such 
as "wikitext" or "json" matching the CONTENT_MODEL_* constants.', false, true, 
false );
        }
 
        public function execute() {
@@ -142,6 +144,10 @@
                $this->namespace = $this->hasOption( 'namespace' ) ?
                        intval( $this->getOption( 'namespace' ) ) : null;
 
+               $this->excludeContentTypes = array_map(
+                       'trim',
+                       explode( ',', $this->getOption( 'excludeContentTypes' ) 
)
+               );
                if ( $this->indexUpdates ) {
                        if ( $this->queue ) {
                                $operationName = 'Queued';
@@ -315,6 +321,10 @@
                        if ( $this->namespace ) {
                                $where['page_namespace'] = $this->namespace;
                        }
+                       if ( $this->excludeContentTypes ) {
+                               $list = $dbr->makeList( 
$this->excludeContentTypes, LIST_COMMA );
+                               $where[] = "page_content_model NOT IN ($list)";
+                       }
 
                        // We'd like to filter out redirects here but it makes 
the query much slower on larger wikis....
                        $res = $dbr->select(
@@ -338,6 +348,10 @@
                        if ( $this->namespace ) {
                                $where['page_namespace'] = $this->namespace;
                        }
+                       if ( $this->excludeContentTypes ) {
+                               $list = $dbr->makeList( 
$this->excludeContentTypes, LIST_COMMA );
+                               $where[] = "page_content_model NOT IN ($list)";
+                       }
 
                        $res = $dbr->select(
                                array( 'page', 'revision' ),

-- 
To view, visit https://gerrit.wikimedia.org/r/261323
To unsubscribe, visit https://gerrit.wikimedia.org/r/settings

Gerrit-MessageType: newchange
Gerrit-Change-Id: I0706ce3f178ad791bdd537fa88d6454f6ac9ae15
Gerrit-PatchSet: 1
Gerrit-Project: mediawiki/extensions/CirrusSearch
Gerrit-Branch: master
Gerrit-Owner: EBernhardson <[email protected]>

_______________________________________________
MediaWiki-commits mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/mediawiki-commits

Reply via email to