[MediaWiki-commits] [Gerrit] Set hard character limit for searchText queries - change (mediawiki...CirrusSearch)

Tjones (Code Review) Mon, 10 Aug 2015 13:28:45 -0700

Tjones has uploaded a new change for review.

  https://gerrit.wikimedia.org/r/230646


Change subject: Set hard character limit for searchText queries
......................................................................

Set hard character limit for searchText queries

Set the maximum query Searcher::searchText length to 300 characters.
Provided error messages in English and Spanish. Modified Title
Search limit from 255 bytes to 255 characters.

Bug: T107947
Change-Id: I5d15338c52a7c871fcc91fd2ec4019b34b71f7d8
---
M i18n/en.json
M i18n/es.json
M includes/Searcher.php
3 files changed, 27 insertions(+), 3 deletions(-)


  git pull ssh://gerrit.wikimedia.org:29418/mediawiki/extensions/CirrusSearch 
refs/changes/46/230646/1

diff --git a/i18n/en.json b/i18n/en.json
index 6117831..b8e7f72 100644
--- a/i18n/en.json
+++ b/i18n/en.json
@@ -21,5 +21,6 @@
        "apihelp-cirrus-settings-dump-description": "Dump of CirrusSearch 
settings for this wiki.",
        "cirrussearch-give-feedback": "Give us your feedback",
        "cirrussearch-morelikethis-settings": " #<!-- leave this line exactly 
as it is --> <pre>\n# This message lets you configure the settings of the 
\"more like this\" feature.\n# Changes to this take effect immediately.\n# The 
syntax is as follows:\n#   * Everything from a \"#\" character to the end of 
the line is a comment.\n#   * Every non-blank line is the setting name followed 
by a \":\" character followed by the setting value\n# The settings are:\n#   * 
min_doc_freq (integer): Minimum number of documents (per shard) that need a 
term for it to be considered.\n#   * max_doc_freq (integer): Maximum number of 
documents (per shard) that have a term for it to be considered.\n#              
     High frequency terms are generally \"stop words\".\n#   * max_query_terms 
(integer): Maximum number of terms to be considered. This value is limited to 
$wgCirrusSearchMoreLikeThisMaxQueryTermsLimit (100).\n#   * min_term_freq 
(integer): Minimum number of times the term appears in the input to doc to be 
considered. For small fields (title) this value should be 1.\n#   * 
percent_terms_to_match (float 0 to 1): The percentage of terms to match on. 
Defaults to 0.3 (30 percent).\n#   * min_word_len (integer): Minimal length of 
a term to be considered. Defaults to 0.\n#   * max_word_len (integer): The 
maximum word length above which words will be ignored. Defaults to unbounded 
(0).\n#   * fields (comma separated list of values): These are the fields to 
use. Allowed fields are title, text, auxiliary_text, opening_text, headings and 
all.\n#   * use_fields (true|false) : Tell the \"more like this\" query to use 
only the field data. Defaults to false: the system will extract the content of 
the text field to build the query.\n# Examples of good lines:\n# 
min_doc_freq:2\n# max_doc_freq:20000\n# max_query_terms:25\n# 
min_term_freq:2\n# percent_terms_to_match:0.3\n# min_word_len:2\n# 
max_word_len:40\n# fields:text,opening_text\n# use_fields:true\n# </pre> <!-- 
leave this line exactly as it is -->",
-       "cirrussearch-didyoumean-settings": "  #<!-- leave this line exactly as 
it is --> <pre>\n# This message lets you configure the settings of the \"Did 
you mean\" suggestions.\n# See also 
https://www.elastic.co/guide/en/elasticsearch/reference/current/search-suggesters-phrase.html\n#
 Changes to this take effect immediately.\n# The syntax is as follows:\n#   * 
Everything from a \"#\" character to the end of the line is a comment.\n#   * 
Every non-blank line is the setting name followed by a \":\" character followed 
by the setting value\n# The settings are :\n#   * max_errors (integer): the 
maximum number of terms that will be considered misspelled in order to be 
corrected. 1 or 2.\n#   * confidence (float): The confidence level defines a 
factor applied to the input phrases score which is used as a threshold for 
other suggestion candidates. Only candidates that score higher than the 
threshold will be included in the result. For instance a confidence level of 
1.0 will only return suggestions that score higher than the input phrase. If 
set to 0.0 the best candidate are returned.\n#   * min_doc_freq (float 0 to 1): 
The minimal threshold in number of documents a suggestion should appear in.\n#  
                 High frequency terms are generally \"stop words\".\n#   * 
max_term_freq (float 0 to 1): The maximum threshold in number of documents in 
which a term can exist in order to be included.\n#   * prefix_length (integer): 
The minimal number of prefix characters that must match a term in order to be a 
suggestion.\n#   * suggest_mode (missing, popular, always): The suggest mode 
controls the way suggestions are included.\n# Examples of good lines:\n# 
max_errors:2\n# confidence:2.0\n# max_term_freq:0.5\n# min_doc_freq:0.01\n# 
prefix_length:2\n# suggest_mode:always\n#\n# </pre> <!-- leave this line 
exactly as it is -->"
+       "cirrussearch-didyoumean-settings": "  #<!-- leave this line exactly as 
it is --> <pre>\n# This message lets you configure the settings of the \"Did 
you mean\" suggestions.\n# See also 
https://www.elastic.co/guide/en/elasticsearch/reference/current/search-suggesters-phrase.html\n#
 Changes to this take effect immediately.\n# The syntax is as follows:\n#   * 
Everything from a \"#\" character to the end of the line is a comment.\n#   * 
Every non-blank line is the setting name followed by a \":\" character followed 
by the setting value\n# The settings are :\n#   * max_errors (integer): the 
maximum number of terms that will be considered misspelled in order to be 
corrected. 1 or 2.\n#   * confidence (float): The confidence level defines a 
factor applied to the input phrases score which is used as a threshold for 
other suggestion candidates. Only candidates that score higher than the 
threshold will be included in the result. For instance a confidence level of 
1.0 will only return suggestions that score higher than the input phrase. If 
set to 0.0 the best candidate are returned.\n#   * min_doc_freq (float 0 to 1): 
The minimal threshold in number of documents a suggestion should appear in.\n#  
                 High frequency terms are generally \"stop words\".\n#   * 
max_term_freq (float 0 to 1): The maximum threshold in number of documents in 
which a term can exist in order to be included.\n#   * prefix_length (integer): 
The minimal number of prefix characters that must match a term in order to be a 
suggestion.\n#   * suggest_mode (missing, popular, always): The suggest mode 
controls the way suggestions are included.\n# Examples of good lines:\n# 
max_errors:2\n# confidence:2.0\n# max_term_freq:0.5\n# min_doc_freq:0.01\n# 
prefix_length:2\n# suggest_mode:always\n#\n# </pre> <!-- leave this line 
exactly as it is -->",
+       "cirrussearch-query-too-long": "Search request is longer than the 
maximum allowed length. ($1 > $2)"
 }
diff --git a/i18n/es.json b/i18n/es.json
index beb6573..2e550d8 100644
--- a/i18n/es.json
+++ b/i18n/es.json
@@ -28,5 +28,6 @@
        "apihelp-cirrus-config-dump-description": "Volcado de configuración de 
CirrusSearch.",
        "apihelp-cirrus-mapping-dump-description": "Volcado de asignaciones de 
CirrusSearch para este wiki.",
        "apihelp-cirrus-settings-dump-description": "Volcado de configuración 
de CirrusSearch para este wiki.",
-       "cirrussearch-give-feedback": "Danos tu opinión"
+       "cirrussearch-give-feedback": "Danos tu opinión",
+       "cirrussearch-query-too-long": "La búsqueda esta más largo que la 
longitud máxima permitida. ($1 > $2)"
 }
diff --git a/includes/Searcher.php b/includes/Searcher.php
index b7e216e..12c0651 100644
--- a/includes/Searcher.php
+++ b/includes/Searcher.php
@@ -54,6 +54,11 @@
        const MAX_TITLE_SEARCH = 255;
 
        /**
+        * Maximum length that we'll check in text searches.
+        */
+       const MAX_TEXT_SEARCH = 300;
+
+       /**
         * Maximum offset depth allowed.  Too deep will cause very slow queries.
         * 100,000 feels plenty deep.
         */
@@ -357,6 +362,11 @@
                        $wgCirrusSearchBoostLinks,
                        $wgCirrusSearchAllFields,
                        $wgCirrusSearchAllFieldsForRescore;
+
+               $CheckLengthStatus = self::checkTextSearchRequestLength( $term 
);
+               if ( !$CheckLengthStatus->isOk() ) {
+                       return $CheckLengthStatus;
+               }
 
                // Transform Mediawiki specific syntax to filters and extra 
(pre-escaped) query string
                $searcher = $this;
@@ -1739,7 +1749,7 @@
         * @throws UsageException
         */
        private function checkTitleSearchRequestLength( $search ) {
-               $requestLength = strlen( $search );
+               $requestLength = mb_strlen( $search );
                if ( $requestLength > self::MAX_TITLE_SEARCH ) {
                        throw new UsageException( 'Prefix search request was 
longer than the maximum allowed length.' .
                                " ($requestLength > " . self::MAX_TITLE_SEARCH 
. ')', 'request_too_long', 400 );
@@ -1747,6 +1757,18 @@
        }
 
        /**
+        * @param string $search
+        * @return Status
+        */
+       private function checkTextSearchRequestLength( $search ) {
+               $requestLength = mb_strlen( $search );
+               if ( $requestLength > self::MAX_TEXT_SEARCH ) {
+                       return Status::newFatal( 'cirrussearch-query-too-long', 
$requestLength, self::MAX_TEXT_SEARCH );
+               }
+               return Status::newGood();
+       }
+
+       /**
         * Attempt to suck a leading namespace followed by a colon from the 
query string.  Reaches out to Elasticsearch to
         * perform normalized lookup against the namespaces.  Should be fast 
but for the network hop.
         *

-- 
To view, visit https://gerrit.wikimedia.org/r/230646
To unsubscribe, visit https://gerrit.wikimedia.org/r/settings

Gerrit-MessageType: newchange
Gerrit-Change-Id: I5d15338c52a7c871fcc91fd2ec4019b34b71f7d8
Gerrit-PatchSet: 1
Gerrit-Project: mediawiki/extensions/CirrusSearch
Gerrit-Branch: master
Gerrit-Owner: Tjones <tjo...@wikimedia.org>

_______________________________________________
MediaWiki-commits mailing list
MediaWiki-commits@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/mediawiki-commits

[MediaWiki-commits] [Gerrit] Set hard character limit for searchText queries - change (mediawiki...CirrusSearch)

Reply via email to