Tjones has uploaded a new change for review. https://gerrit.wikimedia.org/r/230646
Change subject: Set hard character limit for searchText queries ...................................................................... Set hard character limit for searchText queries Set the maximum query Searcher::searchText length to 300 characters. Provided error messages in English and Spanish. Modified Title Search limit from 255 bytes to 255 characters. Bug: T107947 Change-Id: I5d15338c52a7c871fcc91fd2ec4019b34b71f7d8 --- M i18n/en.json M i18n/es.json M includes/Searcher.php 3 files changed, 27 insertions(+), 3 deletions(-) git pull ssh://gerrit.wikimedia.org:29418/mediawiki/extensions/CirrusSearch refs/changes/46/230646/1 diff --git a/i18n/en.json b/i18n/en.json index 6117831..b8e7f72 100644 --- a/i18n/en.json +++ b/i18n/en.json @@ -21,5 +21,6 @@ "apihelp-cirrus-settings-dump-description": "Dump of CirrusSearch settings for this wiki.", "cirrussearch-give-feedback": "Give us your feedback", "cirrussearch-morelikethis-settings": " #<!-- leave this line exactly as it is --> <pre>\n# This message lets you configure the settings of the \"more like this\" feature.\n# Changes to this take effect immediately.\n# The syntax is as follows:\n# * Everything from a \"#\" character to the end of the line is a comment.\n# * Every non-blank line is the setting name followed by a \":\" character followed by the setting value\n# The settings are:\n# * min_doc_freq (integer): Minimum number of documents (per shard) that need a term for it to be considered.\n# * max_doc_freq (integer): Maximum number of documents (per shard) that have a term for it to be considered.\n# High frequency terms are generally \"stop words\".\n# * max_query_terms (integer): Maximum number of terms to be considered. This value is limited to $wgCirrusSearchMoreLikeThisMaxQueryTermsLimit (100).\n# * min_term_freq (integer): Minimum number of times the term appears in the input to doc to be considered. For small fields (title) this value should be 1.\n# * percent_terms_to_match (float 0 to 1): The percentage of terms to match on. Defaults to 0.3 (30 percent).\n# * min_word_len (integer): Minimal length of a term to be considered. Defaults to 0.\n# * max_word_len (integer): The maximum word length above which words will be ignored. Defaults to unbounded (0).\n# * fields (comma separated list of values): These are the fields to use. Allowed fields are title, text, auxiliary_text, opening_text, headings and all.\n# * use_fields (true|false) : Tell the \"more like this\" query to use only the field data. Defaults to false: the system will extract the content of the text field to build the query.\n# Examples of good lines:\n# min_doc_freq:2\n# max_doc_freq:20000\n# max_query_terms:25\n# min_term_freq:2\n# percent_terms_to_match:0.3\n# min_word_len:2\n# max_word_len:40\n# fields:text,opening_text\n# use_fields:true\n# </pre> <!-- leave this line exactly as it is -->", - "cirrussearch-didyoumean-settings": " #<!-- leave this line exactly as it is --> <pre>\n# This message lets you configure the settings of the \"Did you mean\" suggestions.\n# See also https://www.elastic.co/guide/en/elasticsearch/reference/current/search-suggesters-phrase.html\n# Changes to this take effect immediately.\n# The syntax is as follows:\n# * Everything from a \"#\" character to the end of the line is a comment.\n# * Every non-blank line is the setting name followed by a \":\" character followed by the setting value\n# The settings are :\n# * max_errors (integer): the maximum number of terms that will be considered misspelled in order to be corrected. 1 or 2.\n# * confidence (float): The confidence level defines a factor applied to the input phrases score which is used as a threshold for other suggestion candidates. Only candidates that score higher than the threshold will be included in the result. For instance a confidence level of 1.0 will only return suggestions that score higher than the input phrase. If set to 0.0 the best candidate are returned.\n# * min_doc_freq (float 0 to 1): The minimal threshold in number of documents a suggestion should appear in.\n# High frequency terms are generally \"stop words\".\n# * max_term_freq (float 0 to 1): The maximum threshold in number of documents in which a term can exist in order to be included.\n# * prefix_length (integer): The minimal number of prefix characters that must match a term in order to be a suggestion.\n# * suggest_mode (missing, popular, always): The suggest mode controls the way suggestions are included.\n# Examples of good lines:\n# max_errors:2\n# confidence:2.0\n# max_term_freq:0.5\n# min_doc_freq:0.01\n# prefix_length:2\n# suggest_mode:always\n#\n# </pre> <!-- leave this line exactly as it is -->" + "cirrussearch-didyoumean-settings": " #<!-- leave this line exactly as it is --> <pre>\n# This message lets you configure the settings of the \"Did you mean\" suggestions.\n# See also https://www.elastic.co/guide/en/elasticsearch/reference/current/search-suggesters-phrase.html\n# Changes to this take effect immediately.\n# The syntax is as follows:\n# * Everything from a \"#\" character to the end of the line is a comment.\n# * Every non-blank line is the setting name followed by a \":\" character followed by the setting value\n# The settings are :\n# * max_errors (integer): the maximum number of terms that will be considered misspelled in order to be corrected. 1 or 2.\n# * confidence (float): The confidence level defines a factor applied to the input phrases score which is used as a threshold for other suggestion candidates. Only candidates that score higher than the threshold will be included in the result. For instance a confidence level of 1.0 will only return suggestions that score higher than the input phrase. If set to 0.0 the best candidate are returned.\n# * min_doc_freq (float 0 to 1): The minimal threshold in number of documents a suggestion should appear in.\n# High frequency terms are generally \"stop words\".\n# * max_term_freq (float 0 to 1): The maximum threshold in number of documents in which a term can exist in order to be included.\n# * prefix_length (integer): The minimal number of prefix characters that must match a term in order to be a suggestion.\n# * suggest_mode (missing, popular, always): The suggest mode controls the way suggestions are included.\n# Examples of good lines:\n# max_errors:2\n# confidence:2.0\n# max_term_freq:0.5\n# min_doc_freq:0.01\n# prefix_length:2\n# suggest_mode:always\n#\n# </pre> <!-- leave this line exactly as it is -->", + "cirrussearch-query-too-long": "Search request is longer than the maximum allowed length. ($1 > $2)" } diff --git a/i18n/es.json b/i18n/es.json index beb6573..2e550d8 100644 --- a/i18n/es.json +++ b/i18n/es.json @@ -28,5 +28,6 @@ "apihelp-cirrus-config-dump-description": "Volcado de configuración de CirrusSearch.", "apihelp-cirrus-mapping-dump-description": "Volcado de asignaciones de CirrusSearch para este wiki.", "apihelp-cirrus-settings-dump-description": "Volcado de configuración de CirrusSearch para este wiki.", - "cirrussearch-give-feedback": "Danos tu opinión" + "cirrussearch-give-feedback": "Danos tu opinión", + "cirrussearch-query-too-long": "La búsqueda esta más largo que la longitud máxima permitida. ($1 > $2)" } diff --git a/includes/Searcher.php b/includes/Searcher.php index b7e216e..12c0651 100644 --- a/includes/Searcher.php +++ b/includes/Searcher.php @@ -54,6 +54,11 @@ const MAX_TITLE_SEARCH = 255; /** + * Maximum length that we'll check in text searches. + */ + const MAX_TEXT_SEARCH = 300; + + /** * Maximum offset depth allowed. Too deep will cause very slow queries. * 100,000 feels plenty deep. */ @@ -357,6 +362,11 @@ $wgCirrusSearchBoostLinks, $wgCirrusSearchAllFields, $wgCirrusSearchAllFieldsForRescore; + + $CheckLengthStatus = self::checkTextSearchRequestLength( $term ); + if ( !$CheckLengthStatus->isOk() ) { + return $CheckLengthStatus; + } // Transform Mediawiki specific syntax to filters and extra (pre-escaped) query string $searcher = $this; @@ -1739,7 +1749,7 @@ * @throws UsageException */ private function checkTitleSearchRequestLength( $search ) { - $requestLength = strlen( $search ); + $requestLength = mb_strlen( $search ); if ( $requestLength > self::MAX_TITLE_SEARCH ) { throw new UsageException( 'Prefix search request was longer than the maximum allowed length.' . " ($requestLength > " . self::MAX_TITLE_SEARCH . ')', 'request_too_long', 400 ); @@ -1747,6 +1757,18 @@ } /** + * @param string $search + * @return Status + */ + private function checkTextSearchRequestLength( $search ) { + $requestLength = mb_strlen( $search ); + if ( $requestLength > self::MAX_TEXT_SEARCH ) { + return Status::newFatal( 'cirrussearch-query-too-long', $requestLength, self::MAX_TEXT_SEARCH ); + } + return Status::newGood(); + } + + /** * Attempt to suck a leading namespace followed by a colon from the query string. Reaches out to Elasticsearch to * perform normalized lookup against the namespaces. Should be fast but for the network hop. * -- To view, visit https://gerrit.wikimedia.org/r/230646 To unsubscribe, visit https://gerrit.wikimedia.org/r/settings Gerrit-MessageType: newchange Gerrit-Change-Id: I5d15338c52a7c871fcc91fd2ec4019b34b71f7d8 Gerrit-PatchSet: 1 Gerrit-Project: mediawiki/extensions/CirrusSearch Gerrit-Branch: master Gerrit-Owner: Tjones <tjo...@wikimedia.org> _______________________________________________ MediaWiki-commits mailing list MediaWiki-commits@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/mediawiki-commits