Tjones has uploaded a new change for review.
https://gerrit.wikimedia.org/r/230646
Change subject: Set hard character limit for searchText queries
......................................................................
Set hard character limit for searchText queries
Set the maximum query Searcher::searchText length to 300 characters.
Provided error messages in English and Spanish. Modified Title
Search limit from 255 bytes to 255 characters.
Bug: T107947
Change-Id: I5d15338c52a7c871fcc91fd2ec4019b34b71f7d8
---
M i18n/en.json
M i18n/es.json
M includes/Searcher.php
3 files changed, 27 insertions(+), 3 deletions(-)
git pull ssh://gerrit.wikimedia.org:29418/mediawiki/extensions/CirrusSearch
refs/changes/46/230646/1
diff --git a/i18n/en.json b/i18n/en.json
index 6117831..b8e7f72 100644
--- a/i18n/en.json
+++ b/i18n/en.json
@@ -21,5 +21,6 @@
"apihelp-cirrus-settings-dump-description": "Dump of CirrusSearch
settings for this wiki.",
"cirrussearch-give-feedback": "Give us your feedback",
"cirrussearch-morelikethis-settings": " #<!-- leave this line exactly
as it is --> <pre>\n# This message lets you configure the settings of the
\"more like this\" feature.\n# Changes to this take effect immediately.\n# The
syntax is as follows:\n# * Everything from a \"#\" character to the end of
the line is a comment.\n# * Every non-blank line is the setting name followed
by a \":\" character followed by the setting value\n# The settings are:\n# *
min_doc_freq (integer): Minimum number of documents (per shard) that need a
term for it to be considered.\n# * max_doc_freq (integer): Maximum number of
documents (per shard) that have a term for it to be considered.\n#
High frequency terms are generally \"stop words\".\n# * max_query_terms
(integer): Maximum number of terms to be considered. This value is limited to
$wgCirrusSearchMoreLikeThisMaxQueryTermsLimit (100).\n# * min_term_freq
(integer): Minimum number of times the term appears in the input to doc to be
considered. For small fields (title) this value should be 1.\n# *
percent_terms_to_match (float 0 to 1): The percentage of terms to match on.
Defaults to 0.3 (30 percent).\n# * min_word_len (integer): Minimal length of
a term to be considered. Defaults to 0.\n# * max_word_len (integer): The
maximum word length above which words will be ignored. Defaults to unbounded
(0).\n# * fields (comma separated list of values): These are the fields to
use. Allowed fields are title, text, auxiliary_text, opening_text, headings and
all.\n# * use_fields (true|false) : Tell the \"more like this\" query to use
only the field data. Defaults to false: the system will extract the content of
the text field to build the query.\n# Examples of good lines:\n#
min_doc_freq:2\n# max_doc_freq:20000\n# max_query_terms:25\n#
min_term_freq:2\n# percent_terms_to_match:0.3\n# min_word_len:2\n#
max_word_len:40\n# fields:text,opening_text\n# use_fields:true\n# </pre> <!--
leave this line exactly as it is -->",
- "cirrussearch-didyoumean-settings": " #<!-- leave this line exactly as
it is --> <pre>\n# This message lets you configure the settings of the \"Did
you mean\" suggestions.\n# See also
https://www.elastic.co/guide/en/elasticsearch/reference/current/search-suggesters-phrase.html\n#
Changes to this take effect immediately.\n# The syntax is as follows:\n# *
Everything from a \"#\" character to the end of the line is a comment.\n# *
Every non-blank line is the setting name followed by a \":\" character followed
by the setting value\n# The settings are :\n# * max_errors (integer): the
maximum number of terms that will be considered misspelled in order to be
corrected. 1 or 2.\n# * confidence (float): The confidence level defines a
factor applied to the input phrases score which is used as a threshold for
other suggestion candidates. Only candidates that score higher than the
threshold will be included in the result. For instance a confidence level of
1.0 will only return suggestions that score higher than the input phrase. If
set to 0.0 the best candidate are returned.\n# * min_doc_freq (float 0 to 1):
The minimal threshold in number of documents a suggestion should appear in.\n#
High frequency terms are generally \"stop words\".\n# *
max_term_freq (float 0 to 1): The maximum threshold in number of documents in
which a term can exist in order to be included.\n# * prefix_length (integer):
The minimal number of prefix characters that must match a term in order to be a
suggestion.\n# * suggest_mode (missing, popular, always): The suggest mode
controls the way suggestions are included.\n# Examples of good lines:\n#
max_errors:2\n# confidence:2.0\n# max_term_freq:0.5\n# min_doc_freq:0.01\n#
prefix_length:2\n# suggest_mode:always\n#\n# </pre> <!-- leave this line
exactly as it is -->"
+ "cirrussearch-didyoumean-settings": " #<!-- leave this line exactly as
it is --> <pre>\n# This message lets you configure the settings of the \"Did
you mean\" suggestions.\n# See also
https://www.elastic.co/guide/en/elasticsearch/reference/current/search-suggesters-phrase.html\n#
Changes to this take effect immediately.\n# The syntax is as follows:\n# *
Everything from a \"#\" character to the end of the line is a comment.\n# *
Every non-blank line is the setting name followed by a \":\" character followed
by the setting value\n# The settings are :\n# * max_errors (integer): the
maximum number of terms that will be considered misspelled in order to be
corrected. 1 or 2.\n# * confidence (float): The confidence level defines a
factor applied to the input phrases score which is used as a threshold for
other suggestion candidates. Only candidates that score higher than the
threshold will be included in the result. For instance a confidence level of
1.0 will only return suggestions that score higher than the input phrase. If
set to 0.0 the best candidate are returned.\n# * min_doc_freq (float 0 to 1):
The minimal threshold in number of documents a suggestion should appear in.\n#
High frequency terms are generally \"stop words\".\n# *
max_term_freq (float 0 to 1): The maximum threshold in number of documents in
which a term can exist in order to be included.\n# * prefix_length (integer):
The minimal number of prefix characters that must match a term in order to be a
suggestion.\n# * suggest_mode (missing, popular, always): The suggest mode
controls the way suggestions are included.\n# Examples of good lines:\n#
max_errors:2\n# confidence:2.0\n# max_term_freq:0.5\n# min_doc_freq:0.01\n#
prefix_length:2\n# suggest_mode:always\n#\n# </pre> <!-- leave this line
exactly as it is -->",
+ "cirrussearch-query-too-long": "Search request is longer than the
maximum allowed length. ($1 > $2)"
}
diff --git a/i18n/es.json b/i18n/es.json
index beb6573..2e550d8 100644
--- a/i18n/es.json
+++ b/i18n/es.json
@@ -28,5 +28,6 @@
"apihelp-cirrus-config-dump-description": "Volcado de configuración de
CirrusSearch.",
"apihelp-cirrus-mapping-dump-description": "Volcado de asignaciones de
CirrusSearch para este wiki.",
"apihelp-cirrus-settings-dump-description": "Volcado de configuración
de CirrusSearch para este wiki.",
- "cirrussearch-give-feedback": "Danos tu opinión"
+ "cirrussearch-give-feedback": "Danos tu opinión",
+ "cirrussearch-query-too-long": "La búsqueda esta más largo que la
longitud máxima permitida. ($1 > $2)"
}
diff --git a/includes/Searcher.php b/includes/Searcher.php
index b7e216e..12c0651 100644
--- a/includes/Searcher.php
+++ b/includes/Searcher.php
@@ -54,6 +54,11 @@
const MAX_TITLE_SEARCH = 255;
/**
+ * Maximum length that we'll check in text searches.
+ */
+ const MAX_TEXT_SEARCH = 300;
+
+ /**
* Maximum offset depth allowed. Too deep will cause very slow queries.
* 100,000 feels plenty deep.
*/
@@ -357,6 +362,11 @@
$wgCirrusSearchBoostLinks,
$wgCirrusSearchAllFields,
$wgCirrusSearchAllFieldsForRescore;
+
+ $CheckLengthStatus = self::checkTextSearchRequestLength( $term
);
+ if ( !$CheckLengthStatus->isOk() ) {
+ return $CheckLengthStatus;
+ }
// Transform Mediawiki specific syntax to filters and extra
(pre-escaped) query string
$searcher = $this;
@@ -1739,7 +1749,7 @@
* @throws UsageException
*/
private function checkTitleSearchRequestLength( $search ) {
- $requestLength = strlen( $search );
+ $requestLength = mb_strlen( $search );
if ( $requestLength > self::MAX_TITLE_SEARCH ) {
throw new UsageException( 'Prefix search request was
longer than the maximum allowed length.' .
" ($requestLength > " . self::MAX_TITLE_SEARCH
. ')', 'request_too_long', 400 );
@@ -1747,6 +1757,18 @@
}
/**
+ * @param string $search
+ * @return Status
+ */
+ private function checkTextSearchRequestLength( $search ) {
+ $requestLength = mb_strlen( $search );
+ if ( $requestLength > self::MAX_TEXT_SEARCH ) {
+ return Status::newFatal( 'cirrussearch-query-too-long',
$requestLength, self::MAX_TEXT_SEARCH );
+ }
+ return Status::newGood();
+ }
+
+ /**
* Attempt to suck a leading namespace followed by a colon from the
query string. Reaches out to Elasticsearch to
* perform normalized lookup against the namespaces. Should be fast
but for the network hop.
*
--
To view, visit https://gerrit.wikimedia.org/r/230646
To unsubscribe, visit https://gerrit.wikimedia.org/r/settings
Gerrit-MessageType: newchange
Gerrit-Change-Id: I5d15338c52a7c871fcc91fd2ec4019b34b71f7d8
Gerrit-PatchSet: 1
Gerrit-Project: mediawiki/extensions/CirrusSearch
Gerrit-Branch: master
Gerrit-Owner: Tjones <[email protected]>
_______________________________________________
MediaWiki-commits mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/mediawiki-commits