[ 
https://issues.apache.org/jira/browse/SOLR-2519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13034203#comment-13034203
 ] 

Robert Muir commented on SOLR-2519:
-----------------------------------

As someone frustrated by this (but who would ultimately like to move past it 
and try to help with solr's intl), I just wanted to say +1 to Hoss Man's 
proposal.

My only suggestion on what he said is that I would greatly prefer text_en over 
text_western or whatever for these reasons:
1. the stemming and stopwords and crap here are english.
2. for other western languages, even if you swap these out to be say, french or 
italian (which is the seemingly obvious way to cut over), the whole 
WDF+autophrase is still a huge trap (see 
http://www.hathitrust.org/blogs/large-scale-search/tuning-search-performance 
for an example). in this case use of ElisionFilter can be taken to avoid it.

> Improve the defaults for the "text" field type in default schema.xml
> --------------------------------------------------------------------
>
>                 Key: SOLR-2519
>                 URL: https://issues.apache.org/jira/browse/SOLR-2519
>             Project: Solr
>          Issue Type: Bug
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>             Fix For: 3.2, 4.0
>
>         Attachments: SOLR-2519.patch
>
>
> Spinoff from: http://lucene.markmail.org/thread/ww6mhfi3rfpngmc5
> The text fieldType in schema.xml is unusable for non-whitespace
> languages, because it has the dangerous auto-phrase feature (of
> Lucene's QP -- see LUCENE-2458) enabled.
> Lucene leaves this off by default, as does ElasticSearch
> (http://http://www.elasticsearch.org/).
> Furthermore, the "text" fieldType uses WhitespaceTokenizer when
> StandardTokenizer is a better cross-language default.
> Until we have language specific field types, I think we should fix
> the "text" fieldType to work well for all languages, by:
>   * Switching from WhitespaceTokenizer to StandardTokenizer
>   * Turning off auto-phrase

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to