[ 
https://issues.apache.org/jira/browse/SOLR-2519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13034120#comment-13034120
 ] 

Yonik Seeley commented on SOLR-2519:
------------------------------------

I think maybe there's a misconception that the fieldType named "text" was meant 
to be generic for all languages.  As I said in the thread, if I had to do it 
over again, I would have named it "text_en" because that's what it's purpose 
was.  But at this point, it seems like the best way forward is to leave "text" 
as an english fieldType and simply add other fieldTypes that can support other 
languages.

Some downsides I see to this patch (i.e. trying to make the 'text' fieldType 
generic):
- The current WordDelimiterFilter options the fieldType feel like a trap for 
non-whitespace-delimited languages.  WDF is configured to index catenations as 
well as splits... so all of the tokens (words?) that are split out are also 
catenated together and indexed (which seems like it could lead to some truly 
huge tokens erroneously being indexed.)
- You left the english stemmer on the "text" fieldType... but if it's supposed 
to be generic, couldn't this be bad for some other western languages where it 
could cause stemming collisions of words not related to each other?

Taking into account all the existing users (and all the existing documentation, 
examples, tutorial, etc), I favor a more conservative approach of adding new 
fieldTypes rather than radically changing the behavior of existing ones.

Random question: what are the implications of changing from WhitespaceTokenizer 
to StandardTokenizer, esp w.r.t. WDF?

> Improve the defaults for the "text" field type in default schema.xml
> --------------------------------------------------------------------
>
>                 Key: SOLR-2519
>                 URL: https://issues.apache.org/jira/browse/SOLR-2519
>             Project: Solr
>          Issue Type: Bug
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>             Fix For: 3.2, 4.0
>
>         Attachments: SOLR-2519.patch
>
>
> Spinoff from: http://lucene.markmail.org/thread/ww6mhfi3rfpngmc5
> The text fieldType in schema.xml is unusable for non-whitespace
> languages, because it has the dangerous auto-phrase feature (of
> Lucene's QP -- see LUCENE-2458) enabled.
> Lucene leaves this off by default, as does ElasticSearch
> (http://http://www.elasticsearch.org/).
> Furthermore, the "text" fieldType uses WhitespaceTokenizer when
> StandardTokenizer is a better cross-language default.
> Until we have language specific field types, I think we should fix
> the "text" fieldType to work well for all languages, by:
>   * Switching from WhitespaceTokenizer to StandardTokenizer
>   * Turning off auto-phrase

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to