[jira] [Commented] (SOLR-2519) Improve the defaults for the "text" field type in default schema.xml

JIRA Wed, 18 May 2011 15:23:32 -0700

    [ 
https://issues.apache.org/jira/browse/SOLR-2519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13035796#comment-13035796
 ]


Jan Høydahl commented on SOLR-2519:
-----------------------------------

Largely agree with @Hoss' suggestion. But I think it would be wise to emphasize 
that the example schema is just that - an *example* - encouraging people to 
create new fieldTypes instead of editing the example ones. It's not a problem 
for "int", "date" etc, but for text I always encourage our customers and 
students to stay away from the FieldTypes in the example and make their own 
versions instead.

One way to further encourage this best practice is naming all text FieldTypes 
clearly as examples, e.g. 

{code}
<fieldType name="text_example_en" ..>
<fieldType name="text_example_generic" ..>
{code}

We must realize that a lot of non-american users out there are already 
customizing their schemas with the naming pattern "text_<lang>", which means 
you'll find "text_en", "text_it", "text_no" in a lot of installations. 
Therefore it would be un-wise to introduce new FieldTypes wich crashes with 
those names out of the box in version 3.2, thus include _example in the type 
name.

When upgrading, I always leave all the example field types intact, and add my 
custom ones separately, clearly marked by comments for easy copy/paste. I 
believe this to be a fairly common practice, and wanted as well, which would 
give no clashes for the above example.

With this example naming practice, we can be pretty sure that if people talk 
about the fieldType "text_example_en" on the lists, they mean the default 
example type, but if they talk about "text_en", it's something they've 
customized themselves (if so by simply renaming the example). It'll be more 
mental resitance for people to start modifying something with "_example" in it 
wihout also changing the name.

> Improve the defaults for the "text" field type in default schema.xml
> --------------------------------------------------------------------
>
>                 Key: SOLR-2519
>                 URL: https://issues.apache.org/jira/browse/SOLR-2519
>             Project: Solr
>          Issue Type: Bug
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>             Fix For: 3.2, 4.0
>
>         Attachments: SOLR-2519.patch
>
>
> Spinoff from: http://lucene.markmail.org/thread/ww6mhfi3rfpngmc5
> The text fieldType in schema.xml is unusable for non-whitespace
> languages, because it has the dangerous auto-phrase feature (of
> Lucene's QP -- see LUCENE-2458) enabled.
> Lucene leaves this off by default, as does ElasticSearch
> (http://http://www.elasticsearch.org/).
> Furthermore, the "text" fieldType uses WhitespaceTokenizer when
> StandardTokenizer is a better cross-language default.
> Until we have language specific field types, I think we should fix
> the "text" fieldType to work well for all languages, by:
>   * Switching from WhitespaceTokenizer to StandardTokenizer
>   * Turning off auto-phrase

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-2519) Improve the defaults for the "text" field type in default schema.xml

Reply via email to