[jira] [Commented] (SOLR-2519) Improve the defaults for the "text" field type in default schema.xml

Michael McCandless (JIRA) Mon, 16 May 2011 11:00:33 -0700

    [ 
https://issues.apache.org/jira/browse/SOLR-2519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13034154#comment-13034154
 ]


Michael McCandless commented on SOLR-2519:
------------------------------------------

bq. I think maybe there's a misconception that the fieldType named "text" was 
meant to be generic for all languages.

Regardless of what the original intention was, "text" today has become
the generic text fieldType new users use on starting with Solr.  I
mean, it has the perfect name for that :)

bq. As I said in the thread, if I had to do it over again, I would have named 
it "text_en" because that's what it's purpose was.

Hindsight is 20/20... but, we can still fix this today.  We shouldn't
lock ourselves into poor defaults.

Especially, as things improve and we get better analyzers, etc., we
should be free to improve the defaults in schema.xml to take advantage
of these improvements.

bq. But at this point, it seems like the best way forward is to leave "text" as 
an english fieldType and simply add other fieldTypes that can support other 
languages.

I think this is a dangerous approach -- the name (ie, missing _en if
in fact it has such English-specific configuration) is misleading and
traps new users.

Ideally, in the future, we wouldn't even have a "text" fieldType, only
text_XX per-language examples and then maybe something like
text_general, which you use if you cannot find your language.

{quote}
Some downsides I see to this patch (i.e. trying to make the 'text' fieldType 
generic):

The current WordDelimiterFilter options the fieldType feel like a trap for 
non-whitespace-delimited languages. WDF is configured to index catenations as 
well as splits... so all of the tokens (words?) that are split out are also 
catenated together and indexed (which seems like it could lead to some truly 
huge tokens erroneously being indexed.)
{quote}
Ahh good point.  I think we should remove WDF altogether from the
generic "text" fieldType.

{quote}
You left the english stemmer on the "text" fieldType... but if it's supposed to 
be generic, couldn't this be bad for some other western languages where it 
could cause stemming collisions of words not related to each other?
{quote}

+1, we should remove the stemming too from "text".

bq. Taking into account all the existing users (and all the existing 
documentation, examples, tutorial, etc), I favor a more conservative approach 
of adding new fieldTypes rather than radically changing the behavior of 
existing ones.

Can you point to specific examples (docs, examples, tutorial)?  I'd
like to understand how much work it is to fix these...

My feeling is we should simply do the work here (I'll sign up to it)
and fix any places that actually rely on the specifics of "text"
fieldType, eg autophrase.

We shouldn't avoid fixing things well because it's gonna be more work
today, especially if someone (me) is signing up to do it.

Also: existing users would be unaffected by this?  They've already
copied over / edited their own schema.xml?  This is mainly about new
users?


> Improve the defaults for the "text" field type in default schema.xml
> --------------------------------------------------------------------
>
>                 Key: SOLR-2519
>                 URL: https://issues.apache.org/jira/browse/SOLR-2519
>             Project: Solr
>          Issue Type: Bug
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>             Fix For: 3.2, 4.0
>
>         Attachments: SOLR-2519.patch
>
>
> Spinoff from: http://lucene.markmail.org/thread/ww6mhfi3rfpngmc5
> The text fieldType in schema.xml is unusable for non-whitespace
> languages, because it has the dangerous auto-phrase feature (of
> Lucene's QP -- see LUCENE-2458) enabled.
> Lucene leaves this off by default, as does ElasticSearch
> (http://http://www.elasticsearch.org/).
> Furthermore, the "text" fieldType uses WhitespaceTokenizer when
> StandardTokenizer is a better cross-language default.
> Until we have language specific field types, I think we should fix
> the "text" fieldType to work well for all languages, by:
>   * Switching from WhitespaceTokenizer to StandardTokenizer
>   * Turning off auto-phrase

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-2519) Improve the defaults for the "text" field type in default schema.xml

Reply via email to