[
https://issues.apache.org/jira/browse/LUCENE-5989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14159511#comment-14159511
]
Jack Krupansky commented on LUCENE-5989:
----------------------------------------
bq. rename StringField to KeywordField, making it more obvious that this field
isn't tokenized. Then a KeywordsField can take a String or BytesRef in ctors.
Both Lucene and Solr are suffering from a conflation of the two concepts of
treating an input stream as a single token ("a keyword") and as a sequence of
tokens ("sequence of keywords"). We have the "KeywordTokenizer" that does NOT
tokenize the input stream into "a sequence of keywords". The term "keyword
search" is commonly used to describe the ability of search engines to find
"individual keywords" in extended streams of "text" - a clear reference to
"keyword" in a tokenized stream.
So, I don't understand how it is claimed that naming StringField to
KeywordField is making anything "obvious" - it seems to me to be adding to the
existing confusion rather than clarifying anything. I mean, the term "keyword"
should be treated more as a synonym for "token" or "term", NOT as synonym for
"string" or "raw character sequence".
I agree that we need a term for "raw, uninterpreted character sequence", but it
seems to me that "string" is a more "obvious" candidate than "keyword".
There has been some grumbling at the Solr level that KeywordTokenizer should be
renamed to... something, anything, but just not KeywordTokenizer, which
"obviously" implied that the input stream will be tokenized into a sequence of
keywords, which it does not.
In an effort to try to resolve this ongoing confusion, can somebody provide
from historical background as to how KeywordTokenizer got its name, and how a
subset of people continue to refer to an uninterpreted sequence of characters
as a "keyword" rather than a string. I checked the Javadoc, Jira, and even the
source code, but came up empty.
In short, it is a real eye-opener to see a claim that the term "keyword" in any
way makes it "obvious" that input is not tokenized!!
Maybe we could fix this for 5.0 to have a cleaner set of terminology going
forward. At a minimum, we should have some clarifying language in the Javadoc.
And hopefully we can refrain from making the confusion/conflation worse by
renaming StringField to KeywordField.
bq. Then a KeywordsField can take a String
Is that simply a typo or is the intent to have both a KeywordField (singular)
and a KeywordsField (plural)? I presume it is a typo, but... maybe it's a
Freudian slip and highlights this semantic difficulty that persists in the
Lucene terminology (and hence infects Solr terminology as well.)
> Add BinaryField, to index a single binary token
> -----------------------------------------------
>
> Key: LUCENE-5989
> URL: https://issues.apache.org/jira/browse/LUCENE-5989
> Project: Lucene - Core
> Issue Type: Improvement
> Reporter: Michael McCandless
> Assignee: Michael McCandless
> Fix For: 5.0, Trunk
>
> Attachments: LUCENE-5989.patch
>
>
> 5 years ago (LUCENE-1458) we "enabled" fully binary terms in the
> lowest levels of Lucene (the codec APIs) yet today, actually adding an
> arbitrary byte[] binary term during indexing is far from simple: you
> must make a custom Field with a custom TokenStream and a custom
> TermToBytesRefAttribute, as far as I know.
> This is supremely expert, I wonder if anyone out there has succeeded
> in doing so?
> I think we should make indexing a single byte[] as simple as indexing
> a single String.
> This is a pre-cursor for issues like LUCENE-5596 (encoding IPv6
> address as byte[16]) and LUCENE-5879 (encoding native numeric values
> in their simple binary form).
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]