[jira] [Commented] (LUCENE-5989) Add BinaryField, to index a single binary token

Jack Krupansky (JIRA) Sun, 05 Oct 2014 05:21:07 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-5989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14159511#comment-14159511
 ]


Jack Krupansky commented on LUCENE-5989:
----------------------------------------

bq. rename StringField to KeywordField, making it more obvious that this field 
isn't tokenized. Then a KeywordsField can take a String or BytesRef in ctors.

Both Lucene and Solr are suffering from a conflation of the two concepts of 
treating an input stream as a single token ("a keyword") and as a sequence of 
tokens ("sequence of keywords"). We have the "KeywordTokenizer" that does NOT 
tokenize the input stream into "a sequence of keywords". The term "keyword 
search" is commonly used to describe the ability of search engines to find 
"individual keywords" in extended streams of "text" - a clear reference to 
"keyword" in a tokenized stream.

So, I don't understand how it is claimed that naming StringField to 
KeywordField is making anything "obvious" - it seems to me to be adding to the 
existing confusion rather than clarifying anything. I mean, the term "keyword" 
should be treated more as a synonym for "token" or "term", NOT as synonym for 
"string" or "raw character sequence".

I agree that we need a term for "raw, uninterpreted character sequence", but it 
seems to me that "string" is a more "obvious" candidate than "keyword".

There has been some grumbling at the Solr level that KeywordTokenizer should be 
renamed to... something, anything, but just not KeywordTokenizer, which 
"obviously" implied that the input stream will be tokenized into a sequence of 
keywords, which it does not.

In an effort to try to resolve this ongoing confusion, can somebody provide 
from historical background as to how KeywordTokenizer got its name, and how a 
subset of people continue to refer to an uninterpreted sequence of characters 
as a "keyword" rather than a string. I checked the Javadoc, Jira, and even the 
source code, but came up empty.

In short, it is a real eye-opener to see a claim that the term "keyword" in any 
way makes it "obvious" that input is not tokenized!!

Maybe we could fix this for 5.0 to have a cleaner set of terminology going 
forward. At a minimum, we should have some clarifying language in the Javadoc. 
And hopefully we can refrain from making the confusion/conflation worse by 
renaming StringField to KeywordField.

bq.  Then a KeywordsField can take a String

Is that simply a typo or is the intent to have both a KeywordField (singular) 
and a KeywordsField (plural)? I presume it is a typo, but... maybe it's a 
Freudian slip and highlights this semantic difficulty that persists in the 
Lucene terminology (and hence infects Solr terminology as well.)


> Add BinaryField, to index a single binary token
> -----------------------------------------------
>
>                 Key: LUCENE-5989
>                 URL: https://issues.apache.org/jira/browse/LUCENE-5989
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>             Fix For: 5.0, Trunk
>
>         Attachments: LUCENE-5989.patch
>
>
> 5 years ago (LUCENE-1458) we "enabled" fully binary terms in the
> lowest levels of Lucene (the codec APIs) yet today, actually adding an
> arbitrary byte[] binary term during indexing is far from simple: you
> must make a custom Field with a custom TokenStream and a custom
> TermToBytesRefAttribute, as far as I know.
> This is supremely expert, I wonder if anyone out there has succeeded
> in doing so?
> I think we should make indexing a single byte[] as simple as indexing
> a single String.
> This is a pre-cursor for issues like LUCENE-5596 (encoding IPv6
> address as byte[16]) and LUCENE-5879 (encoding native numeric values
> in their simple binary form).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-5989) Add BinaryField, to index a single binary token

Reply via email to