[jira] [Commented] (LUCENE-5989) Add BinaryField, to index a single binary token

Michael McCandless (JIRA) Mon, 06 Oct 2014 01:39:07 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-5989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14160101#comment-14160101
 ]


Michael McCandless commented on LUCENE-5989:
--------------------------------------------

I have also never understood the origin of "keyword" meaning the
entire string is treated as one token.  I don't think it's obvious.
It is *consistent* with the existing KeywordAnalyzer/Tokenizer, but I
don't think that's a good justification to further propagate non-obvious
naming.  I would rather rename KeywordTokenizer/Analyzer to
something else...

I guess net/net I would prefer here that we *not* add BinaryField and
instead keep the name StringField, just giving it another ctor to take
byte[]/BytesRef.  Added classes have an API cost higher than just an
added ctor, and the "purpose" of these two is exactly the same...

bq. I don't like the violation that clear() is a no-op in BytesTermAttribute. 
In a correct world, this should null the bytesref and the TokenStream should 
set the BytesRef after clearAttributes.

Thanks Uwe, I'll add a nocommit to somehow fix it ... seems like
ByteTermAttributeImpl.clear must null out its copy of the bytes, and
then BinaryTokenStream.reset must re-instate the next one (pulling it
via the previous setValue call?).  I guess I must add
BinaryTokenStream.bytes too?  Our analysis APIs are ... challenging.

bq. So the solution is to proceed and make matters worse by requiring the user 
to also deal with the .document API?

But if you can't even figure out how to get your IPv6 byte[]
(LUCENE-5596) or your numeric value encoded as byte\[4] or byte\[8]
(LUCENE-5879) into Lucene's IndexWriter in the first place, how will
you even have any hope of querying it?


> Add BinaryField, to index a single binary token
> -----------------------------------------------
>
>                 Key: LUCENE-5989
>                 URL: https://issues.apache.org/jira/browse/LUCENE-5989
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>             Fix For: 5.0, Trunk
>
>         Attachments: LUCENE-5989.patch
>
>
> 5 years ago (LUCENE-1458) we "enabled" fully binary terms in the
> lowest levels of Lucene (the codec APIs) yet today, actually adding an
> arbitrary byte[] binary term during indexing is far from simple: you
> must make a custom Field with a custom TokenStream and a custom
> TermToBytesRefAttribute, as far as I know.
> This is supremely expert, I wonder if anyone out there has succeeded
> in doing so?
> I think we should make indexing a single byte[] as simple as indexing
> a single String.
> This is a pre-cursor for issues like LUCENE-5596 (encoding IPv6
> address as byte[16]) and LUCENE-5879 (encoding native numeric values
> in their simple binary form).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-5989) Add BinaryField, to index a single binary token

Reply via email to