[jira] [Commented] (LUCENE-5989) Add BinaryField, to index a single binary token

2015-04-11 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-5989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14490852#comment-14490852
 ] 

ASF subversion and git services commented on LUCENE-5989:
-

Commit 1672843 from [~mikemccand] in branch 'dev/trunk'
[ https://svn.apache.org/r1672843 ]

LUCENE-5989: fix CHANGES entry

> Add BinaryField, to index a single binary token
> ---
>
> Key: LUCENE-5989
> URL: https://issues.apache.org/jira/browse/LUCENE-5989
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: 5.0, Trunk
>
> Attachments: LUCENE-5989.patch, LUCENE-5989.patch
>
>
> 5 years ago (LUCENE-1458) we "enabled" fully binary terms in the
> lowest levels of Lucene (the codec APIs) yet today, actually adding an
> arbitrary byte[] binary term during indexing is far from simple: you
> must make a custom Field with a custom TokenStream and a custom
> TermToBytesRefAttribute, as far as I know.
> This is supremely expert, I wonder if anyone out there has succeeded
> in doing so?
> I think we should make indexing a single byte[] as simple as indexing
> a single String.
> This is a pre-cursor for issues like LUCENE-5596 (encoding IPv6
> address as byte[16]) and LUCENE-5879 (encoding native numeric values
> in their simple binary form).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-5989) Add BinaryField, to index a single binary token

2015-04-11 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-5989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14490851#comment-14490851
 ] 

ASF subversion and git services commented on LUCENE-5989:
-

Commit 1672842 from [~mikemccand] in branch 'dev/branches/branch_5x'
[ https://svn.apache.org/r1672842 ]

LUCENE-5989: allow passing BytesRef to StringField to make it easier to index 
arbitrary binary tokens

> Add BinaryField, to index a single binary token
> ---
>
> Key: LUCENE-5989
> URL: https://issues.apache.org/jira/browse/LUCENE-5989
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: 5.0, Trunk
>
> Attachments: LUCENE-5989.patch, LUCENE-5989.patch
>
>
> 5 years ago (LUCENE-1458) we "enabled" fully binary terms in the
> lowest levels of Lucene (the codec APIs) yet today, actually adding an
> arbitrary byte[] binary term during indexing is far from simple: you
> must make a custom Field with a custom TokenStream and a custom
> TermToBytesRefAttribute, as far as I know.
> This is supremely expert, I wonder if anyone out there has succeeded
> in doing so?
> I think we should make indexing a single byte[] as simple as indexing
> a single String.
> This is a pre-cursor for issues like LUCENE-5596 (encoding IPv6
> address as byte[16]) and LUCENE-5879 (encoding native numeric values
> in their simple binary form).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-5989) Add BinaryField, to index a single binary token

2015-04-10 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-5989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14490469#comment-14490469
 ] 

ASF subversion and git services commented on LUCENE-5989:
-

Commit 1672781 from [~mikemccand] in branch 'dev/trunk'
[ https://svn.apache.org/r1672781 ]

LUCENE-5989: allow passing BytesRef to StringField to make it easier to index 
arbitrary binary tokens

> Add BinaryField, to index a single binary token
> ---
>
> Key: LUCENE-5989
> URL: https://issues.apache.org/jira/browse/LUCENE-5989
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: 5.0, Trunk
>
> Attachments: LUCENE-5989.patch, LUCENE-5989.patch
>
>
> 5 years ago (LUCENE-1458) we "enabled" fully binary terms in the
> lowest levels of Lucene (the codec APIs) yet today, actually adding an
> arbitrary byte[] binary term during indexing is far from simple: you
> must make a custom Field with a custom TokenStream and a custom
> TermToBytesRefAttribute, as far as I know.
> This is supremely expert, I wonder if anyone out there has succeeded
> in doing so?
> I think we should make indexing a single byte[] as simple as indexing
> a single String.
> This is a pre-cursor for issues like LUCENE-5596 (encoding IPv6
> address as byte[16]) and LUCENE-5879 (encoding native numeric values
> in their simple binary form).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-5989) Add BinaryField, to index a single binary token

2015-04-07 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-5989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14483065#comment-14483065
 ] 

Uwe Schindler commented on LUCENE-5989:
---

bq. I tried to fix BinaryTokenStreams attr to be "proper" as Uwe Schindler but 
ran into problems because this BytesRef is pre-shared up front to consumers, so 
we can't null it in clear...

I don't think this is a problem here, because the TokenStream is only used 
internally and is never visible to the outside (isn't it?). Another thing is 
that Attribute's  copyTo() does not deep clone, but this is also not an issue 
(because nobody has the chance to copy this tokenstream anywhere else). 
[~shaie] and I fixed TokensStreams in another issue, where payloads were not 
cloned (see changelog, don't have issue number).

In general we should fix the TermToBytesRefAttribute and remove the horrible 
fillBytesRef, which was needed in Lucene 4.x because of some early Lucene 3 
compatibility. But it makes it hard to use, so we should get rid of it. 
TermToBytesRefAttribute should only have a single method: getBytesRef() that 
returns the BytesRef.

Generally I am fine. The issues Robert mentioned should be done in a separate 
issue.

> Add BinaryField, to index a single binary token
> ---
>
> Key: LUCENE-5989
> URL: https://issues.apache.org/jira/browse/LUCENE-5989
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: 5.0, Trunk
>
> Attachments: LUCENE-5989.patch, LUCENE-5989.patch
>
>
> 5 years ago (LUCENE-1458) we "enabled" fully binary terms in the
> lowest levels of Lucene (the codec APIs) yet today, actually adding an
> arbitrary byte[] binary term during indexing is far from simple: you
> must make a custom Field with a custom TokenStream and a custom
> TermToBytesRefAttribute, as far as I know.
> This is supremely expert, I wonder if anyone out there has succeeded
> in doing so?
> I think we should make indexing a single byte[] as simple as indexing
> a single String.
> This is a pre-cursor for issues like LUCENE-5596 (encoding IPv6
> address as byte[16]) and LUCENE-5879 (encoding native numeric values
> in their simple binary form).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-5989) Add BinaryField, to index a single binary token

2015-04-07 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-5989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14483021#comment-14483021
 ] 

Robert Muir commented on LUCENE-5989:
-

{quote}
Maybe we could baby step here, and just change StoredFieldVisitor.stringField 
to take byte[]? I know this doesn't help all the stupid work we do during 
default merge to decode/encode but at least it's a start ...
{quote}

Thanks for looking into it. Maybe we can remove the smooshing in a separate 
issue. I think its really bad that our default merge impl creates so many 
strings.

> Add BinaryField, to index a single binary token
> ---
>
> Key: LUCENE-5989
> URL: https://issues.apache.org/jira/browse/LUCENE-5989
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: 5.0, Trunk
>
> Attachments: LUCENE-5989.patch, LUCENE-5989.patch
>
>
> 5 years ago (LUCENE-1458) we "enabled" fully binary terms in the
> lowest levels of Lucene (the codec APIs) yet today, actually adding an
> arbitrary byte[] binary term during indexing is far from simple: you
> must make a custom Field with a custom TokenStream and a custom
> TermToBytesRefAttribute, as far as I know.
> This is supremely expert, I wonder if anyone out there has succeeded
> in doing so?
> I think we should make indexing a single byte[] as simple as indexing
> a single String.
> This is a pre-cursor for issues like LUCENE-5596 (encoding IPv6
> address as byte[16]) and LUCENE-5879 (encoding native numeric values
> in their simple binary form).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-5989) Add BinaryField, to index a single binary token

2015-04-07 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-5989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14482902#comment-14482902
 ] 

Michael McCandless commented on LUCENE-5989:


bq. If we fix this .document api to allow a StringField to have a binary value, 
maybe it could help with merge code.

This would be very nice ... I struggled some with it, but got stuck with 
StorableField.stringValue() returning String.  I think we need to keep that 
because that's also the API apps use to retrieve their stored fields.

But the default merging operates on StorableDocument/StorableField, so I'm not 
sure how to separate the two.  Really there are two concepts: the "schema" for 
this doc (did it store a binary or string value for this field), and what's 
used to represent a string value (byte[] vs String), and both concepts are 
being smooshed together into this API.

Maybe we could baby step here, and just change StoredFieldVisitor.stringField 
to take byte[]?  I know this doesn't help all the stupid work we do during 
default merge to decode/encode but at least it's a start ...

> Add BinaryField, to index a single binary token
> ---
>
> Key: LUCENE-5989
> URL: https://issues.apache.org/jira/browse/LUCENE-5989
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: 5.0, Trunk
>
> Attachments: LUCENE-5989.patch
>
>
> 5 years ago (LUCENE-1458) we "enabled" fully binary terms in the
> lowest levels of Lucene (the codec APIs) yet today, actually adding an
> arbitrary byte[] binary term during indexing is far from simple: you
> must make a custom Field with a custom TokenStream and a custom
> TermToBytesRefAttribute, as far as I know.
> This is supremely expert, I wonder if anyone out there has succeeded
> in doing so?
> I think we should make indexing a single byte[] as simple as indexing
> a single String.
> This is a pre-cursor for issues like LUCENE-5596 (encoding IPv6
> address as byte[16]) and LUCENE-5879 (encoding native numeric values
> in their simple binary form).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-5989) Add BinaryField, to index a single binary token

2015-04-06 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-5989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14481268#comment-14481268
 ] 

Robert Muir commented on LUCENE-5989:
-

If we fix this .document api to allow a StringField to have a binary value, 
maybe it could help with merge code.

Currently the StoredFieldsVisitor returns strings as java.lang.String, which is 
wasteful for the default merge implementation (it must decode/re-encode). If we 
could remove this and let the visitor deal with it, default merge could avoid 
this decode/re-encode and we might be able to even nuke some specialized bulk 
merge logic that we have solely for reasons like this (at the least we will 
speed up the worst case). I tried to look at this recently and the .document 
api stopped me. 

Not something we have to fix here, but just something related to think about 
when looking at how to change it.

> Add BinaryField, to index a single binary token
> ---
>
> Key: LUCENE-5989
> URL: https://issues.apache.org/jira/browse/LUCENE-5989
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: 5.0, Trunk
>
> Attachments: LUCENE-5989.patch
>
>
> 5 years ago (LUCENE-1458) we "enabled" fully binary terms in the
> lowest levels of Lucene (the codec APIs) yet today, actually adding an
> arbitrary byte[] binary term during indexing is far from simple: you
> must make a custom Field with a custom TokenStream and a custom
> TermToBytesRefAttribute, as far as I know.
> This is supremely expert, I wonder if anyone out there has succeeded
> in doing so?
> I think we should make indexing a single byte[] as simple as indexing
> a single String.
> This is a pre-cursor for issues like LUCENE-5596 (encoding IPv6
> address as byte[16]) and LUCENE-5879 (encoding native numeric values
> in their simple binary form).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-5989) Add BinaryField, to index a single binary token

2015-04-06 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-5989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14481265#comment-14481265
 ] 

Michael McCandless commented on LUCENE-5989:


I'll switch to just adding a StringField ctor that takes a BytesRef ... less 
API.

> Add BinaryField, to index a single binary token
> ---
>
> Key: LUCENE-5989
> URL: https://issues.apache.org/jira/browse/LUCENE-5989
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: 5.0, Trunk
>
> Attachments: LUCENE-5989.patch
>
>
> 5 years ago (LUCENE-1458) we "enabled" fully binary terms in the
> lowest levels of Lucene (the codec APIs) yet today, actually adding an
> arbitrary byte[] binary term during indexing is far from simple: you
> must make a custom Field with a custom TokenStream and a custom
> TermToBytesRefAttribute, as far as I know.
> This is supremely expert, I wonder if anyone out there has succeeded
> in doing so?
> I think we should make indexing a single byte[] as simple as indexing
> a single String.
> This is a pre-cursor for issues like LUCENE-5596 (encoding IPv6
> address as byte[16]) and LUCENE-5879 (encoding native numeric values
> in their simple binary form).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-5989) Add BinaryField, to index a single binary token

2015-04-06 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-5989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14481102#comment-14481102
 ] 

Michael McCandless commented on LUCENE-5989:


I think now that LUCENE-5879 is in, as a dark/unusable feature, we should add 
BinaryField (this issue) to at least shed a bit of light on it.

With BinaryField, apps can efficiently prefix and range search anything they 
can convert to/from byte[], e.g. BigInteger/Decimal, InetAddress (LUCENE-5596), 
int/long/float/double, etc.  On the LUCENE-6005 branch there is also a 
half-precision float (2 bytes).

I don't think Lucene's lack of schema is a justifiable reason to block progress 
here.

> Add BinaryField, to index a single binary token
> ---
>
> Key: LUCENE-5989
> URL: https://issues.apache.org/jira/browse/LUCENE-5989
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: 5.0, Trunk
>
> Attachments: LUCENE-5989.patch
>
>
> 5 years ago (LUCENE-1458) we "enabled" fully binary terms in the
> lowest levels of Lucene (the codec APIs) yet today, actually adding an
> arbitrary byte[] binary term during indexing is far from simple: you
> must make a custom Field with a custom TokenStream and a custom
> TermToBytesRefAttribute, as far as I know.
> This is supremely expert, I wonder if anyone out there has succeeded
> in doing so?
> I think we should make indexing a single byte[] as simple as indexing
> a single String.
> This is a pre-cursor for issues like LUCENE-5596 (encoding IPv6
> address as byte[16]) and LUCENE-5879 (encoding native numeric values
> in their simple binary form).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-5989) Add BinaryField, to index a single binary token

2014-10-06 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-5989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14161141#comment-14161141
 ] 

Michael McCandless commented on LUCENE-5989:


bq. Why jump to the conclusion that a user would have a byte[] already for an 
IP address? Thats a horrible representation Why wouldn't they pass 
java.net.Inet6Address?

I agree: taking Inet6Address would be great, but under the hood it should be 
indexed as a byte[] (which this issue is trying to enable), right?  I'll go 
reopen LUCENE-5596...

bq. I'm just saying that if they then go do the following in queryparser, why 
can't it please work? (ranges too)

+1, but that's really out of scope here?  I mean I don't think we can solve all 
of Lucene's "schema" issues here.  Seems like LUCENE-5596 should make it easy 
to do IP address range querying...



> Add BinaryField, to index a single binary token
> ---
>
> Key: LUCENE-5989
> URL: https://issues.apache.org/jira/browse/LUCENE-5989
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: 5.0, Trunk
>
> Attachments: LUCENE-5989.patch
>
>
> 5 years ago (LUCENE-1458) we "enabled" fully binary terms in the
> lowest levels of Lucene (the codec APIs) yet today, actually adding an
> arbitrary byte[] binary term during indexing is far from simple: you
> must make a custom Field with a custom TokenStream and a custom
> TermToBytesRefAttribute, as far as I know.
> This is supremely expert, I wonder if anyone out there has succeeded
> in doing so?
> I think we should make indexing a single byte[] as simple as indexing
> a single String.
> This is a pre-cursor for issues like LUCENE-5596 (encoding IPv6
> address as byte[16]) and LUCENE-5879 (encoding native numeric values
> in their simple binary form).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-5989) Add BinaryField, to index a single binary token

2014-10-06 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-5989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14160259#comment-14160259
 ] 

Robert Muir commented on LUCENE-5989:
-

Why jump to the conclusion that a user would have a byte[] already for an IP 
address? Thats a horrible representation Why wouldn't they pass 
java.net.Inet6Address?

I'm just saying that if they then go do the following in queryparser, why can't 
it please work? (ranges too)

{code}
... AND address:"1.2::3.4" 
{code}

Otherwise, if we don't want to make binary/numeric/etc fields "first class", 
and only treat them as bastardizations of text fields, then please, do this 
consistently everywhere, parse them as text everywhere, so that they will work 
correctly everywhere.


> Add BinaryField, to index a single binary token
> ---
>
> Key: LUCENE-5989
> URL: https://issues.apache.org/jira/browse/LUCENE-5989
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: 5.0, Trunk
>
> Attachments: LUCENE-5989.patch
>
>
> 5 years ago (LUCENE-1458) we "enabled" fully binary terms in the
> lowest levels of Lucene (the codec APIs) yet today, actually adding an
> arbitrary byte[] binary term during indexing is far from simple: you
> must make a custom Field with a custom TokenStream and a custom
> TermToBytesRefAttribute, as far as I know.
> This is supremely expert, I wonder if anyone out there has succeeded
> in doing so?
> I think we should make indexing a single byte[] as simple as indexing
> a single String.
> This is a pre-cursor for issues like LUCENE-5596 (encoding IPv6
> address as byte[16]) and LUCENE-5879 (encoding native numeric values
> in their simple binary form).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-5989) Add BinaryField, to index a single binary token

2014-10-06 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-5989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14160101#comment-14160101
 ] 

Michael McCandless commented on LUCENE-5989:


I have also never understood the origin of "keyword" meaning the
entire string is treated as one token.  I don't think it's obvious.
It is *consistent* with the existing KeywordAnalyzer/Tokenizer, but I
don't think that's a good justification to further propagate non-obvious
naming.  I would rather rename KeywordTokenizer/Analyzer to
something else...

I guess net/net I would prefer here that we *not* add BinaryField and
instead keep the name StringField, just giving it another ctor to take
byte[]/BytesRef.  Added classes have an API cost higher than just an
added ctor, and the "purpose" of these two is exactly the same...

bq. I don't like the violation that clear() is a no-op in BytesTermAttribute. 
In a correct world, this should null the bytesref and the TokenStream should 
set the BytesRef after clearAttributes.

Thanks Uwe, I'll add a nocommit to somehow fix it ... seems like
ByteTermAttributeImpl.clear must null out its copy of the bytes, and
then BinaryTokenStream.reset must re-instate the next one (pulling it
via the previous setValue call?).  I guess I must add
BinaryTokenStream.bytes too?  Our analysis APIs are ... challenging.

bq. So the solution is to proceed and make matters worse by requiring the user 
to also deal with the .document API?

But if you can't even figure out how to get your IPv6 byte[]
(LUCENE-5596) or your numeric value encoded as byte\[4] or byte\[8]
(LUCENE-5879) into Lucene's IndexWriter in the first place, how will
you even have any hope of querying it?


> Add BinaryField, to index a single binary token
> ---
>
> Key: LUCENE-5989
> URL: https://issues.apache.org/jira/browse/LUCENE-5989
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: 5.0, Trunk
>
> Attachments: LUCENE-5989.patch
>
>
> 5 years ago (LUCENE-1458) we "enabled" fully binary terms in the
> lowest levels of Lucene (the codec APIs) yet today, actually adding an
> arbitrary byte[] binary term during indexing is far from simple: you
> must make a custom Field with a custom TokenStream and a custom
> TermToBytesRefAttribute, as far as I know.
> This is supremely expert, I wonder if anyone out there has succeeded
> in doing so?
> I think we should make indexing a single byte[] as simple as indexing
> a single String.
> This is a pre-cursor for issues like LUCENE-5596 (encoding IPv6
> address as byte[16]) and LUCENE-5879 (encoding native numeric values
> in their simple binary form).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-5989) Add BinaryField, to index a single binary token

2014-10-05 Thread David Smiley (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-5989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14159825#comment-14159825
 ] 

David Smiley commented on LUCENE-5989:
--

bq. This is supremely expert, I wonder if anyone out there has succeeded in 
doing so?

{{org.apache.lucene.spatial.prefix.CellTokenStream}} :-)Though this doesn't 
count since it's in Lucene.

+1 to make this easier via a BinaryField.  With BinaryField and auto-prefixing, 
CellTokenStream won't be needed for indexing a point.  But it's needed for 
other shapes and to support heat-map style faceting.

Jack's opinion about the "Keyword" name being far from obvious really resonated 
with me.  Despite Shai's reasonable explanation, it doesn't seem to me that 
changing the status-quo to anything non-obvious is helpful.  And it wouldn't 
seem like the text equivalent of BinaryField -- for that the current name is 
perfect, I think.  But I do like the idea of simply having StringField taking a 
byte[] too such that there is no BinaryField.  Either way.

> Add BinaryField, to index a single binary token
> ---
>
> Key: LUCENE-5989
> URL: https://issues.apache.org/jira/browse/LUCENE-5989
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: 5.0, Trunk
>
> Attachments: LUCENE-5989.patch
>
>
> 5 years ago (LUCENE-1458) we "enabled" fully binary terms in the
> lowest levels of Lucene (the codec APIs) yet today, actually adding an
> arbitrary byte[] binary term during indexing is far from simple: you
> must make a custom Field with a custom TokenStream and a custom
> TermToBytesRefAttribute, as far as I know.
> This is supremely expert, I wonder if anyone out there has succeeded
> in doing so?
> I think we should make indexing a single byte[] as simple as indexing
> a single String.
> This is a pre-cursor for issues like LUCENE-5596 (encoding IPv6
> address as byte[16]) and LUCENE-5879 (encoding native numeric values
> in their simple binary form).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-5989) Add BinaryField, to index a single binary token

2014-10-05 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-5989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14159528#comment-14159528
 ] 

Shai Erera commented on LUCENE-5989:


bq. Is that simply a typo

Yes, fixed :).

The term 'keyword' is of course overloaded here. When I propose KeywordField, I 
am following the existing Keyword* classes that we have: KeywordTokenizer, 
KeywordAnalyzer, KeywordAttribute. And from what I remember, when users ask how 
to parse 'keywords' they indexed as StringFields, we often tell them to use 
PerFieldAnalyzerWrapper with a KeywordAnalyzer for that field. That's why I 
feel that KeywordField fits better with the overall Keyword* tokenstream API.

> Add BinaryField, to index a single binary token
> ---
>
> Key: LUCENE-5989
> URL: https://issues.apache.org/jira/browse/LUCENE-5989
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: 5.0, Trunk
>
> Attachments: LUCENE-5989.patch
>
>
> 5 years ago (LUCENE-1458) we "enabled" fully binary terms in the
> lowest levels of Lucene (the codec APIs) yet today, actually adding an
> arbitrary byte[] binary term during indexing is far from simple: you
> must make a custom Field with a custom TokenStream and a custom
> TermToBytesRefAttribute, as far as I know.
> This is supremely expert, I wonder if anyone out there has succeeded
> in doing so?
> I think we should make indexing a single byte[] as simple as indexing
> a single String.
> This is a pre-cursor for issues like LUCENE-5596 (encoding IPv6
> address as byte[16]) and LUCENE-5879 (encoding native numeric values
> in their simple binary form).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-5989) Add BinaryField, to index a single binary token

2014-10-05 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-5989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14159515#comment-14159515
 ] 

Robert Muir commented on LUCENE-5989:
-

{quote}
This is supremely expert, I wonder if anyone out there has succeeded
in doing so?
{quote}

So the solution is to proceed and make matters worse by requiring the user to 
*also* deal with the .document API? 

If the user wants their field to work with various query-time features 
(queryparser, morelikethis, whatever), then they must deal with the tokenstream 
side anyway, so adding *Field doesn't help anything. It just adds yet another 
place they must plug in "schema" information (as opposed to only being once in 
Analyzer). Sure, its easier to get past indexwriter maybe, but you win the 
battle and lose the war.

I'm not going to try to block the change, just please, please, please think 
about it.

> Add BinaryField, to index a single binary token
> ---
>
> Key: LUCENE-5989
> URL: https://issues.apache.org/jira/browse/LUCENE-5989
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: 5.0, Trunk
>
> Attachments: LUCENE-5989.patch
>
>
> 5 years ago (LUCENE-1458) we "enabled" fully binary terms in the
> lowest levels of Lucene (the codec APIs) yet today, actually adding an
> arbitrary byte[] binary term during indexing is far from simple: you
> must make a custom Field with a custom TokenStream and a custom
> TermToBytesRefAttribute, as far as I know.
> This is supremely expert, I wonder if anyone out there has succeeded
> in doing so?
> I think we should make indexing a single byte[] as simple as indexing
> a single String.
> This is a pre-cursor for issues like LUCENE-5596 (encoding IPv6
> address as byte[16]) and LUCENE-5879 (encoding native numeric values
> in their simple binary form).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-5989) Add BinaryField, to index a single binary token

2014-10-05 Thread Jack Krupansky (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-5989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14159511#comment-14159511
 ] 

Jack Krupansky commented on LUCENE-5989:


bq. rename StringField to KeywordField, making it more obvious that this field 
isn't tokenized. Then a KeywordsField can take a String or BytesRef in ctors.

Both Lucene and Solr are suffering from a conflation of the two concepts of 
treating an input stream as a single token ("a keyword") and as a sequence of 
tokens ("sequence of keywords"). We have the "KeywordTokenizer" that does NOT 
tokenize the input stream into "a sequence of keywords". The term "keyword 
search" is commonly used to describe the ability of search engines to find 
"individual keywords" in extended streams of "text" - a clear reference to 
"keyword" in a tokenized stream.

So, I don't understand how it is claimed that naming StringField to 
KeywordField is making anything "obvious" - it seems to me to be adding to the 
existing confusion rather than clarifying anything. I mean, the term "keyword" 
should be treated more as a synonym for "token" or "term", NOT as synonym for 
"string" or "raw character sequence".

I agree that we need a term for "raw, uninterpreted character sequence", but it 
seems to me that "string" is a more "obvious" candidate than "keyword".

There has been some grumbling at the Solr level that KeywordTokenizer should be 
renamed to... something, anything, but just not KeywordTokenizer, which 
"obviously" implied that the input stream will be tokenized into a sequence of 
keywords, which it does not.

In an effort to try to resolve this ongoing confusion, can somebody provide 
from historical background as to how KeywordTokenizer got its name, and how a 
subset of people continue to refer to an uninterpreted sequence of characters 
as a "keyword" rather than a string. I checked the Javadoc, Jira, and even the 
source code, but came up empty.

In short, it is a real eye-opener to see a claim that the term "keyword" in any 
way makes it "obvious" that input is not tokenized!!

Maybe we could fix this for 5.0 to have a cleaner set of terminology going 
forward. At a minimum, we should have some clarifying language in the Javadoc. 
And hopefully we can refrain from making the confusion/conflation worse by 
renaming StringField to KeywordField.

bq.  Then a KeywordsField can take a String

Is that simply a typo or is the intent to have both a KeywordField (singular) 
and a KeywordsField (plural)? I presume it is a typo, but... maybe it's a 
Freudian slip and highlights this semantic difficulty that persists in the 
Lucene terminology (and hence infects Solr terminology as well.)


> Add BinaryField, to index a single binary token
> ---
>
> Key: LUCENE-5989
> URL: https://issues.apache.org/jira/browse/LUCENE-5989
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: 5.0, Trunk
>
> Attachments: LUCENE-5989.patch
>
>
> 5 years ago (LUCENE-1458) we "enabled" fully binary terms in the
> lowest levels of Lucene (the codec APIs) yet today, actually adding an
> arbitrary byte[] binary term during indexing is far from simple: you
> must make a custom Field with a custom TokenStream and a custom
> TermToBytesRefAttribute, as far as I know.
> This is supremely expert, I wonder if anyone out there has succeeded
> in doing so?
> I think we should make indexing a single byte[] as simple as indexing
> a single String.
> This is a pre-cursor for issues like LUCENE-5596 (encoding IPv6
> address as byte[16]) and LUCENE-5879 (encoding native numeric values
> in their simple binary form).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-5989) Add BinaryField, to index a single binary token

2014-10-05 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-5989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14159507#comment-14159507
 ] 

Uwe Schindler commented on LUCENE-5989:
---

bq. We could also rename StringField to KeywordField, making it more obvious 
that this field isn't tokenized. Then a KeywordsField can take a String or 
BytesRef in ctors.

+1

bq. Patch, adding BinaryField

I don't like the violation that clear() is a no-op in BytesTermAttribute. In a 
correct world, this should null the bytesref and the TokenStream should set the 
BytesRef after clearAttributes.

This is not urgent here, but it violates the contract. I know 
NumericTermAttribute does similar things... :(

> Add BinaryField, to index a single binary token
> ---
>
> Key: LUCENE-5989
> URL: https://issues.apache.org/jira/browse/LUCENE-5989
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: 5.0, Trunk
>
> Attachments: LUCENE-5989.patch
>
>
> 5 years ago (LUCENE-1458) we "enabled" fully binary terms in the
> lowest levels of Lucene (the codec APIs) yet today, actually adding an
> arbitrary byte[] binary term during indexing is far from simple: you
> must make a custom Field with a custom TokenStream and a custom
> TermToBytesRefAttribute, as far as I know.
> This is supremely expert, I wonder if anyone out there has succeeded
> in doing so?
> I think we should make indexing a single byte[] as simple as indexing
> a single String.
> This is a pre-cursor for issues like LUCENE-5596 (encoding IPv6
> address as byte[16]) and LUCENE-5879 (encoding native numeric values
> in their simple binary form).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-5989) Add BinaryField, to index a single binary token

2014-10-05 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-5989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14159500#comment-14159500
 ] 

Shai Erera commented on LUCENE-5989:


bq. we could maybe instead just add a ctor for StringField taking BytesRef...

We could also rename StringField to KeywordField, making it more obvious that 
this field isn't tokenized. Then a KeywordsField can take a String or BytesRef 
in ctors.

> Add BinaryField, to index a single binary token
> ---
>
> Key: LUCENE-5989
> URL: https://issues.apache.org/jira/browse/LUCENE-5989
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: 5.0, Trunk
>
> Attachments: LUCENE-5989.patch
>
>
> 5 years ago (LUCENE-1458) we "enabled" fully binary terms in the
> lowest levels of Lucene (the codec APIs) yet today, actually adding an
> arbitrary byte[] binary term during indexing is far from simple: you
> must make a custom Field with a custom TokenStream and a custom
> TermToBytesRefAttribute, as far as I know.
> This is supremely expert, I wonder if anyone out there has succeeded
> in doing so?
> I think we should make indexing a single byte[] as simple as indexing
> a single String.
> This is a pre-cursor for issues like LUCENE-5596 (encoding IPv6
> address as byte[16]) and LUCENE-5879 (encoding native numeric values
> in their simple binary form).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org