[jira] [Commented] (LUCENE-5989) Add BinaryField, to index a single binary token
[ https://issues.apache.org/jira/browse/LUCENE-5989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14490852#comment-14490852 ] ASF subversion and git services commented on LUCENE-5989: - Commit 1672843 from [~mikemccand] in branch 'dev/trunk' [ https://svn.apache.org/r1672843 ] LUCENE-5989: fix CHANGES entry > Add BinaryField, to index a single binary token > --- > > Key: LUCENE-5989 > URL: https://issues.apache.org/jira/browse/LUCENE-5989 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Michael McCandless >Assignee: Michael McCandless > Fix For: 5.0, Trunk > > Attachments: LUCENE-5989.patch, LUCENE-5989.patch > > > 5 years ago (LUCENE-1458) we "enabled" fully binary terms in the > lowest levels of Lucene (the codec APIs) yet today, actually adding an > arbitrary byte[] binary term during indexing is far from simple: you > must make a custom Field with a custom TokenStream and a custom > TermToBytesRefAttribute, as far as I know. > This is supremely expert, I wonder if anyone out there has succeeded > in doing so? > I think we should make indexing a single byte[] as simple as indexing > a single String. > This is a pre-cursor for issues like LUCENE-5596 (encoding IPv6 > address as byte[16]) and LUCENE-5879 (encoding native numeric values > in their simple binary form). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5989) Add BinaryField, to index a single binary token
[ https://issues.apache.org/jira/browse/LUCENE-5989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14490851#comment-14490851 ] ASF subversion and git services commented on LUCENE-5989: - Commit 1672842 from [~mikemccand] in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1672842 ] LUCENE-5989: allow passing BytesRef to StringField to make it easier to index arbitrary binary tokens > Add BinaryField, to index a single binary token > --- > > Key: LUCENE-5989 > URL: https://issues.apache.org/jira/browse/LUCENE-5989 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Michael McCandless >Assignee: Michael McCandless > Fix For: 5.0, Trunk > > Attachments: LUCENE-5989.patch, LUCENE-5989.patch > > > 5 years ago (LUCENE-1458) we "enabled" fully binary terms in the > lowest levels of Lucene (the codec APIs) yet today, actually adding an > arbitrary byte[] binary term during indexing is far from simple: you > must make a custom Field with a custom TokenStream and a custom > TermToBytesRefAttribute, as far as I know. > This is supremely expert, I wonder if anyone out there has succeeded > in doing so? > I think we should make indexing a single byte[] as simple as indexing > a single String. > This is a pre-cursor for issues like LUCENE-5596 (encoding IPv6 > address as byte[16]) and LUCENE-5879 (encoding native numeric values > in their simple binary form). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5989) Add BinaryField, to index a single binary token
[ https://issues.apache.org/jira/browse/LUCENE-5989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14490469#comment-14490469 ] ASF subversion and git services commented on LUCENE-5989: - Commit 1672781 from [~mikemccand] in branch 'dev/trunk' [ https://svn.apache.org/r1672781 ] LUCENE-5989: allow passing BytesRef to StringField to make it easier to index arbitrary binary tokens > Add BinaryField, to index a single binary token > --- > > Key: LUCENE-5989 > URL: https://issues.apache.org/jira/browse/LUCENE-5989 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Michael McCandless >Assignee: Michael McCandless > Fix For: 5.0, Trunk > > Attachments: LUCENE-5989.patch, LUCENE-5989.patch > > > 5 years ago (LUCENE-1458) we "enabled" fully binary terms in the > lowest levels of Lucene (the codec APIs) yet today, actually adding an > arbitrary byte[] binary term during indexing is far from simple: you > must make a custom Field with a custom TokenStream and a custom > TermToBytesRefAttribute, as far as I know. > This is supremely expert, I wonder if anyone out there has succeeded > in doing so? > I think we should make indexing a single byte[] as simple as indexing > a single String. > This is a pre-cursor for issues like LUCENE-5596 (encoding IPv6 > address as byte[16]) and LUCENE-5879 (encoding native numeric values > in their simple binary form). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5989) Add BinaryField, to index a single binary token
[ https://issues.apache.org/jira/browse/LUCENE-5989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14483065#comment-14483065 ] Uwe Schindler commented on LUCENE-5989: --- bq. I tried to fix BinaryTokenStreams attr to be "proper" as Uwe Schindler but ran into problems because this BytesRef is pre-shared up front to consumers, so we can't null it in clear... I don't think this is a problem here, because the TokenStream is only used internally and is never visible to the outside (isn't it?). Another thing is that Attribute's copyTo() does not deep clone, but this is also not an issue (because nobody has the chance to copy this tokenstream anywhere else). [~shaie] and I fixed TokensStreams in another issue, where payloads were not cloned (see changelog, don't have issue number). In general we should fix the TermToBytesRefAttribute and remove the horrible fillBytesRef, which was needed in Lucene 4.x because of some early Lucene 3 compatibility. But it makes it hard to use, so we should get rid of it. TermToBytesRefAttribute should only have a single method: getBytesRef() that returns the BytesRef. Generally I am fine. The issues Robert mentioned should be done in a separate issue. > Add BinaryField, to index a single binary token > --- > > Key: LUCENE-5989 > URL: https://issues.apache.org/jira/browse/LUCENE-5989 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Michael McCandless >Assignee: Michael McCandless > Fix For: 5.0, Trunk > > Attachments: LUCENE-5989.patch, LUCENE-5989.patch > > > 5 years ago (LUCENE-1458) we "enabled" fully binary terms in the > lowest levels of Lucene (the codec APIs) yet today, actually adding an > arbitrary byte[] binary term during indexing is far from simple: you > must make a custom Field with a custom TokenStream and a custom > TermToBytesRefAttribute, as far as I know. > This is supremely expert, I wonder if anyone out there has succeeded > in doing so? > I think we should make indexing a single byte[] as simple as indexing > a single String. > This is a pre-cursor for issues like LUCENE-5596 (encoding IPv6 > address as byte[16]) and LUCENE-5879 (encoding native numeric values > in their simple binary form). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5989) Add BinaryField, to index a single binary token
[ https://issues.apache.org/jira/browse/LUCENE-5989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14483021#comment-14483021 ] Robert Muir commented on LUCENE-5989: - {quote} Maybe we could baby step here, and just change StoredFieldVisitor.stringField to take byte[]? I know this doesn't help all the stupid work we do during default merge to decode/encode but at least it's a start ... {quote} Thanks for looking into it. Maybe we can remove the smooshing in a separate issue. I think its really bad that our default merge impl creates so many strings. > Add BinaryField, to index a single binary token > --- > > Key: LUCENE-5989 > URL: https://issues.apache.org/jira/browse/LUCENE-5989 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Michael McCandless >Assignee: Michael McCandless > Fix For: 5.0, Trunk > > Attachments: LUCENE-5989.patch, LUCENE-5989.patch > > > 5 years ago (LUCENE-1458) we "enabled" fully binary terms in the > lowest levels of Lucene (the codec APIs) yet today, actually adding an > arbitrary byte[] binary term during indexing is far from simple: you > must make a custom Field with a custom TokenStream and a custom > TermToBytesRefAttribute, as far as I know. > This is supremely expert, I wonder if anyone out there has succeeded > in doing so? > I think we should make indexing a single byte[] as simple as indexing > a single String. > This is a pre-cursor for issues like LUCENE-5596 (encoding IPv6 > address as byte[16]) and LUCENE-5879 (encoding native numeric values > in their simple binary form). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5989) Add BinaryField, to index a single binary token
[ https://issues.apache.org/jira/browse/LUCENE-5989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14482902#comment-14482902 ] Michael McCandless commented on LUCENE-5989: bq. If we fix this .document api to allow a StringField to have a binary value, maybe it could help with merge code. This would be very nice ... I struggled some with it, but got stuck with StorableField.stringValue() returning String. I think we need to keep that because that's also the API apps use to retrieve their stored fields. But the default merging operates on StorableDocument/StorableField, so I'm not sure how to separate the two. Really there are two concepts: the "schema" for this doc (did it store a binary or string value for this field), and what's used to represent a string value (byte[] vs String), and both concepts are being smooshed together into this API. Maybe we could baby step here, and just change StoredFieldVisitor.stringField to take byte[]? I know this doesn't help all the stupid work we do during default merge to decode/encode but at least it's a start ... > Add BinaryField, to index a single binary token > --- > > Key: LUCENE-5989 > URL: https://issues.apache.org/jira/browse/LUCENE-5989 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Michael McCandless >Assignee: Michael McCandless > Fix For: 5.0, Trunk > > Attachments: LUCENE-5989.patch > > > 5 years ago (LUCENE-1458) we "enabled" fully binary terms in the > lowest levels of Lucene (the codec APIs) yet today, actually adding an > arbitrary byte[] binary term during indexing is far from simple: you > must make a custom Field with a custom TokenStream and a custom > TermToBytesRefAttribute, as far as I know. > This is supremely expert, I wonder if anyone out there has succeeded > in doing so? > I think we should make indexing a single byte[] as simple as indexing > a single String. > This is a pre-cursor for issues like LUCENE-5596 (encoding IPv6 > address as byte[16]) and LUCENE-5879 (encoding native numeric values > in their simple binary form). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5989) Add BinaryField, to index a single binary token
[ https://issues.apache.org/jira/browse/LUCENE-5989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14481268#comment-14481268 ] Robert Muir commented on LUCENE-5989: - If we fix this .document api to allow a StringField to have a binary value, maybe it could help with merge code. Currently the StoredFieldsVisitor returns strings as java.lang.String, which is wasteful for the default merge implementation (it must decode/re-encode). If we could remove this and let the visitor deal with it, default merge could avoid this decode/re-encode and we might be able to even nuke some specialized bulk merge logic that we have solely for reasons like this (at the least we will speed up the worst case). I tried to look at this recently and the .document api stopped me. Not something we have to fix here, but just something related to think about when looking at how to change it. > Add BinaryField, to index a single binary token > --- > > Key: LUCENE-5989 > URL: https://issues.apache.org/jira/browse/LUCENE-5989 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Michael McCandless >Assignee: Michael McCandless > Fix For: 5.0, Trunk > > Attachments: LUCENE-5989.patch > > > 5 years ago (LUCENE-1458) we "enabled" fully binary terms in the > lowest levels of Lucene (the codec APIs) yet today, actually adding an > arbitrary byte[] binary term during indexing is far from simple: you > must make a custom Field with a custom TokenStream and a custom > TermToBytesRefAttribute, as far as I know. > This is supremely expert, I wonder if anyone out there has succeeded > in doing so? > I think we should make indexing a single byte[] as simple as indexing > a single String. > This is a pre-cursor for issues like LUCENE-5596 (encoding IPv6 > address as byte[16]) and LUCENE-5879 (encoding native numeric values > in their simple binary form). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5989) Add BinaryField, to index a single binary token
[ https://issues.apache.org/jira/browse/LUCENE-5989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14481265#comment-14481265 ] Michael McCandless commented on LUCENE-5989: I'll switch to just adding a StringField ctor that takes a BytesRef ... less API. > Add BinaryField, to index a single binary token > --- > > Key: LUCENE-5989 > URL: https://issues.apache.org/jira/browse/LUCENE-5989 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Michael McCandless >Assignee: Michael McCandless > Fix For: 5.0, Trunk > > Attachments: LUCENE-5989.patch > > > 5 years ago (LUCENE-1458) we "enabled" fully binary terms in the > lowest levels of Lucene (the codec APIs) yet today, actually adding an > arbitrary byte[] binary term during indexing is far from simple: you > must make a custom Field with a custom TokenStream and a custom > TermToBytesRefAttribute, as far as I know. > This is supremely expert, I wonder if anyone out there has succeeded > in doing so? > I think we should make indexing a single byte[] as simple as indexing > a single String. > This is a pre-cursor for issues like LUCENE-5596 (encoding IPv6 > address as byte[16]) and LUCENE-5879 (encoding native numeric values > in their simple binary form). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5989) Add BinaryField, to index a single binary token
[ https://issues.apache.org/jira/browse/LUCENE-5989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14481102#comment-14481102 ] Michael McCandless commented on LUCENE-5989: I think now that LUCENE-5879 is in, as a dark/unusable feature, we should add BinaryField (this issue) to at least shed a bit of light on it. With BinaryField, apps can efficiently prefix and range search anything they can convert to/from byte[], e.g. BigInteger/Decimal, InetAddress (LUCENE-5596), int/long/float/double, etc. On the LUCENE-6005 branch there is also a half-precision float (2 bytes). I don't think Lucene's lack of schema is a justifiable reason to block progress here. > Add BinaryField, to index a single binary token > --- > > Key: LUCENE-5989 > URL: https://issues.apache.org/jira/browse/LUCENE-5989 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Michael McCandless >Assignee: Michael McCandless > Fix For: 5.0, Trunk > > Attachments: LUCENE-5989.patch > > > 5 years ago (LUCENE-1458) we "enabled" fully binary terms in the > lowest levels of Lucene (the codec APIs) yet today, actually adding an > arbitrary byte[] binary term during indexing is far from simple: you > must make a custom Field with a custom TokenStream and a custom > TermToBytesRefAttribute, as far as I know. > This is supremely expert, I wonder if anyone out there has succeeded > in doing so? > I think we should make indexing a single byte[] as simple as indexing > a single String. > This is a pre-cursor for issues like LUCENE-5596 (encoding IPv6 > address as byte[16]) and LUCENE-5879 (encoding native numeric values > in their simple binary form). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5989) Add BinaryField, to index a single binary token
[ https://issues.apache.org/jira/browse/LUCENE-5989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14161141#comment-14161141 ] Michael McCandless commented on LUCENE-5989: bq. Why jump to the conclusion that a user would have a byte[] already for an IP address? Thats a horrible representation Why wouldn't they pass java.net.Inet6Address? I agree: taking Inet6Address would be great, but under the hood it should be indexed as a byte[] (which this issue is trying to enable), right? I'll go reopen LUCENE-5596... bq. I'm just saying that if they then go do the following in queryparser, why can't it please work? (ranges too) +1, but that's really out of scope here? I mean I don't think we can solve all of Lucene's "schema" issues here. Seems like LUCENE-5596 should make it easy to do IP address range querying... > Add BinaryField, to index a single binary token > --- > > Key: LUCENE-5989 > URL: https://issues.apache.org/jira/browse/LUCENE-5989 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Michael McCandless >Assignee: Michael McCandless > Fix For: 5.0, Trunk > > Attachments: LUCENE-5989.patch > > > 5 years ago (LUCENE-1458) we "enabled" fully binary terms in the > lowest levels of Lucene (the codec APIs) yet today, actually adding an > arbitrary byte[] binary term during indexing is far from simple: you > must make a custom Field with a custom TokenStream and a custom > TermToBytesRefAttribute, as far as I know. > This is supremely expert, I wonder if anyone out there has succeeded > in doing so? > I think we should make indexing a single byte[] as simple as indexing > a single String. > This is a pre-cursor for issues like LUCENE-5596 (encoding IPv6 > address as byte[16]) and LUCENE-5879 (encoding native numeric values > in their simple binary form). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5989) Add BinaryField, to index a single binary token
[ https://issues.apache.org/jira/browse/LUCENE-5989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14160259#comment-14160259 ] Robert Muir commented on LUCENE-5989: - Why jump to the conclusion that a user would have a byte[] already for an IP address? Thats a horrible representation Why wouldn't they pass java.net.Inet6Address? I'm just saying that if they then go do the following in queryparser, why can't it please work? (ranges too) {code} ... AND address:"1.2::3.4" {code} Otherwise, if we don't want to make binary/numeric/etc fields "first class", and only treat them as bastardizations of text fields, then please, do this consistently everywhere, parse them as text everywhere, so that they will work correctly everywhere. > Add BinaryField, to index a single binary token > --- > > Key: LUCENE-5989 > URL: https://issues.apache.org/jira/browse/LUCENE-5989 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Michael McCandless >Assignee: Michael McCandless > Fix For: 5.0, Trunk > > Attachments: LUCENE-5989.patch > > > 5 years ago (LUCENE-1458) we "enabled" fully binary terms in the > lowest levels of Lucene (the codec APIs) yet today, actually adding an > arbitrary byte[] binary term during indexing is far from simple: you > must make a custom Field with a custom TokenStream and a custom > TermToBytesRefAttribute, as far as I know. > This is supremely expert, I wonder if anyone out there has succeeded > in doing so? > I think we should make indexing a single byte[] as simple as indexing > a single String. > This is a pre-cursor for issues like LUCENE-5596 (encoding IPv6 > address as byte[16]) and LUCENE-5879 (encoding native numeric values > in their simple binary form). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5989) Add BinaryField, to index a single binary token
[ https://issues.apache.org/jira/browse/LUCENE-5989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14160101#comment-14160101 ] Michael McCandless commented on LUCENE-5989: I have also never understood the origin of "keyword" meaning the entire string is treated as one token. I don't think it's obvious. It is *consistent* with the existing KeywordAnalyzer/Tokenizer, but I don't think that's a good justification to further propagate non-obvious naming. I would rather rename KeywordTokenizer/Analyzer to something else... I guess net/net I would prefer here that we *not* add BinaryField and instead keep the name StringField, just giving it another ctor to take byte[]/BytesRef. Added classes have an API cost higher than just an added ctor, and the "purpose" of these two is exactly the same... bq. I don't like the violation that clear() is a no-op in BytesTermAttribute. In a correct world, this should null the bytesref and the TokenStream should set the BytesRef after clearAttributes. Thanks Uwe, I'll add a nocommit to somehow fix it ... seems like ByteTermAttributeImpl.clear must null out its copy of the bytes, and then BinaryTokenStream.reset must re-instate the next one (pulling it via the previous setValue call?). I guess I must add BinaryTokenStream.bytes too? Our analysis APIs are ... challenging. bq. So the solution is to proceed and make matters worse by requiring the user to also deal with the .document API? But if you can't even figure out how to get your IPv6 byte[] (LUCENE-5596) or your numeric value encoded as byte\[4] or byte\[8] (LUCENE-5879) into Lucene's IndexWriter in the first place, how will you even have any hope of querying it? > Add BinaryField, to index a single binary token > --- > > Key: LUCENE-5989 > URL: https://issues.apache.org/jira/browse/LUCENE-5989 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Michael McCandless >Assignee: Michael McCandless > Fix For: 5.0, Trunk > > Attachments: LUCENE-5989.patch > > > 5 years ago (LUCENE-1458) we "enabled" fully binary terms in the > lowest levels of Lucene (the codec APIs) yet today, actually adding an > arbitrary byte[] binary term during indexing is far from simple: you > must make a custom Field with a custom TokenStream and a custom > TermToBytesRefAttribute, as far as I know. > This is supremely expert, I wonder if anyone out there has succeeded > in doing so? > I think we should make indexing a single byte[] as simple as indexing > a single String. > This is a pre-cursor for issues like LUCENE-5596 (encoding IPv6 > address as byte[16]) and LUCENE-5879 (encoding native numeric values > in their simple binary form). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5989) Add BinaryField, to index a single binary token
[ https://issues.apache.org/jira/browse/LUCENE-5989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14159825#comment-14159825 ] David Smiley commented on LUCENE-5989: -- bq. This is supremely expert, I wonder if anyone out there has succeeded in doing so? {{org.apache.lucene.spatial.prefix.CellTokenStream}} :-)Though this doesn't count since it's in Lucene. +1 to make this easier via a BinaryField. With BinaryField and auto-prefixing, CellTokenStream won't be needed for indexing a point. But it's needed for other shapes and to support heat-map style faceting. Jack's opinion about the "Keyword" name being far from obvious really resonated with me. Despite Shai's reasonable explanation, it doesn't seem to me that changing the status-quo to anything non-obvious is helpful. And it wouldn't seem like the text equivalent of BinaryField -- for that the current name is perfect, I think. But I do like the idea of simply having StringField taking a byte[] too such that there is no BinaryField. Either way. > Add BinaryField, to index a single binary token > --- > > Key: LUCENE-5989 > URL: https://issues.apache.org/jira/browse/LUCENE-5989 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Michael McCandless >Assignee: Michael McCandless > Fix For: 5.0, Trunk > > Attachments: LUCENE-5989.patch > > > 5 years ago (LUCENE-1458) we "enabled" fully binary terms in the > lowest levels of Lucene (the codec APIs) yet today, actually adding an > arbitrary byte[] binary term during indexing is far from simple: you > must make a custom Field with a custom TokenStream and a custom > TermToBytesRefAttribute, as far as I know. > This is supremely expert, I wonder if anyone out there has succeeded > in doing so? > I think we should make indexing a single byte[] as simple as indexing > a single String. > This is a pre-cursor for issues like LUCENE-5596 (encoding IPv6 > address as byte[16]) and LUCENE-5879 (encoding native numeric values > in their simple binary form). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5989) Add BinaryField, to index a single binary token
[ https://issues.apache.org/jira/browse/LUCENE-5989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14159528#comment-14159528 ] Shai Erera commented on LUCENE-5989: bq. Is that simply a typo Yes, fixed :). The term 'keyword' is of course overloaded here. When I propose KeywordField, I am following the existing Keyword* classes that we have: KeywordTokenizer, KeywordAnalyzer, KeywordAttribute. And from what I remember, when users ask how to parse 'keywords' they indexed as StringFields, we often tell them to use PerFieldAnalyzerWrapper with a KeywordAnalyzer for that field. That's why I feel that KeywordField fits better with the overall Keyword* tokenstream API. > Add BinaryField, to index a single binary token > --- > > Key: LUCENE-5989 > URL: https://issues.apache.org/jira/browse/LUCENE-5989 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Michael McCandless >Assignee: Michael McCandless > Fix For: 5.0, Trunk > > Attachments: LUCENE-5989.patch > > > 5 years ago (LUCENE-1458) we "enabled" fully binary terms in the > lowest levels of Lucene (the codec APIs) yet today, actually adding an > arbitrary byte[] binary term during indexing is far from simple: you > must make a custom Field with a custom TokenStream and a custom > TermToBytesRefAttribute, as far as I know. > This is supremely expert, I wonder if anyone out there has succeeded > in doing so? > I think we should make indexing a single byte[] as simple as indexing > a single String. > This is a pre-cursor for issues like LUCENE-5596 (encoding IPv6 > address as byte[16]) and LUCENE-5879 (encoding native numeric values > in their simple binary form). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5989) Add BinaryField, to index a single binary token
[ https://issues.apache.org/jira/browse/LUCENE-5989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14159515#comment-14159515 ] Robert Muir commented on LUCENE-5989: - {quote} This is supremely expert, I wonder if anyone out there has succeeded in doing so? {quote} So the solution is to proceed and make matters worse by requiring the user to *also* deal with the .document API? If the user wants their field to work with various query-time features (queryparser, morelikethis, whatever), then they must deal with the tokenstream side anyway, so adding *Field doesn't help anything. It just adds yet another place they must plug in "schema" information (as opposed to only being once in Analyzer). Sure, its easier to get past indexwriter maybe, but you win the battle and lose the war. I'm not going to try to block the change, just please, please, please think about it. > Add BinaryField, to index a single binary token > --- > > Key: LUCENE-5989 > URL: https://issues.apache.org/jira/browse/LUCENE-5989 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Michael McCandless >Assignee: Michael McCandless > Fix For: 5.0, Trunk > > Attachments: LUCENE-5989.patch > > > 5 years ago (LUCENE-1458) we "enabled" fully binary terms in the > lowest levels of Lucene (the codec APIs) yet today, actually adding an > arbitrary byte[] binary term during indexing is far from simple: you > must make a custom Field with a custom TokenStream and a custom > TermToBytesRefAttribute, as far as I know. > This is supremely expert, I wonder if anyone out there has succeeded > in doing so? > I think we should make indexing a single byte[] as simple as indexing > a single String. > This is a pre-cursor for issues like LUCENE-5596 (encoding IPv6 > address as byte[16]) and LUCENE-5879 (encoding native numeric values > in their simple binary form). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5989) Add BinaryField, to index a single binary token
[ https://issues.apache.org/jira/browse/LUCENE-5989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14159511#comment-14159511 ] Jack Krupansky commented on LUCENE-5989: bq. rename StringField to KeywordField, making it more obvious that this field isn't tokenized. Then a KeywordsField can take a String or BytesRef in ctors. Both Lucene and Solr are suffering from a conflation of the two concepts of treating an input stream as a single token ("a keyword") and as a sequence of tokens ("sequence of keywords"). We have the "KeywordTokenizer" that does NOT tokenize the input stream into "a sequence of keywords". The term "keyword search" is commonly used to describe the ability of search engines to find "individual keywords" in extended streams of "text" - a clear reference to "keyword" in a tokenized stream. So, I don't understand how it is claimed that naming StringField to KeywordField is making anything "obvious" - it seems to me to be adding to the existing confusion rather than clarifying anything. I mean, the term "keyword" should be treated more as a synonym for "token" or "term", NOT as synonym for "string" or "raw character sequence". I agree that we need a term for "raw, uninterpreted character sequence", but it seems to me that "string" is a more "obvious" candidate than "keyword". There has been some grumbling at the Solr level that KeywordTokenizer should be renamed to... something, anything, but just not KeywordTokenizer, which "obviously" implied that the input stream will be tokenized into a sequence of keywords, which it does not. In an effort to try to resolve this ongoing confusion, can somebody provide from historical background as to how KeywordTokenizer got its name, and how a subset of people continue to refer to an uninterpreted sequence of characters as a "keyword" rather than a string. I checked the Javadoc, Jira, and even the source code, but came up empty. In short, it is a real eye-opener to see a claim that the term "keyword" in any way makes it "obvious" that input is not tokenized!! Maybe we could fix this for 5.0 to have a cleaner set of terminology going forward. At a minimum, we should have some clarifying language in the Javadoc. And hopefully we can refrain from making the confusion/conflation worse by renaming StringField to KeywordField. bq. Then a KeywordsField can take a String Is that simply a typo or is the intent to have both a KeywordField (singular) and a KeywordsField (plural)? I presume it is a typo, but... maybe it's a Freudian slip and highlights this semantic difficulty that persists in the Lucene terminology (and hence infects Solr terminology as well.) > Add BinaryField, to index a single binary token > --- > > Key: LUCENE-5989 > URL: https://issues.apache.org/jira/browse/LUCENE-5989 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Michael McCandless >Assignee: Michael McCandless > Fix For: 5.0, Trunk > > Attachments: LUCENE-5989.patch > > > 5 years ago (LUCENE-1458) we "enabled" fully binary terms in the > lowest levels of Lucene (the codec APIs) yet today, actually adding an > arbitrary byte[] binary term during indexing is far from simple: you > must make a custom Field with a custom TokenStream and a custom > TermToBytesRefAttribute, as far as I know. > This is supremely expert, I wonder if anyone out there has succeeded > in doing so? > I think we should make indexing a single byte[] as simple as indexing > a single String. > This is a pre-cursor for issues like LUCENE-5596 (encoding IPv6 > address as byte[16]) and LUCENE-5879 (encoding native numeric values > in their simple binary form). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5989) Add BinaryField, to index a single binary token
[ https://issues.apache.org/jira/browse/LUCENE-5989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14159507#comment-14159507 ] Uwe Schindler commented on LUCENE-5989: --- bq. We could also rename StringField to KeywordField, making it more obvious that this field isn't tokenized. Then a KeywordsField can take a String or BytesRef in ctors. +1 bq. Patch, adding BinaryField I don't like the violation that clear() is a no-op in BytesTermAttribute. In a correct world, this should null the bytesref and the TokenStream should set the BytesRef after clearAttributes. This is not urgent here, but it violates the contract. I know NumericTermAttribute does similar things... :( > Add BinaryField, to index a single binary token > --- > > Key: LUCENE-5989 > URL: https://issues.apache.org/jira/browse/LUCENE-5989 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Michael McCandless >Assignee: Michael McCandless > Fix For: 5.0, Trunk > > Attachments: LUCENE-5989.patch > > > 5 years ago (LUCENE-1458) we "enabled" fully binary terms in the > lowest levels of Lucene (the codec APIs) yet today, actually adding an > arbitrary byte[] binary term during indexing is far from simple: you > must make a custom Field with a custom TokenStream and a custom > TermToBytesRefAttribute, as far as I know. > This is supremely expert, I wonder if anyone out there has succeeded > in doing so? > I think we should make indexing a single byte[] as simple as indexing > a single String. > This is a pre-cursor for issues like LUCENE-5596 (encoding IPv6 > address as byte[16]) and LUCENE-5879 (encoding native numeric values > in their simple binary form). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5989) Add BinaryField, to index a single binary token
[ https://issues.apache.org/jira/browse/LUCENE-5989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14159500#comment-14159500 ] Shai Erera commented on LUCENE-5989: bq. we could maybe instead just add a ctor for StringField taking BytesRef... We could also rename StringField to KeywordField, making it more obvious that this field isn't tokenized. Then a KeywordsField can take a String or BytesRef in ctors. > Add BinaryField, to index a single binary token > --- > > Key: LUCENE-5989 > URL: https://issues.apache.org/jira/browse/LUCENE-5989 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Michael McCandless >Assignee: Michael McCandless > Fix For: 5.0, Trunk > > Attachments: LUCENE-5989.patch > > > 5 years ago (LUCENE-1458) we "enabled" fully binary terms in the > lowest levels of Lucene (the codec APIs) yet today, actually adding an > arbitrary byte[] binary term during indexing is far from simple: you > must make a custom Field with a custom TokenStream and a custom > TermToBytesRefAttribute, as far as I know. > This is supremely expert, I wonder if anyone out there has succeeded > in doing so? > I think we should make indexing a single byte[] as simple as indexing > a single String. > This is a pre-cursor for issues like LUCENE-5596 (encoding IPv6 > address as byte[16]) and LUCENE-5879 (encoding native numeric values > in their simple binary form). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org