btw, does anyone have a guess at how expensive this ByteBuffer/CharBuffer.wrap() is?
Looking at the collation support, we could maybe improve IndexableBinaryStringTools by using char[]/byte[] with offset and length. The existing ByteBuffer/CharBuffer methods could stay, they are consistent with Charset api and are not wrong imo, but instead defer to the new char[]/byte[] ones... the current buffer-based ones require the buffer to have a backing array anyway or will throw an exception. On Wed, Nov 18, 2009 at 2:12 PM, Earwin Burrfoot (JIRA) <j...@apache.org>wrote: > > [ > https://issues.apache.org/jira/browse/LUCENE-1799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12779602#action_12779602] > > Earwin Burrfoot commented on LUCENE-1799: > ----------------------------------------- > > bq. as far as the encoding itself, BOCU-1 is available in the ICU library > ICU's API requires to use ByteBuffer and CharBuffer for input/output. And > even if I missed some nice method, encoder/decoder operates internally on > said buffers. Thus, a wrap/unwrap for each String is inevitable. > > > Unicode compression > > ------------------- > > > > Key: LUCENE-1799 > > URL: https://issues.apache.org/jira/browse/LUCENE-1799 > > Project: Lucene - Java > > Issue Type: New Feature > > Components: Store > > Affects Versions: 2.4.1 > > Reporter: DM Smith > > Priority: Minor > > > > In lucene-1793, there is the off-topic suggestion to provide compression > of Unicode data. The motivation was a custom encoding in a Russian analyzer. > The original supposition was that it provided a more compact index. > > This led to the comment that a different or compressed encoding would be > a generally useful feature. > > BOCU-1 was suggested as a possibility. This is a patented algorithm by > IBM with an implementation in ICU. If Lucene provide it's own implementation > a freely avIlable, royalty-free license would need to be obtained. > > SCSU is another Unicode compression algorithm that could be used. > > An advantage of these methods is that they work on the whole of Unicode. > If that is not needed an encoding such as iso8859-1 (or whatever covers the > input) could be used. > > -- > This message is automatically generated by JIRA. > - > You can reply to this email to add a comment to the issue online. > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-dev-h...@lucene.apache.org > > -- Robert Muir rcm...@gmail.com