[ 
https://issues.apache.org/jira/browse/LUCENE-1799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12779621#action_12779621
 ] 

Robert Muir commented on LUCENE-1799:
-------------------------------------

bq. ICU's API requires to use ByteBuffer and CharBuffer for input/output. And 
even if I missed some nice method, encoder/decoder operates internally on said 
buffers. Thus, a wrap/unwrap for each String is inevitable.
Earwin, at least in ICU trunk you have the following (public class) in 
com.ibm.icu.impl.BOCU: 

{code}
public static int compress(String source, byte buffer[], int offset)
public static int getCompressionLength(String source) 
...
{code}

But I think this class only supports encoding, not decoding (only used by 
Collation API for so called BOCSU).
For decoding, we might have to use registered charset and ByteBuffer... unless 
theres another way.

> Unicode compression
> -------------------
>
>                 Key: LUCENE-1799
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1799
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Store
>    Affects Versions: 2.4.1
>            Reporter: DM Smith
>            Priority: Minor
>
> In lucene-1793, there is the off-topic suggestion to provide compression of 
> Unicode data. The motivation was a custom encoding in a Russian analyzer. The 
> original supposition was that it provided a more compact index.
> This led to the comment that a different or compressed encoding would be a 
> generally useful feature. 
> BOCU-1 was suggested as a possibility. This is a patented algorithm by IBM 
> with an implementation in ICU. If Lucene provide it's own implementation a 
> freely avIlable, royalty-free license would need to be obtained.
> SCSU is another Unicode compression algorithm that could be used. 
> An advantage of these methods is that they work on the whole of Unicode. If 
> that is not needed an encoding such as iso8859-1 (or whatever covers the 
> input) could be used.    

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Reply via email to