[jira] Updated: (LUCENE-1799) Unicode compression

Uwe Schindler (JIRA) Wed, 21 Jul 2010 02:04:22 -0700

     [ 
https://issues.apache.org/jira/browse/LUCENE-1799?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Uwe Schindler updated LUCENE-1799:
----------------------------------

    Attachment: LUCENE-1799.patch

A new patch that completely separates the BOCU factory from the implementation 
(which moves to common/miscellaneous). This has the following advantages:

- You can use any Charset to encode your terms. The javadocs should only note, 
that the byte[] order should be correct for range queries to work
- Theoretically you could remove the BOCU classes at all, one that wants to 
use, can simply get the Charset from ICUs factory and pass it to the 
AttributeFactory. The convenience class is still useful, especially if we can 
later natively implement the encoding without NIO (when patent issues are 
solved...)
- The test for the CustomCharsetTermAttributeFactory uses UTF-8 as charset and 
verifies that the created BytesRefs have the same format like a BytesRef 
created using the UnicodeUtils.
- The test also checks that encoding errors are bubbled up as RuntimeExceptions

TODO:

- docs
- handling of encoding errors configureable (replace with replacement char?)
- If you want your complete index e.g. in ISO-8859-1, there should be also 
convenience methods that take CharSequences/char[] in the factory/attribute to 
quickly convert strings to BytesRefs like UnicodeUtil does - by this its 
possible to create TermQueries directly using e.g. ISO-8859-1 encoding.

> Unicode compression
> -------------------
>
>                 Key: LUCENE-1799
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1799
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Store
>    Affects Versions: 2.4.1
>            Reporter: DM Smith
>            Priority: Minor
>         Attachments: LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, 
> LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch
>
>
> In lucene-1793, there is the off-topic suggestion to provide compression of 
> Unicode data. The motivation was a custom encoding in a Russian analyzer. The 
> original supposition was that it provided a more compact index.
> This led to the comment that a different or compressed encoding would be a 
> generally useful feature. 
> BOCU-1 was suggested as a possibility. This is a patented algorithm by IBM 
> with an implementation in ICU. If Lucene provide it's own implementation a 
> freely avIlable, royalty-free license would need to be obtained.
> SCSU is another Unicode compression algorithm that could be used. 
> An advantage of these methods is that they work on the whole of Unicode. If 
> that is not needed an encoding such as iso8859-1 (or whatever covers the 
> input) could be used.    

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] Updated: (LUCENE-1799) Unicode compression

Reply via email to