[ 
https://issues.apache.org/jira/browse/LUCENE-1799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12893368#action_12893368
 ] 

Robert Muir commented on LUCENE-1799:
-------------------------------------

bq. I have only been measuring performance at this point

You havent really been measuring performance, you have just been trying to pick 
a fight.
# any difference in encode has almost no effect on indexing speed, like i said, 
100 million strings in 4.3 seconds?
# you aren't factoring i/o nor ram into the equation for the writing systems 
(of which there are many) where this actually cuts terms to close half their 
size.
# since this is a compression algorithm (and I'm still working on it), its 
vital to include these things, and not post useless benchmarks about whether it 
takes 2.9 or 4.3 seconds to encode 100 million strings, which nothing in lucene 
can do anything with in any short time anyway.

I have a benchmark for UTF-8: and its that i have a lot of text that is twice 
as big on disk and causes twice as much io and eats up twice as much ram than 
it should. 
bocu-1 fixes that, and at the same time keeps ascii at a single-byte encoding 
(and other latin languages are very close).
so everyone can potentially win.

> Unicode compression
> -------------------
>
>                 Key: LUCENE-1799
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1799
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Store
>    Affects Versions: 2.4.1
>            Reporter: DM Smith
>            Priority: Minor
>         Attachments: Benchmark.java, Benchmark.java, Benchmark.java, 
> LUCENE-1779.patch, LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, 
> LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, 
> LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, 
> LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799_big.patch
>
>
> In lucene-1793, there is the off-topic suggestion to provide compression of 
> Unicode data. The motivation was a custom encoding in a Russian analyzer. The 
> original supposition was that it provided a more compact index.
> This led to the comment that a different or compressed encoding would be a 
> generally useful feature. 
> BOCU-1 was suggested as a possibility. This is a patented algorithm by IBM 
> with an implementation in ICU. If Lucene provide it's own implementation a 
> freely avIlable, royalty-free license would need to be obtained.
> SCSU is another Unicode compression algorithm that could be used. 
> An advantage of these methods is that they work on the whole of Unicode. If 
> that is not needed an encoding such as iso8859-1 (or whatever covers the 
> input) could be used.    

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to