[jira] Commented: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0

Simon Willnauer (JIRA) Thu, 07 Jan 2010 11:31:19 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12797758#action_12797758
 ]


Simon Willnauer commented on LUCENE-2094:
-----------------------------------------

Hi Yonik,

bq. It looks like it was committed as part of this issue, but I can't find any 
comments here about either the need to make a copy or the need to make a 
unmodifiable set.
I try to help you to reconstruct the whole thing a bit. 
UnmodifiableCharArraySet was introduces with LUCENE-1688 as far as I recall to 
replace the static string array (stopwords) in StopAnalyzer. 
During the refactoring / improvements in contrib/analyzers we decided to make 
analyzers and tokenfilters immutable and use chararrayset whereever we can. To 
prevent provided set from being modified while they are in use in a filter the 
given set is copied and wrapped in an immutable instance of chararrayset. At 
the same time (still ongoing) we try to convert every set which is likely to be 
used in a TokenFilter into a charArraySet.  Wordlistloader is not done yet but 
on the list, the plan is to change the return values from HashSet<?> into 
Set<?> and create CharArraySet instances internally. 
With LUCENE-2034 we introduced StopwordAnalyzerBase which also uses the 
UnmodifiableCharArraySet with a copy of the given set.
The copy of a charArraySet is very fast even for large sets and the creation of 
a unmodifiableCharArraySet from a CharArraySet instance is basically just an 
object creation. The background is, again to prevent any modification to those 
sets while they are in use.

bq. This new behavior also no longer matches the javadoc for the constructor. 
I agree we should adjust the javadoc for ctors expecting stopwords to reflect 
the behavior.



> Prepare CharArraySet for Unicode 4.0
> ------------------------------------
>
>                 Key: LUCENE-2094
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2094
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>    Affects Versions: 3.0
>            Reporter: Simon Willnauer
>            Assignee: Uwe Schindler
>             Fix For: 3.1
>
>         Attachments: LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, 
> LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.txt, 
> LUCENE-2094.txt, LUCENE-2094.txt
>
>
> CharArraySet does lowercaseing if created with the correspondent flag. This 
> causes that  String / char[] with uncode 4 chars which are in the set can not 
> be retrieved in "ignorecase" mode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0

Reply via email to