Che Dong wrote: > 1. custom sorting beside default score sorting: make docID alias one field you need >output sorting > solved by sort data before indexing(example sorted by field PostDate), so docID can >be an alias to the sort field. if we make hitCollector > sort with docID or 1/docID or even complex stragety (docID * score)... > >http://nagoya.apache.org/eyebrowse/ReadMsg?[EMAIL PROTECTED]&msgId=115469 > IndexOrderSearcher: sort data before indexing and use 1/docID instead of score
That's an interesting approach. I don't recall ever seeing this message when it was originally posted. Sorry. I had imagined instead adding this functionality to Hits.java. Having a different Searcher implementation makes it possible for folks to use MultiSearcher to combine results from an IndexSearcher and an IndexOrderSearcher, which would not make sense. If the functionality instead resides in Hits.java, then it could not be misused in this way. So the way I was going to do it was to add something to Hits.java like: public static final long ORDER_BY_SCORE = 1; public static final long ORDER_BY_DOC_NUM = 1; public void setHitOrdering(int order); If ORDER_BY_SCORE is specfied then Hits would work as it does now. This would be the default. But when ORDER_BY_DOC_NUM is specified then Hits.java would use a HitCollector to implement this ordering. > 2. CJK support: > 2.1 sigram based(no word segment just use one character as a token): >modified from StandardTokenizer.java > >http://nagoya.apache.org/eyebrowse/ReadMsg?[EMAIL PROTECTED]&msgId=330905 > CJKTokenizer for Asia language(Chinese Japanese Korean) Word Segment > >http://nagoya.apache.org/eyebrowse/ReadMsg?[EMAIL PROTECTED]&msgId=450266 > StandardTokenizer with sigram based CJK Support > > 2.2 bigram based word segment: modified from SimpleTokenizer to CJKTokenizer.java > http://www.mail-archive.com/[email protected]/msg01220.html I think it would be great to have some support for asian languages built into Lucene. Which of these approaches do you think is best? I like the idea of a StandardTokenizer or SimpleTokenizer that automatically provides this via bigrams. What do others think? Doug -- To unsubscribe, e-mail: <mailto:[EMAIL PROTECTED]> For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>
