Re: fixed url and How to contribute code to lucene sandbox?

Doug Cutting Wed, 11 Sep 2002 14:42:26 -0700

Che Dong wrote:
> 1. custom sorting beside default score sorting: make docID alias one field you need 
>output sorting
> solved  by sort data before indexing(example sorted by field PostDate), so docID can 
>be an alias to the sort field. if we make hitCollector
> sort with docID or 1/docID or even complex stragety (docID * score)...
> 
>http://nagoya.apache.org/eyebrowse/ReadMsg?[EMAIL PROTECTED]&msgId=115469
> IndexOrderSearcher: sort data before indexing and use 1/docID instead of score


That's an interesting approach.  I don't recall ever seeing this message 
when it was originally posted.  Sorry.

I had imagined instead adding this functionality to Hits.java.  Having 
a different Searcher implementation makes it possible for folks to use 
MultiSearcher to combine results from an IndexSearcher and an 
IndexOrderSearcher, which would not make sense.  If the functionality 
instead resides in Hits.java, then it could not be misused in this way.

So the way I was going to do it was to add something to Hits.java like:
   public static final long ORDER_BY_SCORE = 1;
   public static final long ORDER_BY_DOC_NUM = 1;
   public void setHitOrdering(int order);

If ORDER_BY_SCORE is specfied then Hits would work as it does now.  This 
would be the default.  But when ORDER_BY_DOC_NUM is specified then 
Hits.java would use a HitCollector to implement this ordering.

> 2. CJK support: 
>        2.1 sigram based(no word segment just use one character as a token):  
>modified from StandardTokenizer.java
>     
>http://nagoya.apache.org/eyebrowse/ReadMsg?[EMAIL PROTECTED]&msgId=330905
>     CJKTokenizer for Asia language(Chinese Japanese Korean) Word Segment
>     
>http://nagoya.apache.org/eyebrowse/ReadMsg?[EMAIL PROTECTED]&msgId=450266
>     StandardTokenizer with sigram based CJK Support
> 
>     2.2 bigram based word segment: modified from SimpleTokenizer to CJKTokenizer.java
>     http://www.mail-archive.com/[email protected]/msg01220.html

I think it would be great to have some support for asian languages built 
into Lucene.  Which of these approaches do you think is best?  I like 
the idea of a StandardTokenizer or SimpleTokenizer that automatically 
provides this via bigrams.  What do others think?

Doug



--
To unsubscribe, e-mail:   <mailto:[EMAIL PROTECTED]>
For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>

Re: fixed url and How to contribute code to lucene sandbox?

Reply via email to