Re: indexing help

Grant Ingersoll Thu, 08 Jul 2004 05:51:26 -0700

Hi John,

The source code is available from CVS, make it non-final and do what you need to do.  
Of course, you may have a hard time finding help later if you aren't using something 
everyone else is and your solution doesn't work...  :-)


If I understand correctly what you are trying to do, you already know all of the 
answers for indexing, you just want Lucene to do the retrieval side of the coin, 
correct?  I suppose a crazy idea might be to write a program that took your info and 
output it in the Lucene file format, but that seems a bit like overkill.

-Grant

>>> [EMAIL PROTECTED] 07/07/04 07:37PM >>>
Hi Doug:
     Thanks for the response!

     The solution you proposed is still a derivative of creating a
dummy document stream. Taking the same example, java (5), lucene (6),
VectorTokenStream would create a total of 11 Tokens whereas only 2 is
neccessary.

    Given many documents with many terms and frequencies, it would
create many extra Token instances.

   The reason I was looking to derving the Field class is because I
can directly manipulate the FieldInfo by setting the frequency. But
the class is final...

   Any other suggestions?

Thanks

-John

On Wed, 07 Jul 2004 14:20:24 -0700, Doug Cutting <[EMAIL PROTECTED]> wrote:
> John Wang wrote:
> >      While lucene tokenizes the words in the document, it counts the
> > frequency and figures out the position, we are trying to bypass this
> > stage: For each document, I have a set of words with a know frequency,
> > e.g. java (5), lucene (6) etc. (I don't care about the position, so it
> > can always be 0.)
> >
> >      What I can do now is to create a dummy document, e.g. "java java
> > java java java lucene lucene lucene lucene lucene" and pass it to
> > lucene.
> >
> >      This seems hacky and cumbersome. Is there a better alternative? I
> > browsed around in the source code, but couldn't find anything.
> 
> Write an analyzer that returns terms with the appropriate distribution.
> 
> For example:
> 
> public class VectorTokenStream extends TokenStream {
>   private int term;
>   private int freq;
>   public VectorTokenStream(String[] terms, int[] freqs) {
>     this.terms = terms;
>     this.freqs = freqs;
>   }
>   public Token next() {
>     if (freq == 0) {
>       term++;
>       if (term >= terms.length)
>         return null;
>       freq = freqs[term];
>     }
>     freq--;
>     return new Token(terms[term], 0, 0);
>   }
> }
> 
> Document doc = new Document();
> doc.add(Field.Text("content", ""));
> indexWriter.addDocument(doc, new Analyzer() {
>   public TokenStream tokenStream(String field, Reader reader) {
>     return new VectorTokenStream(new String[] {"java","lucene"},
>                                  new int[] {5,6});
>   }
> });
> 
> >       Too bad the Field class is final, otherwise I can derive from it
> > and do something on that line...
> 
> Extending Field would not help.  That's why it's final.
> 
> Doug
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED] 
> For additional commands, e-mail: [EMAIL PROTECTED] 
> 
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED] 
For additional commands, e-mail: [EMAIL PROTECTED] 



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: indexing help

Reply via email to