On 11/27/06, Wolfgang Hoschek <[EMAIL PROTECTED]> wrote:

On Nov 26, 2006, at 8:57 AM, jm wrote:

> I tested this. I use a single static analyzer for all my documents,
> and the caching analyzer was not working properly. I had to add a
> method to clear the cache each time a new document was to be indexed,
> and then it worked as expected. I have never looked into lucenes inner
> working so I am not sure if what I did is correct.

Makes sense, I've now incorporated that as well by adding a clear()
method and extracting the functionality into a public class
AnalyzerUtil.TokenCachingAnalyzer.
yes, same here, I could have posted my code, sorry,  but I was not
sure if it was even correct...
When theres is a new lucene 2.1 or whatever I'll incorporate to that
optimization into my code. thanks


>
> I also had to comment some code cause I merged the memory stuff from
> trunk with lucene 2.0.
>
> Performance was certainly much better (4 times faster in my very gross
> testing), but for my processing that operation is only a very small,
> so I will keep the original way, without caching the tokens, just to
> be able to use the unmodified lucene 2.0.  I found a data problem in
> my tests, but as I was not going to pursue that improvement for now I
> did not look into it.

Ok.
Wolfgang.

>
> thanks,
> javier
>
> On 11/23/06, Wolfgang Hoschek <[EMAIL PROTECTED]> wrote:
>> Out of interest, I've checked an implementation of something like
>> this into AnalyzerUtil SVN trunk:
>>
>>    /**
>>     * Returns an analyzer wrapper that caches all tokens generated by
>> the underlying child analyzer's
>>     * token stream, and delivers those cached tokens on subsequent
>> calls to
>>     * <code>tokenStream(String fieldName, Reader reader)</code>.
>>     * <p>
>>     * This can help improve performance in the presence of expensive
>> Analyzer / TokenFilter chains.
>>     * <p>
>>     * Caveats:
>>     * 1) Caching only works if the methods equals() and hashCode()
>> methods are properly
>>     * implemented on the Reader passed to <code>tokenStream(String
>> fieldName, Reader reader)</code>.
>>     * 2) Caching the tokens of large Lucene documents can lead to out
>> of memory exceptions.
>>     * 3) The Token instances delivered by the underlying child
>> analyzer must be immutable.
>>     *
>>     * @param child
>>     *            the underlying child analyzer
>>     * @return a new analyzer
>>     */
>>    public static Analyzer getTokenCachingAnalyzer(final Analyzer
>> child) { ... }
>>
>>
>> Check it out, and let me know if this is close to what you had in
>> mind.
>>
>> Wolfgang.
>>
>> On Nov 22, 2006, at 9:19 AM, Wolfgang Hoschek wrote:
>>
>> > I've never tried it, but I guess you could write an Analyzer and
>> > TokenFilter that no only feeds into IndexWriter on
>> > IndexWriter.addDocument(), but as a sneaky side effect also
>> > simultaneously saves its tokens into a list so that you could later
>> > turn that list into another TokenStream to be added to MemoryIndex.
>> > How much this might help depends on how expensive your analyzer
>> > chain is. For some examples on how to set up analyzers for chains
>> > of token streams, see MemoryIndex.keywordTokenStream and class
>> > AnalzyerUtil in the same package.
>> >
>> > Wolfgang.
>> >
>> > On Nov 22, 2006, at 4:15 AM, jm wrote:
>> >
>> >> checking one last thing, just in case...
>> >>
>> >> as I mentioned, I have previously indexed the same document in
>> >> another
>> >> index (for another purpose), as I am going to use the same
>> analyzer,
>> >> would it be possible to avoid analyzing the doc again?
>> >>
>> >> I see IndexWriter.addDocument() returns void, so it does not
>> seem to
>> >> be an easy way to do that no?
>> >>
>> >> thanks
>> >>
>> >> On 11/21/06, Wolfgang Hoschek <[EMAIL PROTECTED]> wrote:
>> >>>
>> >>> On Nov 21, 2006, at 12:38 PM, jm wrote:
>> >>>
>> >>> > Ok, thanks, I'll give MemoryIndex a go, and if that is not good
>> >>> enoguh
>> >>> > I will explore the other options then.
>> >>>
>> >>> To get started you can use something like this:
>> >>>
>> >>> for each document D:
>> >>>      MemoryIndex index = createMemoryIndex(D, ...)
>> >>>      for each query Q:
>> >>>          float score = index.search(Q)
>> >>>         if (score > 0.0) System.out.println("it's a match");
>> >>>
>> >>>
>> >>>
>> >>>
>> >>>    private MemoryIndex createMemoryIndex(Document doc, Analyzer
>> >>> analyzer) {
>> >>>      MemoryIndex index = new MemoryIndex();
>> >>>      Enumeration iter = doc.fields();
>> >>>      while (iter.hasMoreElements()) {
>> >>>        Field field = (Field) iter.nextElement();
>> >>>        index.addField(field.name(), field.stringValue(),
>> analyzer);
>> >>>      }
>> >>>      return index;
>> >>>    }
>> >>>
>> >>>
>> >>>
>> >>> >
>> >>> >
>> >>> > On 11/21/06, Wolfgang Hoschek <[EMAIL PROTECTED]> wrote:
>> >>> >> On Nov 21, 2006, at 7:43 AM, jm wrote:
>> >>> >>
>> >>> >> > Hi,
>> >>> >> >
>> >>> >> > I have to decide between  using a RAMDirectory and
>> >>> MemoryIndex, but
>> >>> >> > not sure what approach will work better...
>> >>> >> >
>> >>> >> > I have to run many items (tens of thousands) against some
>> >>> >> queries (100
>> >>> >> > at most), but I have to do it one item at a time. And I
>> already
>> >>> >> have
>> >>> >> > the lucene Document associated with each item, from a
>> previous
>> >>> >> > operation I perform.
>> >>> >> >
>> >>> >> > From what I read MemoryIndex should be faster, but
>> apparently I
>> >>> >> cannot
>> >>> >> > reuse the document I already have, and I have to create a
>> new
>> >>> >> > MemoryIndex per item.
>> >>> >>
>> >>> >> A MemoryIndex object holds one document.
>> >>> >>
>> >>> >> > Using the RAMDirectory I can use only one of
>> >>> >> > them, also one IndexWriter, and create a IndexSearcher and
>> >>> >> IndexReader
>> >>> >> > per item, for searching and removing the item each time.
>> >>> >> >
>> >>> >> > Any thoughts?
>> >>> >>
>> >>> >> The MemoryIndex impl is optimized to work efficiently without
>> >>> reusing
>> >>> >> the MemoryIndex object for a subsequent document. See the
>> source
>> >>> >> code. Reusing the object would not further improve
>> performance.
>> >>> >>
>> >>> >> Wolfgang.
>> >>> >>
>> >>> >>
>> >>>
>> --------------------------------------------------------------------
>> >>> -
>> >>> >> To unsubscribe, e-mail: java-user-
>> [EMAIL PROTECTED]
>> >>> >> For additional commands, e-mail: java-user-
>> [EMAIL PROTECTED]
>> >>> >>
>> >>> >>
>> >>> >
>> >>> >
>> >>>
>> --------------------------------------------------------------------
>> >>> -
>> >>> > To unsubscribe, e-mail: [EMAIL PROTECTED]
>> >>> > For additional commands, e-mail: java-user-
>> [EMAIL PROTECTED]
>> >>> >
>> >>>
>> >>>
>> >>
>> >>
>> ---------------------------------------------------------------------
>> >> To unsubscribe, e-mail: [EMAIL PROTECTED]
>> >> For additional commands, e-mail: [EMAIL PROTECTED]
>> >>
>> >
>> >
>> >
>> ---------------------------------------------------------------------
>> > To unsubscribe, e-mail: [EMAIL PROTECTED]
>> > For additional commands, e-mail: [EMAIL PROTECTED]
>> >
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [EMAIL PROTECTED]
>> For additional commands, e-mail: [EMAIL PROTECTED]
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to