On 11/27/06, Wolfgang Hoschek <[EMAIL PROTECTED]> wrote:
On Nov 26, 2006, at 8:57 AM, jm wrote: > I tested this. I use a single static analyzer for all my documents, > and the caching analyzer was not working properly. I had to add a > method to clear the cache each time a new document was to be indexed, > and then it worked as expected. I have never looked into lucenes inner > working so I am not sure if what I did is correct. Makes sense, I've now incorporated that as well by adding a clear() method and extracting the functionality into a public class AnalyzerUtil.TokenCachingAnalyzer.
yes, same here, I could have posted my code, sorry, but I was not sure if it was even correct... When theres is a new lucene 2.1 or whatever I'll incorporate to that optimization into my code. thanks
> > I also had to comment some code cause I merged the memory stuff from > trunk with lucene 2.0. > > Performance was certainly much better (4 times faster in my very gross > testing), but for my processing that operation is only a very small, > so I will keep the original way, without caching the tokens, just to > be able to use the unmodified lucene 2.0. I found a data problem in > my tests, but as I was not going to pursue that improvement for now I > did not look into it. Ok. Wolfgang. > > thanks, > javier > > On 11/23/06, Wolfgang Hoschek <[EMAIL PROTECTED]> wrote: >> Out of interest, I've checked an implementation of something like >> this into AnalyzerUtil SVN trunk: >> >> /** >> * Returns an analyzer wrapper that caches all tokens generated by >> the underlying child analyzer's >> * token stream, and delivers those cached tokens on subsequent >> calls to >> * <code>tokenStream(String fieldName, Reader reader)</code>. >> * <p> >> * This can help improve performance in the presence of expensive >> Analyzer / TokenFilter chains. >> * <p> >> * Caveats: >> * 1) Caching only works if the methods equals() and hashCode() >> methods are properly >> * implemented on the Reader passed to <code>tokenStream(String >> fieldName, Reader reader)</code>. >> * 2) Caching the tokens of large Lucene documents can lead to out >> of memory exceptions. >> * 3) The Token instances delivered by the underlying child >> analyzer must be immutable. >> * >> * @param child >> * the underlying child analyzer >> * @return a new analyzer >> */ >> public static Analyzer getTokenCachingAnalyzer(final Analyzer >> child) { ... } >> >> >> Check it out, and let me know if this is close to what you had in >> mind. >> >> Wolfgang. >> >> On Nov 22, 2006, at 9:19 AM, Wolfgang Hoschek wrote: >> >> > I've never tried it, but I guess you could write an Analyzer and >> > TokenFilter that no only feeds into IndexWriter on >> > IndexWriter.addDocument(), but as a sneaky side effect also >> > simultaneously saves its tokens into a list so that you could later >> > turn that list into another TokenStream to be added to MemoryIndex. >> > How much this might help depends on how expensive your analyzer >> > chain is. For some examples on how to set up analyzers for chains >> > of token streams, see MemoryIndex.keywordTokenStream and class >> > AnalzyerUtil in the same package. >> > >> > Wolfgang. >> > >> > On Nov 22, 2006, at 4:15 AM, jm wrote: >> > >> >> checking one last thing, just in case... >> >> >> >> as I mentioned, I have previously indexed the same document in >> >> another >> >> index (for another purpose), as I am going to use the same >> analyzer, >> >> would it be possible to avoid analyzing the doc again? >> >> >> >> I see IndexWriter.addDocument() returns void, so it does not >> seem to >> >> be an easy way to do that no? >> >> >> >> thanks >> >> >> >> On 11/21/06, Wolfgang Hoschek <[EMAIL PROTECTED]> wrote: >> >>> >> >>> On Nov 21, 2006, at 12:38 PM, jm wrote: >> >>> >> >>> > Ok, thanks, I'll give MemoryIndex a go, and if that is not good >> >>> enoguh >> >>> > I will explore the other options then. >> >>> >> >>> To get started you can use something like this: >> >>> >> >>> for each document D: >> >>> MemoryIndex index = createMemoryIndex(D, ...) >> >>> for each query Q: >> >>> float score = index.search(Q) >> >>> if (score > 0.0) System.out.println("it's a match"); >> >>> >> >>> >> >>> >> >>> >> >>> private MemoryIndex createMemoryIndex(Document doc, Analyzer >> >>> analyzer) { >> >>> MemoryIndex index = new MemoryIndex(); >> >>> Enumeration iter = doc.fields(); >> >>> while (iter.hasMoreElements()) { >> >>> Field field = (Field) iter.nextElement(); >> >>> index.addField(field.name(), field.stringValue(), >> analyzer); >> >>> } >> >>> return index; >> >>> } >> >>> >> >>> >> >>> >> >>> > >> >>> > >> >>> > On 11/21/06, Wolfgang Hoschek <[EMAIL PROTECTED]> wrote: >> >>> >> On Nov 21, 2006, at 7:43 AM, jm wrote: >> >>> >> >> >>> >> > Hi, >> >>> >> > >> >>> >> > I have to decide between using a RAMDirectory and >> >>> MemoryIndex, but >> >>> >> > not sure what approach will work better... >> >>> >> > >> >>> >> > I have to run many items (tens of thousands) against some >> >>> >> queries (100 >> >>> >> > at most), but I have to do it one item at a time. And I >> already >> >>> >> have >> >>> >> > the lucene Document associated with each item, from a >> previous >> >>> >> > operation I perform. >> >>> >> > >> >>> >> > From what I read MemoryIndex should be faster, but >> apparently I >> >>> >> cannot >> >>> >> > reuse the document I already have, and I have to create a >> new >> >>> >> > MemoryIndex per item. >> >>> >> >> >>> >> A MemoryIndex object holds one document. >> >>> >> >> >>> >> > Using the RAMDirectory I can use only one of >> >>> >> > them, also one IndexWriter, and create a IndexSearcher and >> >>> >> IndexReader >> >>> >> > per item, for searching and removing the item each time. >> >>> >> > >> >>> >> > Any thoughts? >> >>> >> >> >>> >> The MemoryIndex impl is optimized to work efficiently without >> >>> reusing >> >>> >> the MemoryIndex object for a subsequent document. See the >> source >> >>> >> code. Reusing the object would not further improve >> performance. >> >>> >> >> >>> >> Wolfgang. >> >>> >> >> >>> >> >> >>> >> -------------------------------------------------------------------- >> >>> - >> >>> >> To unsubscribe, e-mail: java-user- >> [EMAIL PROTECTED] >> >>> >> For additional commands, e-mail: java-user- >> [EMAIL PROTECTED] >> >>> >> >> >>> >> >> >>> > >> >>> > >> >>> >> -------------------------------------------------------------------- >> >>> - >> >>> > To unsubscribe, e-mail: [EMAIL PROTECTED] >> >>> > For additional commands, e-mail: java-user- >> [EMAIL PROTECTED] >> >>> > >> >>> >> >>> >> >> >> >> >> --------------------------------------------------------------------- >> >> To unsubscribe, e-mail: [EMAIL PROTECTED] >> >> For additional commands, e-mail: [EMAIL PROTECTED] >> >> >> > >> > >> > >> --------------------------------------------------------------------- >> > To unsubscribe, e-mail: [EMAIL PROTECTED] >> > For additional commands, e-mail: [EMAIL PROTECTED] >> > >> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: [EMAIL PROTECTED] >> For additional commands, e-mail: [EMAIL PROTECTED] >> >> > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]