Re: Existing Parsers
Hi Chris, I do not have a stats but I think the performance is reasonable. I use xpdf for PDF wvWare for DOC. The size of my index is ~2GB (this is not limited to only pdf doc). For avoiding memory problems, I have set an upperbound to the size of the documents that can be indexed. For example in my case I do not index documents if the size is more that 4MB. You could try something like that. Thanks Regards, George --- Chris Fraschetti [EMAIL PROTECTED] wrote: Some of the tools listed use cmd line execs to output a doc of some sort to text and then I grab the text and add it to a lucene doc, etc etc... Any stats on the scalability of that? In large scale applications, I'm assuming this will cause some serious issues... anyone have any input on this? -Chris Fraschetti On Thu, 09 Sep 2004 09:54:43 -0700, David Spencer [EMAIL PROTECTED] wrote: Honey George wrote: Hi, I know some of them. 1. PDF + http://www.pdfbox.org/ + http://www.foolabs.com/xpdf/download.html - I am using this and found good. It even supports My dated experience from 2 years ago was that (the evil, native code) foolabs pdf parser was the best, but obviously things could have changed. http://www.mail-archive.com/[EMAIL PROTECTED]/msg02912.html various languages. 2. word + http://sourceforge.net/projects/wvware 3. excel + http://www.jguru.com/faq/view.jsp?EID=1074230 -George --- [EMAIL PROTECTED] wrote: Anyone know of any reliable parsers out there for pdf word excel or powerpoint? For powerpoint it's not easy. I've been using this and it has worked fine util recently and seems to sometimes go into an infinite loop now on some recent PPTs. Native code and a package that seems to be dormant but to some extent it does the job. The file ppthtml does the work. http://chicago.sourceforge.net/xlhtml - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] ___ALL-NEW Yahoo! Messenger - all new features - even more fun! http://uk.messenger.yahoo.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] ___ALL-NEW Yahoo! Messenger - all new features - even more fun! http://uk.messenger.yahoo.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Out of memory in lucene 1.4.1 when re-indexing large number of documents
Hi Doug, you are absolutely right about the older version of the JDK: it is 1.3.1 (ibm). Unfortunately we cannot upgrade since we are bound to IBM Portalserver 4 environment. Results: I patched the Lucene1.4.1: it has improved not much: after indexing 1897 Objects the number of SegmentTermEnum is up to 17936. To be realistic: This is even a deterioration :((( My next check will be with a JDK1.4.2 for the test environment, but this can only be a reference run for now. Thanks, Daniel Doug Cutting wrote: It sounds like the ThreadLocal in TermInfosReader is not getting correctly garbage collected when the TermInfosReader is collected. Researching a bit, this was a bug in JVMs prior to 1.4.2, so my guess is that you're running in an older JVM. Is that right? I've attached a patch which should fix this. Please tell me if it works for you. Doug Daniel Taurat wrote: Okay, that (1.4rc3)worked fine, too! Got only 257 SegmentTermEnums for 1900 objects. Now I will go for the final test on the production server with the 1.4rc3 version and about 40.000 objects. Daniel Daniel Taurat schrieb: Hi all, here is some update for you: I switched back to Lucene 1.3-final and now the number of the SegmentTermEnum objects is controlled by gc again: it goes up to about 1000 and then it is down again to 254 after indexing my 1900 test-objects. Stay tuned, I will try 1.4RC3 now, the last version before FieldCache was introduced... Daniel Rupinder Singh Mazara schrieb: hi all I had a similar problem, i have database of documents with 24 fields, and a average content of 7K, with 16M+ records i had to split the jobs into slabs of 1M each and merging the resulting indexes, submissions to our job queue looked like java -Xms100M -Xcompactexplicitgc -cp $CLASSPATH lucene.Indexer 22 and i still had outofmemory exception , the solution that i created was to after every 200K, documents create a temp directory, and merge them together, this was done to do the first production run, updates are now being handled incrementally Exception in thread main java.lang.OutOfMemoryError at org.apache.lucene.store.RAMOutputStream.flushBuffer(RAMOutputStream.java(Compiled Code)) at org.apache.lucene.store.OutputStream.flush(OutputStream.java(Inlined Compiled Code)) at org.apache.lucene.store.OutputStream.writeByte(OutputStream.java(Inlined Compiled Code)) at org.apache.lucene.store.OutputStream.writeBytes(OutputStream.java(Compiled Code)) at org.apache.lucene.index.CompoundFileWriter.copyFile(CompoundFileWriter.java(Compiled Code)) at org.apache.lucene.index.CompoundFileWriter.close(CompoundFileWriter.java(Compiled Code)) at org.apache.lucene.index.SegmentMerger.createCompoundFile(SegmentMerger.java(Compiled Code)) at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java(Compiled Code)) at org.apache.lucene.index.IndexWriter.mergeSegments(IndexWriter.java(Compiled Code)) at org.apache.lucene.index.IndexWriter.optimize(IndexWriter.java:366) at lucene.Indexer.doIndex(CDBIndexer.java(Compiled Code)) at lucene.Indexer.main(CDBIndexer.java:168) -Original Message- From: Daniel Taurat [mailto:[EMAIL PROTECTED] Sent: 10 September 2004 14:42 To: Lucene Users List Subject: Re: Out of memory in lucene 1.4.1 when re-indexing large number of documents Hi Pete, good hint, but we actually do have physical memory of 4Gb on the system. But then: we also have experienced that the gc of ibm jdk1.3.1 that we use is sometimes behaving strangely with too large heap space anyway. (Limit seems to be 1.2 Gb) I can say that gc is not collecting these objects since I forced gc runs when indexing every now and then (when parsing pdf-type objects, that is): No effect. regards, Daniel Pete Lewis wrote: Hi all Reading the thread with interest, there is another way I've come across out of memory errors when indexing large batches of documents. If you have your heap space settings too high, then you get swapping (which impacts performance) plus you never reach the trigger for garbage collection, hence you don't garbage collect and hence you run out of memory. Can you check whether or not your garbage collection is being triggered? Anomalously therefore if this is the case, by reducing the heap space you can improve performance get rid of the out of memory errors. Cheers Pete Lewis - Original Message - From: Daniel Taurat [EMAIL PROTECTED] To: Lucene Users List [EMAIL PROTECTED] Sent: Friday, September 10, 2004 1:10 PM Subject: Re: Out of memory in lucene 1.4.1 when re-indexing large number of documents Daniel Aber schrieb: On Thursday 09 September 2004 19:47, Daniel Taurat wrote: I am facing an out of memory problem using Lucene 1.4.1. Could you try with a recent CVS version? There has been a fix about files not being deleted after 1.4.1. Not sure
ANT +BUILD + LUCENE
Hi Guys Apologies.. The Task for me is to build the Index folder using Lucene a simple Build.xml for ANT The Problem .. Same 'Build .xml' should be used for differnet O/s... [ Win / Linux ] The glitch is respective jar files such as Lucene-1.4 .jar other jar files are not in same dir for the O/s. Also the I/p , O/p Indexer path for source/target may also vary. Please Somebody Help me. :( with regards Karthik WITH WARM REGARDS HAVE A NICE DAY [ N.S.KARTHIK] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: question on Hits.doc
Hi, I recently had the same kind of problem but it was due to the way à was dealing with Hits. Obtaining a Hits object from a Query is very fast. but then I was looping over ALL the hits to retrieve informations on the documents before displaying the result to the user. It was not necessary because in my case, the display of search results is paginated. Now I extract documents from Hits on demand (ie only the few ones I need to display a page of results). It's much more better. -Message d'origine- De : [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] Envoyé : samedi 11 septembre 2004 00:20 À : [EMAIL PROTECTED] Objet : question on Hits.doc Hey guys, We were noticing some speed problems on our searches and after adding some debug statements to the lucene source code, we have determined that the Hits.doc(x) is the problem. (BTW, we are using Lucene 1.2 [with plans to upgrade]). It seems that retrieving the actual Document from the search is very slow. We think it might be our Message field which stores a huge amount of text. We are currently running a test in which we won't store the Message field, however, I was wondering if any of you guys would know if that would be the reason why we're having the performance problems? If so, could anyone also please explain it? It seemed that we weren't having these performance problems before. Has anyone else experienced this? Our environment is NT 4, JDK 1.4.2, and PIIIs. I know that for large text fields, storing the field is not a good practice, however, it held certain conveniences for us that I hope to not get rid of. Roy. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: OutOfMemory example
You should reuse your old index (as eg an application variable) unless it has changed - use getCurrentVersion to check the index for updates. This has come up before. John Ji Kuhn wrote: Hi, I think I can reproduce memory leaking problem while reopening an index. Lucene version tested is 1.4.1, version 1.4 final works OK. My JVM is: $ java -version java version 1.4.2_05 Java(TM) 2 Runtime Environment, Standard Edition (build 1.4.2_05-b04) Java HotSpot(TM) Client VM (build 1.4.2_05-b04, mixed mode) The code you can test is below, there are only 3 iterations for me if I use -Xmx5m, the 4th fails. Jiri. package test; import org.apache.lucene.analysis.standard.StandardAnalyzer; import org.apache.lucene.document.Document; import org.apache.lucene.document.Field; import org.apache.lucene.index.IndexReader; import org.apache.lucene.index.IndexWriter; import org.apache.lucene.index.Term; import org.apache.lucene.search.Hits; import org.apache.lucene.search.IndexSearcher; import org.apache.lucene.search.Searcher; import org.apache.lucene.search.Sort; import org.apache.lucene.search.SortField; import org.apache.lucene.search.TermQuery; import org.apache.lucene.store.Directory; import org.apache.lucene.store.RAMDirectory; import java.io.IOException; import java.text.SimpleDateFormat; import java.util.Calendar; import java.util.Date; /** * Run this test with Lucene 1.4.1 and -Xmx5m */ public class ReopenTest { private static long mem_last = 0; public static void main(String[] args) throws IOException { Directory directory = create_index(); for (int i = 1; i 100; i++) { System.err.println(loop + i); search_index(directory); } } private static void search_index(Directory directory) throws IOException { IndexReader reader = IndexReader.open(directory); Searcher searcher = new IndexSearcher(reader); print_mem(search 1); SortField[] fields = new SortField[2]; fields[0] = new SortField(date, SortField.STRING, true); fields[1] = new SortField(id, SortField.STRING, false); Sort sort = new Sort(fields); TermQuery query = new TermQuery(new Term(text, \text 5\)); print_mem(search 2); Hits hits = searcher.search(query, sort); print_mem(search 3); for (int i = 0; i hits.length(); i++) { Document doc = hits.doc(i); System.out.println(doc + i + : + doc.toString()); } print_mem(search 4); searcher.close(); reader.close(); } private static void print_mem(String log) { long mem_free = Runtime.getRuntime().freeMemory(); long mem_total = Runtime.getRuntime().totalMemory(); long mem_max = Runtime.getRuntime().maxMemory(); long delta = (mem_last - mem_free) * -1; System.out.println(log + = delta: + delta + , free: + mem_free + , used: + (mem_total-mem_free) + , total: + mem_total + , max: + mem_max); mem_last = mem_free; } private static Directory create_index() throws IOException { print_mem(create 1); Directory directory = new RAMDirectory(); Calendar c = Calendar.getInstance(); SimpleDateFormat df = new SimpleDateFormat(-MM-dd); IndexWriter writer = new IndexWriter(directory, new StandardAnalyzer(), true); for (int i = 0; i 365 * 30; i++) { Document doc = new Document(); doc.add(Field.Keyword(date, df.format(new Date(c.getTimeInMillis(); doc.add(Field.Keyword(id, AB + String.valueOf(i))); doc.add(Field.Text(text, Tohle je text + i)); writer.addDocument(doc); c.add(Calendar.DAY_OF_YEAR, 1); } writer.optimize(); System.err.println(index size: + writer.docCount()); writer.close(); print_mem(create 2); return directory; } } - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] ** The information in this e-mail is confidential and may be legally privileged. It is intended solely for the addressee. Access to this e-mail by anyone else is unauthorised. If you are not the intended recipient, any disclosure, copying, distribution, or any action taken or omitted to be taken in reliance on it, is prohibited and may be unlawful. Please note that emails to, from and within RT may be subject to the Freedom of Information Act 1997 and may be liable to disclosure. ** - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Out of memory in lucene 1.4.1 when re-indexing large number of documents
Okay, reference test is done: on JDK 1.4.2 Lucene 1.4.1 really seems to run fine: just a moderate number of SegmentTermEnums that is controlled by gc (about 500 for the 1900 test objects). Daniel Taurat wrote: Hi Doug, you are absolutely right about the older version of the JDK: it is 1.3.1 (ibm). Unfortunately we cannot upgrade since we are bound to IBM Portalserver 4 environment. Results: I patched the Lucene1.4.1: it has improved not much: after indexing 1897 Objects the number of SegmentTermEnum is up to 17936. To be realistic: This is even a deterioration :((( My next check will be with a JDK1.4.2 for the test environment, but this can only be a reference run for now. Thanks, Daniel Doug Cutting wrote: It sounds like the ThreadLocal in TermInfosReader is not getting correctly garbage collected when the TermInfosReader is collected. Researching a bit, this was a bug in JVMs prior to 1.4.2, so my guess is that you're running in an older JVM. Is that right? I've attached a patch which should fix this. Please tell me if it works for you. Doug Daniel Taurat wrote: Okay, that (1.4rc3)worked fine, too! Got only 257 SegmentTermEnums for 1900 objects. Now I will go for the final test on the production server with the 1.4rc3 version and about 40.000 objects. Daniel Daniel Taurat schrieb: Hi all, here is some update for you: I switched back to Lucene 1.3-final and now the number of the SegmentTermEnum objects is controlled by gc again: it goes up to about 1000 and then it is down again to 254 after indexing my 1900 test-objects. Stay tuned, I will try 1.4RC3 now, the last version before FieldCache was introduced... Daniel Rupinder Singh Mazara schrieb: hi all I had a similar problem, i have database of documents with 24 fields, and a average content of 7K, with 16M+ records i had to split the jobs into slabs of 1M each and merging the resulting indexes, submissions to our job queue looked like java -Xms100M -Xcompactexplicitgc -cp $CLASSPATH lucene.Indexer 22 and i still had outofmemory exception , the solution that i created was to after every 200K, documents create a temp directory, and merge them together, this was done to do the first production run, updates are now being handled incrementally Exception in thread main java.lang.OutOfMemoryError at org.apache.lucene.store.RAMOutputStream.flushBuffer(RAMOutputStream.java(Compiled Code)) at org.apache.lucene.store.OutputStream.flush(OutputStream.java(Inlined Compiled Code)) at org.apache.lucene.store.OutputStream.writeByte(OutputStream.java(Inlined Compiled Code)) at org.apache.lucene.store.OutputStream.writeBytes(OutputStream.java(Compiled Code)) at org.apache.lucene.index.CompoundFileWriter.copyFile(CompoundFileWriter.java(Compiled Code)) at org.apache.lucene.index.CompoundFileWriter.close(CompoundFileWriter.java(Compiled Code)) at org.apache.lucene.index.SegmentMerger.createCompoundFile(SegmentMerger.java(Compiled Code)) at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java(Compiled Code)) at org.apache.lucene.index.IndexWriter.mergeSegments(IndexWriter.java(Compiled Code)) at org.apache.lucene.index.IndexWriter.optimize(IndexWriter.java:366) at lucene.Indexer.doIndex(CDBIndexer.java(Compiled Code)) at lucene.Indexer.main(CDBIndexer.java:168) -Original Message- From: Daniel Taurat [mailto:[EMAIL PROTECTED] Sent: 10 September 2004 14:42 To: Lucene Users List Subject: Re: Out of memory in lucene 1.4.1 when re-indexing large number of documents Hi Pete, good hint, but we actually do have physical memory of 4Gb on the system. But then: we also have experienced that the gc of ibm jdk1.3.1 that we use is sometimes behaving strangely with too large heap space anyway. (Limit seems to be 1.2 Gb) I can say that gc is not collecting these objects since I forced gc runs when indexing every now and then (when parsing pdf-type objects, that is): No effect. regards, Daniel Pete Lewis wrote: Hi all Reading the thread with interest, there is another way I've come across out of memory errors when indexing large batches of documents. If you have your heap space settings too high, then you get swapping (which impacts performance) plus you never reach the trigger for garbage collection, hence you don't garbage collect and hence you run out of memory. Can you check whether or not your garbage collection is being triggered? Anomalously therefore if this is the case, by reducing the heap space you can improve performance get rid of the out of memory errors. Cheers Pete Lewis - Original Message - From: Daniel Taurat [EMAIL PROTECTED] To: Lucene Users List [EMAIL PROTECTED] Sent: Friday, September 10, 2004 1:10 PM Subject: Re: Out of memory in lucene 1.4.1 when re-indexing large number of documents Daniel Aber schrieb: On Thursday 09 September 2004 19:47, Daniel Taurat
RE: OutOfMemory example
I disagree or I don't understand. I can change the code as it is shown below. Now I must reopen the index to see the changes, but the memory problem remains. I realy don't know what I'm doing wrong, the code is so simple. Jiri. ... public static void main(String[] args) throws IOException { Directory directory = create_index(); for (int i = 1; i 100; i++) { System.err.println(loop + i + , index version: + IndexReader.getCurrentVersion(directory)); search_index(directory); add_to_index(directory, i); } } private static void add_to_index(Directory directory, int i) throws IOException { IndexWriter writer = new IndexWriter(directory, new StandardAnalyzer(), false); SimpleDateFormat df = new SimpleDateFormat(-MM-dd); Document doc = new Document(); doc.add(Field.Keyword(date, df.format(new Date(System.currentTimeMillis(); doc.add(Field.Keyword(id, CD + String.valueOf(i))); doc.add(Field.Text(text, Tohle neni text + i)); writer.addDocument(doc); System.err.println(index size: + writer.docCount()); writer.close(); } ... -Original Message- From: John Moylan [mailto:[EMAIL PROTECTED] Sent: Monday, September 13, 2004 3:25 PM To: Lucene Users List Subject: Re: OutOfMemory example You should reuse your old index (as eg an application variable) unless it has changed - use getCurrentVersion to check the index for updates. This has come up before. John - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: OutOfMemory example
http://issues.apache.org/bugzilla/show_bug.cgi?id=30628 you can close the index, but the Garbage Collector still needs to reclaim the memory and it may be taking longer than your loop to do so. John Ji Kuhn wrote: I disagree or I don't understand. I can change the code as it is shown below. Now I must reopen the index to see the changes, but the memory problem remains. I realy don't know what I'm doing wrong, the code is so simple. Jiri. ... public static void main(String[] args) throws IOException { Directory directory = create_index(); for (int i = 1; i 100; i++) { System.err.println(loop + i + , index version: + IndexReader.getCurrentVersion(directory)); search_index(directory); add_to_index(directory, i); } } private static void add_to_index(Directory directory, int i) throws IOException { IndexWriter writer = new IndexWriter(directory, new StandardAnalyzer(), false); SimpleDateFormat df = new SimpleDateFormat(-MM-dd); Document doc = new Document(); doc.add(Field.Keyword(date, df.format(new Date(System.currentTimeMillis(); doc.add(Field.Keyword(id, CD + String.valueOf(i))); doc.add(Field.Text(text, Tohle neni text + i)); writer.addDocument(doc); System.err.println(index size: + writer.docCount()); writer.close(); } ... -Original Message- From: John Moylan [mailto:[EMAIL PROTECTED] Sent: Monday, September 13, 2004 3:25 PM To: Lucene Users List Subject: Re: OutOfMemory example You should reuse your old index (as eg an application variable) unless it has changed - use getCurrentVersion to check the index for updates. This has come up before. John - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] ** The information in this e-mail is confidential and may be legally privileged. It is intended solely for the addressee. Access to this e-mail by anyone else is unauthorised. If you are not the intended recipient, any disclosure, copying, distribution, or any action taken or omitted to be taken in reliance on it, is prohibited and may be unlawful. Please note that emails to, from and within RT may be subject to the Freedom of Information Act 1997 and may be liable to disclosure. ** - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: OutOfMemory example
I have a few comments regarding your code ... 1. Why do you use RamDirectory and not the hard disk? 2. as John said, you should reuse the index instead of creating it each time in the main function if(!indexExists(File indexFile)) IndexWriter writer = new IndexWriter(directory, new StandardAnalyzer(), true); else IndexWriter writer = new IndexWriter(directory, new StandardAnalyzer(), false); (in some cases indexExists can be as simple as verifying if the file exits on the hard disk) 3. you iterate in a loop over 10.000 times and you create a lot of objects for (int i = 0; i 365 * 30; i++) { Document doc = new Document(); doc.add(Field.Keyword(date, df.format(new Date(c.getTimeInMillis(); doc.add(Field.Keyword(id, AB + String.valueOf(i))); doc.add(Field.Text(text, Tohle je text + i)); writer.addDocument(doc); c.add(Calendar.DAY_OF_YEAR, 1); } all the underlined lines of code create new ojects, and all of them are kept in memory. This is a lot of memory allocated only by this loop. I think that you create more than 100.000 object in this loop ... What do you think? And none of them cannot be realeased (collected by gc) untill you close the index writer. None says that your code is complicated, but all programmers should understand that this is a poor design... And ... more then that your information is kept in a RamDirectory when you will close the writer you will still keep the information in memory ... Sory if I was too agressive with my comments but ... I cannot see what were you thinking when you wrote that code ... If you are trying to make a test then I sugest you to replace the hard codded 365 value ... with a variable, to iterate over it and to test the power of your machine (PC + JVM) :)) I wish you luck, Sergiu Ji Kuhn wrote: I disagree or I don't understand. I can change the code as it is shown below. Now I must reopen the index to see the changes, but the memory problem remains. I realy don't know what I'm doing wrong, the code is so simple. Jiri. ... public static void main(String[] args) throws IOException { Directory directory = create_index(); for (int i = 1; i 100; i++) { System.err.println(loop + i + , index version: + IndexReader.getCurrentVersion(directory)); search_index(directory); add_to_index(directory, i); } } private static void add_to_index(Directory directory, int i) throws IOException { IndexWriter writer = new IndexWriter(directory, new StandardAnalyzer(), false); SimpleDateFormat df = new SimpleDateFormat(-MM-dd); Document doc = new Document(); doc.add(Field.Keyword(date, df.format(new Date(System.currentTimeMillis(); doc.add(Field.Keyword(id, CD + String.valueOf(i))); doc.add(Field.Text(text, Tohle neni text + i)); writer.addDocument(doc); System.err.println(index size: + writer.docCount()); writer.close(); } ... -Original Message- From: John Moylan [mailto:[EMAIL PROTECTED] Sent: Monday, September 13, 2004 3:25 PM To: Lucene Users List Subject: Re: OutOfMemory example You should reuse your old index (as eg an application variable) unless it has changed - use getCurrentVersion to check the index for updates. This has come up before. John - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: OutOfMemory example
Thanks for the bug's id, it seems like my problem and I have a stand-alone code with main(). What about slow garbage collector? This looks for me as wrong suggestion. Let change the code once again: ... public static void main(String[] args) throws IOException, InterruptedException { Directory directory = create_index(); for (int i = 1; i 100; i++) { System.err.println(loop + i + , index version: + IndexReader.getCurrentVersion(directory)); search_index(directory); add_to_index(directory, i); System.gc(); Thread.sleep(1000);// whatever value you want } } ... and in the 4th iteration java.lang.OutOfMemoryError appears again. Jiri. -Original Message- From: John Moylan [mailto:[EMAIL PROTECTED] Sent: Monday, September 13, 2004 4:53 PM To: Lucene Users List Subject: Re: OutOfMemory example http://issues.apache.org/bugzilla/show_bug.cgi?id=30628 you can close the index, but the Garbage Collector still needs to reclaim the memory and it may be taking longer than your loop to do so. John
RE: OutOfMemory example
You don't see the point of my post. I sent an application which can everyone run only with lucene jar and in deterministic way produce OutOfMemoryError. That's all. Jiri. -Original Message- From: sergiu gordea [mailto:[EMAIL PROTECTED] Sent: Monday, September 13, 2004 5:16 PM To: Lucene Users List Subject: Re: OutOfMemory example I have a few comments regarding your code ... 1. Why do you use RamDirectory and not the hard disk? 2. as John said, you should reuse the index instead of creating it each time in the main function if(!indexExists(File indexFile)) IndexWriter writer = new IndexWriter(directory, new StandardAnalyzer(), true); else IndexWriter writer = new IndexWriter(directory, new StandardAnalyzer(), false); (in some cases indexExists can be as simple as verifying if the file exits on the hard disk) 3. you iterate in a loop over 10.000 times and you create a lot of objects for (int i = 0; i 365 * 30; i++) { Document doc = new Document(); doc.add(Field.Keyword(date, df.format(new Date(c.getTimeInMillis(); doc.add(Field.Keyword(id, AB + String.valueOf(i))); doc.add(Field.Text(text, Tohle je text + i)); writer.addDocument(doc); c.add(Calendar.DAY_OF_YEAR, 1); } all the underlined lines of code create new ojects, and all of them are kept in memory. This is a lot of memory allocated only by this loop. I think that you create more than 100.000 object in this loop ... What do you think? And none of them cannot be realeased (collected by gc) untill you close the index writer. None says that your code is complicated, but all programmers should understand that this is a poor design... And ... more then that your information is kept in a RamDirectory when you will close the writer you will still keep the information in memory ... Sory if I was too agressive with my comments but ... I cannot see what were you thinking when you wrote that code ... If you are trying to make a test then I sugest you to replace the hard codded 365 value ... with a variable, to iterate over it and to test the power of your machine (PC + JVM) :)) I wish you luck, Sergiu Ji Kuhn wrote: I disagree or I don't understand. I can change the code as it is shown below. Now I must reopen the index to see the changes, but the memory problem remains. I realy don't know what I'm doing wrong, the code is so simple. Jiri. ... public static void main(String[] args) throws IOException { Directory directory = create_index(); for (int i = 1; i 100; i++) { System.err.println(loop + i + , index version: + IndexReader.getCurrentVersion(directory)); search_index(directory); add_to_index(directory, i); } } private static void add_to_index(Directory directory, int i) throws IOException { IndexWriter writer = new IndexWriter(directory, new StandardAnalyzer(), false); SimpleDateFormat df = new SimpleDateFormat(-MM-dd); Document doc = new Document(); doc.add(Field.Keyword(date, df.format(new Date(System.currentTimeMillis(); doc.add(Field.Keyword(id, CD + String.valueOf(i))); doc.add(Field.Text(text, Tohle neni text + i)); writer.addDocument(doc); System.err.println(index size: + writer.docCount()); writer.close(); } ... -Original Message- From: John Moylan [mailto:[EMAIL PROTECTED] Sent: Monday, September 13, 2004 3:25 PM To: Lucene Users List Subject: Re: OutOfMemory example You should reuse your old index (as eg an application variable) unless it has changed - use getCurrentVersion to check the index for updates. This has come up before. John - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
force gc idiom - Re: OutOfMemory example
Ji Kuhn wrote: Thanks for the bug's id, it seems like my problem and I have a stand-alone code with main(). What about slow garbage collector? This looks for me as wrong suggestion. I've seen this written up before (javaworld?) as a way to probably force GC instead of just a System.gc() call. I think the 2nd gc() call is supposed to clean up junk from the runFinalization() call... System.gc(); Thread.sleep( 100); System.runFinalization(); Thread.sleep( 100); System.gc(); Let change the code once again: ... public static void main(String[] args) throws IOException, InterruptedException { Directory directory = create_index(); for (int i = 1; i 100; i++) { System.err.println(loop + i + , index version: + IndexReader.getCurrentVersion(directory)); search_index(directory); add_to_index(directory, i); System.gc(); Thread.sleep(1000);// whatever value you want } } ... and in the 4th iteration java.lang.OutOfMemoryError appears again. Jiri. -Original Message- From: John Moylan [mailto:[EMAIL PROTECTED] Sent: Monday, September 13, 2004 4:53 PM To: Lucene Users List Subject: Re: OutOfMemory example http://issues.apache.org/bugzilla/show_bug.cgi?id=30628 you can close the index, but the Garbage Collector still needs to reclaim the memory and it may be taking longer than your loop to do so. John - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: force gc idiom - Re: OutOfMemory example
This doesn't work either! Lets concentrate on the first version of my code. I believe that the code should run endlesly (I have said it before: in version 1.4 final it does). Jiri. -Original Message- From: David Spencer [mailto:[EMAIL PROTECTED] Sent: Monday, September 13, 2004 5:34 PM To: Lucene Users List Subject: force gc idiom - Re: OutOfMemory example Ji Kuhn wrote: Thanks for the bug's id, it seems like my problem and I have a stand-alone code with main(). What about slow garbage collector? This looks for me as wrong suggestion. I've seen this written up before (javaworld?) as a way to probably force GC instead of just a System.gc() call. I think the 2nd gc() call is supposed to clean up junk from the runFinalization() call... System.gc(); Thread.sleep( 100); System.runFinalization(); Thread.sleep( 100); System.gc(); Let change the code once again: ... public static void main(String[] args) throws IOException, InterruptedException { Directory directory = create_index(); for (int i = 1; i 100; i++) { System.err.println(loop + i + , index version: + IndexReader.getCurrentVersion(directory)); search_index(directory); add_to_index(directory, i); System.gc(); Thread.sleep(1000);// whatever value you want } } ... and in the 4th iteration java.lang.OutOfMemoryError appears again. Jiri. -Original Message- From: John Moylan [mailto:[EMAIL PROTECTED] Sent: Monday, September 13, 2004 4:53 PM To: Lucene Users List Subject: Re: OutOfMemory example http://issues.apache.org/bugzilla/show_bug.cgi?id=30628 you can close the index, but the Garbage Collector still needs to reclaim the memory and it may be taking longer than your loop to do so. John - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: OutOfMemory example
then probably is my mistake ...I havn't read all the emails in the thread. So ... your goal is to produce errors ... I try to avoid them :)) All the best, Sergiu Ji Kuhn wrote: You don't see the point of my post. I sent an application which can everyone run only with lucene jar and in deterministic way produce OutOfMemoryError. That's all. Jiri. -Original Message- From: sergiu gordea [mailto:[EMAIL PROTECTED] Sent: Monday, September 13, 2004 5:16 PM To: Lucene Users List Subject: Re: OutOfMemory example I have a few comments regarding your code ... 1. Why do you use RamDirectory and not the hard disk? 2. as John said, you should reuse the index instead of creating it each time in the main function if(!indexExists(File indexFile)) IndexWriter writer = new IndexWriter(directory, new StandardAnalyzer(), true); else IndexWriter writer = new IndexWriter(directory, new StandardAnalyzer(), false); (in some cases indexExists can be as simple as verifying if the file exits on the hard disk) 3. you iterate in a loop over 10.000 times and you create a lot of objects for (int i = 0; i 365 * 30; i++) { Document doc = new Document(); doc.add(Field.Keyword(date, df.format(new Date(c.getTimeInMillis(); doc.add(Field.Keyword(id, AB + String.valueOf(i))); doc.add(Field.Text(text, Tohle je text + i)); writer.addDocument(doc); c.add(Calendar.DAY_OF_YEAR, 1); } all the underlined lines of code create new ojects, and all of them are kept in memory. This is a lot of memory allocated only by this loop. I think that you create more than 100.000 object in this loop ... What do you think? And none of them cannot be realeased (collected by gc) untill you close the index writer. None says that your code is complicated, but all programmers should understand that this is a poor design... And ... more then that your information is kept in a RamDirectory when you will close the writer you will still keep the information in memory ... Sory if I was too agressive with my comments but ... I cannot see what were you thinking when you wrote that code ... If you are trying to make a test then I sugest you to replace the hard codded 365 value ... with a variable, to iterate over it and to test the power of your machine (PC + JVM) :)) I wish you luck, Sergiu Ji Kuhn wrote: I disagree or I don't understand. I can change the code as it is shown below. Now I must reopen the index to see the changes, but the memory problem remains. I realy don't know what I'm doing wrong, the code is so simple. Jiri. ... public static void main(String[] args) throws IOException { Directory directory = create_index(); for (int i = 1; i 100; i++) { System.err.println(loop + i + , index version: + IndexReader.getCurrentVersion(directory)); search_index(directory); add_to_index(directory, i); } } private static void add_to_index(Directory directory, int i) throws IOException { IndexWriter writer = new IndexWriter(directory, new StandardAnalyzer(), false); SimpleDateFormat df = new SimpleDateFormat(-MM-dd); Document doc = new Document(); doc.add(Field.Keyword(date, df.format(new Date(System.currentTimeMillis(); doc.add(Field.Keyword(id, CD + String.valueOf(i))); doc.add(Field.Text(text, Tohle neni text + i)); writer.addDocument(doc); System.err.println(index size: + writer.docCount()); writer.close(); } ... -Original Message- From: John Moylan [mailto:[EMAIL PROTECTED] Sent: Monday, September 13, 2004 3:25 PM To: Lucene Users List Subject: Re: OutOfMemory example You should reuse your old index (as eg an application variable) unless it has changed - use getCurrentVersion to check the index for updates. This has come up before. John - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
OptimizeIt -- Re: force gc idiom - Re: OutOfMemory example
Ji Kuhn wrote: This doesn't work either! You're right. I'm running under JDK1.5 and trying larger values for -Xmx and it still fails. Running under (Borlands) OptimzeIt shows the number of Terms and Terminfos (both in org.apache.lucene.index) increase every time thru the loop, by several hundred instances each. I can trace thru some Term instances on the reference graph of OptimizeIt but it's unclear to me what's right. One *guess* is that maybe the WeakHashMap in either SegmentReader or FieldCacheImpl is the problem. Lets concentrate on the first version of my code. I believe that the code should run endlesly (I have said it before: in version 1.4 final it does). Jiri. -Original Message- From: David Spencer [mailto:[EMAIL PROTECTED] Sent: Monday, September 13, 2004 5:34 PM To: Lucene Users List Subject: force gc idiom - Re: OutOfMemory example Ji Kuhn wrote: Thanks for the bug's id, it seems like my problem and I have a stand-alone code with main(). What about slow garbage collector? This looks for me as wrong suggestion. I've seen this written up before (javaworld?) as a way to probably force GC instead of just a System.gc() call. I think the 2nd gc() call is supposed to clean up junk from the runFinalization() call... System.gc(); Thread.sleep( 100); System.runFinalization(); Thread.sleep( 100); System.gc(); Let change the code once again: ... public static void main(String[] args) throws IOException, InterruptedException { Directory directory = create_index(); for (int i = 1; i 100; i++) { System.err.println(loop + i + , index version: + IndexReader.getCurrentVersion(directory)); search_index(directory); add_to_index(directory, i); System.gc(); Thread.sleep(1000);// whatever value you want } } ... and in the 4th iteration java.lang.OutOfMemoryError appears again. Jiri. -Original Message- From: John Moylan [mailto:[EMAIL PROTECTED] Sent: Monday, September 13, 2004 4:53 PM To: Lucene Users List Subject: Re: OutOfMemory example http://issues.apache.org/bugzilla/show_bug.cgi?id=30628 you can close the index, but the Garbage Collector still needs to reclaim the memory and it may be taking longer than your loop to do so. John - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
FieldSortedHitQueue.Comparators -- Re: force gc idiom - Re: OutOfMemory example
Just noticed something else suspicious. FieldSortedHitQueue has a field called Comparators and it seems like things are never removed from it Ji Kuhn wrote: This doesn't work either! Lets concentrate on the first version of my code. I believe that the code should run endlesly (I have said it before: in version 1.4 final it does). Jiri. -Original Message- From: David Spencer [mailto:[EMAIL PROTECTED] Sent: Monday, September 13, 2004 5:34 PM To: Lucene Users List Subject: force gc idiom - Re: OutOfMemory example Ji Kuhn wrote: Thanks for the bug's id, it seems like my problem and I have a stand-alone code with main(). What about slow garbage collector? This looks for me as wrong suggestion. I've seen this written up before (javaworld?) as a way to probably force GC instead of just a System.gc() call. I think the 2nd gc() call is supposed to clean up junk from the runFinalization() call... System.gc(); Thread.sleep( 100); System.runFinalization(); Thread.sleep( 100); System.gc(); Let change the code once again: ... public static void main(String[] args) throws IOException, InterruptedException { Directory directory = create_index(); for (int i = 1; i 100; i++) { System.err.println(loop + i + , index version: + IndexReader.getCurrentVersion(directory)); search_index(directory); add_to_index(directory, i); System.gc(); Thread.sleep(1000);// whatever value you want } } ... and in the 4th iteration java.lang.OutOfMemoryError appears again. Jiri. -Original Message- From: John Moylan [mailto:[EMAIL PROTECTED] Sent: Monday, September 13, 2004 4:53 PM To: Lucene Users List Subject: Re: OutOfMemory example http://issues.apache.org/bugzilla/show_bug.cgi?id=30628 you can close the index, but the Garbage Collector still needs to reclaim the memory and it may be taking longer than your loop to do so. John - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: FieldSortedHitQueue.Comparators -- Re: force gc idiom - Re: OutOfMemory example
David Spencer wrote: Just noticed something else suspicious. FieldSortedHitQueue has a field called Comparators and it seems like things are never removed from it Replying to my own postthis could be the problem. If I put in a print statement here in FieldSortedHitQueue, recompile, and run w/ the new jar then I see Comparators.size() go up after every iteration thru ReopenTest's loop and the size() never goes down... static Object store (IndexReader reader, String field, int type, Object factory, Object value) { FieldCacheImpl.Entry entry = (factory != null) ? new FieldCacheImpl.Entry (field, factory) : new FieldCacheImpl.Entry (field, type); synchronized (Comparators) { HashMap readerCache = (HashMap)Comparators.get(reader); if (readerCache == null) { readerCache = new HashMap(); Comparators.put(reader,readerCache); System.out.println( *\t* NOW: + Comparators.size()); } return readerCache.put (entry, value); } } Ji Kuhn wrote: This doesn't work either! Lets concentrate on the first version of my code. I believe that the code should run endlesly (I have said it before: in version 1.4 final it does). Jiri. -Original Message- From: David Spencer [mailto:[EMAIL PROTECTED] Sent: Monday, September 13, 2004 5:34 PM To: Lucene Users List Subject: force gc idiom - Re: OutOfMemory example Ji Kuhn wrote: Thanks for the bug's id, it seems like my problem and I have a stand-alone code with main(). What about slow garbage collector? This looks for me as wrong suggestion. I've seen this written up before (javaworld?) as a way to probably force GC instead of just a System.gc() call. I think the 2nd gc() call is supposed to clean up junk from the runFinalization() call... System.gc(); Thread.sleep( 100); System.runFinalization(); Thread.sleep( 100); System.gc(); Let change the code once again: ... public static void main(String[] args) throws IOException, InterruptedException { Directory directory = create_index(); for (int i = 1; i 100; i++) { System.err.println(loop + i + , index version: + IndexReader.getCurrentVersion(directory)); search_index(directory); add_to_index(directory, i); System.gc(); Thread.sleep(1000);// whatever value you want } } ... and in the 4th iteration java.lang.OutOfMemoryError appears again. Jiri. -Original Message- From: John Moylan [mailto:[EMAIL PROTECTED] Sent: Monday, September 13, 2004 4:53 PM To: Lucene Users List Subject: Re: OutOfMemory example http://issues.apache.org/bugzilla/show_bug.cgi?id=30628 you can close the index, but the Garbage Collector still needs to reclaim the memory and it may be taking longer than your loop to do so. John - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
SegmentReader - Re: FieldSortedHitQueue.Comparators -- Re: force gc idiom - Re: OutOfMemory example
Another clue, the SegmentReaders are piling up too, which may be why the Comparator map is increasing in size, because SegmentReaders are the keys to Comparator...though again, I don't know enough about the Lucene internals to know what refs to SegmentReaders are valid which which ones may be causing this leak. David Spencer wrote: David Spencer wrote: Just noticed something else suspicious. FieldSortedHitQueue has a field called Comparators and it seems like things are never removed from it Replying to my own postthis could be the problem. If I put in a print statement here in FieldSortedHitQueue, recompile, and run w/ the new jar then I see Comparators.size() go up after every iteration thru ReopenTest's loop and the size() never goes down... static Object store (IndexReader reader, String field, int type, Object factory, Object value) { FieldCacheImpl.Entry entry = (factory != null) ? new FieldCacheImpl.Entry (field, factory) : new FieldCacheImpl.Entry (field, type); synchronized (Comparators) { HashMap readerCache = (HashMap)Comparators.get(reader); if (readerCache == null) { readerCache = new HashMap(); Comparators.put(reader,readerCache); System.out.println( *\t* NOW: + Comparators.size()); } return readerCache.put (entry, value); } } Ji Kuhn wrote: This doesn't work either! Lets concentrate on the first version of my code. I believe that the code should run endlesly (I have said it before: in version 1.4 final it does). Jiri. -Original Message- From: David Spencer [mailto:[EMAIL PROTECTED] Sent: Monday, September 13, 2004 5:34 PM To: Lucene Users List Subject: force gc idiom - Re: OutOfMemory example Ji Kuhn wrote: Thanks for the bug's id, it seems like my problem and I have a stand-alone code with main(). What about slow garbage collector? This looks for me as wrong suggestion. I've seen this written up before (javaworld?) as a way to probably force GC instead of just a System.gc() call. I think the 2nd gc() call is supposed to clean up junk from the runFinalization() call... System.gc(); Thread.sleep( 100); System.runFinalization(); Thread.sleep( 100); System.gc(); Let change the code once again: ... public static void main(String[] args) throws IOException, InterruptedException { Directory directory = create_index(); for (int i = 1; i 100; i++) { System.err.println(loop + i + , index version: + IndexReader.getCurrentVersion(directory)); search_index(directory); add_to_index(directory, i); System.gc(); Thread.sleep(1000);// whatever value you want } } ... and in the 4th iteration java.lang.OutOfMemoryError appears again. Jiri. -Original Message- From: John Moylan [mailto:[EMAIL PROTECTED] Sent: Monday, September 13, 2004 4:53 PM To: Lucene Users List Subject: Re: OutOfMemory example http://issues.apache.org/bugzilla/show_bug.cgi?id=30628 you can close the index, but the Garbage Collector still needs to reclaim the memory and it may be taking longer than your loop to do so. John - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: OutOfMemory example
On Monday 13 September 2004 15:06, Ji Kuhn wrote: I think I can reproduce memory leaking problem while reopening an index. Lucene version tested is 1.4.1, version 1.4 final works OK. My JVM is: Could you try with the latest Lucene version from CVS? I cannot reproduce your problem with that version (Sun's Java 1.4.2_03, Linux). Regards Daniel -- http://www.danielnaber.de - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: OptimizeIt -- Re: force gc idiom - Re: OutOfMemory example
David Spencer wrote: Ji Kuhn wrote: This doesn't work either! You're right. I'm running under JDK1.5 and trying larger values for -Xmx and it still fails. Running under (Borlands) OptimzeIt shows the number of Terms and Terminfos (both in org.apache.lucene.index) increase every time thru the loop, by several hundred instances each. Yes... I'm running into a similar situation on JDK 1.4.2 with Lucene 1.3... I used the JMP debugger and all my memory is taken by Terms and TermInfo... I can trace thru some Term instances on the reference graph of OptimizeIt but it's unclear to me what's right. One *guess* is that maybe the WeakHashMap in either SegmentReader or FieldCacheImpl is the problem. Kevin -- Please reply using PGP. http://peerfear.org/pubkey.asc NewsMonster - http://www.newsmonster.org/ Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965 AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: OutOfMemory example
Ji Kuhn wrote: Hi, I think I can reproduce memory leaking problem while reopening an index. Lucene version tested is 1.4.1, version 1.4 final works OK. My JVM is: $ java -version java version 1.4.2_05 Java(TM) 2 Runtime Environment, Standard Edition (build 1.4.2_05-b04) Java HotSpot(TM) Client VM (build 1.4.2_05-b04, mixed mode) The code you can test is below, there are only 3 iterations for me if I use -Xmx5m, the 4th fails. At least this test seems tied to the Sort API... I removed the sort under Lucene 1.3 and it worked fine... Kevin -- Please reply using PGP. http://peerfear.org/pubkey.asc NewsMonster - http://www.newsmonster.org/ Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965 AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Addition to contributions page
On Friday 10 September 2004 15:48, Chas Emerick wrote: PDFTextStream should be added to the 'Document Converters' section, with this URL http://snowtide.com , and perhaps this heading: 'PDFTextStream -- PDF text and metadata extraction'. The 'Author' field should probably be left blank, since there's no single creator. I just added it. Regards Daniel -- http://www.danielnaber.de - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: OutOfMemory example
Daniel Naber wrote: On Monday 13 September 2004 15:06, Ji Kuhn wrote: I think I can reproduce memory leaking problem while reopening an index. Lucene version tested is 1.4.1, version 1.4 final works OK. My JVM is: Could you try with the latest Lucene version from CVS? I cannot reproduce your problem with that version (Sun's Java 1.4.2_03, Linux). I verified it w/ the latest lucene code from CVS under win xp. Regards Daniel - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Similarity score computation documentation
Hi, I was looking through the score computation when running search, and think there may be a discrepancy between what is _documented_ in the org.apache.lucene.search.Similarity class overview Javadocs, and what actually occurs in the code. I believe the problem is only with the documentation. I'm pretty sure that there should be an idf^2 in the sum. Look at org.apache.lucene.search.TermQuery, the inner class TermWeight. You can see that first sumOfSquaredWeights() is called, followed by normalize(), during search. Further, the resulting value stored in the field value is set as the weightValue on the TermScorer. If we look at what happens to TermWeight, sumOfSquaredWeights() sets queryWeight to idf * boost. During normalize(), queryWeight is multiplied by the query norm, and value is set to queryWeight * idf == idf * boost * query norm * idf == idf^2 * boost * query norm. This becomes the weightValue in the TermScorer that is then used to multiply with the appropriate tf, etc., values. The remaining terms in the Similarity description are properly appended. I also see that the queryNorm effectively cancels out (dimensionally, since it is a 1/ square root of a sum of squares of idfs) one of the idfs, so the formula still ends up being roughly a TF-IDF formula. But the idf^2 should still be there, along with the expansion of queryNorm. Am I mistaken, or is the documentation off? Thanks for your help, -Ken - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]