Re: Existing Parsers

2004-09-13 Thread Honey George
Hi Chris,
   I do not have a stats but I think the performance
is reasonable. I use xpdf for PDF  wvWare for DOC.
The size of my index is ~2GB (this is not limited to
only pdf  doc). For avoiding memory problems, I have
set an upperbound to the size of the documents that
can be indexed. For example in my case I do not index
documents if the size is more that 4MB. You could try
something like that.

Thanks  Regards,
   George

 --- Chris Fraschetti [EMAIL PROTECTED] wrote: 
 Some of the tools listed use cmd line execs to
 output a doc of some
 sort to text and then I grab the text and add it to
 a lucene doc, etc
 etc...
 
 Any stats on the scalability of that? In large scale
 applications, I'm
 assuming this will cause some serious issues...
 anyone have any input
 on this?
 
 -Chris Fraschetti
 
 
 On Thu, 09 Sep 2004 09:54:43 -0700, David Spencer
 [EMAIL PROTECTED] wrote:
  Honey George wrote:
  
   Hi,
 I know some of them.
   1. PDF
+ http://www.pdfbox.org/
+ http://www.foolabs.com/xpdf/download.html
  - I am using this and found good. It even
 supports
  
  My dated experience from 2 years ago was that (the
 evil, native code)
  foolabs pdf parser was the best, but obviously
 things could have changed.
  
 

http://www.mail-archive.com/[EMAIL PROTECTED]/msg02912.html
  
various languages.
   2. word
 + http://sourceforge.net/projects/wvware
   3. excel
 +
 http://www.jguru.com/faq/view.jsp?EID=1074230
  
   -George
--- [EMAIL PROTECTED] wrote:
  
  Anyone know of any reliable parsers out there
 for
  pdf word
  excel or powerpoint?
  
  For powerpoint it's not easy. I've been using this
 and it has worked
  fine util recently and seems to sometimes go into
 an infinite loop now
  on some recent PPTs. Native code and a package
 that seems to be dormant
  but to some extent it does the job. The file
 ppthtml does the work.
  
  http://chicago.sourceforge.net/xlhtml
  
  
  
  
  
  
  

-
  
  To unsubscribe, e-mail:
  [EMAIL PROTECTED]
  For additional commands, e-mail:
  [EMAIL PROTECTED]
  
  
  
  
  
  
  
  
  

___ALL-NEW
 Yahoo! Messenger - all new features - even more fun!
  http://uk.messenger.yahoo.com
  
  

-
   To unsubscribe, e-mail:
 [EMAIL PROTECTED]
   For additional commands, e-mail:
 [EMAIL PROTECTED]
  
  
 

-
  To unsubscribe, e-mail:
 [EMAIL PROTECTED]
  For additional commands, e-mail:
 [EMAIL PROTECTED]
  
 
 

-
 To unsubscribe, e-mail:
 [EMAIL PROTECTED]
 For additional commands, e-mail:
 [EMAIL PROTECTED]
 
  





___ALL-NEW Yahoo! Messenger - 
all new features - even more fun!  http://uk.messenger.yahoo.com

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Out of memory in lucene 1.4.1 when re-indexing large number of documents

2004-09-13 Thread Daniel Taurat
Hi Doug,
you are absolutely right about the older version of the JDK: it is 1.3.1 
(ibm).
Unfortunately we cannot upgrade since we are bound to IBM Portalserver 4 
environment.
Results:
I patched the Lucene1.4.1:
it has improved not much: after indexing 1897 Objects  the number of 
SegmentTermEnum is up to 17936.
To be realistic: This is even a deterioration :(((
My next check will be with a JDK1.4.2 for the test environment, but this 
can only be a reference run for now.

Thanks,
Daniel
Doug Cutting wrote:
It sounds like the ThreadLocal in TermInfosReader is not getting 
correctly garbage collected when the TermInfosReader is collected. 
Researching a bit, this was a bug in JVMs prior to 1.4.2, so my guess 
is that you're running in an older JVM.  Is that right?

I've attached a patch which should fix this.  Please tell me if it 
works for you.

Doug
Daniel Taurat wrote:
Okay, that (1.4rc3)worked fine, too!
Got only 257 SegmentTermEnums for 1900 objects.
Now I will go for the final test on the production server with the 
1.4rc3 version  and about 40.000 objects.

Daniel
Daniel Taurat schrieb:
Hi all,
here is some update for you:
I switched back to Lucene 1.3-final and now the  number of the  
SegmentTermEnum objects is controlled by gc again:
it goes up to about 1000 and then it is down again to 254 after 
indexing my 1900 test-objects.
Stay tuned, I will try 1.4RC3 now, the last version before 
FieldCache was introduced...

Daniel
Rupinder Singh Mazara schrieb:
hi all
 I had a similar problem, i have  database of documents with 24 
fields, and a average content of 7K, with  16M+ records

 i had to split the jobs into slabs of 1M each and merging the 
resulting indexes, submissions to our job queue looked like

 java -Xms100M -Xcompactexplicitgc -cp $CLASSPATH lucene.Indexer 22
 
and i still had outofmemory exception , the solution that i created 
was to after every 200K, documents create a temp directory, and 
merge them together, this was done to do the first production run, 
updates are now being handled incrementally

 

Exception in thread main java.lang.OutOfMemoryError
at 
org.apache.lucene.store.RAMOutputStream.flushBuffer(RAMOutputStream.java(Compiled 
Code))
at 
org.apache.lucene.store.OutputStream.flush(OutputStream.java(Inlined 
Compiled Code))
at 
org.apache.lucene.store.OutputStream.writeByte(OutputStream.java(Inlined 
Compiled Code))
at 
org.apache.lucene.store.OutputStream.writeBytes(OutputStream.java(Compiled 
Code))
at 
org.apache.lucene.index.CompoundFileWriter.copyFile(CompoundFileWriter.java(Compiled 
Code))
at 
org.apache.lucene.index.CompoundFileWriter.close(CompoundFileWriter.java(Compiled 
Code))
at 
org.apache.lucene.index.SegmentMerger.createCompoundFile(SegmentMerger.java(Compiled 
Code))
at 
org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java(Compiled 
Code))
at 
org.apache.lucene.index.IndexWriter.mergeSegments(IndexWriter.java(Compiled 
Code))
at 
org.apache.lucene.index.IndexWriter.optimize(IndexWriter.java:366)
at lucene.Indexer.doIndex(CDBIndexer.java(Compiled Code))
at lucene.Indexer.main(CDBIndexer.java:168)

 

-Original Message-
From: Daniel Taurat [mailto:[EMAIL PROTECTED]
Sent: 10 September 2004 14:42
To: Lucene Users List
Subject: Re: Out of memory in lucene 1.4.1 when re-indexing large 
number
of documents

Hi Pete,
good hint, but we actually do have physical memory of  4Gb on the 
system. But then: we also have experienced that the gc of ibm 
jdk1.3.1 that we use is sometimes
behaving strangely with too large heap space anyway. (Limit seems 
to be 1.2 Gb)
I can say that gc is not collecting these objects since I  forced 
gc runs when indexing every now and then (when parsing pdf-type 
objects, that is): No effect.

regards,
Daniel
Pete Lewis wrote:
 

Hi all
Reading the thread with interest, there is another way I've come 

across out
 

of memory errors when indexing large batches of documents.
If you have your heap space settings too high, then you get 

swapping (which
 

impacts performance) plus you never reach the trigger for garbage
collection, hence you don't garbage collect and hence you run 
out 

of memory.
 

Can you check whether or not your garbage collection is being 
triggered?

Anomalously therefore if this is the case, by reducing the heap 
space you
can improve performance get rid of the out of memory errors.

Cheers
Pete Lewis
- Original Message - From: Daniel Taurat 
[EMAIL PROTECTED]
To: Lucene Users List [EMAIL PROTECTED]
Sent: Friday, September 10, 2004 1:10 PM
Subject: Re: Out of memory in lucene 1.4.1 when re-indexing 
large 

number of
 

documents

  

Daniel Aber schrieb:
 


On Thursday 09 September 2004 19:47, Daniel Taurat wrote:

 

I am facing an out of memory problem using  Lucene 1.4.1.
   

Could you try with a recent CVS version? There has been a fix 



about files
 

not being deleted after 1.4.1. Not sure 

ANT +BUILD + LUCENE

2004-09-13 Thread Karthik N S
Hi

Guys


Apologies..


The Task for me is to build the Index folder using Lucene   a simple
Build.xml  for ANT

The Problem .. Same 'Build .xml'  should be used for differnet O/s...
[ Win / Linux ]

The glitch is  respective jar files such as Lucene-1.4 .jar  other jar
files are not in same dir for the O/s.
Also the  I/p , O/p Indexer path for source/target may also vary.


Please Somebody Help me. :(



with regards
Karthik




  WITH WARM REGARDS
  HAVE A NICE DAY
  [ N.S.KARTHIK]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: question on Hits.doc

2004-09-13 Thread Cocula Remi
Hi,

I recently had the same kind of problem but it was due to the way à was dealing with 
Hits.
Obtaining a Hits object from a Query is very fast. but then I was looping over ALL the 
hits to retrieve informations on the documents before displaying the result to the 
user.
It was not necessary because in my case, the display of search results is paginated.
Now I extract documents from Hits on demand (ie only the few ones I need to display 
a page of results). It's much more better.


-Message d'origine-
De : [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]
Envoyé : samedi 11 septembre 2004 00:20
À : [EMAIL PROTECTED]
Objet : question on Hits.doc


Hey guys,

We were noticing some speed problems on our searches and after adding some
debug statements to the lucene source code, we have determined that the
Hits.doc(x) is the problem.  (BTW, we are using Lucene 1.2 [with plans to
upgrade]).  It seems that retrieving the actual Document from the search is
very slow.

We think it might be our Message field which stores a huge amount of text. 
We are currently running a test in which we won't store the Message field,
however, I was wondering if any of you guys would know if that would be the
reason why we're having the performance problems?  If so, could anyone also
please explain it?  It seemed that we weren't having these performance
problems before.  Has anyone else experienced this?  Our environment is NT 4,
JDK 1.4.2, and PIIIs.

I know that for large text fields, storing the field is not a good practice,
however, it held certain conveniences for us that I hope to not get rid of.

Roy.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: OutOfMemory example

2004-09-13 Thread John Moylan
You should reuse your old index (as eg an application variable) unless 
it has changed - use getCurrentVersion to check the index for updates. 
This has come up before.

John
Ji Kuhn wrote:
Hi,
I think I can reproduce memory leaking problem while reopening an index. 
Lucene version tested is 1.4.1, version 1.4 final works OK. My JVM is:
$ java -version
java version 1.4.2_05
Java(TM) 2 Runtime Environment, Standard Edition (build 1.4.2_05-b04)
Java HotSpot(TM) Client VM (build 1.4.2_05-b04, mixed mode)
The code you can test is below, there are only 3 iterations for me if I use 
-Xmx5m, the 4th fails.
Jiri.
package test;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.Term;
import org.apache.lucene.search.Hits;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Searcher;
import org.apache.lucene.search.Sort;
import org.apache.lucene.search.SortField;
import org.apache.lucene.search.TermQuery;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.RAMDirectory;
import java.io.IOException;
import java.text.SimpleDateFormat;
import java.util.Calendar;
import java.util.Date;
/**
 * Run this test with Lucene 1.4.1 and -Xmx5m
 */
public class ReopenTest
{
private static long mem_last = 0;
public static void main(String[] args) throws IOException
{
Directory directory = create_index();
for (int i = 1; i  100; i++) {
System.err.println(loop  + i);
search_index(directory);
}
}
private static void search_index(Directory directory) throws IOException
{
IndexReader reader = IndexReader.open(directory);
Searcher searcher = new IndexSearcher(reader);
print_mem(search 1);
SortField[] fields = new SortField[2];
fields[0] = new SortField(date, SortField.STRING, true);
fields[1] = new SortField(id, SortField.STRING, false);
Sort sort = new Sort(fields);
TermQuery query = new TermQuery(new Term(text, \text 5\));
print_mem(search 2);
Hits hits = searcher.search(query, sort);
print_mem(search 3);
for (int i = 0; i  hits.length(); i++) {
Document doc = hits.doc(i);
System.out.println(doc  + i + :  + doc.toString());
}
print_mem(search 4);
searcher.close();
reader.close();
}
private static void print_mem(String log)
{
long mem_free = Runtime.getRuntime().freeMemory();
long mem_total = Runtime.getRuntime().totalMemory();
long mem_max = Runtime.getRuntime().maxMemory();
long delta = (mem_last - mem_free) * -1;
System.out.println(log + = delta:  + delta + , free:  + mem_free + , used:  + 
	(mem_total-mem_free) + , total:  + mem_total + , max:  + mem_max);

mem_last = mem_free;
}
private static Directory create_index() throws IOException
{
print_mem(create 1);
Directory directory = new RAMDirectory();
Calendar c = Calendar.getInstance();
SimpleDateFormat df = new SimpleDateFormat(-MM-dd);
IndexWriter writer = new IndexWriter(directory, new StandardAnalyzer(), true);
for (int i = 0; i  365 * 30; i++) {
Document doc = new Document();
doc.add(Field.Keyword(date, df.format(new Date(c.getTimeInMillis();
doc.add(Field.Keyword(id, AB + String.valueOf(i)));
doc.add(Field.Text(text, Tohle je text  + i));
writer.addDocument(doc);
c.add(Calendar.DAY_OF_YEAR, 1);
}
writer.optimize();
System.err.println(index size:  + writer.docCount());
writer.close();
print_mem(create 2);
return directory;
}
}
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
**
The information in this e-mail is confidential and may be legally privileged.
It is intended solely for the addressee. Access to this e-mail by anyone else
is unauthorised. If you are not the intended recipient, any disclosure,
copying, distribution, or any action taken or omitted to be taken in reliance
on it, is prohibited and may be unlawful.
Please note that emails to, from and within RT may be subject to the Freedom
of Information Act 1997 and may be liable to disclosure.
**
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Out of memory in lucene 1.4.1 when re-indexing large number of documents

2004-09-13 Thread Daniel Taurat
Okay,  reference test is done:
on JDK 1.4.2 Lucene 1.4.1 really seems to run fine: just a moderate 
number of SegmentTermEnums that is controlled by gc (about 500 for the 
1900 test objects).

Daniel Taurat wrote:
Hi Doug,
you are absolutely right about the older version of the JDK: it is 
1.3.1 (ibm).
Unfortunately we cannot upgrade since we are bound to IBM Portalserver 
4 environment.
Results:
I patched the Lucene1.4.1:
it has improved not much: after indexing 1897 Objects  the number of 
SegmentTermEnum is up to 17936.
To be realistic: This is even a deterioration :(((
My next check will be with a JDK1.4.2 for the test environment, but 
this can only be a reference run for now.

Thanks,
Daniel
Doug Cutting wrote:
It sounds like the ThreadLocal in TermInfosReader is not getting 
correctly garbage collected when the TermInfosReader is collected. 
Researching a bit, this was a bug in JVMs prior to 1.4.2, so my guess 
is that you're running in an older JVM.  Is that right?

I've attached a patch which should fix this.  Please tell me if it 
works for you.

Doug
Daniel Taurat wrote:
Okay, that (1.4rc3)worked fine, too!
Got only 257 SegmentTermEnums for 1900 objects.
Now I will go for the final test on the production server with the 
1.4rc3 version  and about 40.000 objects.

Daniel
Daniel Taurat schrieb:
Hi all,
here is some update for you:
I switched back to Lucene 1.3-final and now the  number of the  
SegmentTermEnum objects is controlled by gc again:
it goes up to about 1000 and then it is down again to 254 after 
indexing my 1900 test-objects.
Stay tuned, I will try 1.4RC3 now, the last version before 
FieldCache was introduced...

Daniel
Rupinder Singh Mazara schrieb:
hi all
 I had a similar problem, i have  database of documents with 24 
fields, and a average content of 7K, with  16M+ records

 i had to split the jobs into slabs of 1M each and merging the 
resulting indexes, submissions to our job queue looked like

 java -Xms100M -Xcompactexplicitgc -cp $CLASSPATH lucene.Indexer 22
 
and i still had outofmemory exception , the solution that i 
created was to after every 200K, documents create a temp 
directory, and merge them together, this was done to do the first 
production run, updates are now being handled incrementally

 

Exception in thread main java.lang.OutOfMemoryError
at 
org.apache.lucene.store.RAMOutputStream.flushBuffer(RAMOutputStream.java(Compiled 
Code))
at 
org.apache.lucene.store.OutputStream.flush(OutputStream.java(Inlined 
Compiled Code))
at 
org.apache.lucene.store.OutputStream.writeByte(OutputStream.java(Inlined 
Compiled Code))
at 
org.apache.lucene.store.OutputStream.writeBytes(OutputStream.java(Compiled 
Code))
at 
org.apache.lucene.index.CompoundFileWriter.copyFile(CompoundFileWriter.java(Compiled 
Code))
at 
org.apache.lucene.index.CompoundFileWriter.close(CompoundFileWriter.java(Compiled 
Code))
at 
org.apache.lucene.index.SegmentMerger.createCompoundFile(SegmentMerger.java(Compiled 
Code))
at 
org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java(Compiled 
Code))
at 
org.apache.lucene.index.IndexWriter.mergeSegments(IndexWriter.java(Compiled 
Code))
at 
org.apache.lucene.index.IndexWriter.optimize(IndexWriter.java:366)
at lucene.Indexer.doIndex(CDBIndexer.java(Compiled Code))
at lucene.Indexer.main(CDBIndexer.java:168)

 

-Original Message-
From: Daniel Taurat [mailto:[EMAIL PROTECTED]
Sent: 10 September 2004 14:42
To: Lucene Users List
Subject: Re: Out of memory in lucene 1.4.1 when re-indexing large 
number
of documents

Hi Pete,
good hint, but we actually do have physical memory of  4Gb on the 
system. But then: we also have experienced that the gc of ibm 
jdk1.3.1 that we use is sometimes
behaving strangely with too large heap space anyway. (Limit seems 
to be 1.2 Gb)
I can say that gc is not collecting these objects since I  forced 
gc runs when indexing every now and then (when parsing pdf-type 
objects, that is): No effect.

regards,
Daniel
Pete Lewis wrote:
 

Hi all
Reading the thread with interest, there is another way I've 
come 


across out
 

of memory errors when indexing large batches of documents.
If you have your heap space settings too high, then you get 


swapping (which
 

impacts performance) plus you never reach the trigger for garbage
collection, hence you don't garbage collect and hence you run 
out 


of memory.
 

Can you check whether or not your garbage collection is being 
triggered?

Anomalously therefore if this is the case, by reducing the heap 
space you
can improve performance get rid of the out of memory errors.

Cheers
Pete Lewis
- Original Message - From: Daniel Taurat 
[EMAIL PROTECTED]
To: Lucene Users List [EMAIL PROTECTED]
Sent: Friday, September 10, 2004 1:10 PM
Subject: Re: Out of memory in lucene 1.4.1 when re-indexing 
large 


number of
 

documents

 

Daniel Aber schrieb:
 
   

On Thursday 09 September 2004 19:47, Daniel Taurat 

RE: OutOfMemory example

2004-09-13 Thread Ji Kuhn
I disagree or I don't understand. 

I can change the code as it is shown below. Now I must reopen the index to see the 
changes, but the memory problem remains. I realy don't know what I'm doing wrong, the 
code is so simple.

Jiri.

...

public static void main(String[] args) throws IOException
{
Directory directory = create_index();

for (int i = 1; i  100; i++) {
System.err.println(loop  + i + , index version:  + 
IndexReader.getCurrentVersion(directory));
search_index(directory);
add_to_index(directory, i);
}
}

private static void add_to_index(Directory directory, int i) throws IOException
{
IndexWriter writer = new IndexWriter(directory, new StandardAnalyzer(), false);

SimpleDateFormat df = new SimpleDateFormat(-MM-dd);
Document doc = new Document();

doc.add(Field.Keyword(date, df.format(new 
Date(System.currentTimeMillis();
doc.add(Field.Keyword(id, CD + String.valueOf(i)));
doc.add(Field.Text(text, Tohle neni text  + i));
writer.addDocument(doc);

System.err.println(index size:  + writer.docCount());
writer.close();
}

...

-Original Message-
From: John Moylan [mailto:[EMAIL PROTECTED]
Sent: Monday, September 13, 2004 3:25 PM
To: Lucene Users List
Subject: Re: OutOfMemory example


You should reuse your old index (as eg an application variable) unless 
it has changed - use getCurrentVersion to check the index for updates. 
This has come up before.

John


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: OutOfMemory example

2004-09-13 Thread John Moylan
http://issues.apache.org/bugzilla/show_bug.cgi?id=30628
you can close the index, but the Garbage Collector still needs to 
reclaim the memory and it may be taking longer than your loop to do so.

John
Ji Kuhn wrote:
I disagree or I don't understand. 

I can change the code as it is shown below. Now I must reopen the index to see the 
changes, but the memory problem remains. I realy don't know what I'm doing wrong, the 
code is so simple.
Jiri.
...
public static void main(String[] args) throws IOException
{
Directory directory = create_index();
for (int i = 1; i  100; i++) {
System.err.println(loop  + i + , index version:  + 
IndexReader.getCurrentVersion(directory));
search_index(directory);
add_to_index(directory, i);
}
}
private static void add_to_index(Directory directory, int i) throws IOException
{
IndexWriter writer = new IndexWriter(directory, new StandardAnalyzer(), false);
SimpleDateFormat df = new SimpleDateFormat(-MM-dd);
Document doc = new Document();
doc.add(Field.Keyword(date, df.format(new 
Date(System.currentTimeMillis();
doc.add(Field.Keyword(id, CD + String.valueOf(i)));
doc.add(Field.Text(text, Tohle neni text  + i));
writer.addDocument(doc);
System.err.println(index size:  + writer.docCount());
writer.close();
}
...
-Original Message-
From: John Moylan [mailto:[EMAIL PROTECTED]
Sent: Monday, September 13, 2004 3:25 PM
To: Lucene Users List
Subject: Re: OutOfMemory example
You should reuse your old index (as eg an application variable) unless 
it has changed - use getCurrentVersion to check the index for updates. 
This has come up before.

John
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
**
The information in this e-mail is confidential and may be legally privileged.
It is intended solely for the addressee. Access to this e-mail by anyone else
is unauthorised. If you are not the intended recipient, any disclosure,
copying, distribution, or any action taken or omitted to be taken in reliance
on it, is prohibited and may be unlawful.
Please note that emails to, from and within RT may be subject to the Freedom
of Information Act 1997 and may be liable to disclosure.
**
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: OutOfMemory example

2004-09-13 Thread sergiu gordea
I have a few comments regarding your code ...
1. Why do you use RamDirectory and not the hard disk?
2. as John said, you should reuse the index instead of creating it each 
time in the main function
   if(!indexExists(File indexFile))
IndexWriter writer = new IndexWriter(directory, new 
StandardAnalyzer(), true);
   else
IndexWriter writer = new IndexWriter(directory, new 
StandardAnalyzer(), false);
   (in some cases indexExists can be as simple as verifying if the file 
exits on the hard disk)

3. you iterate in a loop over 10.000 times and you create a lot of objects
  

for (int i = 0; i  365 * 30; i++) {
   Document doc = new Document();
   doc.add(Field.Keyword(date, df.format(new 
Date(c.getTimeInMillis();
   doc.add(Field.Keyword(id, AB + String.valueOf(i)));
   doc.add(Field.Text(text, Tohle je text  + i));
   writer.addDocument(doc);

   c.add(Calendar.DAY_OF_YEAR, 1);
   }
all the underlined lines of code create new  ojects, and all of them are 
kept in memory.
This is a lot of memory allocated only by this loop. I think that you 
create more than 100.000 object in this loop ...
What do you think?
And none of them cannot be realeased (collected by gc) untill you close 
the index writer.

None says that your code is complicated, but all programmers should 
understand that this is a poor design...
And ... more then that your information is kept in a RamDirectory 
when you will close the writer you will still keep the information 
in memory ...

Sory if I was too agressive with my comments  but ... I cannot see 
what were you thinking when you wrote that code ...

If you are trying to make a test  then I sugest you to replace the 
hard codded 365 value ... with a variable, to iterate over it and to 
test the power of your machine
(PC + JVM) :))

I wish you luck,
Sergiu


Ji Kuhn wrote:
I disagree or I don't understand. 

I can change the code as it is shown below. Now I must reopen the index to see the 
changes, but the memory problem remains. I realy don't know what I'm doing wrong, the 
code is so simple.
Jiri.
...
   public static void main(String[] args) throws IOException
   {
   Directory directory = create_index();
   for (int i = 1; i  100; i++) {
   System.err.println(loop  + i + , index version:  + 
IndexReader.getCurrentVersion(directory));
   search_index(directory);
   add_to_index(directory, i);
   }
   }
   private static void add_to_index(Directory directory, int i) throws IOException
   {
   IndexWriter writer = new IndexWriter(directory, new StandardAnalyzer(), false);
   SimpleDateFormat df = new SimpleDateFormat(-MM-dd);
   Document doc = new Document();
   doc.add(Field.Keyword(date, df.format(new Date(System.currentTimeMillis();
   doc.add(Field.Keyword(id, CD + String.valueOf(i)));
   doc.add(Field.Text(text, Tohle neni text  + i));
   writer.addDocument(doc);
   System.err.println(index size:  + writer.docCount());
   writer.close();
   }
...
-Original Message-
From: John Moylan [mailto:[EMAIL PROTECTED]
Sent: Monday, September 13, 2004 3:25 PM
To: Lucene Users List
Subject: Re: OutOfMemory example
You should reuse your old index (as eg an application variable) unless 
it has changed - use getCurrentVersion to check the index for updates. 
This has come up before.

John
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


RE: OutOfMemory example

2004-09-13 Thread Ji Kuhn
Thanks for the bug's id, it seems like my problem and I have a stand-alone code with 
main().

What about slow garbage collector? This looks for me as wrong suggestion.

Let change the code once again:

...
public static void main(String[] args) throws IOException, InterruptedException
{
Directory directory = create_index();

for (int i = 1; i  100; i++) {
System.err.println(loop  + i + , index version:  + 
IndexReader.getCurrentVersion(directory));
search_index(directory);
add_to_index(directory, i);
System.gc();
Thread.sleep(1000);// whatever value you want
}
}
...

and in the 4th iteration java.lang.OutOfMemoryError appears again.

Jiri.


-Original Message-
From: John Moylan [mailto:[EMAIL PROTECTED]
Sent: Monday, September 13, 2004 4:53 PM
To: Lucene Users List
Subject: Re: OutOfMemory example


http://issues.apache.org/bugzilla/show_bug.cgi?id=30628

you can close the index, but the Garbage Collector still needs to 
reclaim the memory and it may be taking longer than your loop to do so.

John



RE: OutOfMemory example

2004-09-13 Thread Ji Kuhn
You don't see the point of my post. I sent an application which can everyone run only 
with lucene jar and in deterministic way produce OutOfMemoryError.

That's all.

Jiri.


-Original Message-
From: sergiu gordea [mailto:[EMAIL PROTECTED]
Sent: Monday, September 13, 2004 5:16 PM
To: Lucene Users List
Subject: Re: OutOfMemory example


I have a few comments regarding your code ...
1. Why do you use RamDirectory and not the hard disk?
2. as John said, you should reuse the index instead of creating it each 
time in the main function
if(!indexExists(File indexFile))
 IndexWriter writer = new IndexWriter(directory, new 
StandardAnalyzer(), true);
else
 IndexWriter writer = new IndexWriter(directory, new 
StandardAnalyzer(), false);
(in some cases indexExists can be as simple as verifying if the file 
exits on the hard disk)

3. you iterate in a loop over 10.000 times and you create a lot of objects
   

for (int i = 0; i  365 * 30; i++) {
Document doc = new Document();

doc.add(Field.Keyword(date, df.format(new 
Date(c.getTimeInMillis();
doc.add(Field.Keyword(id, AB + String.valueOf(i)));
doc.add(Field.Text(text, Tohle je text  + i));
writer.addDocument(doc);

c.add(Calendar.DAY_OF_YEAR, 1);
}
all the underlined lines of code create new  ojects, and all of them are 
kept in memory.
This is a lot of memory allocated only by this loop. I think that you 
create more than 100.000 object in this loop ...
What do you think?
And none of them cannot be realeased (collected by gc) untill you close 
the index writer.

None says that your code is complicated, but all programmers should 
understand that this is a poor design...
And ... more then that your information is kept in a RamDirectory 
when you will close the writer you will still keep the information 
in memory ...

Sory if I was too agressive with my comments  but ... I cannot see 
what were you thinking when you wrote that code ...

If you are trying to make a test  then I sugest you to replace the 
hard codded 365 value ... with a variable, to iterate over it and to 
test the power of your machine
(PC + JVM) :))

I wish you luck,

 Sergiu






Ji Kuhn wrote:

I disagree or I don't understand. 

I can change the code as it is shown below. Now I must reopen the index to see the 
changes, but the memory problem remains. I realy don't know what I'm doing wrong, the 
code is so simple.

Jiri.

   ...

public static void main(String[] args) throws IOException
{
Directory directory = create_index();

for (int i = 1; i  100; i++) {
System.err.println(loop  + i + , index version:  + 
 IndexReader.getCurrentVersion(directory));
search_index(directory);
add_to_index(directory, i);
}
}

private static void add_to_index(Directory directory, int i) throws IOException
{
IndexWriter writer = new IndexWriter(directory, new StandardAnalyzer(), 
 false);

SimpleDateFormat df = new SimpleDateFormat(-MM-dd);
Document doc = new Document();

doc.add(Field.Keyword(date, df.format(new 
 Date(System.currentTimeMillis();
doc.add(Field.Keyword(id, CD + String.valueOf(i)));
doc.add(Field.Text(text, Tohle neni text  + i));
writer.addDocument(doc);

System.err.println(index size:  + writer.docCount());
writer.close();
}

   ...

-Original Message-
From: John Moylan [mailto:[EMAIL PROTECTED]
Sent: Monday, September 13, 2004 3:25 PM
To: Lucene Users List
Subject: Re: OutOfMemory example


You should reuse your old index (as eg an application variable) unless 
it has changed - use getCurrentVersion to check the index for updates. 
This has come up before.

John


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

  



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



force gc idiom - Re: OutOfMemory example

2004-09-13 Thread David Spencer
Ji Kuhn wrote:
Thanks for the bug's id, it seems like my problem and I have a stand-alone code with 
main().
What about slow garbage collector? This looks for me as wrong suggestion.

I've seen this written up before (javaworld?) as a way to probably 
force GC instead of just a System.gc() call. I think the 2nd gc() call 
is supposed to clean up junk from the runFinalization() call...

System.gc();
Thread.sleep( 100);
System.runFinalization();
Thread.sleep( 100);
System.gc();
Let change the code once again:
...
public static void main(String[] args) throws IOException, InterruptedException
{
Directory directory = create_index();
for (int i = 1; i  100; i++) {
System.err.println(loop  + i + , index version:  + 
IndexReader.getCurrentVersion(directory));
search_index(directory);
add_to_index(directory, i);
System.gc();
Thread.sleep(1000);// whatever value you want
}
}
...
and in the 4th iteration java.lang.OutOfMemoryError appears again.
Jiri.
-Original Message-
From: John Moylan [mailto:[EMAIL PROTECTED]
Sent: Monday, September 13, 2004 4:53 PM
To: Lucene Users List
Subject: Re: OutOfMemory example
http://issues.apache.org/bugzilla/show_bug.cgi?id=30628
you can close the index, but the Garbage Collector still needs to 
reclaim the memory and it may be taking longer than your loop to do so.

John

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


RE: force gc idiom - Re: OutOfMemory example

2004-09-13 Thread Ji Kuhn
This doesn't work either!

Lets concentrate on the first version of my code. I believe that the code should run 
endlesly (I have said it before: in version 1.4 final it does).

Jiri.

-Original Message-
From: David Spencer [mailto:[EMAIL PROTECTED]
Sent: Monday, September 13, 2004 5:34 PM
To: Lucene Users List
Subject: force gc idiom - Re: OutOfMemory example


Ji Kuhn wrote:

 Thanks for the bug's id, it seems like my problem and I have a stand-alone code with 
 main().
 
 What about slow garbage collector? This looks for me as wrong suggestion.


I've seen this written up before (javaworld?) as a way to probably 
force GC instead of just a System.gc() call. I think the 2nd gc() call 
is supposed to clean up junk from the runFinalization() call...

System.gc();
Thread.sleep( 100);
System.runFinalization();
Thread.sleep( 100);
System.gc();

 
 Let change the code once again:
 
 ...
 public static void main(String[] args) throws IOException, InterruptedException
 {
 Directory directory = create_index();
 
 for (int i = 1; i  100; i++) {
 System.err.println(loop  + i + , index version:  + 
 IndexReader.getCurrentVersion(directory));
 search_index(directory);
 add_to_index(directory, i);
 System.gc();
 Thread.sleep(1000);// whatever value you want
 }
 }
 ...
 
 and in the 4th iteration java.lang.OutOfMemoryError appears again.
 
 Jiri.
 
 
 -Original Message-
 From: John Moylan [mailto:[EMAIL PROTECTED]
 Sent: Monday, September 13, 2004 4:53 PM
 To: Lucene Users List
 Subject: Re: OutOfMemory example
 
 
 http://issues.apache.org/bugzilla/show_bug.cgi?id=30628
 
 you can close the index, but the Garbage Collector still needs to 
 reclaim the memory and it may be taking longer than your loop to do so.
 
 John
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: OutOfMemory example

2004-09-13 Thread sergiu gordea
then probably is my mistake ...I havn't read all the emails in the thread.
So ... your goal is to produce errors ... I try to avoid them :))
  All the best,
 Sergiu
 

Ji Kuhn wrote:
You don't see the point of my post. I sent an application which can everyone run only 
with lucene jar and in deterministic way produce OutOfMemoryError.
That's all.
Jiri.
-Original Message-
From: sergiu gordea [mailto:[EMAIL PROTECTED]
Sent: Monday, September 13, 2004 5:16 PM
To: Lucene Users List
Subject: Re: OutOfMemory example
I have a few comments regarding your code ...
1. Why do you use RamDirectory and not the hard disk?
2. as John said, you should reuse the index instead of creating it each 
time in the main function
   if(!indexExists(File indexFile))
IndexWriter writer = new IndexWriter(directory, new 
StandardAnalyzer(), true);
   else
IndexWriter writer = new IndexWriter(directory, new 
StandardAnalyzer(), false);
   (in some cases indexExists can be as simple as verifying if the file 
exits on the hard disk)

3. you iterate in a loop over 10.000 times and you create a lot of objects
  

for (int i = 0; i  365 * 30; i++) {
   Document doc = new Document();
   doc.add(Field.Keyword(date, df.format(new 
Date(c.getTimeInMillis();
   doc.add(Field.Keyword(id, AB + String.valueOf(i)));
   doc.add(Field.Text(text, Tohle je text  + i));
   writer.addDocument(doc);

   c.add(Calendar.DAY_OF_YEAR, 1);
   }
all the underlined lines of code create new  ojects, and all of them are 
kept in memory.
This is a lot of memory allocated only by this loop. I think that you 
create more than 100.000 object in this loop ...
What do you think?
And none of them cannot be realeased (collected by gc) untill you close 
the index writer.

None says that your code is complicated, but all programmers should 
understand that this is a poor design...
And ... more then that your information is kept in a RamDirectory 
when you will close the writer you will still keep the information 
in memory ...

Sory if I was too agressive with my comments  but ... I cannot see 
what were you thinking when you wrote that code ...

If you are trying to make a test  then I sugest you to replace the 
hard codded 365 value ... with a variable, to iterate over it and to 
test the power of your machine
(PC + JVM) :))

I wish you luck,
Sergiu


Ji Kuhn wrote:
 

I disagree or I don't understand. 

I can change the code as it is shown below. Now I must reopen the index to see the 
changes, but the memory problem remains. I realy don't know what I'm doing wrong, the 
code is so simple.
Jiri.
...
  public static void main(String[] args) throws IOException
  {
  Directory directory = create_index();
  for (int i = 1; i  100; i++) {
  System.err.println(loop  + i + , index version:  + 
IndexReader.getCurrentVersion(directory));
  search_index(directory);
  add_to_index(directory, i);
  }
  }
  private static void add_to_index(Directory directory, int i) throws IOException
  {
  IndexWriter writer = new IndexWriter(directory, new StandardAnalyzer(), false);
  SimpleDateFormat df = new SimpleDateFormat(-MM-dd);
  Document doc = new Document();
  doc.add(Field.Keyword(date, df.format(new Date(System.currentTimeMillis();
  doc.add(Field.Keyword(id, CD + String.valueOf(i)));
  doc.add(Field.Text(text, Tohle neni text  + i));
  writer.addDocument(doc);
  System.err.println(index size:  + writer.docCount());
  writer.close();
  }
...
-Original Message-
From: John Moylan [mailto:[EMAIL PROTECTED]
Sent: Monday, September 13, 2004 3:25 PM
To: Lucene Users List
Subject: Re: OutOfMemory example
You should reuse your old index (as eg an application variable) unless 
it has changed - use getCurrentVersion to check the index for updates. 
This has come up before.

John
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

   


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


OptimizeIt -- Re: force gc idiom - Re: OutOfMemory example

2004-09-13 Thread David Spencer
Ji Kuhn wrote:
This doesn't work either!
You're right.
I'm running under JDK1.5 and trying larger values for -Xmx and it still 
fails.

Running under (Borlands) OptimzeIt shows the number of Terms and 
Terminfos (both in org.apache.lucene.index) increase every time thru the 
loop, by several hundred instances each.

I can trace thru some Term instances on the reference graph of 
OptimizeIt but it's unclear to me what's right. One *guess* is that 
maybe the WeakHashMap in either SegmentReader or FieldCacheImpl is the 
problem.



Lets concentrate on the first version of my code. I believe that the code should 
run endlesly (I have said it before: in version 1.4 final it does).
Jiri.
-Original Message-
From: David Spencer [mailto:[EMAIL PROTECTED]
Sent: Monday, September 13, 2004 5:34 PM
To: Lucene Users List
Subject: force gc idiom - Re: OutOfMemory example
Ji Kuhn wrote:

Thanks for the bug's id, it seems like my problem and I have a stand-alone code with 
main().
What about slow garbage collector? This looks for me as wrong suggestion.

I've seen this written up before (javaworld?) as a way to probably 
force GC instead of just a System.gc() call. I think the 2nd gc() call 
is supposed to clean up junk from the runFinalization() call...

System.gc();
Thread.sleep( 100);
System.runFinalization();
Thread.sleep( 100);
System.gc();

Let change the code once again:
...
   public static void main(String[] args) throws IOException, InterruptedException
   {
   Directory directory = create_index();
   for (int i = 1; i  100; i++) {
   System.err.println(loop  + i + , index version:  + 
IndexReader.getCurrentVersion(directory));
   search_index(directory);
   add_to_index(directory, i);
   System.gc();
   Thread.sleep(1000);// whatever value you want
   }
   }
...
and in the 4th iteration java.lang.OutOfMemoryError appears again.
Jiri.
-Original Message-
From: John Moylan [mailto:[EMAIL PROTECTED]
Sent: Monday, September 13, 2004 4:53 PM
To: Lucene Users List
Subject: Re: OutOfMemory example
http://issues.apache.org/bugzilla/show_bug.cgi?id=30628
you can close the index, but the Garbage Collector still needs to 
reclaim the memory and it may be taking longer than your loop to do so.

John

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


FieldSortedHitQueue.Comparators -- Re: force gc idiom - Re: OutOfMemory example

2004-09-13 Thread David Spencer
Just noticed something else suspicious.
FieldSortedHitQueue has a field called Comparators and it seems like 
things are never removed from it

Ji Kuhn wrote:
This doesn't work either!
Lets concentrate on the first version of my code. I believe that the code should run 
endlesly (I have said it before: in version 1.4 final it does).
Jiri.
-Original Message-
From: David Spencer [mailto:[EMAIL PROTECTED]
Sent: Monday, September 13, 2004 5:34 PM
To: Lucene Users List
Subject: force gc idiom - Re: OutOfMemory example
Ji Kuhn wrote:

Thanks for the bug's id, it seems like my problem and I have a stand-alone code with 
main().
What about slow garbage collector? This looks for me as wrong suggestion.

I've seen this written up before (javaworld?) as a way to probably 
force GC instead of just a System.gc() call. I think the 2nd gc() call 
is supposed to clean up junk from the runFinalization() call...

System.gc();
Thread.sleep( 100);
System.runFinalization();
Thread.sleep( 100);
System.gc();

Let change the code once again:
...
   public static void main(String[] args) throws IOException, InterruptedException
   {
   Directory directory = create_index();
   for (int i = 1; i  100; i++) {
   System.err.println(loop  + i + , index version:  + 
IndexReader.getCurrentVersion(directory));
   search_index(directory);
   add_to_index(directory, i);
   System.gc();
   Thread.sleep(1000);// whatever value you want
   }
   }
...
and in the 4th iteration java.lang.OutOfMemoryError appears again.
Jiri.
-Original Message-
From: John Moylan [mailto:[EMAIL PROTECTED]
Sent: Monday, September 13, 2004 4:53 PM
To: Lucene Users List
Subject: Re: OutOfMemory example
http://issues.apache.org/bugzilla/show_bug.cgi?id=30628
you can close the index, but the Garbage Collector still needs to 
reclaim the memory and it may be taking longer than your loop to do so.

John

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: FieldSortedHitQueue.Comparators -- Re: force gc idiom - Re: OutOfMemory example

2004-09-13 Thread David Spencer
David Spencer wrote:
Just noticed something else suspicious.
FieldSortedHitQueue has a field called Comparators and it seems like 
things are never removed from it
Replying to my own postthis could be the problem.
If I put in a print statement here in FieldSortedHitQueue, recompile, 
and run w/ the new jar then I see Comparators.size() go up after every 
iteration thru ReopenTest's loop and the size() never goes down...

 static Object store (IndexReader reader, String field, int type, 
Object factory, Object value) {
FieldCacheImpl.Entry entry = (factory != null)
  ? new FieldCacheImpl.Entry (field, factory)
  : new FieldCacheImpl.Entry (field, type);
synchronized (Comparators) {
  HashMap readerCache = (HashMap)Comparators.get(reader);
  if (readerCache == null) {
readerCache = new HashMap();
Comparators.put(reader,readerCache);
		System.out.println( *\t* NOW: + Comparators.size());
  }
  return readerCache.put (entry, value);
}
  }

Ji Kuhn wrote:
This doesn't work either!
Lets concentrate on the first version of my code. I believe that the 
code should run endlesly (I have said it before: in version 1.4 final 
it does).

Jiri.
-Original Message-
From: David Spencer [mailto:[EMAIL PROTECTED]
Sent: Monday, September 13, 2004 5:34 PM
To: Lucene Users List
Subject: force gc idiom - Re: OutOfMemory example
Ji Kuhn wrote:

Thanks for the bug's id, it seems like my problem and I have a 
stand-alone code with main().

What about slow garbage collector? This looks for me as wrong 
suggestion.


I've seen this written up before (javaworld?) as a way to probably 
force GC instead of just a System.gc() call. I think the 2nd gc() 
call is supposed to clean up junk from the runFinalization() call...

System.gc();
Thread.sleep( 100);
System.runFinalization();
Thread.sleep( 100);
System.gc();

Let change the code once again:
...
   public static void main(String[] args) throws IOException, 
InterruptedException
   {
   Directory directory = create_index();

   for (int i = 1; i  100; i++) {
   System.err.println(loop  + i + , index version:  + 
IndexReader.getCurrentVersion(directory));
   search_index(directory);
   add_to_index(directory, i);
   System.gc();
   Thread.sleep(1000);// whatever value you want
   }
   }
...

and in the 4th iteration java.lang.OutOfMemoryError appears again.
Jiri.
-Original Message-
From: John Moylan [mailto:[EMAIL PROTECTED]
Sent: Monday, September 13, 2004 4:53 PM
To: Lucene Users List
Subject: Re: OutOfMemory example
http://issues.apache.org/bugzilla/show_bug.cgi?id=30628
you can close the index, but the Garbage Collector still needs to 
reclaim the memory and it may be taking longer than your loop to do so.

John

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


SegmentReader - Re: FieldSortedHitQueue.Comparators -- Re: force gc idiom - Re: OutOfMemory example

2004-09-13 Thread David Spencer
Another clue, the SegmentReaders are piling up too, which may be why the 
 Comparator map is increasing in size, because SegmentReaders are the 
keys to Comparator...though again, I don't know enough about the Lucene 
internals to know what refs to SegmentReaders are valid which which ones 
may be causing this leak.

David Spencer wrote:
David Spencer wrote:
Just noticed something else suspicious.
FieldSortedHitQueue has a field called Comparators and it seems like 
things are never removed from it

Replying to my own postthis could be the problem.
If I put in a print statement here in FieldSortedHitQueue, recompile, 
and run w/ the new jar then I see Comparators.size() go up after every 
iteration thru ReopenTest's loop and the size() never goes down...

 static Object store (IndexReader reader, String field, int type, Object 
factory, Object value) {
FieldCacheImpl.Entry entry = (factory != null)
  ? new FieldCacheImpl.Entry (field, factory)
  : new FieldCacheImpl.Entry (field, type);
synchronized (Comparators) {
  HashMap readerCache = (HashMap)Comparators.get(reader);
  if (readerCache == null) {
readerCache = new HashMap();
Comparators.put(reader,readerCache);
System.out.println( *\t* NOW: + Comparators.size());
  }
  return readerCache.put (entry, value);
}
  }


Ji Kuhn wrote:
This doesn't work either!
Lets concentrate on the first version of my code. I believe that the 
code should run endlesly (I have said it before: in version 1.4 final 
it does).

Jiri.
-Original Message-
From: David Spencer [mailto:[EMAIL PROTECTED]
Sent: Monday, September 13, 2004 5:34 PM
To: Lucene Users List
Subject: force gc idiom - Re: OutOfMemory example
Ji Kuhn wrote:

Thanks for the bug's id, it seems like my problem and I have a 
stand-alone code with main().

What about slow garbage collector? This looks for me as wrong 
suggestion.


I've seen this written up before (javaworld?) as a way to probably 
force GC instead of just a System.gc() call. I think the 2nd gc() 
call is supposed to clean up junk from the runFinalization() call...

System.gc();
Thread.sleep( 100);
System.runFinalization();
Thread.sleep( 100);
System.gc();

Let change the code once again:
...
   public static void main(String[] args) throws IOException, 
InterruptedException
   {
   Directory directory = create_index();

   for (int i = 1; i  100; i++) {
   System.err.println(loop  + i + , index version:  + 
IndexReader.getCurrentVersion(directory));
   search_index(directory);
   add_to_index(directory, i);
   System.gc();
   Thread.sleep(1000);// whatever value you want
   }
   }
...

and in the 4th iteration java.lang.OutOfMemoryError appears again.
Jiri.
-Original Message-
From: John Moylan [mailto:[EMAIL PROTECTED]
Sent: Monday, September 13, 2004 4:53 PM
To: Lucene Users List
Subject: Re: OutOfMemory example
http://issues.apache.org/bugzilla/show_bug.cgi?id=30628
you can close the index, but the Garbage Collector still needs to 
reclaim the memory and it may be taking longer than your loop to do so.

John

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: OutOfMemory example

2004-09-13 Thread Daniel Naber
On Monday 13 September 2004 15:06, Ji Kuhn wrote:

 I think I can reproduce memory leaking problem while reopening
 an index. Lucene version tested is 1.4.1, version 1.4 final works OK. My
 JVM is:

Could you try with the latest Lucene version from CVS? I cannot reproduce 
your problem with that version (Sun's Java 1.4.2_03, Linux).

Regards
 Daniel

-- 
http://www.danielnaber.de

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: OptimizeIt -- Re: force gc idiom - Re: OutOfMemory example

2004-09-13 Thread Kevin A. Burton
David Spencer wrote:
Ji Kuhn wrote:
This doesn't work either!

You're right.
I'm running under JDK1.5 and trying larger values for -Xmx and it 
still fails.

Running under (Borlands) OptimzeIt shows the number of Terms and 
Terminfos (both in org.apache.lucene.index) increase every time thru 
the loop, by several hundred instances each.
Yes... I'm running into a similar situation on JDK 1.4.2 with Lucene 
1.3... I used the JMP debugger and all my memory is taken by Terms and 
TermInfo...

I can trace thru some Term instances on the reference graph of 
OptimizeIt but it's unclear to me what's right. One *guess* is that 
maybe the WeakHashMap in either SegmentReader or FieldCacheImpl is the 
problem.
Kevin
--
Please reply using PGP.
   http://peerfear.org/pubkey.asc
   
   NewsMonster - http://www.newsmonster.org/
   
Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: OutOfMemory example

2004-09-13 Thread Kevin A. Burton
Ji Kuhn wrote:
Hi,
I think I can reproduce memory leaking problem while reopening an index. 
Lucene version tested is 1.4.1, version 1.4 final works OK. My JVM is:
$ java -version
java version 1.4.2_05
Java(TM) 2 Runtime Environment, Standard Edition (build 1.4.2_05-b04)
Java HotSpot(TM) Client VM (build 1.4.2_05-b04, mixed mode)
The code you can test is below, there are only 3 iterations for me if I use 
-Xmx5m, the 4th fails.
 

At least this test seems tied to the Sort API... I removed the sort 
under Lucene 1.3 and it worked fine...

Kevin
--
Please reply using PGP.
   http://peerfear.org/pubkey.asc
   
   NewsMonster - http://www.newsmonster.org/
   
Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Addition to contributions page

2004-09-13 Thread Daniel Naber
On Friday 10 September 2004 15:48, Chas Emerick wrote:

 PDFTextStream should be added to the 'Document Converters' section,
 with this URL  http://snowtide.com , and perhaps this heading:
 'PDFTextStream -- PDF text and metadata extraction'. The 'Author'
 field should probably be left blank, since there's no single creator.

I just added it.

Regards
 Daniel

-- 
http://www.danielnaber.de

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: OutOfMemory example

2004-09-13 Thread David Spencer
Daniel Naber wrote:
On Monday 13 September 2004 15:06, Ji Kuhn wrote:

   I think I can reproduce memory leaking problem while reopening
an index. Lucene version tested is 1.4.1, version 1.4 final works OK. My
JVM is:

Could you try with the latest Lucene version from CVS? I cannot reproduce 
your problem with that version (Sun's Java 1.4.2_03, Linux).
I verified it w/ the latest lucene code from CVS under win xp.
Regards
 Daniel

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Similarity score computation documentation

2004-09-13 Thread Ken McCracken
Hi,

I was looking through the score computation when running search, and
think there may be a discrepancy between what is _documented_ in the
org.apache.lucene.search.Similarity class overview Javadocs, and what
actually occurs in the code.

I believe the problem is only with the documentation.

I'm pretty sure that there should be an idf^2 in the sum.  Look at
org.apache.lucene.search.TermQuery, the inner class TermWeight.  You
can see that first sumOfSquaredWeights() is called, followed by
normalize(), during search.  Further, the resulting value stored in
the field value is set as the weightValue on the TermScorer.

If we look at what happens to TermWeight, sumOfSquaredWeights() sets
queryWeight to idf * boost.  During normalize(), queryWeight is
multiplied by the query norm, and value is set to queryWeight * idf
== idf * boost * query norm * idf == idf^2 * boost * query norm.  This
becomes the weightValue in the TermScorer that is then used to
multiply with the appropriate tf, etc., values.

The remaining terms in the Similarity description are properly
appended.  I also see that the queryNorm effectively cancels out
(dimensionally, since it is a 1/ square root of a sum of squares of
idfs) one of the idfs, so the formula still ends up being roughly a
TF-IDF formula.  But the idf^2 should still be there, along with the
expansion of queryNorm.

Am I mistaken, or is the documentation off?

Thanks for your help,
-Ken

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]