Indexwriter can't add the 10000th document to the index

2007-01-28 Thread maureen tanuwidjaja
I finally rerun the program and it stops at exactly the sampe  place.This time 
the exception came out.Writer cant add the 1th  document to the index...
  
  Indexing C:\sweetpea\wikipedia_xmlfiles\part-18\491886.xml
  Indexing C:\sweetpea\wikipedia_xmlfiles\part-18\491887.xml
  Indexing C:\sweetpea\wikipedia_xmlfiles\part-18\491891.xml
  Indexing C:\sweetpea\wikipedia_xmlfiles\part-18\491893.xml
  Indexing C:\sweetpea\wikipedia_xmlfiles\part-18\491896.xml
  java.lang.Exception: cannot add document to index
  
  
  And this is the code..
  
  public static void addDocToIndex(Document doc) throws Exception
  {
  try
  {
  writer.addDocument(doc);
  counter++;   
  }
  catch (Exception e)
  {
  throw new Exception(cannot add document to index);
  }
  }
  
  I already put -Xmx512m in Java VM argument,since previously it has exception 
of Exception in thread main java.lang.OutOfMemoryError: Java heap space 
  
  
  Maureen

Chris Hostetter [EMAIL PROTECTED] wrote:  
did you try triggering a thread dump to see what it was doing at that
point?

depending on your merge factors and other IndexWriter settings it could
just be doing a relaly big merge.

: Date: Sat, 27 Jan 2007 09:40:47 -0800 (PST)
: From: maureen tanuwidjaja 
: Reply-To: java-user@lucene.apache.org
: To: java-user@lucene.apache.org
: Subject: My program stops indexing after 1th documents is indexed
:
: Hi all,
:
:   Is there any limitation of number of file that lucene can handle?
:   I indexed a total of 3 XML Documents,however it stops at 1th 
documents.
:   No warning,no error ,no exception as well.
:   
:   Indexing C:\sweetpea\wikipedia_xmlfiles\part-18\491876.xml
:   Indexing C:\sweetpea\wikipedia_xmlfiles\part-18\491886.xml
:   Indexing C:\sweetpea\wikipedia_xmlfiles\part-18\491887.xml
:   Indexing C:\sweetpea\wikipedia_xmlfiles\part-18\491891.xml
:   Indexing C:\sweetpea\wikipedia_xmlfiles\part-18\491893.xml
:   Indexing C:\sweetpea\wikipedia_xmlfiles\part-18\491896.xml --1th 
doc
:   --it idles here--
:
:  At first I thought that it was the size of 1th document is so big  so 
that it took quite a long time to put into the index.Then i found  out that the 
1th document has the size of 6 KB only.Indexing  process stops for about 1 
hour,so that i decide to terminate the  progress.
:
:   Is there anything to do with smt like setCompoundFiles etc?cause I dont 
include any in my program...
:
:   Any suggestion pls?
:
:   THanks and Best Regards,
:
:   Maureen
:
:
: -
: Any questions?  Get answers on any topic at Yahoo! Answers. Try it now.



-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



 
-
Cheap Talk? Check out Yahoo! Messenger's low PC-to-Phone call rates.
 
-
Expecting? Get great news right away with email Auto-Check.
Try the Yahoo! Mail Beta.

search on colon : ending words

2007-01-28 Thread Felix Litman
Is there a simple way to turn off field-search syntax in the Lucene parser, and 
have Lucene recognize words ending in a colon : as search terms instead?

Such words are very common occurrences for our documents (or any plain text), 
but Lucene does not seem to find them. :-(

Thank you,
Felix


Re: search on colon : ending words

2007-01-28 Thread Erick Erickson

I've got to ask why you'd want to search on colons. Why not just index the
words without colons and search without them too? Let's say you index the
word work: Do you really want to have a search on work fail?

By and large, you're better off indexing and searching without
punctuation

Best
Erick

On 1/28/07, Felix Litman [EMAIL PROTECTED] wrote:


Is there a simple way to turn off field-search syntax in the Lucene
parser, and have Lucene recognize words ending in a colon : as search
terms instead?

Such words are very common occurrences for our documents (or any plain
text), but Lucene does not seem to find them. :-(

Thank you,
Felix




Re: My program stops indexing after 10000th documents is indexed

2007-01-28 Thread Erick Erickson

Maureen:

I lost the e-mail where you re-throw the exception. But you'd get a *lot*
more information if you'd print the stacktrace via
(catch Exception e) {
e.printStackTrace();
throw e;
}

And that would allow the folks who understand Lucene to give you a LOT more
help G...

Best
Erick

On 1/27/07, Chris Hostetter [EMAIL PROTECTED] wrote:



did you try triggering a thread dump to see what it was doing at that
point?

depending on your merge factors and other IndexWriter settings it could
just be doing a relaly big merge.

: Date: Sat, 27 Jan 2007 09:40:47 -0800 (PST)
: From: maureen tanuwidjaja [EMAIL PROTECTED]
: Reply-To: java-user@lucene.apache.org
: To: java-user@lucene.apache.org
: Subject: My program stops indexing after 1th documents is indexed
:
: Hi all,
:
:   Is there any limitation of number of file that lucene can handle?
:   I indexed a total of 3 XML Documents,however it stops at 1th
documents.
:   No warning,no error ,no exception as well.
:   
:   Indexing C:\sweetpea\wikipedia_xmlfiles\part-18\491876.xml
:   Indexing C:\sweetpea\wikipedia_xmlfiles\part-18\491886.xml
:   Indexing C:\sweetpea\wikipedia_xmlfiles\part-18\491887.xml
:   Indexing C:\sweetpea\wikipedia_xmlfiles\part-18\491891.xml
:   Indexing C:\sweetpea\wikipedia_xmlfiles\part-18\491893.xml
:   Indexing C:\sweetpea\wikipedia_xmlfiles\part-18\491896.xml
--1th doc
:   --it idles here--
:
:  At first I thought that it was the size of  1th document is so big
so that it took quite a long time to put  into the index.Then i found
out  that the 1th document has the  size of 6 KB only.Indexing process
stops for about 1 hour,so that i  decide to terminate the progress.
:
:   Is there anything to do with smt like setCompoundFiles etc?cause I
dont include any in my program...
:
:   Any suggestion pls?
:
:   THanks and Best Regards,
:
:   Maureen
:
:
: -
: Any questions?  Get answers on any topic at Yahoo! Answers. Try it now.



-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




Sorry, it is the 190,000th documents

2007-01-28 Thread maureen tanuwidjaja
Hi...
  
  I'm sorry,I just found out and realize that it is NOT the 10,000th  documents 
that raise the exception when IndexWriter.add(Document) is  calledbut it is 
the 180,000+ 10,000 document,so the 190,000th  documents.
  
  Now I am running the program again and put the code to print the stacktrace 
if exception happens.(thanks for the advice Erick)
  
  OK.Basically what I am going to index is XML documents that consist of  22 
folders where each folder contains  30,000  XML  Documents.Hence Total is 
660,000 XML Documents...I was reading the  Lucene book and spot about the 
mergeFactor.I would like to know wheter  the mergeFactor plays important part 
in indexing these files... and  perhaps that this one that has a strong 
correlation regarding  exception? I run my program using the default Value of  
mergeFactor,which is 10
  
  In case needed The PC used has the following spec:   Intel Pentium 4, 
2.40 GHz CPU, 512  MB of RAM
  
  
  Is there any suggestion about the mergeFactor,maxMergeFactor value that I 
should use for my case?
  
  
  
  Thanks and Regards,
  Maureen
  
  
Erick Erickson [EMAIL PROTECTED] wrote:  Maureen:

I lost the e-mail where you re-throw the exception. But you'd get a *lot*
more information if you'd print the stacktrace via
(catch Exception e) {
e.printStackTrace();
throw e;
}

And that would allow the folks who understand Lucene to give you a LOT more
help ...

Best
Erick

On 1/27/07, Chris Hostetter  wrote:


 did you try triggering a thread dump to see what it was doing at that
 point?

 depending on your merge factors and other IndexWriter settings it could
 just be doing a relaly big merge.

 : Date: Sat, 27 Jan 2007 09:40:47 -0800 (PST)
 : From: maureen tanuwidjaja 
 : Reply-To: java-user@lucene.apache.org
 : To: java-user@lucene.apache.org
 : Subject: My program stops indexing after 1th documents is indexed
 :
 : Hi all,
 :
 :   Is there any limitation of number of file that lucene can handle?
 :   I indexed a total of 3 XML Documents,however it stops at 1th
 documents.
 :   No warning,no error ,no exception as well.
 :   
 :   Indexing C:\sweetpea\wikipedia_xmlfiles\part-18\491876.xml
 :   Indexing C:\sweetpea\wikipedia_xmlfiles\part-18\491886.xml
 :   Indexing C:\sweetpea\wikipedia_xmlfiles\part-18\491887.xml
 :   Indexing C:\sweetpea\wikipedia_xmlfiles\part-18\491891.xml
 :   Indexing C:\sweetpea\wikipedia_xmlfiles\part-18\491893.xml
 :   Indexing C:\sweetpea\wikipedia_xmlfiles\part-18\491896.xml
 --1th doc
 :   --it idles here--
 :
 :  At first I thought that it was the size of  1th document is so big
 so that it took quite a long time to put  into the index.Then i found
 out  that the 1th document has the  size of 6 KB only.Indexing process
 stops for about 1 hour,so that i  decide to terminate the progress.
 :
 :   Is there anything to do with smt like setCompoundFiles etc?cause I
 dont include any in my program...
 :
 :   Any suggestion pls?
 :
 :   THanks and Best Regards,
 :
 :   Maureen
 :
 :
 : -
 : Any questions?  Get answers on any topic at Yahoo! Answers. Try it now.



 -Hoss


 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]




 
-
No need to miss a message. Get email on-the-go 
with Yahoo! Mail for Mobile. Get started.

Re: search on colon : ending words

2007-01-28 Thread Felix Litman
Yes, thank you. That would be a good solution.  But we are using Lucene's 
Standard Analyzer.  It seems to index words with colons : and other 
punctuation by default.  Is there a simple way to have the Analyzer not to 
index colons specifically and punctuation in general?

Erick Erickson [EMAIL PROTECTED] wrote: I've got to ask why you'd want to 
search on colons. Why not just index the
words without colons and search without them too? Let's say you index the
word work: Do you really want to have a search on work fail?

By and large, you're better off indexing and searching without
punctuation

Best
Erick

On 1/28/07, Felix Litman  wrote:

 Is there a simple way to turn off field-search syntax in the Lucene
 parser, and have Lucene recognize words ending in a colon : as search
 terms instead?

 Such words are very common occurrences for our documents (or any plain
 text), but Lucene does not seem to find them. :-(

 Thank you,
 Felix





Re: IndexWriter.docCount

2007-01-28 Thread karl wettin


28 jan 2007 kl. 05.54 skrev Doron Cohen:


karl wettin [EMAIL PROTECTED] wrote on 27/01/2007 13:49:24:


In essence, should I return
   index.getDocumentsByNumber().size() -
   index.getDeletedDocuments().size() +
   unflushedDocuments.size();
or
   index.getDocumentsByNumber().size() +
   unflushedDocuments.size();
?



I guess it is the 2nd one - without subtracting the number of deleted
docs.


That is enough for me to settle. Thanks again.
(I linked to this thread from a comment)


(but I don't know what is getDocumentsByNumber() - nothing like
this in the trunk, nor in current patch for 550.)


If you still really want to find it, perhaps you were looking in the  
IndexWriter in the core rather than the  InstantiatedIndexWriter of  
contrib/instantiated?


--
karl

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: search on colon : ending words

2007-01-28 Thread Mark Miller
StandardAnalyzer should not be indexing punctuation from my 
experience...instead something like old:fart would be indexed as old and 
fart. QueryParser will then generate a query of old within 1 of fart for 
the query old:fart. This is the case for all punctuation I have run 
into. Things like f.b.i are handled differently though. Its indexed as 
fbi...ie the dots are removed...thats part of the acronym handling. 
There are a couple other special handlers as well...but in general 
punctuation is ignored...except that QueryParser will look for the words 
broken by the punctuation next to each other.


-Mark

Felix Litman wrote:

Yes, thank you. That would be a good solution.  But we are using Lucene's Standard 
Analyzer.  It seems to index words with colons : and other punctuation by 
default.  Is there a simple way to have the Analyzer not to index colons specifically and 
punctuation in general?

Erick Erickson [EMAIL PROTECTED] wrote: I've got to ask why you'd want to 
search on colons. Why not just index the
words without colons and search without them too? Let's say you index the
word work: Do you really want to have a search on work fail?

By and large, you're better off indexing and searching without
punctuation

Best
Erick

On 1/28/07, Felix Litman  wrote:
  

Is there a simple way to turn off field-search syntax in the Lucene
parser, and have Lucene recognize words ending in a colon : as search
terms instead?

Such words are very common occurrences for our documents (or any plain
text), but Lucene does not seem to find them. :-(

Thank you,
Felix






  


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Multiword Highlighting

2007-01-28 Thread markharw00d
For what it's worth Mark (Miller), there *is* a need for just 
highlight the query terms without trying to get excerpts functionality

- something a la Google cache (different colours...mmm, nice).

FWIW, the existing highlighter doesn't *have* to fragment - just pass a 
NullFragmenter to the highlighter.
Ideally we'd have one implementation that tackles phrase support and 
preserves (optional) support for selecting fragments. I can see that to 
achieve this the existing highlighter design would need to change. 
Currently the highlighter identifies fragments first (typically using an 
implementation which arbitrarily chops text after 'n' words) and then 
selects which of these fragments have the highest density of 
high-scoring query terms. This logic would need to change to :

1) Use QuerySpansExtractor to identify all the *spans* in the document
2) Use a sliding window to select fragments, taking care to select 
fragments that wholly contain spans, rather than selecting only part of 
a span.

3) Mark up the hits.
Clearly, for people uninterested in selecting fragments, step 2 can be 
skipped.


Cheers
Mark





___ 
All new Yahoo! Mail The new Interface is stunning in its simplicity and ease of use. - PC Magazine 
http://uk.docs.yahoo.com/nowyoucan.html


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Announcement: Lucene powering Monster job search index (Beta)

2007-01-28 Thread Peter Keegan

Correction:
We only do the euclidan computation during sorting. For filtering, a simple
bounding box is computed to approximate the radius, and 2 range comparisons
are made to exclude documents. Because these comparisons are done outside of
Lucene as integer comparisons, it is pretty fast. With 13000 results, the
seach time with distance sort is about 200 msec (compared to 30 ms for a
simple non-radius, date-sorted keyword search).

Peter

On 1/27/07, no spam [EMAIL PROTECTED] wrote:


Isn't this extremely ineffecient to do the euclidean distance twice?
Perhaps not a huge deal if a small search result set.  I at times have
13,000 results that match my search terms of an index with 1.2 million
docs.

Can't you do some simple radian math first to ensure it's way out of
bounds,
then do the euclidian distance for the subset within bounds?  I'm
currently
only doing the distance calc once (post hit collector). I don't have any
performance numbers with the double vs single distance calc.

I'm still working out the sort by radius myself.

Mark

On 11/3/06, Peter Keegan [EMAIL PROTECTED] wrote:

 Daniel,
 Yes, this is correct if you happen to be doing a radius search and
sorting
 by mileage.
 Peter






Re: search on colon : ending words

2007-01-28 Thread Felix Litman
We want to be able to return a result regardless if users use a colon or not in 
the query.  So 'work:' and 'work' query should still return same result.

With the current parser if a user enters 'work:'  with a : , Lucene does not 
return anything :-(.   It seems to me the Lucene parser issue we are 
wondering if there is any simple way to make the Lucene parser ignore the : 
in the query?

any thoughts?

Erick Erickson [EMAIL PROTECTED] wrote: I've got to ask why you'd want to 
search on colons. Why not just index the
words without colons and search without them too? Let's say you index the
word work: Do you really want to have a search on work fail?

By and large, you're better off indexing and searching without
punctuation

Best
Erick

On 1/28/07, Felix Litman  wrote:

 Is there a simple way to turn off field-search syntax in the Lucene
 parser, and have Lucene recognize words ending in a colon : as search
 terms instead?

 Such words are very common occurrences for our documents (or any plain
 text), but Lucene does not seem to find them. :-(

 Thank you,
 Felix





Re: search on colon : ending words

2007-01-28 Thread Erik Hatcher


On Jan 28, 2007, at 3:47 PM, Felix Litman wrote:
We want to be able to return a result regardless if users use a  
colon or not in the query.  So 'work:' and 'work' query should  
still return same result.


With the current parser if a user enters 'work:'  with a : ,  
Lucene does not return anything :-(.   It seems to me the Lucene  
parser issue we are wondering if there is any simple way to  
make the Lucene parser ignore the : in the query?


any thoughts?


What about preprocessing the query string and replace colons with a  
space?   Or perhaps escape colons with a backslash (I believe that  
works, but haven't confirmed it lately).


Would users ever need to use fielded selectors?  Or QueryParser  
syntax in general?  If not, then bypass QueryParser altogether and  
analyze the string yourself and build up a query clauses into a  
BooleanQuery.


Erik


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: search on colon : ending words

2007-01-28 Thread Michael D. Curtin

Felix Litman wrote:

We want to be able to return a result regardless if users use a colon or not in 
the query.  So 'work:' and 'work' query should still return same result.

With the current parser if a user enters 'work:'  with a : , Lucene does not return 
anything :-(.   It seems to me the Lucene parser issue we are wondering if there is any simple 
way to make the Lucene parser ignore the : in the query?


The StandardAnalyzer already strips out the colons from the indexed 
text, so all you need to do is get rid of them in the query.  Would


  String newquery = query.replace(query, :,  );

work?  It uses a space as the new text so that two query words that 
happened to be separated by the colon would still be separate words ...


--MDC

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: search on colon : ending words

2007-01-28 Thread Felix Litman

great suggestion and Eric's also earlier.  Thank you.
Felix

Michael D. Curtin [EMAIL PROTECTED] wrote: Felix Litman wrote:
 We want to be able to return a result regardless if users use a colon or not 
 in the query.  So 'work:' and 'work' query should still return same result.
 
 With the current parser if a user enters 'work:'  with a : , Lucene does 
 not return anything :-(.   It seems to me the Lucene parser issue we are 
 wondering if there is any simple way to make the Lucene parser ignore the : 
 in the query?

The StandardAnalyzer already strips out the colons from the indexed 
text, so all you need to do is get rid of them in the query.  Would

   String newquery = query.replace(query, :,  );

work?  It uses a space as the new text so that two query words that 
happened to be separated by the colon would still be separate words ...

--MDC

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




Re: Multiword Highlighting

2007-01-28 Thread Mark Miller
I do use the NullFragmenter now. I have no interest in the fragments at 
the moment, just in showing hits on the source document. It would be 
great if I could just show the real hits though. The span approach seems 
to work fine for me. I have even tested the highlighting using my 
sentence and paragraph proximity search queries from my query parser. 
These use a modified NotSpan (I call it WithinSpan) within an unbound 
NearSpan. I did a few queries that combine that structure with wildcard 
and boolean queries...everything appeared to work grand -- I got all the 
correct highlights. I just have to combine the highlights (spans) and 
refine my code (and that color comment Otis made is something I am 
interested in well -- it would be great to have the words found in a 
single spanquery be the same color, or a similar shade).


- Mark

markharw00d wrote:
For what it's worth Mark (Miller), there *is* a need for just 
highlight the query terms without trying to get excerpts functionality

- something a la Google cache (different colours...mmm, nice).

FWIW, the existing highlighter doesn't *have* to fragment - just pass 
a NullFragmenter to the highlighter.
Ideally we'd have one implementation that tackles phrase support and 
preserves (optional) support for selecting fragments. I can see that 
to achieve this the existing highlighter design would need to change. 
Currently the highlighter identifies fragments first (typically using 
an implementation which arbitrarily chops text after 'n' words) and 
then selects which of these fragments have the highest density of 
high-scoring query terms. This logic would need to change to :

1) Use QuerySpansExtractor to identify all the *spans* in the document
2) Use a sliding window to select fragments, taking care to select 
fragments that wholly contain spans, rather than selecting only part 
of a span.

3) Mark up the hits.
Clearly, for people uninterested in selecting fragments, step 2 can be 
skipped.


Cheers
Mark




   
___ All new 
Yahoo! Mail The new Interface is stunning in its simplicity and ease 
of use. - PC Magazine http://uk.docs.yahoo.com/nowyoucan.html


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



printout of the stack trace while failing to indexing the 190,000th ocument

2007-01-28 Thread maureen tanuwidjaja
OK,This is the printout of the stack trace while failing to indexing the 
190,000th ocument
  
  Indexing C:\sweetpea\wikipedia_xmlfiles\part-18\491886.xml
  Indexing C:\sweetpea\wikipedia_xmlfiles\part-18\491887.xml
  Indexing C:\sweetpea\wikipedia_xmlfiles\part-18\491891.xml
  Indexing C:\sweetpea\wikipedia_xmlfiles\part-18\491893.xml
  Indexing C:\sweetpea\wikipedia_xmlfiles\part-18\491896.xml
  java.io.IOException: There is not enough space on the disk
  at java.io.RandomAccessFile.writeBytes(Native Method)
  at java.io.RandomAccessFile.write(Unknown Source)
  at org.apache.lucene.store.FSIndexOutput.flushBuffer(FSDirectory.java:583)
  at 
org.apache.lucene.store.BufferedIndexOutput.flush(BufferedIndexOutput.java:85)
  at 
org.apache.lucene.store.BufferedIndexOutput.writeBytes(BufferedIndexOutput.java:75)
  at 
org.apache.lucene.index.CompoundFileWriter.copyFile(CompoundFileWriter.java:212)
  at 
org.apache.lucene.index.CompoundFileWriter.close(CompoundFileWriter.java:169)
  at 
org.apache.lucene.index.SegmentMerger.createCompoundFile(SegmentMerger.java:153)
  at 
org.apache.lucene.index.IndexWriter.mergeSegments(IndexWriter.java:1447)
  at 
org.apache.lucene.index.IndexWriter.maybeMergeSegments(IndexWriter.java:1286)
  at 
org.apache.lucene.index.IndexWriter.flushRamSegments(IndexWriter.java:1232)
  at 
org.apache.lucene.index.IndexWriter.maybeFlushRamSegments(IndexWriter.java:1224)
  at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:652)
  at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:631)
  at 
edu.ntu.ce.maureen.index.DocumentIndexer.addDocToIndex(DocumentIndexer.java:39)
  at 
edu.ntu.ce.maureen.index.DOMTraversal.fileTraverse(DOMTraversal.java:123)
  at 
edu.ntu.ce.maureen.index.DOMTraversal.fileTraverse(DOMTraversal.java:106)
  at edu.ntu.ce.maureen.index.DOMTraversal.main(DOMTraversal.java:133)
  java.io.IOException: There is not enough space on the disk
  
  Can anyone help?
  
  Thanks and Regards,
  Maureen
  
 
-
Never miss an email again!
Yahoo! Toolbar alerts you the instant new Mail arrives. Check it out.

Re: How many documents in the biggest Lucene index to date?

2007-01-28 Thread Erik Hatcher


On Jan 26, 2007, at 2:30 PM, Otis Gospodnetic wrote:

It really all dependsright Erik?


Ha!  Looks like I've earned a tag line around here, eh?!  :)

On the hardware you are using, complexity of queries, query  
concurrency, query latency you are willing to live with, the size  
of the index, etc.  A few million sounds small even for average/ 
cheap hw.  I have several multi-million document indices that are  
constantly hammered over on Simpy.com and we use Lucene at  
Technorati to index the blogosphere, so you can imagine those  
numbers.  To handle that much data things needs to be heavily  
distributed, of course.


Admittedly I've not run indexes anywhere close to the numbers folks  
have already mentioned on this thread already.   I'm about to build  
my largest index to date, at ~3.7M documents.


Erik




Otis

- Original Message 
From: Bill Taylor [EMAIL PROTECTED]
To: java-user@lucene.apache.org
Sent: Friday, January 26, 2007 12:45:43 AM
Subject: How many documents in the biggest Lucene index to date?

I have used Lucene to index a small collection - only a few hundred
documents.  I have a potential client who wants to index a collection
which will start at about a million documents and could easily grow
to two million.

Has anyone used Lucene with an index that large?

Thank you very much.

Bill Taylor



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: printout of the stack trace while failing to indexing the 190,000th ocument

2007-01-28 Thread Erik Hatcher


On Jan 28, 2007, at 9:15 PM, maureen tanuwidjaja wrote:
OK,This is the printout of the stack trace while failing to  
indexing the 190,000th ocument


  java.io.IOException: There is not enough space on the disk

  Can anyone help?


Ummm get more disk space?!

Erik


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: printout of the stack trace while failing to indexing the 190,000th ocument

2007-01-28 Thread maureen tanuwidjaja
I think so ...btw may I ask the opinion, will it be useful to optimize let say 
every 50,000-60,000 documents? I have total of 660,000 docs...

Erik Hatcher [EMAIL PROTECTED] wrote:  
On Jan 28, 2007, at 9:15 PM, maureen tanuwidjaja wrote:
 OK,This is the printout of the stack trace while failing to 
 indexing the 190,000th ocument

 java.io.IOException: There is not enough space on the disk

 Can anyone help?

Ummm get more disk space?!

Erik


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



 
-
Everyone is raving about the all-new Yahoo! Mail beta.

Re: Is the new version of the Lucene book available in any form?

2007-01-28 Thread Erik Hatcher


On Jan 26, 2007, at 1:56 PM, Bill Taylor wrote:
I notice that the Lucene book offered by Amazon was published in  
2004.  I saw some mail on the subject of a new edition.


Is the new edition available in any form?

I promise to buy the new edition as soon as it comes out even if I  
get some of the material early.  I wrote a book which was published  
by the MIT Press; I know how long it takes to get a book out.


This is a thread more suited to the Manning forum for LIA:  http:// 
www.manning-sandbox.com/thread.jspa?forumID=152threadID=17520


In short, LIA2 will live, that much is for sure.


Failing that, how should I learn more about the internals of Lucene?


Ask here.  Delve into the source code.  Study the unit tests.

  My client has a large code base in C++.  The system has its own  
index which is not all that fast.  One way to improve performance  
would be to convert to the C version of Lucene.


Is HTTP communication viable for your situation?  If so, give Solr a  
shot.  C - HTTP - Solr - Lucene and back won't be not all that  
fast.  In fact, it'll be very fast.


Erik




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Is the new version of the Lucene book available in any form?

2007-01-28 Thread Erik Hatcher


On Jan 26, 2007, at 5:28 PM, Chris Hostetter wrote:



: LIA2 will happen, but Lucene is undergoing a lot of changes, so  
Erik and

: I are going to wait a little more for development to calm down
: (utopia?).

you're waiting for Lucene development to calm down? ... that could  
be a

long wait.


We're not exactly waiting.  I'm working night and day on Solr + Ruby  
(solrb and Flare) for various projects.


A book project, especially a 2nd edition, is an incredible  
undertaking and commitment.  It is an undertaking Otis and I plan on  
carving out time for in the near future, but predicting exactly when  
that will be is not worth speculating.  Rest assured that this list  
will be kept well informed of LIA2's progress.  java-user is the  
audience to which we most cater.


Erik


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: printout of the stack trace while failing to indexing the 190,000th ocument

2007-01-28 Thread Erik Hatcher


On Jan 28, 2007, at 11:23 PM, maureen tanuwidjaja wrote:
I think so ...btw may I ask the opinion, will it be useful to  
optimize let say every 50,000-60,000 documents? I have total of  
660,000 docs...


Lucene automatically merges segments periodically during large  
indexing runs.  Look at the parameters available on IndexWriter, and  
research the best practices mentioned about those settings in this  
forum archives, the Lucene wiki, and other resources (such as  
articles and Lucene in Action).  With sufficient disk space you'll  
be able to tune those settings to keep the index as unsegmented as  
you like as you index, and then optimize after the batch is completed  
again for good measure.


Erik




Erik Hatcher [EMAIL PROTECTED] wrote:
On Jan 28, 2007, at 9:15 PM, maureen tanuwidjaja wrote:

OK,This is the printout of the stack trace while failing to
indexing the 190,000th ocument

java.io.IOException: There is not enough space on the disk

Can anyone help?


Ummm get more disk space?!

Erik


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
Everyone is raving about the all-new Yahoo! Mail beta.



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]