Indexwriter can't add the 10000th document to the index

2007-01-28 Thread maureen tanuwidjaja
I finally rerun the program and it stops at exactly the sampe place.This time the exception came out.Writer cant add the 1th document to the index... Indexing C:\sweetpea\wikipedia_xmlfiles\part-18\491886.xml Indexing C:\sweetpea\wikipedia_xmlfiles\part-18\491887.xml

search on colon : ending words

2007-01-28 Thread Felix Litman
Is there a simple way to turn off field-search syntax in the Lucene parser, and have Lucene recognize words ending in a colon : as search terms instead? Such words are very common occurrences for our documents (or any plain text), but Lucene does not seem to find them. :-( Thank you, Felix

Re: search on colon : ending words

2007-01-28 Thread Erick Erickson
I've got to ask why you'd want to search on colons. Why not just index the words without colons and search without them too? Let's say you index the word work: Do you really want to have a search on work fail? By and large, you're better off indexing and searching without punctuation Best

Re: My program stops indexing after 10000th documents is indexed

2007-01-28 Thread Erick Erickson
Maureen: I lost the e-mail where you re-throw the exception. But you'd get a *lot* more information if you'd print the stacktrace via (catch Exception e) { e.printStackTrace(); throw e; } And that would allow the folks who understand Lucene to give you a LOT more help G... Best Erick On

Sorry, it is the 190,000th documents

2007-01-28 Thread maureen tanuwidjaja
Hi... I'm sorry,I just found out and realize that it is NOT the 10,000th documents that raise the exception when IndexWriter.add(Document) is calledbut it is the 180,000+ 10,000 document,so the 190,000th documents. Now I am running the program again and put the code to print the

Re: search on colon : ending words

2007-01-28 Thread Felix Litman
Yes, thank you. That would be a good solution. But we are using Lucene's Standard Analyzer. It seems to index words with colons : and other punctuation by default. Is there a simple way to have the Analyzer not to index colons specifically and punctuation in general? Erick Erickson [EMAIL

Re: IndexWriter.docCount

2007-01-28 Thread karl wettin
28 jan 2007 kl. 05.54 skrev Doron Cohen: karl wettin [EMAIL PROTECTED] wrote on 27/01/2007 13:49:24: In essence, should I return index.getDocumentsByNumber().size() - index.getDeletedDocuments().size() + unflushedDocuments.size(); or index.getDocumentsByNumber().size() +

Re: search on colon : ending words

2007-01-28 Thread Mark Miller
StandardAnalyzer should not be indexing punctuation from my experience...instead something like old:fart would be indexed as old and fart. QueryParser will then generate a query of old within 1 of fart for the query old:fart. This is the case for all punctuation I have run into. Things like

Re: Multiword Highlighting

2007-01-28 Thread markharw00d
For what it's worth Mark (Miller), there *is* a need for just highlight the query terms without trying to get excerpts functionality - something a la Google cache (different colours...mmm, nice). FWIW, the existing highlighter doesn't *have* to fragment - just pass a NullFragmenter to the

Re: Announcement: Lucene powering Monster job search index (Beta)

2007-01-28 Thread Peter Keegan
Correction: We only do the euclidan computation during sorting. For filtering, a simple bounding box is computed to approximate the radius, and 2 range comparisons are made to exclude documents. Because these comparisons are done outside of Lucene as integer comparisons, it is pretty fast. With

Re: search on colon : ending words

2007-01-28 Thread Felix Litman
We want to be able to return a result regardless if users use a colon or not in the query. So 'work:' and 'work' query should still return same result. With the current parser if a user enters 'work:' with a : , Lucene does not return anything :-(. It seems to me the Lucene parser issue

Re: search on colon : ending words

2007-01-28 Thread Erik Hatcher
On Jan 28, 2007, at 3:47 PM, Felix Litman wrote: We want to be able to return a result regardless if users use a colon or not in the query. So 'work:' and 'work' query should still return same result. With the current parser if a user enters 'work:' with a : , Lucene does not return

Re: search on colon : ending words

2007-01-28 Thread Michael D. Curtin
Felix Litman wrote: We want to be able to return a result regardless if users use a colon or not in the query. So 'work:' and 'work' query should still return same result. With the current parser if a user enters 'work:' with a : , Lucene does not return anything :-(. It seems to me the

Re: search on colon : ending words

2007-01-28 Thread Felix Litman
great suggestion and Eric's also earlier. Thank you. Felix Michael D. Curtin [EMAIL PROTECTED] wrote: Felix Litman wrote: We want to be able to return a result regardless if users use a colon or not in the query. So 'work:' and 'work' query should still return same result. With the

Re: Multiword Highlighting

2007-01-28 Thread Mark Miller
I do use the NullFragmenter now. I have no interest in the fragments at the moment, just in showing hits on the source document. It would be great if I could just show the real hits though. The span approach seems to work fine for me. I have even tested the highlighting using my sentence and

printout of the stack trace while failing to indexing the 190,000th ocument

2007-01-28 Thread maureen tanuwidjaja
OK,This is the printout of the stack trace while failing to indexing the 190,000th ocument Indexing C:\sweetpea\wikipedia_xmlfiles\part-18\491886.xml Indexing C:\sweetpea\wikipedia_xmlfiles\part-18\491887.xml Indexing C:\sweetpea\wikipedia_xmlfiles\part-18\491891.xml

Re: How many documents in the biggest Lucene index to date?

2007-01-28 Thread Erik Hatcher
On Jan 26, 2007, at 2:30 PM, Otis Gospodnetic wrote: It really all dependsright Erik? Ha! Looks like I've earned a tag line around here, eh?! :) On the hardware you are using, complexity of queries, query concurrency, query latency you are willing to live with, the size of the

Re: printout of the stack trace while failing to indexing the 190,000th ocument

2007-01-28 Thread Erik Hatcher
On Jan 28, 2007, at 9:15 PM, maureen tanuwidjaja wrote: OK,This is the printout of the stack trace while failing to indexing the 190,000th ocument java.io.IOException: There is not enough space on the disk Can anyone help? Ummm get more disk space?! Erik

Re: printout of the stack trace while failing to indexing the 190,000th ocument

2007-01-28 Thread maureen tanuwidjaja
I think so ...btw may I ask the opinion, will it be useful to optimize let say every 50,000-60,000 documents? I have total of 660,000 docs... Erik Hatcher [EMAIL PROTECTED] wrote: On Jan 28, 2007, at 9:15 PM, maureen tanuwidjaja wrote: OK,This is the printout of the stack trace while failing

Re: Is the new version of the Lucene book available in any form?

2007-01-28 Thread Erik Hatcher
On Jan 26, 2007, at 1:56 PM, Bill Taylor wrote: I notice that the Lucene book offered by Amazon was published in 2004. I saw some mail on the subject of a new edition. Is the new edition available in any form? I promise to buy the new edition as soon as it comes out even if I get some of

Re: Is the new version of the Lucene book available in any form?

2007-01-28 Thread Erik Hatcher
On Jan 26, 2007, at 5:28 PM, Chris Hostetter wrote: : LIA2 will happen, but Lucene is undergoing a lot of changes, so Erik and : I are going to wait a little more for development to calm down : (utopia?). you're waiting for Lucene development to calm down? ... that could be a long

Re: printout of the stack trace while failing to indexing the 190,000th ocument

2007-01-28 Thread Erik Hatcher
On Jan 28, 2007, at 11:23 PM, maureen tanuwidjaja wrote: I think so ...btw may I ask the opinion, will it be useful to optimize let say every 50,000-60,000 documents? I have total of 660,000 docs... Lucene automatically merges segments periodically during large indexing runs. Look at