Sorting, RuleBasedCollater, and synchronization bottleneck

2007-02-14 Thread Paul Smith
Hi ho peoples. We have an application that is internationalized, and stores data from many languages (each project has it's own index, mostly aligned with a single language, maybe 2). Anyway, I've noticed during some thread dumps diagnosing some performance issues, that there appears to

RE: hi sample code

2007-02-14 Thread yeohwm
Hi, Hope this help. Regards, Wooi Meng -- No virus found in this outgoing message. Checked by AVG Free Edition. Version: 7.5.441 / Virus Database: 268.17.39/685 - Release Date: 2/13/2007 10:01 PM Disclaimer

Re: encoding question.

2007-02-14 Thread Chris Hostetter
Internally Lucene deals with pure Java Strings; when writing those strings to and reading those strings back from disk, Lucene allways uses the stock Java modified UTF-8 format, regardless of what your file.encoding system property may be. typcially when people have encoding problems in their

Re: hi sample code

2007-02-14 Thread Saroja Kanta Maharana
Can you come to online now . On 2/14/07, ashwin kumar [EMAIL PROTECTED] wrote: hi thanks for your kindest reply. i just trying to index some text files using lucene-2.0.0 if you can share any sample programs for text file indexing in lucene-2.0.0 it will be allot helpfull for me to

Help me in Thesaurus implementation using lucene

2007-02-14 Thread Saroja Kanta Maharana
Hi All, I'm a new user of Lucene, and a would like to use it to create a Thesaurus. Do you have any idea to do this? Thanks! *Regards * *Saroj*

Re: Multipile field search

2007-02-14 Thread Mohammad Norouzi
Hi, I think it makes sense if it returns zero records because you are using BooleanClause.Occur.SHOULD for each field, it means the term open should occurs in all fields. but when you specify the field name in your query you limit searching through that mentioned field. as stated in Lucene

Caching

2007-02-14 Thread Kainth, Sachin
Hi all, I have read that Lucene performs caching of search results so that if you perform the same search in succession the second result should be returned faster. What I wanted to ask is whether this caching is any good or whether it's a good idea to add some sort of caching layer on top of

RE: encoding question.

2007-02-14 Thread Benson Margulies
The usual source of this problem is HTML forms. If you want to get UTF-8 back from a form, you have to send \the form itself/ to the browser in UTF-8. -Original Message- From: Chris Hostetter [mailto:[EMAIL PROTECTED] Sent: Wednesday, February 14, 2007 3:50 AM To:

Omitting TermVector info and index size

2007-02-14 Thread Erick Erickson
I'm indexing books, with a significant amount of overhead in each document and a LOT of OCR data. I'm indexing over 20,000 books and the index size is 8G. So I decided to play around with not storing some of the termvector information and I'm shocked at how much smaller the index is. By storing

Extending Query, Weight, Scorer

2007-02-14 Thread poeta simbolista
Hi, I have created a Query that works for numerical max-min ranges, that may work for any Field specified. I have done that by extending Query, and creating own Weight and Scorer subclasses as well. So it works ... but I have problems when setting min or max boundary to 0: In this case, those

Re: Omitting TermVector info and index size

2007-02-14 Thread Erik Hatcher
On Feb 14, 2007, at 9:03 AM, Erick Erickson wrote: My reasoning was that I do need position information since I need to do Span queries, but character information (WITH_OFFSETS) isn't necessary here/now. 1 Am I going off a cliff here? I suppose this is really answered by 2 what is the

Re: Caching

2007-02-14 Thread Erick Erickson
This is really an unanswerable question, since, to steal a phrase, It depends G... Do you have any reason to believe that the current performance is inadequate for you application? Caching is notoriously difficult to get right, so I wouldn't go there unless there is a *demonstrated* need. By

Re: Omitting TermVector info and index size

2007-02-14 Thread Erick Erickson
You've made me a happy man G. Thanks again. [EMAIL PROTECTED] G. On 2/14/07, Erik Hatcher [EMAIL PROTECTED] wrote: On Feb 14, 2007, at 9:03 AM, Erick Erickson wrote: My reasoning was that I do need position information since I need to do Span queries, but character information

Re: Caching

2007-02-14 Thread karl wettin
14 feb 2007 kl. 14.57 skrev Kainth, Sachin: I have read that Lucene performs caching of search results so that if you perform the same search in succession the second result should be returned faster. What I wanted to ask is whether this caching is any good or whether it's a good idea to add

Re: Omitting TermVector info and index size

2007-02-14 Thread karl wettin
14 feb 2007 kl. 15.03 skrev Erick Erickson: My reasoning was that I do need position information since I need to do Span queries, but character information (WITH_OFFSETS) isn't necessary here/now. So I thought I'd make a small test to see if this was worth pursuing. If omitting offsets

Re: Omitting TermVector info and index size

2007-02-14 Thread Erick Erickson
Erik Hatcher sez no. Erick On 2/14/07, karl wettin [EMAIL PROTECTED] wrote: 14 feb 2007 kl. 15.03 skrev Erick Erickson: My reasoning was that I do need position information since I need to do Span queries, but character information (WITH_OFFSETS) isn't necessary here/now. So I thought

RE: Caching

2007-02-14 Thread Kainth, Sachin
Well, I have an index with 5.2 million records (each record containing 3 fields) and it sometimes takes about a minute and a half for results to come back. I have noticed however, that when I run the same query the second time the result comes back faster. I just thought that this was a bit too

Re: Omitting TermVector info and index size

2007-02-14 Thread Grant Ingersoll
As Erik stated, you don't need term vectors to do spans, but I thought I would add a bit on the difference between positions and offsets. Positions are what is stored in Lucene internally (see Token.getPositionIncrement() and TermPositions) and are usually just consecutive integers

Re: Omitting TermVector info and index size

2007-02-14 Thread Mark Miller
As Erick said, Term positions are kept regardless of whether you store term vectors. The positional information is needed for phrase queries, span queries, etc. You certainly don't lose the ability to use phrase queries if you do not store term vectors. If you check out the Posting class in

Re: Caching

2007-02-14 Thread Yonik Seeley
On 2/14/07, Kainth, Sachin [EMAIL PROTECTED] wrote: I have an index with 5.2 million records (each record containing 3 fields) and it sometimes takes about a minute and a half for results to come back. Doe to sort fields (and other factors), the first query can be slow. Solr has built-in

Re: Omitting TermVector info and index size

2007-02-14 Thread Erick Erickson
Thanks for that addition, it may well be important to me (as well as pointing up a weakness in my unit tests. Honest, I've been thinking about explicitly testing this. Really. I'll get around to it real soon now. Truly). We store multiple entries in the same field, think of it as storing a

Re: Caching

2007-02-14 Thread Mark Miller
Not to get off topic, but I was curious Yonik, what does solr do if many updates come in at a time opening and closing a writer each update...does the first update kick off a warm operation, then before that warm is done the second updates kicks off a warm operation, and then before that warm

Re: Omitting TermVector info and index size

2007-02-14 Thread Mark Miller
My apologies to Erik...and Erick...I am horrible with names. If I am reading Grant's email correctly, he also said you don't need to store the Term Vectors...just that if you did store them, you can use them with the highlighter so that you do not need to reanalyze the text...why exactly this

Re: Omitting TermVector info and index size

2007-02-14 Thread Erick Erickson
It's always embarrassing when the correct unit test takes, say, 3 minutes to write and I've engaged in all this angst that I could have dispelled all by myself (although it is nice to have confirmation from folks in the know). The answer is that omitting term vectors has no influence on the

Re: Caching

2007-02-14 Thread Yonik Seeley
On 2/14/07, Mark Miller [EMAIL PROTECTED] wrote: Not to get off topic, but I was curious Yonik, what does solr do if many updates come in at a time opening and closing a writer each update...does the first update kick off a warm operation, then before that warm is done the second updates kicks

possible to disable internal caching?

2007-02-14 Thread jm
Hi, That last thread about caching reminded me of something. Me need is actually the opposite... I use lucene to search in hundreds/thousands of indexes. Doing a lucene query on a set of the indexes is only one of the steps involved in my 'queries', and some of the other steps take longer than

Re: possible to disable internal caching?

2007-02-14 Thread karl wettin
14 feb 2007 kl. 17.12 skrev jm: So my question, is it possible to disable some of the caching lucene does so the memory consumption will be smaller (I am a bit concerned on the memory usage side)? Or the memory savings would not pay off? You could try to create a new Searcher for each query,

FieldCacheImpl mistake?

2007-02-14 Thread poeta simbolista
Hi guys, I have been diving into the FieldCacheImpl code. I have seen sth on actual version: Revision 488908 - (view) (download) (annotate) - [select for diffs] Modified Wed Dec 20 03:47:09 2006 UTC (8 weeks ago) by yonik File length: 13425 byte(s) that I wonder if it's not totally right, or if

Re: FieldCacheImpl mistake?

2007-02-14 Thread Otis Gospodnetic
I'm not looking at the code now, but I believe this is because those Strings are interned, and I believe they are interned precisely so that this (faster) comparison can be done. Otis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Simpy -- http://www.simpy.com/

Re: FieldCacheImpl mistake?

2007-02-14 Thread poeta simbolista
Cool. Thanks! BTW, I have another issue here. The array of floats for the Float cache is not initialised. Which means that it will return '0.0' (not initialised) as the value for those documents that have a '0' as the value, as well as for those ones that do not have the field. In my actual

Re: Omitting TermVector info and index size

2007-02-14 Thread Erick Erickson
OK, final note. I wish I knew what kind of drugs I was on when I first thought that the sizes were so much smaller. Because they weren't. I got to thinking that gee, it's kind of weird that if you don't specify anything for TermVector when creating a field, you get all this advanced stuff. If it

Re: possible to disable internal caching?

2007-02-14 Thread Daniel Naber
On Wednesday 14 February 2007 17:12, jm wrote: So my question, is it possible to disable some of the caching lucene does so the memory consumption will be smaller (I am a bit concerned on the memory usage side)? Or the memory savings would not pay off? You could set

Re: FieldCacheImpl mistake?

2007-02-14 Thread Mark Miller
There is some code in contrib with comments claiming this interning is actually slower. I think it was the MemoryIndex? Has this ever been discussed? - Mark Otis Gospodnetic wrote: I'm not looking at the code now, but I believe this is because those Strings are interned, and I believe they

Re: FieldCacheImpl mistake?

2007-02-14 Thread karl wettin
14 feb 2007 kl. 20.49 skrev Mark Miller: There is some code in contrib with comments claiming this interning is actually slower. I think it was the MemoryIndex? Has this ever been discussed? There is of course a cost of RAM and CPU involved with flyweighting instances. In order to win

Re: FieldCacheImpl mistake?

2007-02-14 Thread Mark Miller
Here is the comment: /* * Note that this method signature avoids having a user call new * o.a.l.d.Field(...) which would be much too expensive due to the * String.intern() usage of that class. * * More often than not, String.intern() leads to serious performance *

Too many open files?!

2007-02-14 Thread Michael Prichard
I am getting this exception: Exception in thread main java.io.FileNotFoundException: /index/_gna.f13 (Too many open files) This is happening on a SLES10 (64-bit) box when trying to index 18k items. I can run it on a much lesser SLES9 box without any issues. Any ideas?! Thanks, Michael

RE: Too many open files?!

2007-02-14 Thread Steven Parkes
See the wiki: http://wiki.apache.org/jakarta-lucene/LuceneFAQ#head-48921635adf2c968f79 36dc07d51dfb40d638b82 -Original Message- From: Michael Prichard [mailto:[EMAIL PROTECTED] Sent: Wednesday, February 14, 2007 5:02 PM To: java-user@lucene.apache.org Subject: Too many open files?! I

Re: Too many open files?!

2007-02-14 Thread Michael Prichard
That helped! Thanks! I just added some .close() calls to a few places where I kept file handles open and it worked quite nicely. Good lesson, make sure you all clean up after yourselves! Thanks, Michael On Feb 14, 2007, at 8:04 PM, Steven Parkes wrote: See the wiki: