Hi ho peoples.
We have an application that is internationalized, and stores data
from many languages (each project has it's own index, mostly aligned
with a single language, maybe 2).
Anyway, I've noticed during some thread dumps diagnosing some
performance issues, that there appears to
Hi,
Hope this help.
Regards,
Wooi Meng
--
No virus found in this outgoing message.
Checked by AVG Free Edition.
Version: 7.5.441 / Virus Database: 268.17.39/685 - Release Date: 2/13/2007
10:01 PM
Disclaimer
Internally Lucene deals with pure Java Strings; when writing those strings
to and reading those strings back from disk, Lucene allways uses the stock
Java modified UTF-8 format, regardless of what your file.encoding
system property may be.
typcially when people have encoding problems in their
Can you come to online now .
On 2/14/07, ashwin kumar [EMAIL PROTECTED] wrote:
hi thanks for your kindest reply.
i just trying to index some text files using lucene-2.0.0
if you can share any sample programs for text file indexing in
lucene-2.0.0
it will be allot helpfull for me to
Hi All,
I'm a new user of Lucene, and a would like to use it to create a
Thesaurus.
Do you have any idea to do this? Thanks!
*Regards *
*Saroj*
Hi,
I think it makes sense if it returns zero records because you are using
BooleanClause.Occur.SHOULD for each field, it means the term open should
occurs in all fields. but when you specify the field name in your query you
limit searching through that mentioned field.
as stated in Lucene
Hi all,
I have read that Lucene performs caching of search results so that if
you perform the same search in succession the second result should be
returned faster. What I wanted to ask is whether this caching is any
good or whether it's a good idea to add some sort of caching layer on
top of
The usual source of this problem is HTML forms. If you want to get UTF-8
back from a form, you have to send \the form itself/ to the browser in
UTF-8.
-Original Message-
From: Chris Hostetter [mailto:[EMAIL PROTECTED]
Sent: Wednesday, February 14, 2007 3:50 AM
To:
I'm indexing books, with a significant amount of overhead in each document
and a LOT of OCR data. I'm indexing over 20,000 books and the index size is
8G. So I decided to play around with not storing some of the termvector
information and I'm shocked at how much smaller the index is. By storing
Hi,
I have created a Query that works for numerical max-min ranges, that may
work for any Field specified.
I have done that by extending Query, and creating own Weight and Scorer
subclasses as well.
So it works ... but I have problems when setting min or max boundary to 0:
In this case, those
On Feb 14, 2007, at 9:03 AM, Erick Erickson wrote:
My reasoning was that I do need position information since I need
to do Span
queries, but character information (WITH_OFFSETS) isn't necessary
here/now.
1 Am I going off a cliff here? I suppose this is really answered by
2 what is the
This is really an unanswerable question, since, to steal a phrase, It
depends G...
Do you have any reason to believe that the current performance is inadequate
for you application? Caching is notoriously difficult to get right, so I
wouldn't go there unless there is a *demonstrated* need. By
You've made me a happy man G.
Thanks again.
[EMAIL PROTECTED] G.
On 2/14/07, Erik Hatcher [EMAIL PROTECTED] wrote:
On Feb 14, 2007, at 9:03 AM, Erick Erickson wrote:
My reasoning was that I do need position information since I need
to do Span
queries, but character information
14 feb 2007 kl. 14.57 skrev Kainth, Sachin:
I have read that Lucene performs caching of search results so that if
you perform the same search in succession the second result should be
returned faster. What I wanted to ask is whether this caching is any
good or whether it's a good idea to add
14 feb 2007 kl. 15.03 skrev Erick Erickson:
My reasoning was that I do need position information since I need
to do Span
queries, but character information (WITH_OFFSETS) isn't necessary
here/now.
So I thought I'd make a small test to see if this was worth
pursuing. If
omitting offsets
Erik Hatcher sez no.
Erick
On 2/14/07, karl wettin [EMAIL PROTECTED] wrote:
14 feb 2007 kl. 15.03 skrev Erick Erickson:
My reasoning was that I do need position information since I need
to do Span
queries, but character information (WITH_OFFSETS) isn't necessary
here/now.
So I thought
Well,
I have an index with 5.2 million records (each record containing 3
fields) and it sometimes takes about a minute and a half for results to
come back. I have noticed however, that when I run the same query the
second time the result comes back faster. I just thought that this was
a bit too
As Erik stated, you don't need term vectors to do spans, but I
thought I would add a bit on the difference between positions and
offsets.
Positions are what is stored in Lucene internally (see
Token.getPositionIncrement() and TermPositions) and are usually just
consecutive integers
As Erick said, Term positions are kept regardless of whether you store
term vectors. The positional information is needed for phrase queries,
span queries, etc. You certainly don't lose the ability to use phrase
queries if you do not store term vectors. If you check out the Posting
class in
On 2/14/07, Kainth, Sachin [EMAIL PROTECTED] wrote:
I have an index with 5.2 million records (each record containing 3
fields) and it sometimes takes about a minute and a half for results to
come back.
Doe to sort fields (and other factors), the first query can be slow.
Solr has built-in
Thanks for that addition, it may well be important to me (as well as
pointing up a weakness in my unit tests. Honest, I've been thinking about
explicitly testing this. Really. I'll get around to it real soon now.
Truly). We store multiple entries in the same field, think of it as
storing a
Not to get off topic, but I was curious Yonik, what does solr do if many
updates come in at a time opening and closing a writer each
update...does the first update kick off a warm operation, then before
that warm is done the second updates kicks off a warm operation, and
then before that warm
My apologies to Erik...and Erick...I am horrible with names.
If I am reading Grant's email correctly, he also said you don't need to
store the Term Vectors...just that if you did store them, you can use
them with the highlighter so that you do not need to reanalyze the
text...why exactly this
It's always embarrassing when the correct unit test takes, say, 3 minutes to
write and I've engaged in all this angst that I could have dispelled all by
myself (although it is nice to have confirmation from folks in the know).
The answer is that omitting term vectors has no influence on the
On 2/14/07, Mark Miller [EMAIL PROTECTED] wrote:
Not to get off topic, but I was curious Yonik, what does solr do if many
updates come in at a time opening and closing a writer each
update...does the first update kick off a warm operation, then before
that warm is done the second updates kicks
Hi,
That last thread about caching reminded me of something. Me need is
actually the opposite...
I use lucene to search in hundreds/thousands of indexes. Doing a
lucene query on a set of the indexes is only one of the steps involved
in my 'queries', and some of the other steps take longer than
14 feb 2007 kl. 17.12 skrev jm:
So my question, is it possible to disable some of the caching lucene
does so the memory consumption will be smaller (I am a bit concerned
on the memory usage side)? Or the memory savings would not pay off?
You could try to create a new Searcher for each query,
Hi guys,
I have been diving into the FieldCacheImpl code.
I have seen sth on actual version:
Revision 488908 - (view) (download) (annotate) - [select for diffs]
Modified Wed Dec 20 03:47:09 2006 UTC (8 weeks ago) by yonik
File length: 13425 byte(s)
that I wonder if it's not totally right, or if
I'm not looking at the code now, but I believe this is because those Strings
are interned, and I believe they are interned precisely so that this (faster)
comparison can be done.
Otis
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Simpy -- http://www.simpy.com/
Cool.
Thanks!
BTW, I have another issue here.
The array of floats for the Float cache is not initialised. Which means that
it will return '0.0' (not initialised) as the value for those documents that
have a '0' as the value, as well as for those ones that do not have the
field.
In my actual
OK, final note. I wish I knew what kind of drugs I was on when I first
thought that the sizes were so much smaller. Because they weren't. I got to
thinking that gee, it's kind of weird that if you don't specify anything
for TermVector when creating a field, you get all this advanced stuff. If it
On Wednesday 14 February 2007 17:12, jm wrote:
So my question, is it possible to disable some of the caching lucene
does so the memory consumption will be smaller (I am a bit concerned
on the memory usage side)? Or the memory savings would not pay off?
You could set
There is some code in contrib with comments claiming this interning is
actually slower. I think it was the MemoryIndex? Has this ever been
discussed?
- Mark
Otis Gospodnetic wrote:
I'm not looking at the code now, but I believe this is because those Strings
are interned, and I believe they
14 feb 2007 kl. 20.49 skrev Mark Miller:
There is some code in contrib with comments claiming this interning
is actually slower. I think it was the MemoryIndex? Has this ever
been discussed?
There is of course a cost of RAM and CPU involved with flyweighting
instances. In order to win
Here is the comment:
/*
* Note that this method signature avoids having a user call new
* o.a.l.d.Field(...) which would be much too expensive due to the
* String.intern() usage of that class.
*
* More often than not, String.intern() leads to serious performance
*
I am getting this exception:
Exception in thread main java.io.FileNotFoundException: /index/_gna.f13 (Too
many open files)
This is happening on a SLES10 (64-bit) box when trying to index 18k items.
I can run it on a much lesser SLES9 box without any issues.
Any ideas?!
Thanks,
Michael
See the wiki:
http://wiki.apache.org/jakarta-lucene/LuceneFAQ#head-48921635adf2c968f79
36dc07d51dfb40d638b82
-Original Message-
From: Michael Prichard [mailto:[EMAIL PROTECTED]
Sent: Wednesday, February 14, 2007 5:02 PM
To: java-user@lucene.apache.org
Subject: Too many open files?!
I
That helped! Thanks!
I just added some .close() calls to a few places where I kept file
handles open and it worked quite nicely. Good lesson, make sure you
all clean up after yourselves!
Thanks,
Michael
On Feb 14, 2007, at 8:04 PM, Steven Parkes wrote:
See the wiki:
38 matches
Mail list logo