Re: how to estimate how much memory is required to support the large index search

2008-11-18 Thread Michael McCandless


BTW, upcoming changes in Lucene for flexible indexing should improve  
the RAM usage of the terms index substantially:


https://issues.apache.org/jira/browse/LUCENE-1458

In the current (first) iteration on that patch, TermInfo is no longer  
used at all when loading the index.  I think for a typical index this  
will likely cut in half the RAM used by the terms index.


But... this won't be available for some time (it's still a work in  
progress).


Mike

Chris Lu wrote:

So looks like you are not really doing much sorting? This index  
divisor

affects reader.terms(), but not too much with sorting.

--
Chris Lu
-
Instant Scalable Full-Text Search On Any Database/Application
site: http://www.dbsight.net
demo: http://search.dbsight.com
Lucene Database Search in 3 minutes:
http://wiki.dbsight.com/index.php?title=Create_Lucene_Database_Search_in_3_minutes
DBSight customer, a shopping comparison site, (anonymous per  
request) got

2.6 Million Euro funding!


On Mon, Nov 17, 2008 at 6:21 PM, Zhibin Mai [EMAIL PROTECTED] wrote:

It is a cache tunning setting in IndexReader. It can be set via  
method

setTermInfosIndexDivisor(int).

Thanks,

Zhibin





From: Chris Lu [EMAIL PROTECTED]
To: java-user@lucene.apache.org
Sent: Monday, November 17, 2008 7:07:21 PM
Subject: Re: how to estimate how much memory is required to support  
the

large index search

Calculation looks right. But what's the Index divisor that you  
mentioned?


--
Chris Lu
-
Instant Scalable Full-Text Search On Any Database/Application
site: http://www.dbsight.net
demo: http://search.dbsight.com
Lucene Database Search in 3 minutes:

http://wiki.dbsight.com/index.php?title=Create_Lucene_Database_Search_in_3_minutes
DBSight customer, a shopping comparison site, (anonymous per  
request) got

2.6 Million Euro funding!

On Mon, Nov 17, 2008 at 5:00 PM, Zhibin Mai [EMAIL PROTECTED] wrote:


Aleksander,

I figured it out that most of heap was consumed by the Term cache.  
In our
case, the index has 233 millions of terms and 6.4 millions of them  
were
loaded into the cache when we did the search. I roughly did a  
calculation

that each term will need how much memory, it is about
16 bytes for Term Object + 32 bytes for TermInfo Object + 24 bytes  
for

String Object for term text + 2 * length(Char[]) for term text.

In our case, the average length of term text is 25 characters,  
that means
each term needs at least 122 bytes. The cache for 6.4 millions of  
terms
needs 6.4 * 122 = 780MB. Plus 200MB for caching norm, the RAM for  
cache

is
larger than 980MB. We work around the cache issue for Terms by  
setting

index
divisor of the IndexReader to a higher value. Actually, the  
performance

of

search is good even using index divisor as 4.

Thanks,

Zhibin





From: Aleksander M. Stensby [EMAIL PROTECTED]
To: java-user@lucene.apache.org
Sent: Monday, November 17, 2008 2:31:04 AM
Subject: Re: how to estimate how much memory is required to  
support the

large index search

One major factor that may result in heap space problems is if you  
are

doing
any form of sorting when searching. Do you have any form of  
default sort

in
your application? Also, the type of field used for sorting is  
important

with

regard to memory consumption.

This issue has been discussed before on the list. (You can search  
the

archive for sorting and memory consumption.)

- Aleksander

On Sun, 16 Nov 2008 14:36:36 +0100, Zhibin Mai [EMAIL PROTECTED]  
wrote:



Hello,

I
am a beginner on using lucene. We developed an application to
create and search index using lucene 2.3.1. We would like to know  
how

to estimate how much memory is required to support
the index search given an index.

Recently,
the size of the index has reached to about 200GB with 197M of  
documents

and 223M of terms. Our application starts having intermittent
OutOfMemoryError: Java heap space when we use
it to search the index. We use JProfiler to get the following  
memory

allocation when we do one keyword search:


char[]332MB
org.apache.lucene.index.TermInfo194MB
java.lang.String146MB
org.apache.lucene.index.Term99,823KB
org.apache.lucene.index.Term24,956KB
org.apache.lucene.index.TermInfo[]24,956KB

byte[]188MB
long[]49,912KB

The memory allocation for the first 6 types of objects does not  
change
when we change the search criteria. Could you please give me some  
advice

what major factors will affect the memory allocation
and how those factors will affect the memory usage precisely on  
search?

Is it possible to reduce the memory usage on search?



Thank you,


Zhibin







--Aleksander M. Stensby
Senior software 

Re: Lucene 2.4 Token Stream error

2008-11-18 Thread Michael McCandless


Can you post the code fragment in AccentFilter.java that's setting the  
Token?


In 2.4 we added that check (for IllegalArgumentException) to ensure  
you don't setTermLength to something longer than the current term  
buffer.  You should call resizeTermBuffer() first, then fill in the  
char[] for the token, then call setTermLength.


Mike

bhupesh bansal wrote:



Hey folks,

I saw this error in my code base after upgrading lucene-2.4 from  
lucene 2.3.

have folks seen this before and any idea ?? is it related to fix of
https://issues.apache.org/jira/browse/LUCENE-1333

java.lang.IllegalArgumentException: length 11 exceeds the size of the
termBuffer (10)
   at org.apache.lucene.analysis.Token.setTermLength(Token.java: 
526)

   at
com 
.linkedin 
.search.pub.stemming.impl.filter.AccentFilter.next(AccentFilter.java: 
42)

   at
org.apache.lucene.analysis.LowerCaseFilter.next(LowerCaseFilter.java: 
34)
   at  
org.apache.lucene.analysis.TokenStream.next(TokenStream.java:47)

   at
com 
.linkedin 
.search 
.pub.stemming.impl.filter.PushbackFilter.next(PushbackFilter.java:52)

   at
com 
.linkedin 
.search 
.pub 
.stemming.impl.filter.rewrite.RewriteFilter.next(RewriteFilter.java: 
58)

   at
com 
.linkedin 
.search 
.pub 
.stemming.impl.filter.rewrite.RewriteFilter.next(RewriteFilter.java: 
70)

   at
com 
.linkedin 
.search 
.pub 
.stemming.impl.filter.rewrite.RewriteFilter.next(RewriteFilter.java: 
39)
   at org.apache.lucene.analysis.StopFilter.next(StopFilter.java: 
120)
   at  
org.apache.lucene.analysis.TokenStream.next(TokenStream.java:47)

--
View this message in context: 
http://www.nabble.com/Lucene-2.4-Token-Stream-error-tp20550488p20550488.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Reopen IndexReader

2008-11-18 Thread Michael McCandless


Did you create your IndexSearcher using a String or File (not  
Directory)?


If so, it sounds like you are hitting this issue (just fixed this  
morning, on 2.9-dev (trunk)):


https://issues.apache.org/jira/browse/LUCENE-1453

The workaround is to use the Directory ctor of IndexSearcher.

Mike

Ganesh wrote:


Hello all,

I am using version 2.4. The following code throws  
AlreadyClosedException


  IndexReader reader = searcher.getIndexReader();
  IndexReader newReader =  reader.reopen();
  if (reader != newReader) {
  reader.close();
  boolean isCurrent = newReader.isCurrent(); //throws  
exception

  }

Full list of exception:

org.apache.lucene.store.AlreadyClosedException: this Directory is  
closed
  at org.apache.lucene.store.Directory.ensureOpen(Directory.java: 
220)
  at org.apache.lucene.store.FSDirectory.list(FSDirectory.java: 
320)
  at org.apache.lucene.index.SegmentInfos 
$FindSegmentsFile.run(SegmentInfos.java:533)
  at  
org 
.apache 
.lucene.index.SegmentInfos.readCurrentVersion(SegmentInfos.java:366)
  at  
org 
.apache 
.lucene 
.index.DirectoryIndexReader.isCurrent(DirectoryIndexReader.java:188)
  at MailIndexer.IndexSearcherEx.reOpenDB(IndexSearcherEx.java: 
102)


Please correct me, if i am wrong.

Regards
Ganesh

Send instant messages to your online friends http://in.messenger.yahoo.com
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Transforming german umlaute like ö,ä,ü,ß into oe, ae, ue, ss

2008-11-18 Thread Sascha Fahl

Hi,
what is the best to transform the german umlaute ö,ä,ü,ß into oe, ae,  
ue, ss during the process of analyzing?


Thanks,


Sascha Fahl
Softwareentwicklung

evenity GmbH
Zu den Mühlen 19
D-35390 Gießen

Mail: [EMAIL PROTECTED]









-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



AW: Transforming german umlaute like ö,ä,ü ,ß into oe, ae, ue, ss

2008-11-18 Thread Uwe Goetzke
Use ISOLatin1AccentFilter, although it is not perfect...
So I made ISOLatin2AccentFilter for me and changed this method.
We use our own analysers, so you would use something like this

result = new 
org.apache.lucene.analysis.WhitespaceTokenizer(reader);
result = new ISOLatin2AccentFilter(result);
result = new org.apache.lucene.analysis.LowerCaseFilter(result);


* To replace accented characters in a String by unaccented equivalents.
 */
public final static String removeAccents(String input) {
final StringBuffer output = new StringBuffer();
for (int i = 0; i  input.length(); i++) {
switch (input.charAt(i)) {
case '\u00C0' : // À
case '\u00C1' : // Á
case '\u00C2' : // Â
case '\u00C3' : // Ã
case '\u00C5' : // Å
output.append(A);
break;
case '\u00C4' : // Ä
case '\u00C6' : // Æ
output.append(AE);
break;
case '\u00C7' : // Ç
output.append(C);
break;
case '\u00C8' : // È
case '\u00C9' : // É
case '\u00CA' : // Ê
case '\u00CB' : // Ë
output.append(E);
break;
case '\u00CC' : // Ì
case '\u00CD' : // Í
case '\u00CE' : // Î
case '\u00CF' : // Ï
output.append(I);
break;
case '\u00D0' : // Ð
output.append(D);
break;
case '\u00D1' : // Ñ
output.append(N);
break;
case '\u00D2' : // Ò
case '\u00D3' : // Ó
case '\u00D4' : // Ô
case '\u00D5' : // Õ
case '\u00D8' : // Ø
output.append(O);
break;
case '\u00D6' : // Ö
case '\u0152' : // Œ
output.append(OE);
break;
case '\u00DE' : // Þ
output.append(TH);
break;
case '\u00D9' : // Ù
case '\u00DA' : // Ú
case '\u00DB' : // Û
output.append(U);
break;
case '\u00DC' : // Ü
output.append(UE);
break;
case '\u00DD' : // Ý
case '\u0178' : // Ÿ
output.append(Y);
break;
case '\u00E0' : // à
case '\u00E1' : // á
case '\u00E2' : // â
case '\u00E3' : // ã
case '\u00E5' : // å
output.append(a);
break;
case '\u00E4' : // ä
case '\u00E6' : // æ
output.append(ae);
break;
case '\u00E7' : // ç
output.append(c);
break;
case '\u00E8' : // è
case '\u00E9' : // é
case '\u00EA' : // ê
case '\u00EB' : // ë
output.append(e);
break;
case '\u00EC' : // ì
case '\u00ED' : // í

Re: Reopen IndexReader

2008-11-18 Thread Michael McCandless


Well... we certainly do our best to have each release be stable, but  
we do make mistakes, so you'll have to use your own judgement on when  
to upgrade.


However, it's only through users like yourself upgrading that we then  
find  fix any uncaught issues in each new release.


Mike

Ganesh wrote:

I am creating IndexSearcher using String, this is working fine with  
version 2.3.2.
I tried by replacing Directory ctor of IndexSearcher and it is  
working fine with v2.4.


I have recently upgraded from v2.3.2 to 2.4. Is v2.4 stable and i  
could more forward with this or shall i revert back to 2.3.2?


Regards
Ganesh


- Original Message - From: Michael McCandless [EMAIL PROTECTED] 


To: java-user@lucene.apache.org
Sent: Tuesday, November 18, 2008 4:59 PM
Subject: Re: Reopen IndexReader




Did you create your IndexSearcher using a String or File (not   
Directory)?


If so, it sounds like you are hitting this issue (just fixed this  
morning, on 2.9-dev (trunk)):


   https://issues.apache.org/jira/browse/LUCENE-1453

The workaround is to use the Directory ctor of IndexSearcher.

Mike

Ganesh wrote:


Hello all,

I am using version 2.4. The following code throws   
AlreadyClosedException


 IndexReader reader = searcher.getIndexReader();
 IndexReader newReader =  reader.reopen();
 if (reader != newReader) {
 reader.close();
 boolean isCurrent = newReader.isCurrent(); //throws   
exception

 }

Full list of exception:

org.apache.lucene.store.AlreadyClosedException: this Directory is   
closed
 at  
org.apache.lucene.store.Directory.ensureOpen(Directory.java: 220)
 at org.apache.lucene.store.FSDirectory.list(FSDirectory.java:  
320)
 at org.apache.lucene.index.SegmentInfos  
$FindSegmentsFile.run(SegmentInfos.java:533)
 at   
org 
 .apache 
 .lucene.index.SegmentInfos.readCurrentVersion(SegmentInfos.java: 
366)
 at   
org 
 .apache 
 .lucene 
 .index.DirectoryIndexReader.isCurrent(DirectoryIndexReader.java: 
188)
 at MailIndexer.IndexSearcherEx.reOpenDB(IndexSearcherEx.java:  
102)


Please correct me, if i am wrong.

Regards
Ganesh

Send instant messages to your online friends http://in.messenger.yahoo.com
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Send instant messages to your online friends http://in.messenger.yahoo.com
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: AW: Transforming german umlaute like ö,ä,ü,ß into oe, ae, ue, ss

2008-11-18 Thread Koji Sekiguchi

Uwe Goetzke wrote:
 Use ISOLatin1AccentFilter, although it is not perfect...
 So I made ISOLatin2AccentFilter for me and changed this method.

Or use CharFilter library. It is for Solr as of now, though.

See:
https://issues.apache.org/jira/secure/attachment/12392639/character-normalization.JPG
https://issues.apache.org/jira/browse/SOLR-822

Koji


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: AW: Transforming german umlaute like ö,ä, ü,ß into oe, ae, ue, ss

2008-11-18 Thread Sascha Fahl

Where do I get the CharFilter library? I'm using Lucene, not Solr.

Thanks,
Sascha

Am 18.11.2008 um 14:11 schrieb Koji Sekiguchi:


Uwe Goetzke wrote:
 Use ISOLatin1AccentFilter, although it is not perfect...
 So I made ISOLatin2AccentFilter for me and changed this method.

Or use CharFilter library. It is for Solr as of now, though.

See:
https://issues.apache.org/jira/secure/attachment/12392639/character-normalization.JPG
https://issues.apache.org/jira/browse/SOLR-822

Koji


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Sascha Fahl
Softwareentwicklung

evenity GmbH
Zu den Mühlen 19
D-35390 Gießen

Mail: [EMAIL PROTECTED]









-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: how to estimate how much memory is required to support the large index search

2008-11-18 Thread Zhibin Mai
You are right.

Cheers,

Zhibin





From: Chris Lu [EMAIL PROTECTED]
To: java-user@lucene.apache.org
Sent: Monday, November 17, 2008 11:13:44 PM
Subject: Re: how to estimate how much memory is required to support the large 
index search

So looks like you are not really doing much sorting? This index divisor
affects reader.terms(), but not too much with sorting.

-- 
Chris Lu
-
Instant Scalable Full-Text Search On Any Database/Application
site: http://www.dbsight.net
demo: http://search.dbsight.com
Lucene Database Search in 3 minutes:
http://wiki.dbsight.com/index.php?title=Create_Lucene_Database_Search_in_3_minutes
DBSight customer, a shopping comparison site, (anonymous per request) got
2.6 Million Euro funding!


On Mon, Nov 17, 2008 at 6:21 PM, Zhibin Mai [EMAIL PROTECTED] wrote:

 It is a cache tunning setting in IndexReader. It can be set via method
 setTermInfosIndexDivisor(int).

 Thanks,

 Zhibin




 
 From: Chris Lu [EMAIL PROTECTED]
 To: java-user@lucene.apache.org
 Sent: Monday, November 17, 2008 7:07:21 PM
 Subject: Re: how to estimate how much memory is required to support the
 large index search

 Calculation looks right. But what's the Index divisor that you mentioned?

 --
 Chris Lu
 -
 Instant Scalable Full-Text Search On Any Database/Application
 site: http://www.dbsight.net
 demo: http://search.dbsight.com
 Lucene Database Search in 3 minutes:

 http://wiki.dbsight.com/index.php?title=Create_Lucene_Database_Search_in_3_minutes
 DBSight customer, a shopping comparison site, (anonymous per request) got
 2.6 Million Euro funding!

 On Mon, Nov 17, 2008 at 5:00 PM, Zhibin Mai [EMAIL PROTECTED] wrote:

  Aleksander,
 
  I figured it out that most of heap was consumed by the Term cache. In our
  case, the index has 233 millions of terms and 6.4 millions of them were
  loaded into the cache when we did the search. I roughly did a calculation
  that each term will need how much memory, it is about
  16 bytes for Term Object + 32 bytes for TermInfo Object + 24 bytes for
  String Object for term text + 2 * length(Char[]) for term text.
 
  In our case, the average length of term text is 25 characters, that means
  each term needs at least 122 bytes. The cache for 6.4 millions of terms
  needs 6.4 * 122 = 780MB. Plus 200MB for caching norm, the RAM for cache
 is
  larger than 980MB. We work around the cache issue for Terms by setting
 index
  divisor of the IndexReader to a higher value. Actually, the performance
 of
  search is good even using index divisor as 4.
 
  Thanks,
 
  Zhibin
 
 
 
 
  
  From: Aleksander M. Stensby [EMAIL PROTECTED]
  To: java-user@lucene.apache.org
  Sent: Monday, November 17, 2008 2:31:04 AM
  Subject: Re: how to estimate how much memory is required to support the
  large index search
 
  One major factor that may result in heap space problems is if you are
 doing
  any form of sorting when searching. Do you have any form of default sort
 in
  your application? Also, the type of field used for sorting is important
 with
  regard to memory consumption.
 
  This issue has been discussed before on the list. (You can search the
  archive for sorting and memory consumption.)
 
  - Aleksander
 
  On Sun, 16 Nov 2008 14:36:36 +0100, Zhibin Mai [EMAIL PROTECTED] wrote:
 
   Hello,
  
   I
   am a beginner on using lucene. We developed an application to
   create and search index using lucene 2.3.1. We would like to know how
   to estimate how much memory is required to support
   the index search given an index.
  
   Recently,
   the size of the index has reached to about 200GB with 197M of documents
   and 223M of terms. Our application starts having intermittent
   OutOfMemoryError: Java heap space when we use
   it to search the index. We use JProfiler to get the following memory
  allocation when we do one keyword search:
  
   char[]332MB
   org.apache.lucene.index.TermInfo194MB
   java.lang.String146MB
   org.apache.lucene.index.Term99,823KB
   org.apache.lucene.index.Term24,956KB
   org.apache.lucene.index.TermInfo[]24,956KB
  
   byte[]188MB
   long[]49,912KB
  
   The memory allocation for the first 6 types of objects does not change
  when we change the search criteria. Could you please give me some advice
  what major factors will affect the memory allocation
   and how those factors will affect the memory usage precisely on search?
  Is it possible to reduce the memory usage on search?
  
  
   Thank you,
  
  
   Zhibin
  
  
  
 
 
 
  --Aleksander M. Stensby
  Senior software developer
  Integrasco A/S
  www.integrasco.no
 
  

Re: AW: Transforming german umlaute like ö,ä ,ü,ß into oe, ae, ue, ss

2008-11-18 Thread Koji Sekiguchi

Sascha Fahl wrote:

Where do I get the CharFilter library? I'm using Lucene, not Solr.

Thanks,
Sascha

CharFilter is included in recent Solr nightly build.
It is not OOTB solution for Lucene now, sorry.
If I have time, I will make it for Lucene in this weekend.

Koji



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Transforming german umlaute like ö,ä,ü ,ß into oe, ae, ue, ss

2008-11-18 Thread Teruhiko Kurosaka
Naming this class to include Latin2 may be misleading.
Latin2 means ISO-8859-2 character set.

http://en.wikipedia.org/wiki/ISO_8859-2


 From: Uwe Goetzke [mailto:[EMAIL PROTECTED] 
 Sent: Tuesday, November 18, 2008 7:26 AM
 To: java-user@lucene.apache.org
 Cc: [EMAIL PROTECTED]
 Subject: AW: Transforming german umlaute like ö,ä,ü,ß into 
 oe, ae, ue, ss
 
 Use ISOLatin1AccentFilter, although it is not perfect...
 So I made ISOLatin2AccentFilter for me and changed this method.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Special characters prevent entity being indexed

2008-11-18 Thread Pekka Nykyri

Hi!

I'm having problems with entities including special characters (Spanish 
language) not getting indexed.


I haven't been able to find the the reason why some entities get indexed 
while some don't.


I have 3 fields that (currently) hold the same value. The value for the 
fields is example ¡Fantástico!- blaaba. Then when I change ONE of the 
three values to ¡Fantástico! - blaaba, the entity gets indexed. So 
chanching only one field makes it to index.


But the bigger problem with this is, that I have almost (other fields are 
almost similar and I don't think they cause the problem) similar entity, 
with exactly the same three ¡Fantástico!- blaaba -fields and it gets 
indexed normally. Even though the critical fields are exactly the same.


And also all entities where three fields start with upside down ?-mark 
doesn't get indexed.


I'm really confused with the problem because I don't seem to be able to 
find any logic some entities not being indexed even though they are 
similar to some other. And changing only one value of the three makes it 
index.


Sorry for a really messy message but I just can't explain it more clearly 
now.


Thanks in advance,
pn
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Special characters prevent entity being indexed

2008-11-18 Thread Erick Erickson
What analyzer are you using at index and search time? Typical problems
include:
using an analyzer that doesn't understand accented chars (StandardAnalyzer
for instance)
using a different anlyzer during search and index.

Search the user list for accent and you'll find this kind of problem
discussed,
and if that doesn't help we need to know what analyzers you are using and
what behavior you really want. Typically, for instance, *requiring* a user
to
type the upside-down exclamation point to get a match on this field would
be considered incorrect.

Also, you'd be helped a lot be getting a copy of Luke and examining your
index
to see exactly what's been indexed, it'll reveal a lot.

Best
Erick

On Tue, Nov 18, 2008 at 10:05 AM, Pekka Nykyri [EMAIL PROTECTED]wrote:

 Hi!

 I'm having problems with entities including special characters (Spanish
 language) not getting indexed.

 I haven't been able to find the the reason why some entities get indexed
 while some don't.

 I have 3 fields that (currently) hold the same value. The value for the
 fields is example ¡Fantástico!- blaaba. Then when I change ONE of the
 three values to ¡Fantástico! - blaaba, the entity gets indexed. So
 chanching only one field makes it to index.

 But the bigger problem with this is, that I have almost (other fields are
 almost similar and I don't think they cause the problem) similar entity,
 with exactly the same three ¡Fantástico!- blaaba -fields and it gets
 indexed normally. Even though the critical fields are exactly the same.

 And also all entities where three fields start with upside down ?-mark
 doesn't get indexed.

 I'm really confused with the problem because I don't seem to be able to
 find any logic some entities not being indexed even though they are similar
 to some other. And changing only one value of the three makes it index.

 Sorry for a really messy message but I just can't explain it more clearly
 now.

 Thanks in advance,
 pn

 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]



compare scores across queries

2008-11-18 Thread Ng Vinny
Hi all,

I am wondering if the raw scores obtained from HitCollector can be used to
compare relevance of documents to different queries?

E.g.  two  phrase queries are issued : (PQ1: Barack Obama  and PQ2:  John
McCain). if a document (doc1) belongs to the result sets of both queries
and has the raw score of 5 for PQ1 and 3 for PQ2, can  I say that doc1 is
more relevant to Barack Obama than to John McCain?

There have been some previous discussions about this at [1,2]. On the other
hand, the javadoc of the Similarity class says *queryNorm(q) * is a
normalizing factor used to make scores between queries comparable. This
factor does not affect document ranking (since all ranked documents are
multiplied by the same factor), but rather just attempts to make scores from
different queries (or even different indexes) comparable. 

Please advise.

Thanks.
Ng.

[1] http://thread.gmane.org/gmane.comp.jakarta.lucene.user/10760/focus=10810
[2]
http://www.gossamer-threads.com/lists/lucene/java-user/35051?search_string=compare%20score%20across%20queries;#35051
[3]
http://hudson.zones.apache.org/hudson/job/Lucene-trunk/javadoc//org/apache/lucene/search/Similarity.html


can I set Boost to the term while indexing?

2008-11-18 Thread T. H. Lin
I would like to store a set of keywords in a single field of a document.

for example I have now three keywords: One, Two and Three
and I am going to add them into a document.

At first, is this code correct?
//
String[] keyword = new String[]{One, Two, Three};
for (int i = 0; i  keyword.length; i++) {
   Field f = new Field(field_name,
keyword[i],
Field.Store.NO,
Field.Index.UN_TOKENIZED,
TermVector.YES);
   doc.add(f);
}
indexWriter.addDocument(doc);
/***/

when searching, We can set Boost for a query term.

the question is...
Can I set Boost for every keyword/term while indexing?

from the example above. I may set those keywords. i.e. One, Two and
Three, with different Weight/Boost/Relavance... while indexing.
and the same term may have different Weight/Boost/Relavance... in
different document.

can I do this?

thanks. :-)


Searching repeating fields

2008-11-18 Thread Mark Ferguson
Hello,

I am designing an index in which one url corresponds to one document. Each
document also contains multiple parallel repeating fields. For example:

Document 1:
  url: http://www.cnn.com/
  page_description: cnn breaking news
  page_title: news
  page_title: cnn news
  page_titel: homepage
  username: ajax
  username: paris
  username: daniel

In this contrived example, user 'ajax' have saved the URL with the page
title 'news', 'paris' has saved it with 'cnn news', and 'daniel' has saved
it with 'homepage'.

What I need to be able to do is perform a search for a particular user and a
particular title, but they must occur together. For example, +user:ajax
+page_title:news would return this document, but +user:ajax
+page_title:homepage would not.

I am open to changing the design of the document (i.e. using repeating
fields isn't required), but I do need to have one document per url. I am
looking for suggestions for a strategy on implementing this requirement.

Thanks,

Mark Ferguson


Re: Searching repeating fields

2008-11-18 Thread Ian Lea
How about using variable field names?

 url: http://www.cnn.com/
 page_description: cnn breaking news
 page_title_ajax: news
 page_title_paris: cnn news
 page_title_daniel: homepage
 username: ajax
 username: paris
 username: daniel

and search for +user:ajax +page_title_ajax:news or maybe just
page_title_ajax:news.  Might not even need to store user.


--
Ian.


On Tue, Nov 18, 2008 at 5:48 PM, Mark Ferguson
[EMAIL PROTECTED] wrote:
 Hello,

 I am designing an index in which one url corresponds to one document. Each
 document also contains multiple parallel repeating fields. For example:

 Document 1:
  url: http://www.cnn.com/
  page_description: cnn breaking news
  page_title: news
  page_title: cnn news
  page_titel: homepage
  username: ajax
  username: paris
  username: daniel

 In this contrived example, user 'ajax' have saved the URL with the page
 title 'news', 'paris' has saved it with 'cnn news', and 'daniel' has saved
 it with 'homepage'.

 What I need to be able to do is perform a search for a particular user and a
 particular title, but they must occur together. For example, +user:ajax
 +page_title:news would return this document, but +user:ajax
 +page_title:homepage would not.

 I am open to changing the design of the document (i.e. using repeating
 fields isn't required), but I do need to have one document per url. I am
 looking for suggestions for a strategy on implementing this requirement.

 Thanks,

 Mark Ferguson


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: constructing a mini-index with just the number of hits for a term

2008-11-18 Thread Michael McCandless


Flexible indexing (LUCENE-1458) should make this possible.

IE you could use your own codec which discards doc/freq/prox/payload  
and during indexing (for this one field) and simply stores the term  
frequency in the terms dict.  However, one problem will be deletions  
(in case it matters to your app): in order to properly update the  
terms dict counts, SegmentMerger walks through the docIDs for the term  
and skips the deleted ones.


But it will be some time before this is real, though there's an  
initial patch on LUCENE-1458.


Mike

Grant Ingersoll wrote:

Can you share what the actual problem is that you are trying to  
solve?  It might help put things in context for me.  I'm guessing  
you are doing some type of co-occurrence analysis, but...


More below.

On Nov 13, 2008, at 11:08 AM, Sven wrote:

First - I apologize for the double post on my earlier email.  The  
first time I sent it I received an error message from [EMAIL PROTECTED] 
 saying that I should instead send email to [EMAIL PROTECTED]  
so I thought it did not go through.
My question is this - is there a way to use the Lucene/Solr  
infrastructure to create a mini-index that simply contains a lookup  
table of terms and the number of times they have appeared?


This could be possible.  I think I would create documents with  
Index.ANALYZED, and Store.NO.  Then, you just need to use the  
TermEnum and TermDocs to access the information that you need.  In a  
sense, you are just creating the term dictionary.  You could also  
turn off storing of NORMS, which will save too.




I do not need to record which documents have them nor do I need to  
know where in the documents they appear.  There could be (and  
probably will be) more than 2^32 terms, however.


2^32 unique terms or 2^32 total terms?

I'm not sure if that makes a difference to the Lucene backend, but  
thought it might be relevant.
This question coincides with my earlier question about counting the  
times a given term is associated with another term.  I figure that  
this would be more easily accomplished by making the mini-index  
described above alongside the normal index when a document is  
indexed.  For example, when scanning:


Bravely bold Sir Robin, brought forth from Camelot.  He was not  
afraid to die!  Oh, brave Sir Robin!


In addition to the normal indexing function of Lucene, I would like  
to write something on the backend to also index:


bravely|bold
bravely|sir
bravely|robin
bravely|brought
bravely|forth
bold|sir
bold|robin
bold|brought
bold|forth
bold|camelot  (from being a stop word)
...and so on

I only need to keep a running total of each bravely|bold term,  
however, since the number of terms will be quite large and keeping  
track of the document/termpositions would translate to a lot of  
wasted HD space.


For this, I think you will have to hook into the Analyzer process.   
The other thing to do is just try keeping the document/term  
positions, it may not actually be as bad as you think in terms of  
space.




If such a thing is not already in place, could someone let me know  
if there are some tutorials, documentation, or presentations that  
describe the inner workings of Lucene and the theories/ 
implementation at work for the actual file formats, structures,  
data manipulations, etc?  (The javadocs don't go into this kind of  
detail.)  I'm sure I can sift through the code and eventually make  
sense of it, but if there is documentation out there, I'd prefer to  
peruse that first.  My thought being that I can simply generate my  
own kind of hash for each combined term and write it out to a  
custom file structure similar to Lucene - but the specifics of how  
to (optimally) do so are not plain to me yet.

Thanks again!
-Sven


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



--
Grant Ingersoll

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ











-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Searching repeating fields

2008-11-18 Thread Mark Ferguson
Thanks for the suggestion, but I think I will need a more robust solution,
because this will only work with pairs of fields. I should have specified
that the example I gave was somewhat contrived, but in practice there could
be more than two parallel fields. I'm trying to find a general solution that
I can apply to any number of parallel fields holding any kind of data.

I was thinking of trying something along the lines of a multi-value field.
So for example, I could have:

page_user_title: ajax|news (where | is a field separator)

The problem is I don't know how to formulate the query that would be
equivalent to +username:ajax +page_title:news, or if it's even possible. (I
should also mention that I am creating the queries programmatically, not
using the query parser, so anything goes).

Any other ideas?

Mark Ferguson


On Tue, Nov 18, 2008 at 1:06 PM, Ian Lea [EMAIL PROTECTED] wrote:

 How about using variable field names?

  url: http://www.cnn.com/
  page_description: cnn breaking news
  page_title_ajax: news
  page_title_paris: cnn news
  page_title_daniel: homepage
  username: ajax
  username: paris
  username: daniel

 and search for +user:ajax +page_title_ajax:news or maybe just
 page_title_ajax:news.  Might not even need to store user.


 --
 Ian.


 On Tue, Nov 18, 2008 at 5:48 PM, Mark Ferguson
 [EMAIL PROTECTED] wrote:
  Hello,
 
  I am designing an index in which one url corresponds to one document.
 Each
  document also contains multiple parallel repeating fields. For example:
 
  Document 1:
   url: http://www.cnn.com/
   page_description: cnn breaking news
   page_title: news
   page_title: cnn news
   page_titel: homepage
   username: ajax
   username: paris
   username: daniel
 
  In this contrived example, user 'ajax' have saved the URL with the page
  title 'news', 'paris' has saved it with 'cnn news', and 'daniel' has
 saved
  it with 'homepage'.
 
  What I need to be able to do is perform a search for a particular user
 and a
  particular title, but they must occur together. For example, +user:ajax
  +page_title:news would return this document, but +user:ajax
  +page_title:homepage would not.
 
  I am open to changing the design of the document (i.e. using repeating
  fields isn't required), but I do need to have one document per url. I am
  looking for suggestions for a strategy on implementing this requirement.
 
  Thanks,
 
  Mark Ferguson
 

 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]




Re: Searching repeating fields

2008-11-18 Thread Mark Ferguson
I'll provide a better example, perhaps it will help in formulating a
solution.

Suppose I am designing an index that stores invoices. One document
corresponds to one invoice, which has a unique id. Any number of employees
can make comments on the invoices, and comments have different
classifications (request_for_approval, redirection, approval,
miscellaneous). Each comment is timestamped. An invoice also contains a long
description that is indexed and is stored.

So an example document may look like this:

invoice_id: 1234
invoice_description:(some text)
employee_id: 5
employee_id: 8
employee_id: 12
comment_type: request_for_approval
comment_type: redirection
comment_type: approval
comment: please approve invoice
comment: sending invoice to sales
comment: invoice approved
ts:200811181012
ts:200811181015
ts:200811181340

I want to be able to search by any number of these fields. For example, I
may want all of employee 5's requests for approvals from today.

It may seem like it would be simpler to just have two separate indexes: a
comments index and an invoice index. But I also want to be able to search
the invoice description along with the comments. I could set the granularity
of the index to the comments level, but then I am duplicating a lot of text
in the invoice description. Also, I only care about returning the invoice,
so I will have to merge results if the granularity is set to the comments
level, which will ruin Lucene's scoring (?).

This is a made-up example, but I think it describes pretty thoroughly the
problem I'm trying to solve. In my real world problem, I'm storing the
full-text of web pages, and I really don't want to be duplicating that much
text to set the granularity lower.

Mark Ferguson


On Tue, Nov 18, 2008 at 2:29 PM, Mark Ferguson [EMAIL PROTECTED]wrote:

 Thanks for the suggestion, but I think I will need a more robust solution,
 because this will only work with pairs of fields. I should have specified
 that the example I gave was somewhat contrived, but in practice there could
 be more than two parallel fields. I'm trying to find a general solution that
 I can apply to any number of parallel fields holding any kind of data.

 I was thinking of trying something along the lines of a multi-value field.
 So for example, I could have:

 page_user_title: ajax|news (where | is a field separator)

 The problem is I don't know how to formulate the query that would be
 equivalent to +username:ajax +page_title:news, or if it's even possible. (I
 should also mention that I am creating the queries programmatically, not
 using the query parser, so anything goes).

 Any other ideas?

 Mark Ferguson



 On Tue, Nov 18, 2008 at 1:06 PM, Ian Lea [EMAIL PROTECTED] wrote:

 How about using variable field names?

  url: http://www.cnn.com/
  page_description: cnn breaking news
  page_title_ajax: news
  page_title_paris: cnn news
  page_title_daniel: homepage
  username: ajax
  username: paris
  username: daniel

 and search for +user:ajax +page_title_ajax:news or maybe just
 page_title_ajax:news.  Might not even need to store user.


 --
 Ian.


 On Tue, Nov 18, 2008 at 5:48 PM, Mark Ferguson
 [EMAIL PROTECTED] wrote:
  Hello,
 
  I am designing an index in which one url corresponds to one document.
 Each
  document also contains multiple parallel repeating fields. For example:
 
  Document 1:
   url: http://www.cnn.com/
   page_description: cnn breaking news
   page_title: news
   page_title: cnn news
   page_titel: homepage
   username: ajax
   username: paris
   username: daniel
 
  In this contrived example, user 'ajax' have saved the URL with the page
  title 'news', 'paris' has saved it with 'cnn news', and 'daniel' has
 saved
  it with 'homepage'.
 
  What I need to be able to do is perform a search for a particular user
 and a
  particular title, but they must occur together. For example, +user:ajax
  +page_title:news would return this document, but +user:ajax
  +page_title:homepage would not.
 
  I am open to changing the design of the document (i.e. using repeating
  fields isn't required), but I do need to have one document per url. I am
  looking for suggestions for a strategy on implementing this requirement.
 
  Thanks,
 
  Mark Ferguson
 

 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]





Re: Searching repeating fields

2008-11-18 Thread Chris Hostetter

There has been discussion in the past about how PhraseQuery artificially 
requires that the Terms you add to it must be in the same field ... you 
could theoretically modify PhraseQuery to have a tpe of query that 
required terms in one field be withing (slop)N positions of a term in a 
parallel field ... with N=0 you would get something like what you're 
describing...

http://www.nabble.com/Re%3A-One-item%2C-multiple-fields%2C-and-range-queries-p8377712.html

(that thread oes on to discuss the complexities of trying to make 
something like this work if one of the query clauses you want in your 
phrase is non-trivial like a RangeQuery)




-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Term numbering and range filtering

2008-11-18 Thread Tim Sturge
I've finished a query time implementation of a column stride filter, which
implements DocIdSetIterator. This just builds the filter at process start
and uses it for each subsequent query. The index itself is unchanged.

The results are very impressive. Here are the results on a 45M document
index:

Firstly without an age constraint as a baseline:

Query +name:tim 
startup: 0 
Hits: 15089
first query: 1004
100 queries: 132 (1.32 msec per query)

Now with a cached filter. This is ideal from a speed standpoint but there
are too many possible start/end combinations to cache all the filters.

Query +name:tim age:[18 TO 35] (ConstantScoreQuery on cached RangeFilter)
startup: 3
Hits: 11156
first query: 1830
100 queries: 287 (2.87 msec per query)

Now with an uncached filter. This is awful.

Query +name:tim age:[18 TO 35] (uncached ConstantScoreRangeQuery)
startup: 3
Hits: 11156
first query: 1665
100 queries: 51862 (yes, 518 msec per query, 200x slower)

A RangeQuery is slightly better but still bad (and has a different result
set)

Query +name:tim age:[18 TO 35] (uncached RangeQuery)
startup: 0
Hits: 10147
first query: 1517
100 queries: 27157 (271 msec is 100x slower than the filter)

Now with the prebuilt column stride filter:

Query +name:tim age:[18 TO 35] (ConstantScoreQuery on prebuilt column
stride filter)
startup: 2811
Hits: 11156
first query: 1395
100 queries: 441 (back down to 4.41msec per query)

This is less than 2x slower than the dedicated bitset and more than 50x
faster than the range boolean query.

Mike, Paul, I'm happy to contribute this (ugly but working) code if there is
interest. Let me know and I'll open a JIRA issue for it.

Tim


On 11/11/08 1:27 PM, Michael McCandless [EMAIL PROTECTED] wrote:

 
 Paul Elschot wrote:
 
 Op Tuesday 11 November 2008 21:55:45 schreef Michael McCandless:
 Also, one nice optimization we could do with the term number column-
 stride array is do bit packing (borrowing from the PFOR code)
 dynamically.
 
 Ie since we know there are X unique terms in this segment, when
 populating the array that maps docID to term number we could use
 exactly the right number of bits.  Enumerated fields with not many
 unique values (eg, country, state) would take relatively little RAM.
 With LUCENE-1231, where the fields are stored column stride on disk,
 we could do this packing during index such that loading at search
 time is very fast.
 
 Perhaps we'd better continue this at LUCENE-1231 or LUCENE-1410.
 I think what you're referring to is PDICT, which has frame exceptions
 for values that occur infrequently.
 
 Yes let's move the discussion to Jira.
 
 Actually I was referring to simple bit-packing.
 
 For encoding array of compact enum terms (eg city, state, color, zip)
 I'm guessing the exceptions logic won't buy us much and would hurt
 seeking needed for column-stride fields.  But we should certainly test
 it.
 
 Mike
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Term numbering and range filtering

2008-11-18 Thread Paul Elschot
Op Wednesday 19 November 2008 00:43:56 schreef Tim Sturge:
 I've finished a query time implementation of a column stride filter,
 which implements DocIdSetIterator. This just builds the filter at
 process start and uses it for each subsequent query. The index itself
 is unchanged.

 The results are very impressive. Here are the results on a 45M
 document index:

 Firstly without an age constraint as a baseline:

 Query +name:tim
 startup: 0
 Hits: 15089
 first query: 1004
 100 queries: 132 (1.32 msec per query)

 Now with a cached filter. This is ideal from a speed standpoint but
 there are too many possible start/end combinations to cache all the
 filters.

 Query +name:tim age:[18 TO 35] (ConstantScoreQuery on cached
 RangeFilter) startup: 3
 Hits: 11156
 first query: 1830
 100 queries: 287 (2.87 msec per query)

 Now with an uncached filter. This is awful.

 Query +name:tim age:[18 TO 35] (uncached ConstantScoreRangeQuery)
 startup: 3
 Hits: 11156
 first query: 1665
 100 queries: 51862 (yes, 518 msec per query, 200x slower)

 A RangeQuery is slightly better but still bad (and has a different
 result set)

 Query +name:tim age:[18 TO 35] (uncached RangeQuery)
 startup: 0
 Hits: 10147
 first query: 1517
 100 queries: 27157 (271 msec is 100x slower than the filter)

 Now with the prebuilt column stride filter:

 Query +name:tim age:[18 TO 35] (ConstantScoreQuery on prebuilt
 column stride filter)

With Allow Filter as clause to BooleanQuery:
https://issues.apache.org/jira/browse/LUCENE-1345
one could even skip the ConstantScoreQuery with this.
Unfortunately 1345 is unfinished for now.

 startup: 2811
 Hits: 11156
 first query: 1395
 100 queries: 441 (back down to 4.41msec per query)

 This is less than 2x slower than the dedicated bitset and more than
 50x faster than the range boolean query.

 Mike, Paul, I'm happy to contribute this (ugly but working) code if
 there is interest. Let me know and I'll open a JIRA issue for it.

In case you think more performance improvements based on this
are possible...

Regards,
Paul Elschot.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Term numbering and range filtering

2008-11-18 Thread Tim Sturge


 With Allow Filter as clause to BooleanQuery:
 https://issues.apache.org/jira/browse/LUCENE-1345
 one could even skip the ConstantScoreQuery with this.
 Unfortunately 1345 is unfinished for now.
 

That would be interesting; I'd like to see how much performance improves.

 startup: 2811
 Hits: 11156
 first query: 1395
 100 queries: 441 (back down to 4.41msec per query)
 
 This is less than 2x slower than the dedicated bitset and more than
 50x faster than the range boolean query.
 
 Mike, Paul, I'm happy to contribute this (ugly but working) code if
 there is interest. Let me know and I'll open a JIRA issue for it.
 
 In case you think more performance improvements based on this
 are possible...

I think this is generally useful for range and set queries on non-text based
fields (dates, location data, prices, general enumerations). These all have
the required property that there is only one value (term) per document.

I've opened LUCENE-1461.

Tim

 
 Regards,
 Paul Elschot.
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: InstantiatedIndex help + first impression

2008-11-18 Thread karl wettin
The actual performance depends on how much you load to the index. Can
you tell us how many documents and how large these documents are that
you have in your index?

Compared with RAMDirectory I'vee seen performance boosts of

up to 100x in a small index that contains (1-20) Wikipedia sized
documents, an index I used to apply user search agents on as new data
arrived to the primary index.
up to 25x when placing massive amounts of span queries on the apriori
index in LUCENE-626. This index contained tens of thousands of
documents with only a few (5-20) terms each.
up to 15x in a relatively large ngram index for classifications using
LUCENE-626. This is pure skipTo operations.

Regarding the fuzzy query, try to see how much time was spent
rewriting the query and then how much time was spend querying. I'm
almost certain you'll notice that the time spent rewriting the query
(comparing edit distance between the terms of the index and the query
term) is overwhelming  compared to the time spend searching for the
rewritten query. I.e. this is probably as much a store related expense
as it is a Levenshtein calculation expense.


karl

(this is my second reply, the first one seems to be lost in space?)

On Mon, Nov 17, 2008 at 1:37 AM, Darren Govoni [EMAIL PROTECTED] wrote:
 After I switched to InstantiatedIndex from RAMDirectory (but using the
 reader from my RAMDirectory to create the InstantiatedIndex), I see a
 less than 25% (.25) improvement in speed. Nowhere near the 100x (100.00)
 speed mentioned in the documentation. Probably I am doing something
 wrong.

 I am using too, a fuzzy query. e.g. word:house~0.80 but I'd expect the
 improvement to be because of physical representation (memory graph) and
 mostly unaffected by the query. no?

 Could there be some lazy loading going on in RAMDirectory that prevents
 InstantiatedIndex from building out its graph and getting the expected
 speed?

 thanks to anyone who can verify this.


 On Sun, 2008-11-16 at 12:37 -0500, Darren Govoni wrote:
 Yeah. That makes sense. Its not too hard to wrap those extra steps so I
 can end up with something simpler too. Like:

 iindex = InstantiatedIndex(path/to/my/index)

 I'm lazy so the intermediate hoops to jump through clutter my code.
 Hehe.

 :)

 Darren

 On Sun, 2008-11-16 at 11:46 -0500, Mark Miller wrote:
  Can you start with an empty index? Then how about:
 
  // Adding these
 
  iindex = InstantiatedIndex()
  ireader = iindex.indexReaderFactory()
  isearcher = IndexSearcher(ireader)
 
  If you want a copy from another IndexReader though, you have to get that 
  reader from somewhere right?
 
  - Mark
 
 
 
  Darren Govoni wrote:
   Hi Mark,
 Thanks for the tips. Here's what I will try (psuedo-code)
  
   endirectory = RAMDirectory(index/dictionary.en)
   ensearcher = IndexSearcher(endirectory)
   // Adding these
   reader = ensearcher.getIndexReader()
   iindex = InstantiatedIndex(reader)
   ireader = iindex.indexReaderFactory()
   isearcher = IndexSearcher(ireader)
  
   Kind of round about way to get an InstantiatedIndex I guess,but maybe
   there's a briefer way?
  
   Thank you.
   Darren
  
   On Sun, 2008-11-16 at 10:50 -0500, Mark Miller wrote:
  
   Check out the docs at:
   http://lucene.apache.org/java/2_4_0/api/contrib-instantiated/index.html
  
   There is a performance graph there to check  out.
  
   The code should be fairly straightforward - you can make an
   InstantiatedIndex thats empty, or seed it with an IndexReader. Then you
   can make an InstantiatedReader or Writer, which take the
   InstantiatedIndex as a constructor arg.
  
   You should be able to just wrap that InstantiatedReader in a regular
   Searcher.
  
   Darren Govoni wrote:
  
   Hi gang,
  I am trying to trace the 2.4 API to create an InstantiatedIndex, but
   its rather difficult to connect directory,reader,search,index etc just
   reading the javadocs.
  
   I have a (POI - plain old index) directory already and want to
   create a faster InstantiatedIndex and IndexSearcher to query it like
   before. What's the proper order to do this?
  
   Also, if anyone has any empirical data on the performance or 
   reliability
   of InstantiatedIndex, I'd be curious.
  
   Thanks for the tips!
   Darren
  
  
   -
   To unsubscribe, e-mail: [EMAIL PROTECTED]
   For additional commands, e-mail: [EMAIL PROTECTED]
  
  
  
   -
   To unsubscribe, e-mail: [EMAIL PROTECTED]
   For additional commands, e-mail: [EMAIL PROTECTED]
  
  
  
  
   -
   To unsubscribe, e-mail: [EMAIL PROTECTED]
   For additional commands, e-mail: [EMAIL PROTECTED]
  
  
 
 
  -
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, 

2.4 Performance

2008-11-18 Thread lucene
On an index of around 20 gigs I've been seeing a performance drop of
around 35% after upgrading to 2.4 (measured on ~1 requests
identical requests, executed in parallel against a threaded lucene /
apache setup, after a roughly 1 query warmup). The principal
changes I've made so far are just to switch to NIOFSDirectories and
use read-only index readers.

Our design is roughly as follows: we have some pre-query filters,
queries typically involving around 25 clauses, and some
post-processing of hits. We collect counts and filter post query using
a hit collector, which uses the (now deprecated) bits() method of
Filters.

I looked at converting us to use the new DocIdSet infrastructure (to
gain the supposed 30% speed bump), but this seems to be somewhat
problematic as there is no guarantee for whether we will get back a
set we can do binary operations on (for example, if we get back a
SortedVIntList, we're pretty much out of luck - the cardinality of the set
is large (as it's a sortedvintlist), so we can't coerce it into
another type, and it doesn't have the set operations we need to use it
directly.

Has anyone else seen this? Is there anything else
we should be changing in the upgrade to 2.4?

Thanks,

-Matt

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: InstantiatedIndex help + first impression

2008-11-18 Thread karl wettin
On Wed, Nov 19, 2008 at 3:27 AM, karl wettin [EMAIL PROTECTED] wrote:
 rewritten query. I.e. this is probably as much a store related expense
 as it is a Levenshtein calculation expense.

this is probably *not* as much a store related.. that is.


karl

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Reopen IndexReader

2008-11-18 Thread Cool The Breezer
I had same kind of problem and I somehow managed to find a work around by 
initializing IndexSearcher from new reader. 

try {
IndexReader newReader =  reader.reopen();
if (newReader != reader) {
//  reader was reopened
   reader.close(); 
   reader = null;
 }

reader = newReader;
searcher = new IndexSearcher(newReader);
} catch (Exception e) {
e.printStackTrace();
} 


--- On Tue, 11/18/08, Michael McCandless [EMAIL PROTECTED] wrote:

 From: Michael McCandless [EMAIL PROTECTED]
 Subject: Re: Reopen IndexReader
 To: java-user@lucene.apache.org
 Date: Tuesday, November 18, 2008, 7:52 AM
 Well... we certainly do our best to have each release be
 stable, but we do make mistakes, so you'll have to use
 your own judgement on when to upgrade.
 
 However, it's only through users like yourself
 upgrading that we then find  fix any uncaught issues in
 each new release.
 
 Mike
 
 Ganesh wrote:
 
  I am creating IndexSearcher using String, this is
 working fine with version 2.3.2.
  I tried by replacing Directory ctor of IndexSearcher
 and it is working fine with v2.4.
  
  I have recently upgraded from v2.3.2 to 2.4. Is v2.4
 stable and i could more forward with this or shall i revert
 back to 2.3.2?
  
  Regards
  Ganesh
  
  
  - Original Message - From: Michael
 McCandless [EMAIL PROTECTED]
  To: java-user@lucene.apache.org
  Sent: Tuesday, November 18, 2008 4:59 PM
  Subject: Re: Reopen IndexReader
  
  
  
  Did you create your IndexSearcher using a String
 or File (not  Directory)?
  
  If so, it sounds like you are hitting this issue
 (just fixed this morning, on 2.9-dev (trunk)):
  

 https://issues.apache.org/jira/browse/LUCENE-1453
  
  The workaround is to use the Directory ctor of
 IndexSearcher.
  
  Mike
  
  Ganesh wrote:
  
  Hello all,
  
  I am using version 2.4. The following code
 throws  AlreadyClosedException
  
   IndexReader reader =
 searcher.getIndexReader();
   IndexReader newReader =  reader.reopen();
   if (reader != newReader) {
   reader.close();
   boolean isCurrent =
 newReader.isCurrent(); //throws  exception
   }
  
  Full list of exception:
  
 
 org.apache.lucene.store.AlreadyClosedException: this
 Directory is  closed
   at
 org.apache.lucene.store.Directory.ensureOpen(Directory.java:
 220)
   at
 org.apache.lucene.store.FSDirectory.list(FSDirectory.java:
 320)
   at org.apache.lucene.index.SegmentInfos
 $FindSegmentsFile.run(SegmentInfos.java:533)
   at  org .apache
 .lucene.index.SegmentInfos.readCurrentVersion(SegmentInfos.java:366)
   at  org .apache .lucene
 .index.DirectoryIndexReader.isCurrent(DirectoryIndexReader.java:188)
   at
 MailIndexer.IndexSearcherEx.reOpenDB(IndexSearcherEx.java:
 102)
  
  Please correct me, if i am wrong.
  
  Regards
  Ganesh
  
  Send instant messages to your online friends
 http://in.messenger.yahoo.com
 
 -
  To unsubscribe, e-mail:
 [EMAIL PROTECTED]
  For additional commands, e-mail:
 [EMAIL PROTECTED]
  
  
  
 
 -
  To unsubscribe, e-mail:
 [EMAIL PROTECTED]
  For additional commands, e-mail:
 [EMAIL PROTECTED]
  
  Send instant messages to your online friends
 http://in.messenger.yahoo.com
 
 -
  To unsubscribe, e-mail:
 [EMAIL PROTECTED]
  For additional commands, e-mail:
 [EMAIL PROTECTED]
  
 
 
 -
 To unsubscribe, e-mail:
 [EMAIL PROTECTED]
 For additional commands, e-mail:
 [EMAIL PROTECTED]


  

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Reg two versions of lucene on the same machine

2008-11-18 Thread Shireesha.Katkoor

Hi,

 

I am trying to upgrade the version of Lucene from 1.2 to 2.4. Can we do
this directly?

 Is it possible to have two versions of Lucene on the same machine.?

 

Shireesha 

 



This e-mail and any files transmitted with it are for the sole use of the 
intended recipient(s) and may contain confidential and privileged information.
If you are not the intended recipient, please contact the sender by reply 
e-mail and destroy all copies of the original message. 
Any unauthorized review, use, disclosure, dissemination, forwarding, printing 
or copying of this email or any action taken in reliance on this e-mail is 
strictly prohibited and may be unlawful.

Re: Reg two versions of lucene on the same machine

2008-11-18 Thread Anshum
Hi Shireesha,
I'm not sure as to what is it that you have been using, but 'm kinda sure
that you'd have to check for deprecated things as well as improved ones
while upgrading.. 1.2 to 2.4 is a huge jump certainly, with compound index
structure etc. coming into place.
You would have to try it and check if your code works the same(I doubt it
would though).
About having 2 versions of lucene on the same machine, ofcourse yes, it is
as good as having 2 (or more) java jars.
I am presuming that you place your lucene core jar in the project library
directory and not in the jre/lib/ext directory, in which case you would have
issues placing the 2 jars.
It would be better of you completely remove lucene jars from the implicit
included library dir, and place them in a different folder (and include that
in your classpath).

Hope that solves a bit of your doubt (atleast) !
--
Anshum Gupta
Naukri Labs!
http://ai-cafe.blogspot.com

The facts expressed here belong to everybody, the opinions to me. The
distinction is yours to draw


On Wed, Nov 19, 2008 at 11:57 AM, [EMAIL PROTECTED] wrote:


 Hi,



 I am trying to upgrade the version of Lucene from 1.2 to 2.4. Can we do
 this directly?

  Is it possible to have two versions of Lucene on the same machine.?



 Shireesha





 This e-mail and any files transmitted with it are for the sole use of the
 intended recipient(s) and may contain confidential and privileged
 information.
 If you are not the intended recipient, please contact the sender by reply
 e-mail and destroy all copies of the original message.
 Any unauthorized review, use, disclosure, dissemination, forwarding,
 printing or copying of this email or any action taken in reliance on this
 e-mail is strictly prohibited and may be unlawful.