Re: Throughput doesn't increase when using more concurrent threads

2006-01-26 Thread Paul Elschot
On Wednesday 25 January 2006 20:51, Peter Keegan wrote:
 The index is non-compound format and optimized. Yes, I did try
 MMapDirectory, but the index is too big - 3.5 GB (1.3GB is term vectors)
 
 Peter
 
You could also give this a try:

http://issues.apache.org/jira/browse/LUCENE-283

Regards,
Paul Elschot

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Getting the document number (with IndexReader)

2006-01-26 Thread Chun Wei Ho
I am attempting to prune an index by getting each document in turn and
then checking/deleting it:

IndexReader ir = IndexReader.open(path);
for(int i=0;iir.numDocs();i++) {
Document doc = ir.document(i);
if(thisDocShouldBeDeleted(doc)) {
ir.delete(docNum); // - I need the docNum for doc.
}
}

How do I get the docNum for IndexReader.delete() function in the above
case? Is there a API function I am missing? I am working with a merged
index over different segments so the docNum might not be in running
sequence with the counter i.

In general, is there a better way to do this sort of thing?

Thanks!

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Getting the document number (with IndexReader)

2006-01-26 Thread Paul Elschot
On Thursday 26 January 2006 09:15, Chun Wei Ho wrote:
 I am attempting to prune an index by getting each document in turn and
 then checking/deleting it:
 
 IndexReader ir = IndexReader.open(path);
 for(int i=0;iir.numDocs();i++) {
   Document doc = ir.document(i);
   if(thisDocShouldBeDeleted(doc)) {
   ir.delete(docNum); // - I need the docNum for doc.
   }
 }
 
 How do I get the docNum for IndexReader.delete() function in the above
 case? Is there a API function I am missing? I am working with a merged

The document number is the variable i in this case.

 index over different segments so the docNum might not be in running
 sequence with the counter i.
 
 In general, is there a better way to do this sort of thing?

This code:

Document doc = ir.document(i);

normally retrieves all the stored fields of the document and that is
quite costly. In case you know that the document(s) to be deleted
match(es) a Term, it's better to use IndexReader.delete(Term).

Regards,
Paul Elschot


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Throughput doesn't increase when using more concurrent threads

2006-01-26 Thread Ray Tsang
Speaking of NioFSDirectory, I thought there was one posted a while
ago, is this something that can be used?
http://issues.apache.org/jira/browse/LUCENE-414

ray,

On 11/22/05, Doug Cutting [EMAIL PROTECTED] wrote:
 Jay Booth wrote:
  I had a similar problem with threading, the problem turned out to be that in
  the back end of the FSDirectory class I believe it was, there was a
  synchronized block on the actual RandomAccessFile resource when reading a
  block of data from it... high-concurrency situations caused threads to stack
  up in front of this synchronized block and our CPU time wound up being spent
  thrashing between blocked threads instead of doing anything useful.

 This is correct.  In Lucene, multiple streams per file are created by
 cloning, and all clones of an FSDirectory input stream share a
 RandomAccessFile and must synchronize input from it.  MmapDirectory does
 not have this limitation.  If your indexes are less than a few GB or you
 are using 64-bit hardware, then MmapDirectory should work well for you.
   Otherwise it would be simple to write an nio-based Directory that does
 not use mmap that is also unsynchronized.  Such a contribution would be
 welcome.

  Making multiple IndexSearchers and FSDirectories didn't help because in the
  back end, lucene consults a singleton HashMap of some kind (don't remember
  implementation) that maintained a single FSDirectory for any given index
  being accessed from the JVM... multiple calls to FSDirectory.getDirectory
  actually return the same FSDirectory object with synchronization at the same
  point.

 This does not make sense to me.  FSDirectory does keep a cache of
 FSDirectory instances, but i/o should not be synchronized on these.  One
 should be able to open multiple input streams on the same file from an
 FSDirectory.  But this would not be a great solution, since file handle
 limits would soon become a problem.

 Doug

 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]




Re: Getting the document number (with IndexReader)

2006-01-26 Thread Chun Wei Ho
Hi,

Thanks for the help, just a few more questions:

On 1/26/06, Paul Elschot [EMAIL PROTECTED] wrote:
 On Thursday 26 January 2006 09:15, Chun Wei Ho wrote:
  I am attempting to prune an index by getting each document in turn and
  then checking/deleting it:
 
  IndexReader ir = IndexReader.open(path);
  for(int i=0;iir.numDocs();i++) {
Document doc = ir.document(i);
if(thisDocShouldBeDeleted(doc)) {
ir.delete(docNum); // - I need the docNum for doc.
}
  }
 
  How do I get the docNum for IndexReader.delete() function in the above
  case? Is there a API function I am missing? I am working with a merged

 The document number is the variable i in this case.
If the document number is the variable i (enumerated from numDocs()),
what's the difference between numDocs() and maxDoc() in this case? I
was previously under the impression that the internal docNum might be
different to the counter.

  index over different segments so the docNum might not be in running
  sequence with the counter i.
  In general, is there a better way to do this sort of thing?

 This code:

 Document doc = ir.document(i);

 normally retrieves all the stored fields of the document and that is
 quite costly. In case you know that the document(s) to be deleted
 match(es) a Term, it's better to use IndexReader.delete(Term).

I'm doing something akin to a rangeQuery, where I delete documents
within a certain range (in addition to other criteria). Is it better
to do a query on the range, mark all the docNums getting them with
Hits.id(), and then retrieve docs and test for deletion according to
that?

Thanks for the help

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



encoding

2006-01-26 Thread arnaudbuffet
Hello,
 
I 've a problem with data i try to index with lucene. I browse a
directory and index text from different types of files throw parsers.
 
For text files, data could be in different languages so different
encoding. If data are in Turkish for exemple, all special characters and
accents are not recognized in my lucene index. Is there a way to resolve
problem? How do I work with the encoding ?
 
Thanks for your help

A.
 


Range number queries

2006-01-26 Thread Mike Streeton
For the recent questions about this here are a couple of methods for
encoding/decoding long values that will be sorted into order by a range
query

 

public static String encodeLong(long num) {

  String hex = Long.toHexString(num  0 ? Long.MAX_VALUE -
(0xL ^ num) : num);

  hex = (num  0 ? N : P)+.substring(0,
16-hex.length()) + hex;

  return hex;

}



public static long decodeLong(String hex) {

  long num = Long.parseLong(hex.substring(1,17), 16);

  return hex.charAt(0) == 'N' ? (Long.MAX_VALUE - num) ^
0xL : num;

}



 

 

Hope this helps

 

Mike

 

www.ardentia.com the home of NetSearch

 



Re: Highlighter

2006-01-26 Thread msftblows
Yes, that is correct...you need to rewrite the query. I was actually the main 
developer for the 1.5 .NET port, so if you come across any issues, please email 
me at my hotmail address which I check more often than this one...
 
-Joe Langley
 
-Original Message-
From: Gwyn Carwardine [EMAIL PROTECTED]
To: java-user@lucene.apache.org
Sent: Tue, 24 Jan 2006 22:43:53 -
Subject: RE: Highlighter


Yes I think you're right. On reading the lucene in action chapted on
highlighting I found it squirreled in the middle of the text. I get the
feeling that whilst I have so far found query parser to be the primary
method of building queries that this is not ht eprimary method used by other
people. Otherwise I would have expected to see the first example in the book
use query parser. So what I'm not quite sure is how come the norm is using
the direct queries.

it helped, thanks

-Gwyn

-Original Message-
From: Koji Sekiguchi [mailto:[EMAIL PROTECTED] 
Sent: 24 January 2006 22:23
To: java-user@lucene.apache.org
Subject: RE: Highlighter

I've never used .net port of Lucene and highlighter,
but I believe we have to call Query.rewrite()
to expand the query expression when using
phrasequery, wildcardquery, regexquery and fuzzyquery,
then pass it to highlighter.

hope this helps,

Koji


 -Original Message-
 From: Gwyn Carwardine [mailto:[EMAIL PROTECTED]
 Sent: Wednesday, January 25, 2006 6:28 AM
 To: java-user@lucene.apache.org
 Subject: Highlighter
 
 
 I'm using the .net port of highlighter (1.5) and I notice it doesn't
 highlight range or prefix queries.. Is this consistent with the java
 version? Only I note my standard reference of www.lucenebook.com seems to
 support highlighting.. is this using that same highlighter 
 version (couldn't
 find any verison info on the lucene apache site)
 
 TIA
 
 -Gwyn
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


RE : encoding

2006-01-26 Thread arnaudbuffet
Hello and thanks for your answer.

I do not find the ISOLatin1AccentFilter class in my lucene jar, but I find one 
on google attach to this mail, could you tell me if it is the good one?

I do not see anything in this class which can help me. This program will 
replace some accent characters but my problem is:

if I try to index a text file encoded in Western 1252 for exemple with the 
Turkish text düzenlediğimiz kampanyamıza the lucene index will contain re 
encoded data with #0;#17;k#0;#0; 

Thanks  regards

A.

-Message d'origine-
De : John Haxby [mailto:[EMAIL PROTECTED] 
Envoyé : jeudi 26 janvier 2006 03:01
À : java-user@lucene.apache.org
Objet : Re: encoding

arnaudbuffet wrote:

For text files, data could be in different languages so different
encoding. If data are in Turkish for exemple, all special characters and
accents are not recognized in my lucene index. Is there a way to resolve
problem? How do I work with the encoding ?
  

I've been looking at a similar problem recently. There's 
org.apache.lucene.analysis.ISOLatin1AccentFilter on the svn trunk which 
may be quite close to what you want. I have a perl script here that I 
used to generate downgrading table for a C program. I can let you have 
the perl script as is, but if there's enough interest(*) I'll use it to 
generate, say, CompoundAsciiFilter since it converts compound characters 
like á, æ, ffi (ffi-ligature, in case it doesn't display) to a, ae and 
ffi. It's actually built from 
http://www.unicode.org/Public/4.1.0/ucd/UnicodeData.txt so it winds up 
having nearly 1200 entries. An earlier version converted all compound 
characters to their constient parts, but this version just converts 
characters that are made up entirely of ASCII and modifiers.

jch

(*) Any interest, actually. Might be enough for me to be interested.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Throughput doesn't increase when using more concurrent threads

2006-01-26 Thread Peter Keegan
Paul,

I tried this but it ran out of memory trying to read the 500Mb .fdt file. I
tried various values for MAX_BBUF, but it still ran out of memory (I'm using
-Xmx1600M, which is the jvm's maximum value (v1.5))  I'll give
NioFSDirectory a try.

Thanks,
Peter


On 1/26/06, Paul Elschot [EMAIL PROTECTED] wrote:

 On Wednesday 25 January 2006 20:51, Peter Keegan wrote:
  The index is non-compound format and optimized. Yes, I did try
  MMapDirectory, but the index is too big - 3.5 GB (1.3GB is term vectors)
 
  Peter
 
 You could also give this a try:

 http://issues.apache.org/jira/browse/LUCENE-283

 Regards,
 Paul Elschot

 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]




Re: Throughput doesn't increase when using more concurrent threads

2006-01-26 Thread Peter Keegan
Ray,

The throughput is worse with NioFSDIrectory than with the FSDIrectory
(patched and unpatched). The bottleneck still seems to be synchronization,
this time in NioFile.getChannel (7 of the 8 threads were blocked there
during one snapshot).  I tried this with 4 and 8 channels.

The throughput with the patched FSDirectory was about the same as before the
patch.

Thanks,
Peter


On 1/26/06, Ray Tsang [EMAIL PROTECTED] wrote:

 Speaking of NioFSDirectory, I thought there was one posted a while
 ago, is this something that can be used?
 http://issues.apache.org/jira/browse/LUCENE-414

 ray,

 On 11/22/05, Doug Cutting [EMAIL PROTECTED] wrote:
  Jay Booth wrote:
   I had a similar problem with threading, the problem turned out to be
 that in
   the back end of the FSDirectory class I believe it was, there was a
   synchronized block on the actual RandomAccessFile resource when
 reading a
   block of data from it... high-concurrency situations caused threads to
 stack
   up in front of this synchronized block and our CPU time wound up being
 spent
   thrashing between blocked threads instead of doing anything useful.
 
  This is correct.  In Lucene, multiple streams per file are created by
  cloning, and all clones of an FSDirectory input stream share a
  RandomAccessFile and must synchronize input from it.  MmapDirectory does
  not have this limitation.  If your indexes are less than a few GB or you
  are using 64-bit hardware, then MmapDirectory should work well for you.
Otherwise it would be simple to write an nio-based Directory that does
  not use mmap that is also unsynchronized.  Such a contribution would be
  welcome.
 
   Making multiple IndexSearchers and FSDirectories didn't help because
 in the
   back end, lucene consults a singleton HashMap of some kind (don't
 remember
   implementation) that maintained a single FSDirectory for any given
 index
   being accessed from the JVM... multiple calls to
 FSDirectory.getDirectory
   actually return the same FSDirectory object with synchronization at
 the same
   point.
 
  This does not make sense to me.  FSDirectory does keep a cache of
  FSDirectory instances, but i/o should not be synchronized on these.  One
  should be able to open multiple input streams on the same file from an
  FSDirectory.  But this would not be a great solution, since file handle
  limits would soon become a problem.
 
  Doug
 
  -
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail: [EMAIL PROTECTED]
 
 



Re: Throughput doesn't increase when using more concurrent threads

2006-01-26 Thread Yonik Seeley
Hmmm, can you run the 64 bit version of Windows (and hence a 64 bit JVM?)
We're running with heap sizes up to 8GB (RH Linux 64 bit, Opterons,
Sun Java 1.5)

-Yonik

On 1/26/06, Peter Keegan [EMAIL PROTECTED] wrote:
 Paul,

 I tried this but it ran out of memory trying to read the 500Mb .fdt file. I
 tried various values for MAX_BBUF, but it still ran out of memory (I'm using
 -Xmx1600M, which is the jvm's maximum value (v1.5))  I'll give
 NioFSDirectory a try.

 Thanks,
 Peter


 On 1/26/06, Paul Elschot [EMAIL PROTECTED] wrote:
 
  On Wednesday 25 January 2006 20:51, Peter Keegan wrote:
   The index is non-compound format and optimized. Yes, I did try
   MMapDirectory, but the index is too big - 3.5 GB (1.3GB is term vectors)
  
   Peter
  
  You could also give this a try:
 
  http://issues.apache.org/jira/browse/LUCENE-283
 
  Regards,
  Paul Elschot
 
  -
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail: [EMAIL PROTECTED]
 
 



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: RE : encoding

2006-01-26 Thread Erik Hatcher


On Jan 26, 2006, at 7:26 PM, arnaudbuffet wrote:
I do not find the ISOLatin1AccentFilter class in my lucene jar, but  
I find one on google attach to this mail, could you tell me if it  
is the good one?


This used to be in contrib/analyzers but has been moved into the core  
(Subversion only for now):


	http://svn.apache.org/repos/asf/lucene/java/trunk/src/java/org/ 
apache/lucene/analysis/


I do not see anything in this class which can help me. This program  
will replace some accent characters but my problem is:


if I try to index a text file encoded in Western 1252 for exemple  
with the Turkish text düzenlediğimiz kampanyamıza the lucene  
index will contain re encoded data with #0;#17;k#0;#0; 


Reading encoding files is your applications responsibility.  You need  
to be sure to read the files in using the proper encoding.  Once read  
properly into Java all will be well as far as Lucene indexing the  
characters.


Erik


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: encoding

2006-01-26 Thread John Haxby

arnaudbuffet wrote:


if I try to index a text file encoded in Western 1252 for exemple with the Turkish text 
düzenlediğimiz kampanyamıza the lucene index will contain re encoded data with 
#0;#17;k#0;#0; 
 


ISOLatin1AccentFilter.removeAccents() converts that string to
duzenlediğimiz kampanyamıza The g-breve and the dotless-i are
untouched. My AsciiDecomposeFilter.decompose() converts the string to
duzenledigimiz kampanyamiza.

However, since you're seeing those rather odd entities, it looks as
though you're not actually indexing what you think you're indexing. As
Erik says, you need to make sure that you're reading files with the
proper encoding and removing accent and adding dots won't help.

jch



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Throughput doesn't increase when using more concurrent threads

2006-01-26 Thread Peter Keegan
I'd love to try this, but I'm not aware of any 64-bit jvms for Windows on
Intel. If you know of any, please let me know. Linux may be an option, too.

btw, I'm getting a sustained rate of 135 queries/sec with 4 threads, which
is pretty impressive. Another way around the concurrency limit is to run
multiple jvms. The throughput of each is less, but the aggregate throughput
is higher.

Peter


On 1/26/06, Yonik Seeley [EMAIL PROTECTED] wrote:

 Hmmm, can you run the 64 bit version of Windows (and hence a 64 bit JVM?)
 We're running with heap sizes up to 8GB (RH Linux 64 bit, Opterons,
 Sun Java 1.5)

 -Yonik

 On 1/26/06, Peter Keegan [EMAIL PROTECTED] wrote:
  Paul,
 
  I tried this but it ran out of memory trying to read the 500Mb .fdt
 file. I
  tried various values for MAX_BBUF, but it still ran out of memory (I'm
 using
  -Xmx1600M, which is the jvm's maximum value (v1.5))  I'll give
  NioFSDirectory a try.
 
  Thanks,
  Peter
 
 
  On 1/26/06, Paul Elschot [EMAIL PROTECTED] wrote:
  
   On Wednesday 25 January 2006 20:51, Peter Keegan wrote:
The index is non-compound format and optimized. Yes, I did try
MMapDirectory, but the index is too big - 3.5 GB (1.3GB is term
 vectors)
   
Peter
   
   You could also give this a try:
  
   http://issues.apache.org/jira/browse/LUCENE-283
  
   Regards,
   Paul Elschot
  
   -
   To unsubscribe, e-mail: [EMAIL PROTECTED]
   For additional commands, e-mail: [EMAIL PROTECTED]
  
  
 
 

 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]




Re: Throughput doesn't increase when using more concurrent threads

2006-01-26 Thread Yonik Seeley
BEA Jrockit supports both AMD64 and Intel's EM64T (basically renamed AMD64)
http://www.bea.com/framework.jsp?CNT=index.htmFP=/content/products/jrockit/

and Sun's Java 1.5 for Windows AMD64 Platform
They advertize AMD64, presumably because that's what there servers
use, but it should work on Intel's x86_64 (EM64T) also.  The release
notes have the following:
With the release, J2SE support for Windows 64-bit has progressed from
release candidate to final release. This version runs on AMD64/EM64T
64-bit mode machines with Windows Server 2003 x64 Editions.

Of course, if the platform is up to you, I'd choose Linux :-)

-Yonik

On 1/26/06, Peter Keegan [EMAIL PROTECTED] wrote:
 I'd love to try this, but I'm not aware of any 64-bit jvms for Windows on
 Intel. If you know of any, please let me know. Linux may be an option, too.

 btw, I'm getting a sustained rate of 135 queries/sec with 4 threads, which
 is pretty impressive. Another way around the concurrency limit is to run
 multiple jvms. The throughput of each is less, but the aggregate throughput
 is higher.

 Peter


 On 1/26/06, Yonik Seeley [EMAIL PROTECTED] wrote:
 
  Hmmm, can you run the 64 bit version of Windows (and hence a 64 bit JVM?)
  We're running with heap sizes up to 8GB (RH Linux 64 bit, Opterons,
  Sun Java 1.5)
 
  -Yonik
 
  On 1/26/06, Peter Keegan [EMAIL PROTECTED] wrote:
   Paul,
  
   I tried this but it ran out of memory trying to read the 500Mb .fdt
  file. I
   tried various values for MAX_BBUF, but it still ran out of memory (I'm
  using
   -Xmx1600M, which is the jvm's maximum value (v1.5))  I'll give
   NioFSDirectory a try.
  
   Thanks,
   Peter
  
  
   On 1/26/06, Paul Elschot [EMAIL PROTECTED] wrote:
   
On Wednesday 25 January 2006 20:51, Peter Keegan wrote:
 The index is non-compound format and optimized. Yes, I did try
 MMapDirectory, but the index is too big - 3.5 GB (1.3GB is term
  vectors)

 Peter

You could also give this a try:
   
http://issues.apache.org/jira/browse/LUCENE-283
   
Regards,
Paul Elschot
   
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
   
   
  
  
 
  -
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail: [EMAIL PROTECTED]
 
 



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Getting the document number (with IndexReader)

2006-01-26 Thread Chris Hostetter

:  The document number is the variable i in this case.
: If the document number is the variable i (enumerated from numDocs()),
: what's the difference between numDocs() and maxDoc() in this case? I
: was previously under the impression that the internal docNum might be
: different to the counter.

Iterating between 1 and maxDoc-1 will give you the range of all possible
doc ids, but some of those docs may have already been deleted.  I believe
that is what you want to do. ... you can check if a doc is deleted using
IndexReader.isDeleted(i)

numDocs is implimented as maxDocs() - deletedDocs.count(), so i don't
think it ever makes sese to iterate up to numDocs.

: I'm doing something akin to a rangeQuery, where I delete documents
: within a certain range (in addition to other criteria). Is it better
: to do a query on the range, mark all the docNums getting them with
: Hits.id(), and then retrieve docs and test for deletion according to
: that?

Take a look at the way RangeFilter.bits() is implimented.  if you
cut/paste that code and replace the call to bits.set(termDocs.doc()); with
reader.delete(termDocs.doc()) I think you've have exactly what you want.

Or, since cutting/pasting code is A Bad Thing from a maintenence/bug
fixing standpoint, you could just call RangeFilter.bits(reader) yourself,
and then iterate of the set bits and call delete on each one.


-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Getting the document number (with IndexReader)

2006-01-26 Thread Paul Elschot
On Thursday 26 January 2006 09:47, Chun Wei Ho wrote:
 Hi,
 
 Thanks for the help, just a few more questions:
 
 On 1/26/06, Paul Elschot [EMAIL PROTECTED] wrote:
  On Thursday 26 January 2006 09:15, Chun Wei Ho wrote:
   I am attempting to prune an index by getting each document in turn and
   then checking/deleting it:
  
   IndexReader ir = IndexReader.open(path);
   for(int i=0;iir.numDocs();i++) {
 Document doc = ir.document(i);
 if(thisDocShouldBeDeleted(doc)) {
 ir.delete(docNum); // - I need the docNum for doc.
 }
   }
  
   How do I get the docNum for IndexReader.delete() function in the above
   case? Is there a API function I am missing? I am working with a merged
 
  The document number is the variable i in this case.
 If the document number is the variable i (enumerated from numDocs()),
 what's the difference between numDocs() and maxDoc() in this case? I
 was previously under the impression that the internal docNum might be
 different to the counter.

Iirc, the difference between maxDoc() + 1 and numDocs() is the number of
deleted documents. Check the javadocs to be sure.

 
   index over different segments so the docNum might not be in running
   sequence with the counter i.
   In general, is there a better way to do this sort of thing?
 
  This code:
 
  Document doc = ir.document(i);
 
  normally retrieves all the stored fields of the document and that is
  quite costly. In case you know that the document(s) to be deleted
  match(es) a Term, it's better to use IndexReader.delete(Term).
 
 I'm doing something akin to a rangeQuery, where I delete documents
 within a certain range (in addition to other criteria). Is it better
 to do a query on the range, mark all the docNums getting them with
 Hits.id(), and then retrieve docs and test for deletion according to
 that?

In that case it is faster to use the Terms generated inside the range query
and then use these on IndexReader.delete(Term).
To generate the terms have a look at the source code of the rewrite()
method of RangeQuery here:
http://svn.apache.org/viewcvs.cgi/lucene/java/trunk/src/java/org/apache/lucene/search/

Regards,
Paul Elschot

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Getting the document number (with IndexReader)

2006-01-26 Thread Paul Elschot
On Thursday 26 January 2006 19:44, Chris Hostetter wrote:
 
 :  The document number is the variable i in this case.
 : If the document number is the variable i (enumerated from numDocs()),
 : what's the difference between numDocs() and maxDoc() in this case? I
 : was previously under the impression that the internal docNum might be
 : different to the counter.
 
 Iterating between 1 and maxDoc-1 will give you the range of all possible
 doc ids, but some of those docs may have already been deleted.  I believe
 that is what you want to do. ... you can check if a doc is deleted using
 IndexReader.isDeleted(i)
 
 numDocs is implimented as maxDocs() - deletedDocs.count(), so i don't
 think it ever makes sese to iterate up to numDocs.
 
 : I'm doing something akin to a rangeQuery, where I delete documents
 : within a certain range (in addition to other criteria). Is it better
 : to do a query on the range, mark all the docNums getting them with
 : Hits.id(), and then retrieve docs and test for deletion according to
 : that?
 
 Take a look at the way RangeFilter.bits() is implimented.  if you
 cut/paste that code and replace the call to bits.set(termDocs.doc()); with
 reader.delete(termDocs.doc()) I think you've have exactly what you want.
 
 Or, since cutting/pasting code is A Bad Thing from a maintenence/bug
 fixing standpoint, you could just call RangeFilter.bits(reader) yourself,
 and then iterate of the set bits and call delete on each one.

Perhaps an extra rewrite method with a term visitor argument?

Regards,
Paul Elschot

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: encoding

2006-01-26 Thread petite_abeille

Hello,

On Jan 26, 2006, at 12:01, John Haxby wrote:

I have a perl script here that I used to generate downgrading table 
for a C program. I can let you have the perl script as is, but if 
there's enough interest(*) I'll use it to generate, say, 
CompoundAsciiFilter since it converts compound characters like á, æ, ffi 
(ffi-ligature, in case it doesn't display) to a, ae and ffi. It's 
actually built from 
http://www.unicode.org/Public/4.1.0/ucd/UnicodeData.txt so it winds up 
having nearly 1200 entries. An earlier version converted all compound 
characters to their constient parts, but this version just converts 
characters that are made up entirely of ASCII and modifiers.


I would love to see this. I presently have a somewhat unwieldy 
conversion table [1] that I would love to get ride of :))


Cheers

--
PA, Onnay Equitursay
http://alt.textdrive.com/

[1] http://dev.alt.textdrive.com/browser/lu/LUStringBasicLatin.txt





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Throughput doesn't increase when using more concurrent threads

2006-01-26 Thread Doug Cutting

Doug Cutting wrote:
A 64-bit JVM with NioDirectory would really be optimal for this. 


Oops.  I meant MMapDirectory, not NioDirectory.

Doug

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Throughput doesn't increase when using more concurrent threads

2006-01-26 Thread Peter Keegan
Dumb question: does the 64-bit compiler (javac) generate different code than
the 32-bit version, or is it just the jvm that matters? My reported speedups
were soley from using the 64-bit jvm with jar files from the 32-bit
compiler.

Peter


On 1/26/06, Yonik Seeley [EMAIL PROTECTED] wrote:

 Nice speedup!  The extra registers in 64 bit mode hay have helped a little
 too.

 -Yonik

 On 1/26/06, Peter Keegan [EMAIL PROTECTED] wrote:
  Correction: make that 285 qps :)

 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]




Re: Throughput doesn't increase when using more concurrent threads

2006-01-26 Thread Yonik Seeley
There is no difference in bytecode... the whole difference is just in
the underlying JVM.

-Yonik

On 1/26/06, Peter Keegan [EMAIL PROTECTED] wrote:
 Dumb question: does the 64-bit compiler (javac) generate different code than
 the 32-bit version, or is it just the jvm that matters? My reported speedups
 were soley from using the 64-bit jvm with jar files from the 32-bit
 compiler.

 Peter

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Throughput doesn't increase when using more concurrent threads

2006-01-26 Thread Ray Tsang
Peter,

Wow, the speed up in impressive! But may I ask what did you do to
achieve 135 queries/sec prior to the JVM swich?

ray,

On 1/27/06, Peter Keegan [EMAIL PROTECTED] wrote:
 Correction: make that 285 qps :)

 On 1/26/06, Peter Keegan [EMAIL PROTECTED] wrote:
 
  I tried the AMD64-bit JVM from Sun and with MMapDirectory and I'm now
  getting 250 queries/sec and excellent cpu utilization (equal concurrency on
  all cpus)!! Yonik, thanks for the pointer to the 64-bit jvm. I wasn't aware
  of it.
 
  Thanks all very much.
  Peter
 
 
  On 1/26/06, Doug Cutting [EMAIL PROTECTED] wrote:
  
   Doug Cutting wrote:
A 64-bit JVM with NioDirectory would really be optimal for this.
  
   Oops.  I meant MMapDirectory, not NioDirectory.
  
   Doug
  
   -
   To unsubscribe, e-mail: [EMAIL PROTECTED]
   For additional commands, e-mail: [EMAIL PROTECTED]
  
  
 




Re: Throughput doesn't increase when using more concurrent threads

2006-01-26 Thread Peter Keegan
Ray,

The short answer is that you can make Lucene blazingly fast by using advice
and design principles mentioned in this forum and of course reading 'Lucene
in Action'. For example, use a 'content' field for searching all fields (vs
mutli-field search), put all your stored data in one field, understand the
cost of numeric search and sorting. On the platform side, go multi-CPU and
of course 64-bit if possible :)

Also, I would venture to guess that a lot of search bottlenecks have nothing
to do with Lucene, but rather in the infrastructure around it. For example,
how does your client interface to the search engine? My results use a plain
socket interface between client and server (one connection for queries,
another for results), using a simple query/results data format. Introducing
other web infrastructures invites degradation in performance, too.

I've a bit of experience with search engines, but I'm obviously still
learning thanks to this group.

Peter

On 1/26/06, Ray Tsang [EMAIL PROTECTED] wrote:

 Peter,

 Wow, the speed up in impressive! But may I ask what did you do to
 achieve 135 queries/sec prior to the JVM swich?

 ray,

 On 1/27/06, Peter Keegan [EMAIL PROTECTED] wrote:
  Correction: make that 285 qps :)
 
  On 1/26/06, Peter Keegan [EMAIL PROTECTED] wrote:
  
   I tried the AMD64-bit JVM from Sun and with MMapDirectory and I'm now
   getting 250 queries/sec and excellent cpu utilization (equal
 concurrency on
   all cpus)!! Yonik, thanks for the pointer to the 64-bit jvm. I wasn't
 aware
   of it.
  
   Thanks all very much.
   Peter
  
  
   On 1/26/06, Doug Cutting [EMAIL PROTECTED] wrote:
   
Doug Cutting wrote:
 A 64-bit JVM with NioDirectory would really be optimal for this.
   
Oops.  I meant MMapDirectory, not NioDirectory.
   
Doug
   
   
 -
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
   
   
  
 
 



problem updating a document: no segments file?

2006-01-26 Thread John Powers
Hello,

I have a couple instances of lucene.  I just altered on implementation and now 
its not keeping a segments file.  while indexing occurs, there is a segment 
file.but once its done, there isn't.all the other indexes have one. 
the problem comes when i try to update a document, it says segments file not 
found and that stops it.this code was working fine on my development box, 
but now i go to production its not keeping that segments file.and, it 
searches just fine.i can reindex over and over, and it keeps disappearing.

any ideas?

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Throughput doesn't increase when using more concurrent threads

2006-01-26 Thread Ray Tsang
Paul,

Thanks for the advice! But for the 100+queries/sec on a 32-bit
platfrom, did you end up applying other patches? or use different
FSDirectory implementations?

Thanks!

ray,

On 1/27/06, Peter Keegan [EMAIL PROTECTED] wrote:
 Ray,

 The short answer is that you can make Lucene blazingly fast by using advice
 and design principles mentioned in this forum and of course reading 'Lucene
 in Action'. For example, use a 'content' field for searching all fields (vs
 mutli-field search), put all your stored data in one field, understand the
 cost of numeric search and sorting. On the platform side, go multi-CPU and
 of course 64-bit if possible :)

 Also, I would venture to guess that a lot of search bottlenecks have nothing
 to do with Lucene, but rather in the infrastructure around it. For example,
 how does your client interface to the search engine? My results use a plain
 socket interface between client and server (one connection for queries,
 another for results), using a simple query/results data format. Introducing
 other web infrastructures invites degradation in performance, too.

 I've a bit of experience with search engines, but I'm obviously still
 learning thanks to this group.

 Peter

 On 1/26/06, Ray Tsang [EMAIL PROTECTED] wrote:
 
  Peter,
 
  Wow, the speed up in impressive! But may I ask what did you do to
  achieve 135 queries/sec prior to the JVM swich?
 
  ray,
 
  On 1/27/06, Peter Keegan [EMAIL PROTECTED] wrote:
   Correction: make that 285 qps :)
  
   On 1/26/06, Peter Keegan [EMAIL PROTECTED] wrote:
   
I tried the AMD64-bit JVM from Sun and with MMapDirectory and I'm now
getting 250 queries/sec and excellent cpu utilization (equal
  concurrency on
all cpus)!! Yonik, thanks for the pointer to the 64-bit jvm. I wasn't
  aware
of it.
   
Thanks all very much.
Peter
   
   
On 1/26/06, Doug Cutting [EMAIL PROTECTED] wrote:

 Doug Cutting wrote:
  A 64-bit JVM with NioDirectory would really be optimal for this.

 Oops.  I meant MMapDirectory, not NioDirectory.

 Doug


  -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]


   
  
  
 




Re: Throughput doesn't increase when using more concurrent threads

2006-01-26 Thread Peter Keegan
Ray,

The 135 qps rate was using the standard FSDirectory in 1.9.

Peter


On 1/26/06, Ray Tsang [EMAIL PROTECTED] wrote:

 Paul,

 Thanks for the advice! But for the 100+queries/sec on a 32-bit
 platfrom, did you end up applying other patches? or use different
 FSDirectory implementations?

 Thanks!

 ray,

 On 1/27/06, Peter Keegan [EMAIL PROTECTED] wrote:
  Ray,
 
  The short answer is that you can make Lucene blazingly fast by using
 advice
  and design principles mentioned in this forum and of course reading
 'Lucene
  in Action'. For example, use a 'content' field for searching all fields
 (vs
  mutli-field search), put all your stored data in one field, understand
 the
  cost of numeric search and sorting. On the platform side, go multi-CPU
 and
  of course 64-bit if possible :)
 
  Also, I would venture to guess that a lot of search bottlenecks have
 nothing
  to do with Lucene, but rather in the infrastructure around it. For
 example,
  how does your client interface to the search engine? My results use a
 plain
  socket interface between client and server (one connection for queries,
  another for results), using a simple query/results data format.
 Introducing
  other web infrastructures invites degradation in performance, too.
 
  I've a bit of experience with search engines, but I'm obviously still
  learning thanks to this group.
 
  Peter
 
  On 1/26/06, Ray Tsang [EMAIL PROTECTED] wrote:
  
   Peter,
  
   Wow, the speed up in impressive! But may I ask what did you do to
   achieve 135 queries/sec prior to the JVM swich?
  
   ray,
  
   On 1/27/06, Peter Keegan [EMAIL PROTECTED] wrote:
Correction: make that 285 qps :)
   
On 1/26/06, Peter Keegan [EMAIL PROTECTED] wrote:

 I tried the AMD64-bit JVM from Sun and with MMapDirectory and I'm
 now
 getting 250 queries/sec and excellent cpu utilization (equal
   concurrency on
 all cpus)!! Yonik, thanks for the pointer to the 64-bit jvm. I
 wasn't
   aware
 of it.

 Thanks all very much.
 Peter


 On 1/26/06, Doug Cutting [EMAIL PROTECTED] wrote:
 
  Doug Cutting wrote:
   A 64-bit JVM with NioDirectory would really be optimal for
 this.
 
  Oops.  I meant MMapDirectory, not NioDirectory.
 
  Doug
 
 
   -
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail:
 [EMAIL PROTECTED]
 
 

   
   
  
 
 



Re: Two strange things in Lucene

2006-01-26 Thread Daniel Pfeifer
 Since I didn't find anything in the log from log4j I did a kill  
 -3 on
  the process and found two very interesting things:
 
 Almost all multisearcher threads were in this state:
 
 MultiSearcher thread #1 daemon prio=10 tid=0x01900960
 nid=0x81442c waiting for monitor entry
 [0xfd7d269ff000..0xfd7d269ffb50]
  at java.util.Vector.size(Vector.java:270)
  - waiting to lock 0xfd7f0114ea28 (a java.util.Vector)
  at
 org.apache.lucene.search.BooleanQuery$BooleanWeight.init 
 (BooleanQuery.
 java:95)
 
 I don't know about this one, but guessing that it just happens to be  
 a normal state of the system when you killed the process.  *shrugs*
 
You probably missed the -3 parameter. This just dumps the state of the
virtual machine, it doesn't actually kill the JVM. Thus I believe that
this is not a normal state.
 
 And, additionally I found another stacktrace in the stdout-log which
I
 find interesting:
 
 Exception in thread MultiSearcher thread #1
 org.apache.lucene.search.BooleanQuery$TooManyClauses
 
 This is a typical occurrence when using Query's that expand such as  
 WildcardQuery, RangeQuery, FuzzyQuery, etc.  If users are doing  
 queries like a* and there are over 1024 terms that start with a  
 then you will, by default, blow up WildcardQuery's expansion into a  
 BooleanQuery.  You can up that limit on BooleanQuery, or disallow  
 those types of queries perhaps.
 
Ok, I'll see what I can do.
 
Thanks!


How does the lucene normalize the score?

2006-01-26 Thread xing jiang
Hi,

I want to know how the lucene normalizes the score. I see hits class has
this function to get each document's score. But i dont know how lucene
calculates the normalized score and in the Lucene in action, it only said
normalized score of the nth top scoring docuemnts.
--
Regards

Jiang Xing


Re: Performance tips?

2006-01-26 Thread Chris Lamprecht
I seem to say this a lot :), but, assuming your OS has a decent
filesystem cache, try reducing your JVM heapsize, using an FSDirectory
instead of RAMDirectory, and see if your filesystem cache does ok.  If
you have 12GB, then you should have enough RAM to hold both the old
and new indexes during the switchover.

-chris

On 1/26/06, Daniel Pfeifer [EMAIL PROTECTED] wrote:
 Hi,



 Got more questions regarding Lucene and this time it's about performance
 ;-)



 We currently are using RAMDirectories to read our Indexes. This has now
 become a problem since our index has grown to appx 5GB of RAM and the
 machine we are running on only has 12GB of RAM and everytime we refresh
 the RAMDirectories we of course keep the old Searchables so that there
 is no service interruption.



 This means we consume 10GB of RAM from time to time. One solution is of
 course to stop using RAM and read anything from disk but I can imagine
 that the performance will decrease significantly. Is there any
 workaround you can think of? Perhaps a hybrid between FSDirectory and
 RAMDirectory. For example that only frequently searched documents are
 cached and the others are read from disk?



 Well, I'd appreciate any ideas at all!
 Thanks
 /Daniel




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]