Lucene 1.3 final to 1.4final problem

2004-07-08 Thread Karthik N S
Hey

Dev Guys

Apologies 

I have a Quick Problem...

  The no of Hits on set of Documents  indexed using 1.3-final  is not same
on  1.4-final  version
  [ The only modification done to the src is , I have upgraded my
CustomAnalyzer  on basis of StopAnalyzer avaliable in 1.4 ]
  Does doing this effect the performance.


  Some body please explain.


with regards
Karthik




-Original Message-
From: Alex Aw Seat Kiong [mailto:[EMAIL PROTECTED]
Sent: Wednesday, July 07, 2004 9:50 AM
To: Lucene Users List
Subject: upgrade from Lucene 1.3 final to 1.4rc3 problem


Hi!

I'm using Lucene 1.3 final currently, all things were working fine.
But, after i'm upgraded from Lucene 1.3 final to 1.4rc3 (simply overwrite
the lucene-1.4-final.jar to lucene-1.4-rc3.jar and re-compile it)
We can re-compile it successfuly. but when will try to index the document.
It give the error as below:
java.lang.NullPointerException
at org.apache.lucene.store.FSDirectory.create(FSDirectory.java:146)
at org.apache.lucene.store.FSDirectory.init(FSDirectory.java:126)
at
org.apache.lucene.store.FSDirectory.getDirectory(FSDirectory.java:102)
at
org.apache.lucene.store.FSDirectory.getDirectory(FSDirectory.java:83)
at org.apache.lucene.index.IndexWriter.init(IndexWriter.java:173)
Which wrong? Pls help.

Thanks.

Regards,
Alex





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Lucene 1.3 final to 1.4final problem

2004-07-08 Thread Karthik N S

Hey

Dev Guys

Apologies


 Can Some body Explain me

  Why  for an I/P word TA to  the StopAnalyzer.java  returns  [ta]
instead of [ta]

  TA  == [ta]   instead of  [ta]

  $125.96  === [125.95] instead of [$125.95]

  Is it something wrong I have been missing.


 with regards
Karthik




-Original Message-
From: Karthik N S [mailto:[EMAIL PROTECTED]
Sent: Thursday, July 08, 2004 11:59 AM
To: Lucene Users List
Subject: Lucene 1.3 final to 1.4final problem


Hey

Dev Guys

Apologies 

I have a Quick Problem...

  The no of Hits on set of Documents  indexed using 1.3-final  is not same
on  1.4-final  version
  [ The only modification done to the src is , I have upgraded my
CustomAnalyzer  on basis of StopAnalyzer avaliable in 1.4 ]
  Does doing this effect the performance.


  Some body please explain.


with regards
Karthik




-Original Message-
From: Alex Aw Seat Kiong [mailto:[EMAIL PROTECTED]
Sent: Wednesday, July 07, 2004 9:50 AM
To: Lucene Users List
Subject: upgrade from Lucene 1.3 final to 1.4rc3 problem


Hi!

I'm using Lucene 1.3 final currently, all things were working fine.
But, after i'm upgraded from Lucene 1.3 final to 1.4rc3 (simply overwrite
the lucene-1.4-final.jar to lucene-1.4-rc3.jar and re-compile it)
We can re-compile it successfuly. but when will try to index the document.
It give the error as below:
java.lang.NullPointerException
at org.apache.lucene.store.FSDirectory.create(FSDirectory.java:146)
at org.apache.lucene.store.FSDirectory.init(FSDirectory.java:126)
at
org.apache.lucene.store.FSDirectory.getDirectory(FSDirectory.java:102)
at
org.apache.lucene.store.FSDirectory.getDirectory(FSDirectory.java:83)
at org.apache.lucene.index.IndexWriter.init(IndexWriter.java:173)
Which wrong? Pls help.

Thanks.

Regards,
Alex





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: boolean operators and score

2004-07-08 Thread Niraj Alok
If i do it by sorting the input before sending it to lucene, it could become
unmanageable to handle and could also throw unexpected results for the user.

e.g . if i type: winston churchill and world war and germany

i could split the string by and and get the sorted string as (churchill
winston) and (germany) and (war world) .
this would obviously make the hits.score throw up unexpected results.

isnt there any other solution which comes from lucene itself ? i am using
1.4 final

Regards,
Niraj


Re: boolean operators and score

2004-07-08 Thread Brisbart Franck
Niraj Alok wrote:
Hi Guys,
Finally I have sorted the problem of hits score thanks to the great help of
Franck.
I have hit another problem with the boolean operators now.
When I search for Winston and churchill i get a set of perfectly
acceptable results.
But when I change the order, churchill and winston the results are the
same but the order of the results changes.
I don't it is interpreted as the same request. As you should know the 
terms of a boolean query have a 'required' flag.
As to me, your request 'winston and churchill' is interpreted as 
'winston (not required)' and 'churchill (required)'
But your request 'churchill and winston' is interpreted as 'churchill 
(not required)' and 'winston (required)'

I think you'd rather search for 'and winston and churchill' (which 
should be the same than 'and churchill and winston') to have the both 
terms required

Franck

Is it possible to have the same order (hits.score) irrespective of which
term is given before or after?
Regards,
Niraj

--
Franck Brisbart
RD
http://www.kelkoo.com
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: indexing help

2004-07-08 Thread Grant Ingersoll
Hi John,

The source code is available from CVS, make it non-final and do what you need to do.  
Of course, you may have a hard time finding help later if you aren't using something 
everyone else is and your solution doesn't work...  :-)

If I understand correctly what you are trying to do, you already know all of the 
answers for indexing, you just want Lucene to do the retrieval side of the coin, 
correct?  I suppose a crazy idea might be to write a program that took your info and 
output it in the Lucene file format, but that seems a bit like overkill.

-Grant

 [EMAIL PROTECTED] 07/07/04 07:37PM 
Hi Doug:
 Thanks for the response!

 The solution you proposed is still a derivative of creating a
dummy document stream. Taking the same example, java (5), lucene (6),
VectorTokenStream would create a total of 11 Tokens whereas only 2 is
neccessary.

Given many documents with many terms and frequencies, it would
create many extra Token instances.

   The reason I was looking to derving the Field class is because I
can directly manipulate the FieldInfo by setting the frequency. But
the class is final...

   Any other suggestions?

Thanks

-John

On Wed, 07 Jul 2004 14:20:24 -0700, Doug Cutting [EMAIL PROTECTED] wrote:
 John Wang wrote:
   While lucene tokenizes the words in the document, it counts the
  frequency and figures out the position, we are trying to bypass this
  stage: For each document, I have a set of words with a know frequency,
  e.g. java (5), lucene (6) etc. (I don't care about the position, so it
  can always be 0.)
 
   What I can do now is to create a dummy document, e.g. java java
  java java java lucene lucene lucene lucene lucene and pass it to
  lucene.
 
   This seems hacky and cumbersome. Is there a better alternative? I
  browsed around in the source code, but couldn't find anything.
 
 Write an analyzer that returns terms with the appropriate distribution.
 
 For example:
 
 public class VectorTokenStream extends TokenStream {
   private int term;
   private int freq;
   public VectorTokenStream(String[] terms, int[] freqs) {
 this.terms = terms;
 this.freqs = freqs;
   }
   public Token next() {
 if (freq == 0) {
   term++;
   if (term = terms.length)
 return null;
   freq = freqs[term];
 }
 freq--;
 return new Token(terms[term], 0, 0);
   }
 }
 
 Document doc = new Document();
 doc.add(Field.Text(content, ));
 indexWriter.addDocument(doc, new Analyzer() {
   public TokenStream tokenStream(String field, Reader reader) {
 return new VectorTokenStream(new String[] {java,lucene},
  new int[] {5,6});
   }
 });
 
Too bad the Field class is final, otherwise I can derive from it
  and do something on that line...
 
 Extending Field would not help.  That's why it's final.
 
 Doug
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED] 
 For additional commands, e-mail: [EMAIL PROTECTED] 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED] 
For additional commands, e-mail: [EMAIL PROTECTED] 



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: boolean operators and score

2004-07-08 Thread Don Vaillancourt
What could actually be done is perhaps sort the search result by document 
id.  Of course your relevancy will be all shot, but at least you would have 
control over the sorting order.

At 09:05 AM 07/07/2004, you wrote:
Hi Guys,
Finally I have sorted the problem of hits score thanks to the great help of
Franck.
I have hit another problem with the boolean operators now.
When I search for Winston and churchill i get a set of perfectly
acceptable results.
But when I change the order, churchill and winston the results are the
same but the order of the results changes.
Is it possible to have the same order (hits.score) irrespective of which
term is given before or after?
Regards,
Niraj
Don Vaillancourt
Director of Software Development
WEB IMPACT INC.
416-815-2000 ext. 245
email: [EMAIL PROTECTED]
web: http://www.web-impact.com

This email message is intended only for the addressee(s)
and contains information that may be confidential and/or
copyright.  If you are not the intended recipient please
notify the sender by reply email and immediately delete
this email. Use, disclosure or reproduction of this email
by anyone other than the intended recipient(s) is strictly
prohibited. No representation is made that this email or
any attachments are free of viruses. Virus scanning is
recommended and is the responsibility of the recipient.







RE: Problem with match on a non tokenized field.

2004-07-08 Thread Polina Litvak
Thanks a lot for your help.
I have one more question:

How would you handle a query consisting of two fields combined with a
Boolean operator, where one field is only indexed and stored (a Keyword)
and another is tokenized, indexed and store ?
Is it possible to have parts of the same query analyzed with different
analyzers ?


-Original Message-
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] 
Sent: July 7, 2004 4:38 PM
To: [EMAIL PROTECTED]
Subject: RE: Problem with match on a non tokenized field.

Use org.apache.lucene.analysis.PerFieldAnalyzerWrapper

Here is how I use it:

PerFieldAnalyzerWrapper analyzer  = new
org.apache.lucene.analysis.PerFieldAnalyzerWrapper(new MyAnalyzer());
analyzer.addAnalyzer(url, new NullAnalyzer());
try 
{
query = QueryParser.parse(searchQuery,
contents,
analyzer);

-Original Message-
From: Polina Litvak [mailto:[EMAIL PROTECTED]
Sent: Wednesday, July 07, 2004 4:20 PM
To: [EMAIL PROTECTED]
Subject: Problem with match on a non tokenized field.


I have a Lucene Document with a field named Code which is stored 
and indexed but not tokenized. The value of the field is ABC5-LB.
The only way I can match the field when searching is by entering 
Code:ABC5-LB because when I drop the quotes, every Analyzer I've tried
using breaks my
query into Code:ABC5 -Code:LB.
 
I need to be able to match this field by doing something like
Code:ABC5-L*, therefore always using quotes is not an option.
 
How would I go about writing my own analyzer that will not tokenize the
query ?
 
Thanks,
Polina
 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: indexing help

2004-07-08 Thread John Wang
Hi Grant:
 Thanks for the options. How likely will the lucene file formats change?

 Are there really no more optiosn? :(...

Thanks

-John

On Thu, 08 Jul 2004 08:50:44 -0400, Grant Ingersoll [EMAIL PROTECTED] wrote:
 Hi John,
 
 The source code is available from CVS, make it non-final and do what you need to do. 
  Of course, you may have a hard time finding help later if you aren't using 
 something everyone else is and your solution doesn't work...  :-)
 
 If I understand correctly what you are trying to do, you already know all of the 
 answers for indexing, you just want Lucene to do the retrieval side of the coin, 
 correct?  I suppose a crazy idea might be to write a program that took your info and 
 output it in the Lucene file format, but that seems a bit like overkill.
 
 -Grant
 
  [EMAIL PROTECTED] 07/07/04 07:37PM 
 
 
 Hi Doug:
 Thanks for the response!
 
 The solution you proposed is still a derivative of creating a
 dummy document stream. Taking the same example, java (5), lucene (6),
 VectorTokenStream would create a total of 11 Tokens whereas only 2 is
 neccessary.
 
Given many documents with many terms and frequencies, it would
 create many extra Token instances.
 
   The reason I was looking to derving the Field class is because I
 can directly manipulate the FieldInfo by setting the frequency. But
 the class is final...
 
   Any other suggestions?
 
 Thanks
 
 -John
 
 On Wed, 07 Jul 2004 14:20:24 -0700, Doug Cutting [EMAIL PROTECTED] wrote:
  John Wang wrote:
While lucene tokenizes the words in the document, it counts the
   frequency and figures out the position, we are trying to bypass this
   stage: For each document, I have a set of words with a know frequency,
   e.g. java (5), lucene (6) etc. (I don't care about the position, so it
   can always be 0.)
  
What I can do now is to create a dummy document, e.g. java java
   java java java lucene lucene lucene lucene lucene and pass it to
   lucene.
  
This seems hacky and cumbersome. Is there a better alternative? I
   browsed around in the source code, but couldn't find anything.
 
  Write an analyzer that returns terms with the appropriate distribution.
 
  For example:
 
  public class VectorTokenStream extends TokenStream {
private int term;
private int freq;
public VectorTokenStream(String[] terms, int[] freqs) {
  this.terms = terms;
  this.freqs = freqs;
}
public Token next() {
  if (freq == 0) {
term++;
if (term = terms.length)
  return null;
freq = freqs[term];
  }
  freq--;
  return new Token(terms[term], 0, 0);
}
  }
 
  Document doc = new Document();
  doc.add(Field.Text(content, ));
  indexWriter.addDocument(doc, new Analyzer() {
public TokenStream tokenStream(String field, Reader reader) {
  return new VectorTokenStream(new String[] {java,lucene},
   new int[] {5,6});
}
  });
 
 Too bad the Field class is final, otherwise I can derive from it
   and do something on that line...
 
  Extending Field would not help.  That's why it's final.
 
  Doug
 
  -
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail: [EMAIL PROTECTED]
 
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: indexing help

2004-07-08 Thread John Wang
Hi Grant:

 I have something that would extract only the important words from
a document along with its importance, furthermore, these important
words may not be physically in the document, it could be synonyms to
some of the words in the document. So the output of a process for a
document is a list of word/importance pairs.

I want to be able to query using only these words on the document. 

   I don't think Lucene has such capability.

   Can you suggest what I can do with the analysers process in doing
this without replicating words/tokens?

Thanks

-John

On Thu, 08 Jul 2004 11:10:07 -0400, Grant Ingersoll [EMAIL PROTECTED] wrote:
 Hey John,
 
 Those are just options, didn't say they were good ones!  :-)
 
 I guess the real question is, what is the background of what you are trying to do?  
 Presumably you have some other program that is generating frequencies for you, do 
 you really need that in the current form?  Can't the Lucene indexing engine act as a 
 stand-in for this process since your end result _should_ be the same?  The Lucene 
 Analyzer process is quite flexible, I bet you could even find a way to hook in your 
 existing tools into the Analyzer process.
 
 -Grant
 
  [EMAIL PROTECTED] 07/08/04 10:42AM 
 
 
 Hi Grant:
 Thanks for the options. How likely will the lucene file formats change?
 
 Are there really no more optiosn? :(...
 
 Thanks
 
 -John
 
 On Thu, 08 Jul 2004 08:50:44 -0400, Grant Ingersoll [EMAIL PROTECTED] wrote:
  Hi John,
 
  The source code is available from CVS, make it non-final and do what you need to 
  do.  Of course, you may have a hard time finding help later if you aren't using 
  something everyone else is and your solution doesn't work...  :-)
 
  If I understand correctly what you are trying to do, you already know all of the 
  answers for indexing, you just want Lucene to do the retrieval side of the coin, 
  correct?  I suppose a crazy idea might be to write a program that took your info 
  and output it in the Lucene file format, but that seems a bit like overkill.
 
  -Grant
 
   [EMAIL PROTECTED] 07/07/04 07:37PM 
 
 
  Hi Doug:
  Thanks for the response!
 
  The solution you proposed is still a derivative of creating a
  dummy document stream. Taking the same example, java (5), lucene (6),
  VectorTokenStream would create a total of 11 Tokens whereas only 2 is
  neccessary.
 
 Given many documents with many terms and frequencies, it would
  create many extra Token instances.
 
The reason I was looking to derving the Field class is because I
  can directly manipulate the FieldInfo by setting the frequency. But
  the class is final...
 
Any other suggestions?
 
  Thanks
 
  -John
 
  On Wed, 07 Jul 2004 14:20:24 -0700, Doug Cutting [EMAIL PROTECTED] wrote:
   John Wang wrote:
 While lucene tokenizes the words in the document, it counts the
frequency and figures out the position, we are trying to bypass this
stage: For each document, I have a set of words with a know frequency,
e.g. java (5), lucene (6) etc. (I don't care about the position, so it
can always be 0.)
   
 What I can do now is to create a dummy document, e.g. java java
java java java lucene lucene lucene lucene lucene and pass it to
lucene.
   
 This seems hacky and cumbersome. Is there a better alternative? I
browsed around in the source code, but couldn't find anything.
  
   Write an analyzer that returns terms with the appropriate distribution.
  
   For example:
  
   public class VectorTokenStream extends TokenStream {
 private int term;
 private int freq;
 public VectorTokenStream(String[] terms, int[] freqs) {
   this.terms = terms;
   this.freqs = freqs;
 }
 public Token next() {
   if (freq == 0) {
 term++;
 if (term = terms.length)
   return null;
 freq = freqs[term];
   }
   freq--;
   return new Token(terms[term], 0, 0);
 }
   }
  
   Document doc = new Document();
   doc.add(Field.Text(content, ));
   indexWriter.addDocument(doc, new Analyzer() {
 public TokenStream tokenStream(String field, Reader reader) {
   return new VectorTokenStream(new String[] {java,lucene},
new int[] {5,6});
 }
   });
  
  Too bad the Field class is final, otherwise I can derive from it
and do something on that line...
  
   Extending Field would not help.  That's why it's final.
  
   Doug
  
   -
   To unsubscribe, e-mail: [EMAIL PROTECTED]
   For additional commands, e-mail: [EMAIL PROTECTED]
  
  
 
  -
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail: [EMAIL PROTECTED]
 
  -
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional 

Re: Way to repair an index broking during 1/2 optimize?

2004-07-08 Thread Peter M Cipollone
You might try merging the existing index into a new index located on a ram
disk.  Once it is done, you can move the directory from ram disk back to
your hard disk.  I think this will work as long as the old index did not
finish merging.  You might do a strings command on the segments file to
make sure the new (merged) segment is not in there, and if there's a
deletable file, make sure there are no segments from the old index listed
therein.

- Original Message - 
From: Kevin A. Burton [EMAIL PROTECTED]
To: Lucene Users List [EMAIL PROTECTED]
Sent: Thursday, July 08, 2004 2:02 PM
Subject: Way to repair an index broking during 1/2 optimize?


 So.. the other day I sent an email about building an index with 14M
 documents.

 That went well but the optimize() was taking FOREVER.  It took 7 hours
 to generate the whole index and when complete as of 10AM it was still
 optimizing (6 hours later) and I needed the box back.

 So is it possible to fix this index now?  Can I just delete the most
 recent segment that was created?  I can find this by ls -alt

 Also... what can I do to speed up this optimize?  Ideally it wouldn't
 take 6 hours.

 Kevin

 -- 

 Please reply using PGP.

 http://peerfear.org/pubkey.asc

 NewsMonster - http://www.newsmonster.org/

 Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
AIM/YIM - sfburtonator,  Web - http://peerfear.org/
 GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
   IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster


 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



problem running lucene 1.4 demo on a solaris machine (permission denied)

2004-07-08 Thread MATL (Mats Lindberg)
Hello
 
I have downloaded the lucene 1.4 to a windows machine, and it all works
fine, when i tries to move this to a solaris machine i get the following
error:
 
/opt/tomcat/common/lib/lucene-1.4-final.jar: cannot execute
 
If i then tries to change the permission (777) on the above file, i get
the following error:
/opt/tomcat/common/lib/lucene-1.4-final.jar: PK^C^D: not found
/opt/tomcat/common/lib/lucene-1.4-final.jar: \304U\3410: not found
/opt/tomcat/common/lib/lucene-1.4-final.jar: syntax error at line 3: `('
unexpected

 
any ideas how to solve this, or what causes the error
 
I am runinng in the following environment:
java version 1.2.2
Solaris VM (build Solaris_JDK_1.2.2_10, native threads, sunwjit)

but have tried on a java version 1.4.2 (i believe it was), but with the
same error.
 
When i copied the lucene jar file to the solaris machine from the
windows machine i used a ftp program.
 
Any help is much appreciated.
 
Best regards,
Mats Lindberg


Re: Way to repair an index broking during 1/2 optimize?

2004-07-08 Thread Doug Cutting
Kevin A. Burton wrote:
So is it possible to fix this index now?  Can I just delete the most 
recent segment that was created?  I can find this by ls -alt
Sorry, I forgot to answer your question: this should work fine.  I don't 
think you should even have to delete that segment.

Also, to elaborate on my previous comment, a mergeFactor of 5000 not 
only delays the work until the end, but it also makes the disk workload 
more seek-dominated, which is not optimal.  So I suspect a smaller merge 
factor, together with a larger minMergeDocs, will be much faster 
overall, including the final optimize().  Please tell us how it goes.

Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: problem running lucene 1.4 demo on a solaris machine (permission denied)

2004-07-08 Thread Doug Cutting
MATL (Mats Lindberg) wrote:
When i copied the lucene jar file to the solaris machine from the
windows machine i used a ftp program.
FTP probably mangled the file.  You need to use FTP's binary mode.
Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Way to repair an index broking during 1/2 optimize?

2004-07-08 Thread Kevin A. Burton
Peter M Cipollone wrote:
You might try merging the existing index into a new index located on a ram
disk.  Once it is done, you can move the directory from ram disk back to
your hard disk.  I think this will work as long as the old index did not
finish merging.  You might do a strings command on the segments file to
make sure the new (merged) segment is not in there, and if there's a
deletable file, make sure there are no segments from the old index listed
therein.
 

Its a HUGE index.  It won't fit in memory ;)  Right now its at 8G...
Thanks though! :)
Kevin
--
Please reply using PGP.
   http://peerfear.org/pubkey.asc
   
   NewsMonster - http://www.newsmonster.org/
   
Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Way to repair an index broking during 1/2 optimize?

2004-07-08 Thread Kevin A. Burton
Doug Cutting wrote:
Kevin A. Burton wrote:
Also... what can I do to speed up this optimize? Ideally it wouldn't 
take 6 hours.

Was this the index with the mergeFactor of 5000? If so, that's why 
it's so slow: you've delayed all of the work until the end. Indexing 
on a ramfs will make things faster in general, however, if you have 
enough RAM...
No... I changed the mergeFactor back to 10 as you suggested.
Kevin
--
Please reply using PGP.
   http://peerfear.org/pubkey.asc
   
   NewsMonster - http://www.newsmonster.org/
   
Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Way to repair an index broking during 1/2 optimize?

2004-07-08 Thread Kevin A. Burton
Doug Cutting wrote:
Kevin A. Burton wrote:
So is it possible to fix this index now? Can I just delete the most 
recent segment that was created? I can find this by ls -alt

Sorry, I forgot to answer your question: this should work fine. I 
don't think you should even have to delete that segment.
I'm worried about duplicate or missing content from the original index. 
I'd rather rebuild the index and waste another 6 hours (I've probably 
blown 100 hours of CPU time on this already) and have a correct index :)

During an optimize I assume Lucene starts writing to a new segment and 
leaves all others in place until everything is done and THEN deletes them?

Also, to elaborate on my previous comment, a mergeFactor of 5000 not 
only delays the work until the end, but it also makes the disk 
workload more seek-dominated, which is not optimal. 
The only settings I uses are:
targetIndex.mergeFactor=10;
targetIndex.minMergeDocs=1000;
the resulting index has 230k files in it :-/
I assume this is contributing to all the disk seeks.
So I suspect a smaller merge factor, together with a larger 
minMergeDocs, will be much faster overall, including the final 
optimize(). Please tell us how it goes.

This is what I did for this last round but then I ended up with the 
highly fragmented index.

hm...
Thanks for all the help btw!
Kevin
--
Please reply using PGP.
   http://peerfear.org/pubkey.asc
   
   NewsMonster - http://www.newsmonster.org/
   
Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Understanding TooManyClauses-Exception and Query-RAM-size

2004-07-08 Thread Kevin A. Burton
[EMAIL PROTECTED] wrote:
Hi,
a couple of weeks ago we migrated from Lucene 1.2 to 1.4rc3. Everything went
smoothly, but we are experiencing some problems with that new constant limit
maxClauseCount=1024
which leeds to Exceptions of type 

	org.apache.lucene.search.BooleanQuery$TooManyClauses 

when certain RangeQueries are executed (in fact, we get this Excpetion when
we execute certain Wildcard queries, too). Although we are working with a
fairly small index with about 35.000 documents, we encounter this Exception
when we search for the property modificationDate. For example
	modificationDate:[00 TO 0dwc970kw] 

 

We talked about this the other day.
http://wiki.apache.org/jakarta-lucene/IndexingDateFields
Find out what type of precision you need and use that.  If you only need 
days or hours or minutes then use that.   Millis is just too small. 

We're only using days and have queries for just the last 7 days as max 
so this really works out well...

Kevin
--
Please reply using PGP.
   http://peerfear.org/pubkey.asc
   
   NewsMonster - http://www.newsmonster.org/
   
Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: indexing help

2004-07-08 Thread John Wang
Thanks Doug. I will do just that.

Just for my education, can you maybe elaborate on using the
implement an IndexReader that delivers a
synthetic index approach?

Thanks in advance

-John

On Thu, 08 Jul 2004 10:01:59 -0700, Doug Cutting [EMAIL PROTECTED] wrote:
 John Wang wrote:
   The solution you proposed is still a derivative of creating a
  dummy document stream. Taking the same example, java (5), lucene (6),
  VectorTokenStream would create a total of 11 Tokens whereas only 2 is
  neccessary.
 
 That's easy to fix.  We just need to reuse the token:
 
 public class VectorTokenStream extends TokenStream {
   private int term = -1;
   private int freq = 0;
   private Token token;
   public VectorTokenStream(String[] terms, int[] freqs) {
 this.terms = terms;
 this.freqs = freqs;
   }
   public Token next() {
 if (freq == 0) {
   term++;
   if (term = terms.length)
 return null;
   token = new Token(terms[term], 0, 0);
   freq = freqs[term];
 }
 freq--;
 return token;
   }
 }
 
 Then only two tokens are created, as you desire.
 
 If you for some reason don't want to create a dummy document stream,
 then you could instead implement an IndexReader that delivers a
 synthetic index for a single document.  Then use
 IndexWriter.addIndexes() to turn this into a real, FSDirectory-based
 index.  However that would be a lot more work and only very marginally
 faster.  So I'd stick with the approach I've outlined above.  (Note:
 this code has not been compiled or run.  It may have bugs.)
 
 
 
 Doug
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Way to repair an index broking during 1/2 optimize?

2004-07-08 Thread Doug Cutting
Kevin A. Burton wrote:
No... I changed the mergeFactor back to 10 as you suggested.
Then I am confused about why it should take so long.
Did you by chance set the IndexWriter.infoStream to something, so that 
it logs merges?  If so, it would be interesting to see that output, 
especially the last entry.

Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Lucene shouldn't use java.io.tmpdir

2004-07-08 Thread Kevin A. Burton
Otis Gospodnetic wrote:
Hey Kevin,
Not sure if you're aware of it, but you can specify the lock dir, so in
your example, both JVMs could use the exact same lock dir, as long as
you invoke the VMs with the same params.  

Most people won't do this or won't even understand WHY they need to do 
this :-/.

You shouldn't be writing the
same index with more than 1 IndexWriter though (not sure if this was
just a bad example or a real scenario).
 

Yes... I realize that you shouldn't use more than one IndexWriter. That 
was the point. The locks are to prevent this from happening. If one were 
to accidentally do this the locks would be in different directories and 
our IndexWriter would corrupt the index.

This is why I think it makes more sense to use our own java.io.tmpdir to 
be on the safe side.

--
Please reply using PGP.
   http://peerfear.org/pubkey.asc
   
   NewsMonster - http://www.newsmonster.org/
   
Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Way to repair an index broking during 1/2 optimize?

2004-07-08 Thread Kevin A. Burton
Doug Cutting wrote:
Kevin A. Burton wrote:
No... I changed the mergeFactor back to 10 as you suggested.

Then I am confused about why it should take so long.
Did you by chance set the IndexWriter.infoStream to something, so that 
it logs merges? If so, it would be interesting to see that output, 
especially the last entry.

No I didn't actually... If I run it again I'll be sure to do this.
--
Please reply using PGP.
   http://peerfear.org/pubkey.asc
   
   NewsMonster - http://www.newsmonster.org/
   
Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Lucene shouldn't use java.io.tmpdir

2004-07-08 Thread Doug Cutting
Kevin A. Burton wrote:
This is why I think it makes more sense to use our own java.io.tmpdir to 
be on the safe side.
I think the bug is that Tomcat changes java.io.tmpdir.  I thought that 
the point of the system property java.io.tmpdir was to have a portable 
name for /tmp on unix, c:\windows\tmp on Windows, etc.  Tomcat breaks 
that.  So must Lucene have its own way of finding the platform-specific 
temporary directory that everyone can write to?  Perhaps, but it seems a 
shame, since Java already has a standard mechanism for this, which 
Tomcat abuses...

Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: indexing help

2004-07-08 Thread Doug Cutting
John Wang wrote:
Just for my education, can you maybe elaborate on using the
implement an IndexReader that delivers a
synthetic index approach?
IndexReader is an abstract class.  It has few data fields, and few 
non-static methods that are not implemented in terms of abstract 
methods.  So, in effect, it is an interface.

When Lucene indexes a token stream it creates a single-document index 
that is then merged with other single- and multi-document indexes to 
form an index that is searched.  You could bypass the first step of this 
(indexing a token stream) by instead directly implementing all of 
IndexReader's abstract methods to return the same thing as the 
single-document index that Lucene would create.  This would be 
marginally faster, as no Token objects would be created at all.  But, 
since IndexReader has a lot of abstract methods, it would be a lot of 
work.  I didn't really mean it as a practical suggestion.

Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Lucene shouldn't use java.io.tmpdir

2004-07-08 Thread Kevin A. Burton
Doug Cutting wrote:
Kevin A. Burton wrote:
This is why I think it makes more sense to use our own java.io.tmpdir 
to be on the safe side.

I think the bug is that Tomcat changes java.io.tmpdir. I thought that 
the point of the system property java.io.tmpdir was to have a portable 
name for /tmp on unix, c:\windows\tmp on Windows, etc. Tomcat breaks 
that. So must Lucene have its own way of finding the platform-specific 
temporary directory that everyone can write to? Perhaps, but it seems 
a shame, since Java already has a standard mechanism for this, which 
Tomcat abuses...
I've seen this done in other places as well. I think Weblogic did/does 
it. I'm wondering what some of these big EJB containsers use which is 
why I brought this up. I'm not sure the problem is just with Tomcat.

Kevin
--
Please reply using PGP.
   http://peerfear.org/pubkey.asc
   
   NewsMonster - http://www.newsmonster.org/
   
Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Where's the search(Query query, Sort sort) method of Searcher

2004-07-08 Thread Bill Tschumy
I'm trying to do a search and sort the results using a Sort object.  
The 1.4-final API says that Searcher has the following method.

Hits search(Query query,  Sort sort)
However, when I try to use it in the code below:
IndexSearcher is = new IndexSearcher(fsDir);
Query query = QueryParser.parse(Nuggets, creator, new 
StandardAnalyzer());
Hits hits = is.search(query, new Sort(created));

I get the following compile error:
[javac] Compiling 18 source files to /Users/bill/Nuggets/classes
[javac] 
/Users/bill/Nuggets/src/com/otherwise/nuggets/MySearcher.java:44: 
cannot resolve symbol
[javac] symbol  : method search 
(org.apache.lucene.search.Query,org.apache.lucene.search.Sort)
[javac] location: class org.apache.lucene.search.IndexSearcher
[javac] hits = is.search(query, new Sort(created));
[javac]  ^

If I do the same call without the Sort object it compiles just fine.
This seems to be indicating the search(Query, Sort) method is not in 
the jar file.  Either the API is in error (doubtful) or I'm doing 
something really stupid (likely).  Can someone explain which it is?
--
Bill Tschumy
Otherwise -- Austin, TX
http://www.otherwise.com

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Way to repair an index broking during 1/2 optimize?

2004-07-08 Thread Doug Cutting
Kevin A. Burton wrote:
During an optimize I assume Lucene starts writing to a new segment and 
leaves all others in place until everything is done and THEN deletes them?
That's correct.
The only settings I uses are:
targetIndex.mergeFactor=10;
targetIndex.minMergeDocs=1000;
the resulting index has 230k files in it :-/
Something sounds very wrong for there to be that many files.
The maximum number of files should be around:
  (7 + numIndexedFields) * (mergeFactor-1) * 
(log_base_mergeFactor(numDocs/minMergeDocs))

With 14M documents, log_10(14M/1000) is 4, which gives, for you:
  (7 + numIndexedFields) * 36 = 230k
   7*36 + numIndexedFields*36 = 230k
   numIndexedFields = (230k - 7*36) / 36 =~ 6k
So you'd have to have around 6k unique field names to get 230k files. 
Or something else must be wrong.  Are you running on win32, where file 
deletion can be difficult?

With the typical handful of fields, one should never see more than 
hundreds of files.

Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Way to repair an index broking during 1/2 optimize?

2004-07-08 Thread Kevin A. Burton
Doug Cutting wrote:
Something sounds very wrong for there to be that many files.
The maximum number of files should be around:
(7 + numIndexedFields) * (mergeFactor-1) * 
(log_base_mergeFactor(numDocs/minMergeDocs))

With 14M documents, log_10(14M/1000) is 4, which gives, for you:
(7 + numIndexedFields) * 36 = 230k
7*36 + numIndexedFields*36 = 230k
numIndexedFields = (230k - 7*36) / 36 =~ 6k
So you'd have to have around 6k unique field names to get 230k files. 
Or something else must be wrong. Are you running on win32, where file 
deletion can be difficult?

With the typical handful of fields, one should never see more than 
hundreds of files.

We only have 13 fields... Though to be honest I'm worried that even if I 
COULD do the optimize that it would run out of file handles.

This is very strange...
I'm going to increase minMergeDocs to 1 and then run the full 
converstion on one box and then try to do an optimize (of the corrupt) 
another box. See which one finishes first.

I assume the speed of optimize() can be increased the same way that 
indexing is increased...

Kevin
--
Please reply using PGP.
   http://peerfear.org/pubkey.asc
   
   NewsMonster - http://www.newsmonster.org/
   
Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Browse by Letter within a Category

2004-07-08 Thread O'Hare, Thomas
I would like to implement the following functionality:

- Search a specific field (category) and limit the search where the
title field begins with a given letter, and return the results sorted in
alphabetical order by title. Both the category and title fields are
tokenized, indexed and stored in the index (type Field.Text). How should
I construct the search and sort? I tried the following, but the titles
are not being displayed in alphabetical order:

Searcher.search(category:\Products\ AND title:\A*\, new
Sort(title));

I want to display all results where Products is the category whose title
begins with the letter A, sorted in alphabetical order by title. I'm
using Lucene 1.4 final release. 

Thanks, 
Tom




Re: boolean operators and score

2004-07-08 Thread Niraj Alok
Hi Don,

After months of struggling with lucene and finally achieving the complex
relevancy desired, the client would kill me if i now make that relevancy all
lost.

I am trying to do it with the way Franck suggested by sorting the words the
user has entered, but otherwise, isn't this a bug of lucene ?

Regards,
Niraj

- Original Message -
From: Don Vaillancourt [EMAIL PROTECTED]
To: Lucene Users List [EMAIL PROTECTED]
Sent: Thursday, July 08, 2004 7:15 PM
Subject: Re: boolean operators and score


 What could actually be done is perhaps sort the search result by document
 id.  Of course your relevancy will be all shot, but at least you would
have
 control over the sorting order.