Lucene in clustered environment (Tomcat)

2005-06-07 Thread Ben
Hi

I would like to use Lucene in a clustered environment, what are the
things that I should consider and do?

I would like to use the same ordinary index storage for all the nodes
in the the cluster, possible?

Thanks,
Ben

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Indexing multiple languages

2005-06-07 Thread sergiu gordea

Tansley, Robert wrote:


Hi all,

The DSpace (www.dspace.org) currently uses Lucene to index metadata
(Dublin Core standard) and extracted full-text content of documents
stored in it.  Now the system is being used globally, it needs to
support multi-language indexing.

I've looked through the mailing list archives etc. and it seems it's
easy to plug in analyzers for different languages.

What if we're trying to index multiple languages in the same site?  Is
it best to have:

1/ one index for all languages
2/ one index for all languages, with an extra language field so searches
can be constrained to a particular language
3/ separate indices for each language?

I don't fully understand the consequences in terms of performance for
1/, but I can see that false hits could turn up where one word appears
in different languages (stemming could increase the changes of this).
Also some languages' analyzers are quite dramatically different (e.g.
the Chinese one which just treats every character as a separate
token/word).
 


On the other hand, if people are searching for proper nouns in metadata
(e.g. DSpace) it may be advantageous to search all languages at once.


I'm also not sure of the storage and performance consequences of 2/.

Approach 3/ seems like it might be the most complex from an
implementation/code point of view.  
 

But this will be the most robust solution. You have to differentiate 
between languages anyway,
and as you pointed here, you can differentiate by adding a Keyword field 
for language, or you can create different

indexes.

If you need to use complex search strings over multiple fields and 
indexes then I recommend you to use the QueryParser
to compute the search string. When you instantiate a QueryPArser you 
will need to provide an analyzer, that will be different

for different languages.

I think that the differences in performance won't be noticable between  
2nd and 3rd solutions, but from maintenance point of

view, I would choose the third solution.

Of course there are other factors that must be take in account when 
designing such an application:
number of documents to be indexed, number of document fields, index 
change frequency, server load (number of concurrent sessions), etc.


Hope this hints help you a little,

Best,

Sergiu




Does anyone have any thoughts or recommendations on this?

Many thanks,

Robert Tansley / Digital Media Systems Programme / HP Labs
 http://www.hpl.hp.com/personal/Robert_Tansley/

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

 




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Finding minimum and maximum value of a field?

2005-06-07 Thread sergiu gordea



Kevin Burton wrote:

I have an index with a date field.  I want to quickly find the minimum 
and maximum values in the index.


Is there a quick way to do this?  I looked at using TermInfos and 
finding the first one but how to I find the last?


I also tried the new sort API and the performance was horrible :-/

Any ideas?


You may keep a history of the MIN and MAX values in an external file.
Let's say, you can write in a text file the MIN_DATE and MAX_DATE,
and keep them up to date when indexing, deleting documents.

Best,

Sergiu



Kevin




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Lucene in clustered environment (Tomcat)

2005-06-07 Thread Nader Henein

IMHO, Issues that you need to consider

   * Atomicity of updates and deletes if you are using multiple indexes
 on multiple machines (the case if your cluster is over a wide network)
   * Scheduled indecies to core data comparison and sanitization
 (intensive)

This all depends on what the volume of change is on your index and 
whether you'll be using a Memory resident index or an FS index.


This should start the ball rolling, we've been using Lucene successfully 
on a distributed cluster for a while now, and as long as you're aware of 
some basic NDS limitations/constraints you should be fine.


Hope this helps

Nader Henein

Ben wrote:


Hi

I would like to use Lucene in a clustered environment, what are the
things that I should consider and do?

I would like to use the same ordinary index storage for all the nodes
in the the cluster, possible?

Thanks,
Ben

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]







 



--

Nader S. Henein
Senior Applications Architect

Bayt.com





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



URLDirectory

2005-06-07 Thread LABATTE Jacques
Hi,

I'm looking for URLDirectory implementation NOT based on RAMDirectory because 
the size of my indexes is up to 500Mo.

Thanks.

Jacques LABATTE.



Re: Lucene in clustered environment (Tomcat)

2005-06-07 Thread Ben
 When you say your cluster is on a single machine, do you mean that you have 
 multiple webservers on the same machine all of which search a single Lucene 
 index?

Yes, this is my case.

 Do you use Lucene as your persistent store or do you have a DB back there?

I use Lucene to search for data stored in a PostgreSQL server.

 what is your current update/delete strategy because real time inserts from 
 the webservers directly to the index will not work because you can't have 
 multiple writers.

I have to do this in real time, what are the available solutions? My
application has the ability to do batch update/delete to a Lucene
index but I would like to do this in real time.

One solution I am thinking is to have each cluster has it own index
and use parallel search. This makes my application even more complex.

 I strongly recommend Quartz, it's rock solid and really versatile.

I am using Quartz, it is really great and supports cluster.

Thanks,
Ben


On 6/7/05, Nader Henein [EMAIL PROTECTED] wrote:
 When you say your cluster is on a single machine, do you mean that you
 have multiple webservers on the same machine all of which search a
 single Lucene index? Because if that's the case, your solution is
 simple, as long as you persist to a single DB and then designate one of
 your servers (or even another server) to update/delete the index. Do you
 use Lucene as your persistent store or do you have a DB back there? and
 what is your current update/delete strategy because real time inserts
 from the webservers directly to the index will not work because you
 can't have multiple writers. Updating a dirty flag on rows that need to
 be indexed/deleted, or using a table for this task and then batching
 your updates would be ideal, and if you're using server specific
 scheduling, I strongly recommend Quartz, it's rock solid and really
 versatile.
 
 My two cents.
 
 Nader Henein
 
 
 Ben wrote:
 
 My cluster is on a single machine and I am using FS index.
 
 I have already integrated Lucene into my web application for use in a
 non-clustered environment. I don't know what I need to do to make it
 work in a clustered environment.
 
 Thanks,
 Ben
 
 On 6/7/05, Nader Henein [EMAIL PROTECTED] wrote:
 
 
 IMHO, Issues that you need to consider
 
 * Atomicity of updates and deletes if you are using multiple indexes
   on multiple machines (the case if your cluster is over a wide network)
 * Scheduled indecies to core data comparison and sanitization
   (intensive)
 
 This all depends on what the volume of change is on your index and
 whether you'll be using a Memory resident index or an FS index.
 
 This should start the ball rolling, we've been using Lucene successfully
 on a distributed cluster for a while now, and as long as you're aware of
 some basic NDS limitations/constraints you should be fine.
 
 Hope this helps
 
 Nader Henein
 
 Ben wrote:
 
 
 
 Hi
 
 I would like to use Lucene in a clustered environment, what are the
 things that I should consider and do?
 
 I would like to use the same ordinary index storage for all the nodes
 in the the cluster, possible?
 
 Thanks,
 Ben
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 
 
 
 
 
 
 
 
 
 
 --
 
 Nader S. Henein
 Senior Applications Architect
 
 Bayt.com
 
 
 
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 
 
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 
 
 
 
 
 
 
 
 
 --
 
 Nader S. Henein
 Senior Applications Architect
 
 Bayt.com
 
 
 
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: deleting on a keyword field

2005-06-07 Thread Max Pfingsthorn
Hello!

Ehem, I have to apologize. It was my stupidity that caused this problem. I 
simply mixed up field names... I did the deletion of items in a superclass, 
which of course didn't know about the change in the uri field name. Duh! 
Everything works now, just like it should.

Sorry again! Thanks for bearing with me though!

max

-Original Message-
From: Erik Hatcher [mailto:[EMAIL PROTECTED]
Sent: Tuesday, June 07, 2005 03:37
To: java-user@lucene.apache.org
Subject: Re: deleting on a keyword field



On Jun 6, 2005, at 7:07 AM, Max Pfingsthorn wrote:
 Thanks for all the replies. I do know that the readers should be  
 reopened, but that is not the problem.

Could you work up a test case that shows this issue?  From all I can  
see, you're doing the right thing.  Something is amiss somewhere though.

 I try to remove some docs, and add their new versions again to  
 incrementally update the index. After updating the index with the  
 same document twice, I opened the index in luke. There I saw that  
 the file's uri was present three times in the uri field. So, I  
 concluded, it didn't delete the docs right as there are in total  
 three documents which contain this term, right? By the way,  
 Reader.delete() returned 0 as well.

 I thought I used Field.Keyword(), but actually I use

 doc.add(new Field(URI_FIELD, uri, true, true, false));

Same thing in this case.  new Field(name, value, true, true, false)  
is the same as Field.Keyword(name, value)

 to add the uri to the doc. I can see it in luke, and even find the  
 docs when searching for it (using the KeywordAnalyzer).

 Any ideas?

Nothing comes to mind from what I've seen thus far.  An easily  
runnable example demonstrating this issue would be the next step.

 Erik



 Thanks!
 max


 -Original Message-
 From: Daniel Naber [mailto:[EMAIL PROTECTED]
 Sent: Friday, June 03, 2005 20:10
 To: java-user@lucene.apache.org
 Subject: Re: deleting on a keyword field


 On Friday 03 June 2005 18:50, Max Pfingsthorn wrote:


 reader.delete(new Term(URI_FIELD, uri));

 This does not remove anything. Do I have to make the uri a normal  
 field?


 How do you know nothing was deleted? Are you aware that you need to  
 re-open
 your IndexSearcher/Reader in order to see the changes made to the  
 index?

 Regards
  Daniel

 -- 
 http://www.danielnaber.de

 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]


 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: log4j:WARN No appenders could be found for logger

2005-06-07 Thread Erik Hatcher

António,

This error is not coming from Lucene, but rather from the ELATED  
library (as you can tell from package name).  Lucene does not use  
Log4j at all.  Please address this issue to either the Fedora or  
ELATED groups.


Erik

On Jun 6, 2005, at 8:21 PM, [EMAIL PROTECTED] wrote:


Hi!

I'm newbie in java, and not a real coder.
I'm implementing a digital library (windows)with 2 open sources:
a server aplications called FEDORA (www.fedora.info) and a JSPs  
interface called ELATED (http://elated.sourceforge.net).


when I start the fedora server I get:

c:\fedora-2.0\server\binfedora-start
Starting Fedora server...
Deploying API-M and API-A...
Waiting for server to start...
log4j:WARN No appenders could be found for logger  
(org.acs.elated.lucene.LuceneInterface).

log4j:WARN Please initialize the log4j system properly.
Processing file C:\fedora-2.0\server\config\deployAPI-A.wsdd
adminDone processing/Admin
Processing file C:\fedora-2.0\server\config\deploy.wsdd
adminDone processing/Admin
Initializing Fedora Server instance...
Fedora Version: 2.0
Fedora Build: 1
Server Host Name: localhost
Server Port: 8080
Debugging: false
OK
Finished.  To stop the server, use fedora-stop.
c:\fedora-2.0\server\bin

I dont understand the error:
log4j:WARN No appenders could be found for logger  
(org.acs.elated.lucene.LuceneInterface).

log4j:WARN Please initialize the log4j system properly.

Can anyone tell me what is this error?

Thanks in advance
António Fonseca

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Lucene in clustered environment (Tomcat)

2005-06-07 Thread Nader Henein
I realize I've already asked you this question, but do you need 100% 
real time, because you could run batch them every 2 minutes, and 
concerning Parallel search, unless you really need it, it's overkill in 
this case, a communal index will serve you well and will be much easier 
to maintain. You have to way requirement vs. complexity/ debug time.


Nader Henein

Ben wrote:


When you say your cluster is on a single machine, do you mean that you have 
multiple webservers on the same machine all of which search a single Lucene 
index?
   



Yes, this is my case.

 


Do you use Lucene as your persistent store or do you have a DB back there?
   



I use Lucene to search for data stored in a PostgreSQL server.

 


what is your current update/delete strategy because real time inserts from the 
webservers directly to the index will not work because you can't have multiple 
writers.
   



I have to do this in real time, what are the available solutions? My
application has the ability to do batch update/delete to a Lucene
index but I would like to do this in real time.

One solution I am thinking is to have each cluster has it own index
and use parallel search. This makes my application even more complex.

 


I strongly recommend Quartz, it's rock solid and really versatile.
   



I am using Quartz, it is really great and supports cluster.

Thanks,
Ben


On 6/7/05, Nader Henein [EMAIL PROTECTED] wrote:
 


When you say your cluster is on a single machine, do you mean that you
have multiple webservers on the same machine all of which search a
single Lucene index? Because if that's the case, your solution is
simple, as long as you persist to a single DB and then designate one of
your servers (or even another server) to update/delete the index. Do you
use Lucene as your persistent store or do you have a DB back there? and
what is your current update/delete strategy because real time inserts
from the webservers directly to the index will not work because you
can't have multiple writers. Updating a dirty flag on rows that need to
be indexed/deleted, or using a table for this task and then batching
your updates would be ideal, and if you're using server specific
scheduling, I strongly recommend Quartz, it's rock solid and really
versatile.

My two cents.

Nader Henein


Ben wrote:

   


My cluster is on a single machine and I am using FS index.

I have already integrated Lucene into my web application for use in a
non-clustered environment. I don't know what I need to do to make it
work in a clustered environment.

Thanks,
Ben

On 6/7/05, Nader Henein [EMAIL PROTECTED] wrote:


 


IMHO, Issues that you need to consider

  * Atomicity of updates and deletes if you are using multiple indexes
on multiple machines (the case if your cluster is over a wide network)
  * Scheduled indecies to core data comparison and sanitization
(intensive)

This all depends on what the volume of change is on your index and
whether you'll be using a Memory resident index or an FS index.

This should start the ball rolling, we've been using Lucene successfully
on a distributed cluster for a while now, and as long as you're aware of
some basic NDS limitations/constraints you should be fine.

Hope this helps

Nader Henein

Ben wrote:



   


Hi

I would like to use Lucene in a clustered environment, what are the
things that I should consider and do?

I would like to use the same ordinary index storage for all the nodes
in the the cluster, possible?

Thanks,
Ben

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]











 


--

Nader S. Henein
Senior Applications Architect

Bayt.com





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




   


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]









 


--

Nader S. Henein
Senior Applications Architect

Bayt.com





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


   



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]







 



--

Nader S. Henein
Senior Applications Architect

Bayt.com





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Lucene in clustered environment (Tomcat)

2005-06-07 Thread Ben
How about using JavaGroups to notify other nodes in the cluster about
the changes?

Essentially, each node has the same index stored in a different
location. When one node updates/deletes a record, other nodes will get
a notification about the changes and update their index accordingly?
By using this method, I don't have to modify my Lucene code, I just
need to add additional code to notify other nodes. I believe this
method also scales better.

Cheers,
Ben


On 6/7/05, Nader Henein [EMAIL PROTECTED] wrote:
 I realize I've already asked you this question, but do you need 100%
 real time, because you could run batch them every 2 minutes, and
 concerning Parallel search, unless you really need it, it's overkill in
 this case, a communal index will serve you well and will be much easier
 to maintain. You have to way requirement vs. complexity/ debug time.
 
 Nader Henein
 
 Ben wrote:
 
 When you say your cluster is on a single machine, do you mean that you have 
 multiple webservers on the same machine all of which search a single Lucene 
 index?
 
 
 
 Yes, this is my case.
 
 
 
 Do you use Lucene as your persistent store or do you have a DB back there?
 
 
 
 I use Lucene to search for data stored in a PostgreSQL server.
 
 
 
 what is your current update/delete strategy because real time inserts from 
 the webservers directly to the index will not work because you can't have 
 multiple writers.
 
 
 
 I have to do this in real time, what are the available solutions? My
 application has the ability to do batch update/delete to a Lucene
 index but I would like to do this in real time.
 
 One solution I am thinking is to have each cluster has it own index
 and use parallel search. This makes my application even more complex.
 
 
 
 I strongly recommend Quartz, it's rock solid and really versatile.
 
 
 
 I am using Quartz, it is really great and supports cluster.
 
 Thanks,
 Ben
 
 
 On 6/7/05, Nader Henein [EMAIL PROTECTED] wrote:
 
 
 When you say your cluster is on a single machine, do you mean that you
 have multiple webservers on the same machine all of which search a
 single Lucene index? Because if that's the case, your solution is
 simple, as long as you persist to a single DB and then designate one of
 your servers (or even another server) to update/delete the index. Do you
 use Lucene as your persistent store or do you have a DB back there? and
 what is your current update/delete strategy because real time inserts
 from the webservers directly to the index will not work because you
 can't have multiple writers. Updating a dirty flag on rows that need to
 be indexed/deleted, or using a table for this task and then batching
 your updates would be ideal, and if you're using server specific
 scheduling, I strongly recommend Quartz, it's rock solid and really
 versatile.
 
 My two cents.
 
 Nader Henein
 
 
 Ben wrote:
 
 
 
 My cluster is on a single machine and I am using FS index.
 
 I have already integrated Lucene into my web application for use in a
 non-clustered environment. I don't know what I need to do to make it
 work in a clustered environment.
 
 Thanks,
 Ben
 
 On 6/7/05, Nader Henein [EMAIL PROTECTED] wrote:
 
 
 
 
 IMHO, Issues that you need to consider
 
* Atomicity of updates and deletes if you are using multiple indexes
  on multiple machines (the case if your cluster is over a wide 
  network)
* Scheduled indecies to core data comparison and sanitization
  (intensive)
 
 This all depends on what the volume of change is on your index and
 whether you'll be using a Memory resident index or an FS index.
 
 This should start the ball rolling, we've been using Lucene successfully
 on a distributed cluster for a while now, and as long as you're aware of
 some basic NDS limitations/constraints you should be fine.
 
 Hope this helps
 
 Nader Henein
 
 Ben wrote:
 
 
 
 
 
 Hi
 
 I would like to use Lucene in a clustered environment, what are the
 things that I should consider and do?
 
 I would like to use the same ordinary index storage for all the nodes
 in the the cluster, possible?
 
 Thanks,
 Ben
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 
 
 
 
 
 
 
 
 
 
 
 
 --
 
 Nader S. Henein
 Senior Applications Architect
 
 Bayt.com
 
 
 
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 
 
 
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 
 
 
 
 
 
 
 
 
 
 --
 
 Nader S. Henein
 Senior Applications Architect
 
 Bayt.com
 
 
 
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 
 
 
 

Re: Documents returned by Scorer

2005-06-07 Thread Paul Elschot
On Tuesday 07 June 2005 11:42, Matt Quail wrote:
 I've been playing around with a custom Query, and I've just realized  
 that my Scorer is likely to return the same document more then once.  
 Before I delve a bit further, can anyone tell me if this is this a  
 Bad Thing?

Normally, yes. A query is expected to provide a single score for
each matching document. The Hits class depends on this.
One can suppress later 'hits' by using a BitVector.

When your scorer implements skipTo it would normally have
to return the documents in document number order.

In the development version all scorers implement skipTo.

Regards,
Paul Elschot


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Cannot search on plain numbers

2005-06-07 Thread Peter T. Brown
Hello. I am using lucene 1.4.3

I am indexing a Java Long number using a Lucene Keyword field, but no matter
what I do, I cannot find any documents I know have been indexed with this
field. My logs show that the number 4 is being indexed as 4 but doing
any searches in that field for 4 return no hits.

Is there something special I need to do to index and search on fields that
contain ONLY numbers?


Thank You 



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Cannot search on plain numbers

2005-06-07 Thread Daniel Naber
On Tuesday 07 June 2005 22:19, Peter T. Brown wrote:

 I am indexing a Java Long number using a Lucene Keyword field, but no
 matter what I do, I cannot find any documents I know have been indexed
 with this field. My logs show that the number 4 is being indexed as
 4 but doing any searches in that field for 4 return no hits.

Please check the FAQ:
http://wiki.apache.org/jakarta-lucene/LuceneFAQ#head-3558e5121806fb4fce80fc022d889484a9248b71

-- 
http://www.danielnaber.de

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Cannot search on plain numbers

2005-06-07 Thread Omar Didi
this depends on the analyzer you are using, use luke and check that numbers are 
actually in the index. if not then use an analyzer that does index numbers.

omar

-Original Message-
From: Daniel Naber [mailto:[EMAIL PROTECTED]
Sent: Tuesday, June 07, 2005 4:27 PM
To: java-user@lucene.apache.org
Subject: Re: Cannot search on plain numbers


On Tuesday 07 June 2005 22:19, Peter T. Brown wrote:

 I am indexing a Java Long number using a Lucene Keyword field, but no
 matter what I do, I cannot find any documents I know have been indexed
 with this field. My logs show that the number 4 is being indexed as
 4 but doing any searches in that field for 4 return no hits.

Please check the FAQ:
http://wiki.apache.org/jakarta-lucene/LuceneFAQ#head-3558e5121806fb4fce80fc022d889484a9248b71

-- 
http://www.danielnaber.de

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Cannot search on plain numbers

2005-06-07 Thread Peter T. Brown
Thank You. I've re-read the FAQ and I think I've got a better understanding
of how I am confused. Presently I am using this arrangement to get my
analyzer:

public static class DefaultAnalyzer extends Analyzer {
public TokenStream tokenStream(String fieldName, Reader reader) {
LetterTokenizer tokenizer = new LetterTokenizer(reader);
TokenStream result = null;
result = new LowerCaseFilter(tokenizer);
result = new StopFilter(result,
StopAnalyzer.ENGLISH_STOP_WORDS);
result = new PorterStemFilter(result);


return result;
}
} 


However, for reasons I do not yet understand, it filters out search on plain
numbers. How can I modify this to keep the benefits of the filters currently
in use but also search on plain numbers?


Thanks again

 From: Daniel Naber [EMAIL PROTECTED]
 Reply-To: java-user@lucene.apache.org
 Date: Tue, 7 Jun 2005 22:27:09 +0200
 To: java-user@lucene.apache.org
 Subject: Re: Cannot search on plain numbers
 
 On Tuesday 07 June 2005 22:19, Peter T. Brown wrote:
 
 I am indexing a Java Long number using a Lucene Keyword field, but no
 matter what I do, I cannot find any documents I know have been indexed
 with this field. My logs show that the number 4 is being indexed as
 4 but doing any searches in that field for 4 return no hits.
 
 Please check the FAQ:
 http://wiki.apache.org/jakarta-lucene/LuceneFAQ#head-3558e5121806fb4fce80fc022
 d889484a9248b71
 
 -- 
 http://www.danielnaber.de
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Fastest way to fetch N documents with unique keys within large numbers of indexes..

2005-06-07 Thread Kevin Burton

Chris Hostetter wrote:


: was computing the score.  This was a big performance gain.  About 2x and
: since its the slowest part of our app it was a nice one. :)
:
: We were using a TermQuery though.

I believe that one search on one BooleanQuery containing 20
TermQueries should be faster then 20 searches on 20 TermQueries.
 


Actually.. it wasn't... :-/

It was about 4x slower.

Ug...

Kevin

--


Use Rojo (RSS/Atom aggregator)! - visit http://rojo.com. 
See irc.freenode.net #rojo if you want to chat.


Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html

  Kevin A. Burton, Location - San Francisco, CA
 AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Lucene search clusters

2005-06-07 Thread Daniel Stephan
I am currently writing sth about text retrieval using EM clustering. The
approach represents documents as high-dimensional vectors, but still it
is not related to Lucene (yet?).
How would you add clustering to Lucene? I think it may be a very
interesting technique to improve search results. If it works. My current
experience shows that it scales rather bad for larger document collections.

I don't think I will take part in Googles SoC, as I have my own summer
of code right now. But I would surely like to take part in discussions
about that topic, or at least read it and throw 2cents at it now and then.

cheers
Daniel


Lorenzo schrieb:

Some people just replied, but I forgot the most important thing...
I'm thinking of this project as part of the Google's Summer of Code program, 
so I'm looking for other students.
I've sent an email to Erik and he told me that we can propose this as part 
of Google's SoC if we find some other people interested in it.
Lorenzo

On 6/7/05, Lorenzo [EMAIL PROTECTED] wrote:
  

I'm writing this message trying to find some people interested in creating 
a 'general purpose' lucene search results' clustering extension.
I wrote a simply implementation of clustering, and I would like to 
contribute to lucene development by releasing an open source clustering 
implementation. I know that maybe each project need a different 
implementation but that would be a useful basis for everyone to develop his 
own project.
Is anyone interested in it?
Lorenzo




  



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Fastest way to fetch N documents with unique keys within large numbers of indexes..

2005-06-07 Thread Kevin Burton

Paul Elschot wrote:


For a large number of indexes, it may be necessary to do this over
multiple indexes by first getting the doc numbers for all indexes,
then sorting these per index, then retrieving them
from all indexes, and repeating the whole thing using terms determined
from the retrieved docs.
 

Well this was a BIG win.  Just benchmarking it out shows a 10x - 50x 
performance increase.


Times in milliseconds:

Before:

duration: 1127
duration: 449
duration: 394
duration: 564

After:

duration: 182
duration: 39
duration: 12
duration: 11

The values of 2-4  I'm sure are due to the filesystem buffer cache but I 
can't imagine why they'd be faster in the second round.  It might be 
that Linux is deciding not to buffer the document blocks.


Kevin

--

Use Rojo (RSS/Atom aggregator)! - visit http://rojo.com. 
See irc.freenode.net #rojo if you want to chat.


Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html

  Kevin A. Burton, Location - San Francisco, CA
 AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Lucene search clusters

2005-06-07 Thread Lorenzo
My approach uses the same technique, but I'm using mostly HAG clustering.
I did manage to add clustering support to a lucene based application (a 
customized solution), but I'd like to try to create a 'general purpose' 
library. I know it ain't easy!
I've found many scaling issues, but I saw that with an optimized algorithms 
you can have pretty good results. Reading a carrot2 and lucene related 
messages, I figured out that I can cluster only the n first results, 
avoiding any performance issue in that way.
Lucene offers a good support to a clustering framework, based on a tf idf 
analysis (not thinking of k-means or EM 'til now).
The most interesting problem is creating the architecture for such a system, 
being general purpose but also very efficient.
Thanks, 
Lorenzo

On 6/8/05, Daniel Stephan [EMAIL PROTECTED] wrote:
 
 I am currently writing sth about text retrieval using EM clustering. The
 approach represents documents as high-dimensional vectors, but still it
 is not related to Lucene (yet?).
 How would you add clustering to Lucene? I think it may be a very
 interesting technique to improve search results. If it works. My current
 experience shows that it scales rather bad for larger document 
 collections.
 
 I don't think I will take part in Googles SoC, as I have my own summer
 of code right now. But I would surely like to take part in discussions
 about that topic, or at least read it and throw 2cents at it now and then.
 
 cheers
 Daniel
 
 
 Lorenzo schrieb:
 
 Some people just replied, but I forgot the most important thing...
 I'm thinking of this project as part of the Google's Summer of Code 
 program,
 so I'm looking for other students.
 I've sent an email to Erik and he told me that we can propose this as 
 part
 of Google's SoC if we find some other people interested in it.
 Lorenzo
 
 On 6/7/05, Lorenzo [EMAIL PROTECTED] wrote:
 
 
 I'm writing this message trying to find some people interested in 
 creating
 a 'general purpose' lucene search results' clustering extension.
 I wrote a simply implementation of clustering, and I would like to
 contribute to lucene development by releasing an open source clustering
 implementation. I know that maybe each project need a different
 implementation but that would be a useful basis for everyone to develop 
 his
 own project.
 Is anyone interested in it?
 Lorenzo
 
 
 
 
 
 
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]