Lucene in clustered environment (Tomcat)
Hi I would like to use Lucene in a clustered environment, what are the things that I should consider and do? I would like to use the same ordinary index storage for all the nodes in the the cluster, possible? Thanks, Ben - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Indexing multiple languages
Tansley, Robert wrote: Hi all, The DSpace (www.dspace.org) currently uses Lucene to index metadata (Dublin Core standard) and extracted full-text content of documents stored in it. Now the system is being used globally, it needs to support multi-language indexing. I've looked through the mailing list archives etc. and it seems it's easy to plug in analyzers for different languages. What if we're trying to index multiple languages in the same site? Is it best to have: 1/ one index for all languages 2/ one index for all languages, with an extra language field so searches can be constrained to a particular language 3/ separate indices for each language? I don't fully understand the consequences in terms of performance for 1/, but I can see that false hits could turn up where one word appears in different languages (stemming could increase the changes of this). Also some languages' analyzers are quite dramatically different (e.g. the Chinese one which just treats every character as a separate token/word). On the other hand, if people are searching for proper nouns in metadata (e.g. DSpace) it may be advantageous to search all languages at once. I'm also not sure of the storage and performance consequences of 2/. Approach 3/ seems like it might be the most complex from an implementation/code point of view. But this will be the most robust solution. You have to differentiate between languages anyway, and as you pointed here, you can differentiate by adding a Keyword field for language, or you can create different indexes. If you need to use complex search strings over multiple fields and indexes then I recommend you to use the QueryParser to compute the search string. When you instantiate a QueryPArser you will need to provide an analyzer, that will be different for different languages. I think that the differences in performance won't be noticable between 2nd and 3rd solutions, but from maintenance point of view, I would choose the third solution. Of course there are other factors that must be take in account when designing such an application: number of documents to be indexed, number of document fields, index change frequency, server load (number of concurrent sessions), etc. Hope this hints help you a little, Best, Sergiu Does anyone have any thoughts or recommendations on this? Many thanks, Robert Tansley / Digital Media Systems Programme / HP Labs http://www.hpl.hp.com/personal/Robert_Tansley/ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Finding minimum and maximum value of a field?
Kevin Burton wrote: I have an index with a date field. I want to quickly find the minimum and maximum values in the index. Is there a quick way to do this? I looked at using TermInfos and finding the first one but how to I find the last? I also tried the new sort API and the performance was horrible :-/ Any ideas? You may keep a history of the MIN and MAX values in an external file. Let's say, you can write in a text file the MIN_DATE and MAX_DATE, and keep them up to date when indexing, deleting documents. Best, Sergiu Kevin - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene in clustered environment (Tomcat)
IMHO, Issues that you need to consider * Atomicity of updates and deletes if you are using multiple indexes on multiple machines (the case if your cluster is over a wide network) * Scheduled indecies to core data comparison and sanitization (intensive) This all depends on what the volume of change is on your index and whether you'll be using a Memory resident index or an FS index. This should start the ball rolling, we've been using Lucene successfully on a distributed cluster for a while now, and as long as you're aware of some basic NDS limitations/constraints you should be fine. Hope this helps Nader Henein Ben wrote: Hi I would like to use Lucene in a clustered environment, what are the things that I should consider and do? I would like to use the same ordinary index storage for all the nodes in the the cluster, possible? Thanks, Ben - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] -- Nader S. Henein Senior Applications Architect Bayt.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
URLDirectory
Hi, I'm looking for URLDirectory implementation NOT based on RAMDirectory because the size of my indexes is up to 500Mo. Thanks. Jacques LABATTE.
Re: Lucene in clustered environment (Tomcat)
When you say your cluster is on a single machine, do you mean that you have multiple webservers on the same machine all of which search a single Lucene index? Yes, this is my case. Do you use Lucene as your persistent store or do you have a DB back there? I use Lucene to search for data stored in a PostgreSQL server. what is your current update/delete strategy because real time inserts from the webservers directly to the index will not work because you can't have multiple writers. I have to do this in real time, what are the available solutions? My application has the ability to do batch update/delete to a Lucene index but I would like to do this in real time. One solution I am thinking is to have each cluster has it own index and use parallel search. This makes my application even more complex. I strongly recommend Quartz, it's rock solid and really versatile. I am using Quartz, it is really great and supports cluster. Thanks, Ben On 6/7/05, Nader Henein [EMAIL PROTECTED] wrote: When you say your cluster is on a single machine, do you mean that you have multiple webservers on the same machine all of which search a single Lucene index? Because if that's the case, your solution is simple, as long as you persist to a single DB and then designate one of your servers (or even another server) to update/delete the index. Do you use Lucene as your persistent store or do you have a DB back there? and what is your current update/delete strategy because real time inserts from the webservers directly to the index will not work because you can't have multiple writers. Updating a dirty flag on rows that need to be indexed/deleted, or using a table for this task and then batching your updates would be ideal, and if you're using server specific scheduling, I strongly recommend Quartz, it's rock solid and really versatile. My two cents. Nader Henein Ben wrote: My cluster is on a single machine and I am using FS index. I have already integrated Lucene into my web application for use in a non-clustered environment. I don't know what I need to do to make it work in a clustered environment. Thanks, Ben On 6/7/05, Nader Henein [EMAIL PROTECTED] wrote: IMHO, Issues that you need to consider * Atomicity of updates and deletes if you are using multiple indexes on multiple machines (the case if your cluster is over a wide network) * Scheduled indecies to core data comparison and sanitization (intensive) This all depends on what the volume of change is on your index and whether you'll be using a Memory resident index or an FS index. This should start the ball rolling, we've been using Lucene successfully on a distributed cluster for a while now, and as long as you're aware of some basic NDS limitations/constraints you should be fine. Hope this helps Nader Henein Ben wrote: Hi I would like to use Lucene in a clustered environment, what are the things that I should consider and do? I would like to use the same ordinary index storage for all the nodes in the the cluster, possible? Thanks, Ben - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] -- Nader S. Henein Senior Applications Architect Bayt.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] -- Nader S. Henein Senior Applications Architect Bayt.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: deleting on a keyword field
Hello! Ehem, I have to apologize. It was my stupidity that caused this problem. I simply mixed up field names... I did the deletion of items in a superclass, which of course didn't know about the change in the uri field name. Duh! Everything works now, just like it should. Sorry again! Thanks for bearing with me though! max -Original Message- From: Erik Hatcher [mailto:[EMAIL PROTECTED] Sent: Tuesday, June 07, 2005 03:37 To: java-user@lucene.apache.org Subject: Re: deleting on a keyword field On Jun 6, 2005, at 7:07 AM, Max Pfingsthorn wrote: Thanks for all the replies. I do know that the readers should be reopened, but that is not the problem. Could you work up a test case that shows this issue? From all I can see, you're doing the right thing. Something is amiss somewhere though. I try to remove some docs, and add their new versions again to incrementally update the index. After updating the index with the same document twice, I opened the index in luke. There I saw that the file's uri was present three times in the uri field. So, I concluded, it didn't delete the docs right as there are in total three documents which contain this term, right? By the way, Reader.delete() returned 0 as well. I thought I used Field.Keyword(), but actually I use doc.add(new Field(URI_FIELD, uri, true, true, false)); Same thing in this case. new Field(name, value, true, true, false) is the same as Field.Keyword(name, value) to add the uri to the doc. I can see it in luke, and even find the docs when searching for it (using the KeywordAnalyzer). Any ideas? Nothing comes to mind from what I've seen thus far. An easily runnable example demonstrating this issue would be the next step. Erik Thanks! max -Original Message- From: Daniel Naber [mailto:[EMAIL PROTECTED] Sent: Friday, June 03, 2005 20:10 To: java-user@lucene.apache.org Subject: Re: deleting on a keyword field On Friday 03 June 2005 18:50, Max Pfingsthorn wrote: reader.delete(new Term(URI_FIELD, uri)); This does not remove anything. Do I have to make the uri a normal field? How do you know nothing was deleted? Are you aware that you need to re-open your IndexSearcher/Reader in order to see the changes made to the index? Regards Daniel -- http://www.danielnaber.de - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: log4j:WARN No appenders could be found for logger
António, This error is not coming from Lucene, but rather from the ELATED library (as you can tell from package name). Lucene does not use Log4j at all. Please address this issue to either the Fedora or ELATED groups. Erik On Jun 6, 2005, at 8:21 PM, [EMAIL PROTECTED] wrote: Hi! I'm newbie in java, and not a real coder. I'm implementing a digital library (windows)with 2 open sources: a server aplications called FEDORA (www.fedora.info) and a JSPs interface called ELATED (http://elated.sourceforge.net). when I start the fedora server I get: c:\fedora-2.0\server\binfedora-start Starting Fedora server... Deploying API-M and API-A... Waiting for server to start... log4j:WARN No appenders could be found for logger (org.acs.elated.lucene.LuceneInterface). log4j:WARN Please initialize the log4j system properly. Processing file C:\fedora-2.0\server\config\deployAPI-A.wsdd adminDone processing/Admin Processing file C:\fedora-2.0\server\config\deploy.wsdd adminDone processing/Admin Initializing Fedora Server instance... Fedora Version: 2.0 Fedora Build: 1 Server Host Name: localhost Server Port: 8080 Debugging: false OK Finished. To stop the server, use fedora-stop. c:\fedora-2.0\server\bin I dont understand the error: log4j:WARN No appenders could be found for logger (org.acs.elated.lucene.LuceneInterface). log4j:WARN Please initialize the log4j system properly. Can anyone tell me what is this error? Thanks in advance António Fonseca - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene in clustered environment (Tomcat)
I realize I've already asked you this question, but do you need 100% real time, because you could run batch them every 2 minutes, and concerning Parallel search, unless you really need it, it's overkill in this case, a communal index will serve you well and will be much easier to maintain. You have to way requirement vs. complexity/ debug time. Nader Henein Ben wrote: When you say your cluster is on a single machine, do you mean that you have multiple webservers on the same machine all of which search a single Lucene index? Yes, this is my case. Do you use Lucene as your persistent store or do you have a DB back there? I use Lucene to search for data stored in a PostgreSQL server. what is your current update/delete strategy because real time inserts from the webservers directly to the index will not work because you can't have multiple writers. I have to do this in real time, what are the available solutions? My application has the ability to do batch update/delete to a Lucene index but I would like to do this in real time. One solution I am thinking is to have each cluster has it own index and use parallel search. This makes my application even more complex. I strongly recommend Quartz, it's rock solid and really versatile. I am using Quartz, it is really great and supports cluster. Thanks, Ben On 6/7/05, Nader Henein [EMAIL PROTECTED] wrote: When you say your cluster is on a single machine, do you mean that you have multiple webservers on the same machine all of which search a single Lucene index? Because if that's the case, your solution is simple, as long as you persist to a single DB and then designate one of your servers (or even another server) to update/delete the index. Do you use Lucene as your persistent store or do you have a DB back there? and what is your current update/delete strategy because real time inserts from the webservers directly to the index will not work because you can't have multiple writers. Updating a dirty flag on rows that need to be indexed/deleted, or using a table for this task and then batching your updates would be ideal, and if you're using server specific scheduling, I strongly recommend Quartz, it's rock solid and really versatile. My two cents. Nader Henein Ben wrote: My cluster is on a single machine and I am using FS index. I have already integrated Lucene into my web application for use in a non-clustered environment. I don't know what I need to do to make it work in a clustered environment. Thanks, Ben On 6/7/05, Nader Henein [EMAIL PROTECTED] wrote: IMHO, Issues that you need to consider * Atomicity of updates and deletes if you are using multiple indexes on multiple machines (the case if your cluster is over a wide network) * Scheduled indecies to core data comparison and sanitization (intensive) This all depends on what the volume of change is on your index and whether you'll be using a Memory resident index or an FS index. This should start the ball rolling, we've been using Lucene successfully on a distributed cluster for a while now, and as long as you're aware of some basic NDS limitations/constraints you should be fine. Hope this helps Nader Henein Ben wrote: Hi I would like to use Lucene in a clustered environment, what are the things that I should consider and do? I would like to use the same ordinary index storage for all the nodes in the the cluster, possible? Thanks, Ben - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] -- Nader S. Henein Senior Applications Architect Bayt.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] -- Nader S. Henein Senior Applications Architect Bayt.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] -- Nader S. Henein Senior Applications Architect Bayt.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene in clustered environment (Tomcat)
How about using JavaGroups to notify other nodes in the cluster about the changes? Essentially, each node has the same index stored in a different location. When one node updates/deletes a record, other nodes will get a notification about the changes and update their index accordingly? By using this method, I don't have to modify my Lucene code, I just need to add additional code to notify other nodes. I believe this method also scales better. Cheers, Ben On 6/7/05, Nader Henein [EMAIL PROTECTED] wrote: I realize I've already asked you this question, but do you need 100% real time, because you could run batch them every 2 minutes, and concerning Parallel search, unless you really need it, it's overkill in this case, a communal index will serve you well and will be much easier to maintain. You have to way requirement vs. complexity/ debug time. Nader Henein Ben wrote: When you say your cluster is on a single machine, do you mean that you have multiple webservers on the same machine all of which search a single Lucene index? Yes, this is my case. Do you use Lucene as your persistent store or do you have a DB back there? I use Lucene to search for data stored in a PostgreSQL server. what is your current update/delete strategy because real time inserts from the webservers directly to the index will not work because you can't have multiple writers. I have to do this in real time, what are the available solutions? My application has the ability to do batch update/delete to a Lucene index but I would like to do this in real time. One solution I am thinking is to have each cluster has it own index and use parallel search. This makes my application even more complex. I strongly recommend Quartz, it's rock solid and really versatile. I am using Quartz, it is really great and supports cluster. Thanks, Ben On 6/7/05, Nader Henein [EMAIL PROTECTED] wrote: When you say your cluster is on a single machine, do you mean that you have multiple webservers on the same machine all of which search a single Lucene index? Because if that's the case, your solution is simple, as long as you persist to a single DB and then designate one of your servers (or even another server) to update/delete the index. Do you use Lucene as your persistent store or do you have a DB back there? and what is your current update/delete strategy because real time inserts from the webservers directly to the index will not work because you can't have multiple writers. Updating a dirty flag on rows that need to be indexed/deleted, or using a table for this task and then batching your updates would be ideal, and if you're using server specific scheduling, I strongly recommend Quartz, it's rock solid and really versatile. My two cents. Nader Henein Ben wrote: My cluster is on a single machine and I am using FS index. I have already integrated Lucene into my web application for use in a non-clustered environment. I don't know what I need to do to make it work in a clustered environment. Thanks, Ben On 6/7/05, Nader Henein [EMAIL PROTECTED] wrote: IMHO, Issues that you need to consider * Atomicity of updates and deletes if you are using multiple indexes on multiple machines (the case if your cluster is over a wide network) * Scheduled indecies to core data comparison and sanitization (intensive) This all depends on what the volume of change is on your index and whether you'll be using a Memory resident index or an FS index. This should start the ball rolling, we've been using Lucene successfully on a distributed cluster for a while now, and as long as you're aware of some basic NDS limitations/constraints you should be fine. Hope this helps Nader Henein Ben wrote: Hi I would like to use Lucene in a clustered environment, what are the things that I should consider and do? I would like to use the same ordinary index storage for all the nodes in the the cluster, possible? Thanks, Ben - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] -- Nader S. Henein Senior Applications Architect Bayt.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] -- Nader S. Henein Senior Applications Architect Bayt.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Documents returned by Scorer
On Tuesday 07 June 2005 11:42, Matt Quail wrote: I've been playing around with a custom Query, and I've just realized that my Scorer is likely to return the same document more then once. Before I delve a bit further, can anyone tell me if this is this a Bad Thing? Normally, yes. A query is expected to provide a single score for each matching document. The Hits class depends on this. One can suppress later 'hits' by using a BitVector. When your scorer implements skipTo it would normally have to return the documents in document number order. In the development version all scorers implement skipTo. Regards, Paul Elschot - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Cannot search on plain numbers
Hello. I am using lucene 1.4.3 I am indexing a Java Long number using a Lucene Keyword field, but no matter what I do, I cannot find any documents I know have been indexed with this field. My logs show that the number 4 is being indexed as 4 but doing any searches in that field for 4 return no hits. Is there something special I need to do to index and search on fields that contain ONLY numbers? Thank You - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Cannot search on plain numbers
On Tuesday 07 June 2005 22:19, Peter T. Brown wrote: I am indexing a Java Long number using a Lucene Keyword field, but no matter what I do, I cannot find any documents I know have been indexed with this field. My logs show that the number 4 is being indexed as 4 but doing any searches in that field for 4 return no hits. Please check the FAQ: http://wiki.apache.org/jakarta-lucene/LuceneFAQ#head-3558e5121806fb4fce80fc022d889484a9248b71 -- http://www.danielnaber.de - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Cannot search on plain numbers
this depends on the analyzer you are using, use luke and check that numbers are actually in the index. if not then use an analyzer that does index numbers. omar -Original Message- From: Daniel Naber [mailto:[EMAIL PROTECTED] Sent: Tuesday, June 07, 2005 4:27 PM To: java-user@lucene.apache.org Subject: Re: Cannot search on plain numbers On Tuesday 07 June 2005 22:19, Peter T. Brown wrote: I am indexing a Java Long number using a Lucene Keyword field, but no matter what I do, I cannot find any documents I know have been indexed with this field. My logs show that the number 4 is being indexed as 4 but doing any searches in that field for 4 return no hits. Please check the FAQ: http://wiki.apache.org/jakarta-lucene/LuceneFAQ#head-3558e5121806fb4fce80fc022d889484a9248b71 -- http://www.danielnaber.de - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Cannot search on plain numbers
Thank You. I've re-read the FAQ and I think I've got a better understanding of how I am confused. Presently I am using this arrangement to get my analyzer: public static class DefaultAnalyzer extends Analyzer { public TokenStream tokenStream(String fieldName, Reader reader) { LetterTokenizer tokenizer = new LetterTokenizer(reader); TokenStream result = null; result = new LowerCaseFilter(tokenizer); result = new StopFilter(result, StopAnalyzer.ENGLISH_STOP_WORDS); result = new PorterStemFilter(result); return result; } } However, for reasons I do not yet understand, it filters out search on plain numbers. How can I modify this to keep the benefits of the filters currently in use but also search on plain numbers? Thanks again From: Daniel Naber [EMAIL PROTECTED] Reply-To: java-user@lucene.apache.org Date: Tue, 7 Jun 2005 22:27:09 +0200 To: java-user@lucene.apache.org Subject: Re: Cannot search on plain numbers On Tuesday 07 June 2005 22:19, Peter T. Brown wrote: I am indexing a Java Long number using a Lucene Keyword field, but no matter what I do, I cannot find any documents I know have been indexed with this field. My logs show that the number 4 is being indexed as 4 but doing any searches in that field for 4 return no hits. Please check the FAQ: http://wiki.apache.org/jakarta-lucene/LuceneFAQ#head-3558e5121806fb4fce80fc022 d889484a9248b71 -- http://www.danielnaber.de - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Fastest way to fetch N documents with unique keys within large numbers of indexes..
Chris Hostetter wrote: : was computing the score. This was a big performance gain. About 2x and : since its the slowest part of our app it was a nice one. :) : : We were using a TermQuery though. I believe that one search on one BooleanQuery containing 20 TermQueries should be faster then 20 searches on 20 TermQueries. Actually.. it wasn't... :-/ It was about 4x slower. Ug... Kevin -- Use Rojo (RSS/Atom aggregator)! - visit http://rojo.com. See irc.freenode.net #rojo if you want to chat. Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html Kevin A. Burton, Location - San Francisco, CA AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene search clusters
I am currently writing sth about text retrieval using EM clustering. The approach represents documents as high-dimensional vectors, but still it is not related to Lucene (yet?). How would you add clustering to Lucene? I think it may be a very interesting technique to improve search results. If it works. My current experience shows that it scales rather bad for larger document collections. I don't think I will take part in Googles SoC, as I have my own summer of code right now. But I would surely like to take part in discussions about that topic, or at least read it and throw 2cents at it now and then. cheers Daniel Lorenzo schrieb: Some people just replied, but I forgot the most important thing... I'm thinking of this project as part of the Google's Summer of Code program, so I'm looking for other students. I've sent an email to Erik and he told me that we can propose this as part of Google's SoC if we find some other people interested in it. Lorenzo On 6/7/05, Lorenzo [EMAIL PROTECTED] wrote: I'm writing this message trying to find some people interested in creating a 'general purpose' lucene search results' clustering extension. I wrote a simply implementation of clustering, and I would like to contribute to lucene development by releasing an open source clustering implementation. I know that maybe each project need a different implementation but that would be a useful basis for everyone to develop his own project. Is anyone interested in it? Lorenzo - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Fastest way to fetch N documents with unique keys within large numbers of indexes..
Paul Elschot wrote: For a large number of indexes, it may be necessary to do this over multiple indexes by first getting the doc numbers for all indexes, then sorting these per index, then retrieving them from all indexes, and repeating the whole thing using terms determined from the retrieved docs. Well this was a BIG win. Just benchmarking it out shows a 10x - 50x performance increase. Times in milliseconds: Before: duration: 1127 duration: 449 duration: 394 duration: 564 After: duration: 182 duration: 39 duration: 12 duration: 11 The values of 2-4 I'm sure are due to the filesystem buffer cache but I can't imagine why they'd be faster in the second round. It might be that Linux is deciding not to buffer the document blocks. Kevin -- Use Rojo (RSS/Atom aggregator)! - visit http://rojo.com. See irc.freenode.net #rojo if you want to chat. Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html Kevin A. Burton, Location - San Francisco, CA AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene search clusters
My approach uses the same technique, but I'm using mostly HAG clustering. I did manage to add clustering support to a lucene based application (a customized solution), but I'd like to try to create a 'general purpose' library. I know it ain't easy! I've found many scaling issues, but I saw that with an optimized algorithms you can have pretty good results. Reading a carrot2 and lucene related messages, I figured out that I can cluster only the n first results, avoiding any performance issue in that way. Lucene offers a good support to a clustering framework, based on a tf idf analysis (not thinking of k-means or EM 'til now). The most interesting problem is creating the architecture for such a system, being general purpose but also very efficient. Thanks, Lorenzo On 6/8/05, Daniel Stephan [EMAIL PROTECTED] wrote: I am currently writing sth about text retrieval using EM clustering. The approach represents documents as high-dimensional vectors, but still it is not related to Lucene (yet?). How would you add clustering to Lucene? I think it may be a very interesting technique to improve search results. If it works. My current experience shows that it scales rather bad for larger document collections. I don't think I will take part in Googles SoC, as I have my own summer of code right now. But I would surely like to take part in discussions about that topic, or at least read it and throw 2cents at it now and then. cheers Daniel Lorenzo schrieb: Some people just replied, but I forgot the most important thing... I'm thinking of this project as part of the Google's Summer of Code program, so I'm looking for other students. I've sent an email to Erik and he told me that we can propose this as part of Google's SoC if we find some other people interested in it. Lorenzo On 6/7/05, Lorenzo [EMAIL PROTECTED] wrote: I'm writing this message trying to find some people interested in creating a 'general purpose' lucene search results' clustering extension. I wrote a simply implementation of clustering, and I would like to contribute to lucene development by releasing an open source clustering implementation. I know that maybe each project need a different implementation but that would be a useful basis for everyone to develop his own project. Is anyone interested in it? Lorenzo - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]