RE: Grouping results by choosen field
Good grouping by domain realized in nutch... Nutch can serve good example of group on certain field. -Original Message- From: Java Programmer [mailto:[EMAIL PROTECTED] Sent: Tuesday, March 21, 2006 3:56 PM To: java-user@lucene.apache.org Subject: Re: Grouping results by choosen field On 3/17/06, Java Programmer [EMAIL PROTECTED] wrote: Hello, I tried to search myself for soultion, but without any good result, so I want to ask group. My problem concerns result grouping, the best example will be Google search where you have results sorted by relevance, and also grouped by domain (they have little indent/margin). In my project I want to get similar functionality, without very huge CPU consumption. Can you share any helpful hints ? Best Regards, Adr Hello, I have written some code to do sorting form me (it's not perfect, maybe it's even very poor solution, but I'm still learning). So if you have a time please take a look: long sort_start = new Date().getTime(); MapString,ArrayListInteger domains = new HashMapString,ArrayListInteger(); ListString results = new ArrayListString(); int i = 0; while(ihits.length() i500){ Document doc = hits.doc(i); String url = doc.get(domain); if(!domains.containsKey(url)){ domains.put(url,new ArrayListInteger()); results.add(url); } domains.get(url).add(i); i++; } long sort_end = new Date().getTime(); so I'm grouping results for domains in Lists to prevent order of score, such ordered groups I put into Map and key of that Map I put into another List to prevent order of most scored domains, so in result I get: - domain A score 1.0 -- domain A score 0.6 - domain B score 0.9 etc. I put this code into servlet (Tomcat 5.5) and it's working but ... when I made first query it take a long time to run whole sorting process eg. 4900 ms, but when I run same query again (eg with paging), it's run very quickly eg 40 ms - why such thing is happen? Is there any optimizing in Lucene, or any kind of caching? When restarting servlet for queries which already were asked results are at once, but new queries always take long time to process. Maybe I'm miss something when read documentation - someone can give me an explanation? Best regards, Adr - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Grouping results by choosen field
On 3/17/06, Java Programmer [EMAIL PROTECTED] wrote: Hello, I tried to search myself for soultion, but without any good result, so I want to ask group. My problem concerns result grouping, the best example will be Google search where you have results sorted by relevance, and also grouped by domain (they have little indent/margin). In my project I want to get similar functionality, without very huge CPU consumption. Can you share any helpful hints ? Best Regards, Adr Hello, I have written some code to do sorting form me (it's not perfect, maybe it's even very poor solution, but I'm still learning). So if you have a time please take a look: long sort_start = new Date().getTime(); MapString,ArrayListInteger domains = new HashMapString,ArrayListInteger(); ListString results = new ArrayListString(); int i = 0; while(ihits.length() i500){ Document doc = hits.doc(i); String url = doc.get(domain); if(!domains.containsKey(url)){ domains.put(url,new ArrayListInteger()); results.add(url); } domains.get(url).add(i); i++; } long sort_end = new Date().getTime(); so I'm grouping results for domains in Lists to prevent order of score, such ordered groups I put into Map and key of that Map I put into another List to prevent order of most scored domains, so in result I get: - domain A score 1.0 -- domain A score 0.6 - domain B score 0.9 etc. I put this code into servlet (Tomcat 5.5) and it's working but ... when I made first query it take a long time to run whole sorting process eg. 4900 ms, but when I run same query again (eg with paging), it's run very quickly eg 40 ms - why such thing is happen? Is there any optimizing in Lucene, or any kind of caching? When restarting servlet for queries which already were asked results are at once, but new queries always take long time to process. Maybe I'm miss something when read documentation - someone can give me an explanation? Best regards, Adr - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Grouping results by choosen field
Hello, I tried to search myself for soultion, but without any good result, so I want to ask group. My problem concerns result grouping, the best example will be Google search where you have results sorted by relevance, and also grouped by domain (they have little indent/margin). In my project I want to get similar functionality, without very huge CPU consumption. Can you share any helpful hints ? Best Regards, Adr
Re: Grouping results by choosen field
I believe hte topic you are refering to is typically refered to as clustering ... you may wnat to search for that. I've never really looked at it, but carrot2 seems to be a favorite among those who do result clustering. : Date: Fri, 17 Mar 2006 16:36:44 +0100 : From: Java Programmer [EMAIL PROTECTED] : Reply-To: java-user@lucene.apache.org : To: java-user@lucene.apache.org : Subject: Grouping results by choosen field : : Hello, : I tried to search myself for soultion, but without any good result, so I : want to ask group. : My problem concerns result grouping, the best example will be Google search : where you have results sorted by relevance, and also grouped by domain (they : have little indent/margin). In my project I want to get similar : functionality, without very huge CPU consumption. : : Can you share any helpful hints ? : : Best Regards, : Adr : -Hoss - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Grouping results by choosen field
17 mar 2006 kl. 16.36 skrev Java Programmer: My problem concerns result grouping, the best example will be Google search where you have results sorted by relevance, and also grouped by domain (they have little indent/margin). In my project I want to get similar functionality, without very huge CPU consumption. Can you share any helpful hints ? I do that. Basically I marshall the hit documents to java instances of Comparable. Then I just plain old Collections.sort(the documents as object representation). Each document may contain classification weights. Weights points at a classifiction value, and the classification value points at a clazz. UML class diagram: [Persistent]#--- {0..*} -[ClassificationWeight +compareTo() +weight:float] {1} -[Classification + compareTo() +value:String]--- {1} -[Clazz +fieldName:String + compareTo()] A clazz in this instance is the group of domain names. The classification is the actual domain name. You can skip the weight if you only use domain names. I guess all weights would be 1. The weight compares to classifications that compares to the clazz. If two weights equal I use the lucene score. You might want to do several passes or nested order to get the group top score as the natrual order per group cluster. It handles 3000+ queries per minute on 12+ documents in RAM, 24/7 on a dual core at 40% load. And I even use Lucene for persistency even though I should not. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Grouping results by choosen field
17 mar 2006 kl. 21.01 skrev karl wettin: 17 mar 2006 kl. 16.36 skrev Java Programmer: My problem concerns result grouping, the best example will be Google search where you have results sorted by relevance, and also grouped by domain (they have little indent/margin). In my project I want to get similar functionality, without very huge CPU consumption. Can you share any helpful hints ? I do that. Basically I marshall the hit documents to java instances of Comparable. Then I just plain old Collections.sort(the documents as object representation). I just made it complicated after this. Sorry. I ment to say I do well grouping fields with a comparator. If I get too many results, I only sort the top n hits. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]