RE: Grouping results by choosen field

2006-03-21 Thread anton
Good grouping by domain realized in nutch... Nutch can serve good example of
group on certain field.

-Original Message-
From: Java Programmer [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, March 21, 2006 3:56 PM
To: java-user@lucene.apache.org
Subject: Re: Grouping results by choosen field

On 3/17/06, Java Programmer [EMAIL PROTECTED] wrote:

 Hello,
 I tried to search myself for soultion, but without any good result, so I
want to ask group.
 My problem concerns result grouping, the best example will be Google
search where you have results sorted by relevance, and also grouped by
domain (they have little indent/margin). In my project I want to get similar
functionality, without very huge CPU consumption.

  Can you share any helpful hints ?

  Best Regards,
  Adr


Hello,
I have written some code to do sorting form me (it's not perfect,
maybe it's even very poor solution, but I'm still learning). So if you
have a time please take a look:

long sort_start = new Date().getTime();

MapString,ArrayListInteger domains = new
HashMapString,ArrayListInteger();
ListString results = new ArrayListString();

int i = 0;

while(ihits.length()  i500){
Document doc = hits.doc(i);
String url = doc.get(domain);
if(!domains.containsKey(url)){
domains.put(url,new ArrayListInteger());
results.add(url);
}
domains.get(url).add(i);
i++;
}

long sort_end = new Date().getTime();

so I'm grouping results for domains in Lists to prevent order of
score, such ordered groups I put into Map and key of that Map I put
into another List to prevent order of most scored domains, so in
result I get:
- domain A score 1.0
-- domain A score 0.6
- domain B score 0.9
etc.

I put this code into servlet (Tomcat 5.5) and it's working but ...
when I made first query it take a long time to run whole sorting
process eg. 4900 ms, but when I run same query again (eg with paging),
it's run very quickly eg 40 ms - why such thing is happen? Is there
any optimizing in Lucene, or any kind of caching? When restarting
servlet for queries which already were asked results are at once, but
new queries always take long time to process.

Maybe I'm miss something when read documentation - someone can give me
an explanation?

Best regards,
Adr

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Grouping results by choosen field

2006-03-21 Thread Java Programmer
On 3/17/06, Java Programmer [EMAIL PROTECTED] wrote:

 Hello,
 I tried to search myself for soultion, but without any good result, so I want 
 to ask group.
 My problem concerns result grouping, the best example will be Google search 
 where you have results sorted by relevance, and also grouped by domain (they 
 have little indent/margin). In my project I want to get similar 
 functionality, without very huge CPU consumption.

  Can you share any helpful hints ?

  Best Regards,
  Adr


Hello,
I have written some code to do sorting form me (it's not perfect,
maybe it's even very poor solution, but I'm still learning). So if you
have a time please take a look:

long sort_start = new Date().getTime();

MapString,ArrayListInteger domains = new
HashMapString,ArrayListInteger();
ListString results = new ArrayListString();

int i = 0;

while(ihits.length()  i500){
Document doc = hits.doc(i);
String url = doc.get(domain);
if(!domains.containsKey(url)){
domains.put(url,new ArrayListInteger());
results.add(url);
}
domains.get(url).add(i);
i++;
}

long sort_end = new Date().getTime();

so I'm grouping results for domains in Lists to prevent order of
score, such ordered groups I put into Map and key of that Map I put
into another List to prevent order of most scored domains, so in
result I get:
- domain A score 1.0
-- domain A score 0.6
- domain B score 0.9
etc.

I put this code into servlet (Tomcat 5.5) and it's working but ...
when I made first query it take a long time to run whole sorting
process eg. 4900 ms, but when I run same query again (eg with paging),
it's run very quickly eg 40 ms - why such thing is happen? Is there
any optimizing in Lucene, or any kind of caching? When restarting
servlet for queries which already were asked results are at once, but
new queries always take long time to process.

Maybe I'm miss something when read documentation - someone can give me
an explanation?

Best regards,
Adr

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Grouping results by choosen field

2006-03-17 Thread Java Programmer
Hello,
I tried to search myself for soultion, but without any good result, so I
want to ask group.
My problem concerns result grouping, the best example will be Google search
where you have results sorted by relevance, and also grouped by domain (they
have little indent/margin). In my project I want to get similar
functionality, without very huge CPU consumption.

Can you share any helpful hints ?

Best Regards,
Adr


Re: Grouping results by choosen field

2006-03-17 Thread Chris Hostetter

I believe hte topic you are refering to is typically refered to as
clustering ... you may wnat to search for that.

I've never really looked at it, but carrot2 seems to be a favorite among
those who do result clustering.


: Date: Fri, 17 Mar 2006 16:36:44 +0100
: From: Java Programmer [EMAIL PROTECTED]
: Reply-To: java-user@lucene.apache.org
: To: java-user@lucene.apache.org
: Subject: Grouping results by choosen field
:
: Hello,
: I tried to search myself for soultion, but without any good result, so I
: want to ask group.
: My problem concerns result grouping, the best example will be Google search
: where you have results sorted by relevance, and also grouped by domain (they
: have little indent/margin). In my project I want to get similar
: functionality, without very huge CPU consumption.
:
: Can you share any helpful hints ?
:
: Best Regards,
: Adr
:



-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Grouping results by choosen field

2006-03-17 Thread karl wettin


17 mar 2006 kl. 16.36 skrev Java Programmer:


My problem concerns result grouping, the best example will be  
Google search
where you have results sorted by relevance, and also grouped by  
domain (they

have little indent/margin). In my project I want to get similar
functionality, without very huge CPU consumption.

Can you share any helpful hints ?


I do that. Basically I marshall the hit documents to java instances  
of Comparable. Then I just plain old Collections.sort(the documents  
as object representation). Each document may contain classification  
weights. Weights points at a classifiction value, and the  
classification value points at a clazz.


UML class diagram:

[Persistent]#--- {0..*} -[ClassificationWeight +compareTo()  
+weight:float] {1} -[Classification + compareTo()  
+value:String]--- {1} -[Clazz +fieldName:String + compareTo()]


A clazz in this instance is the group of domain names. The  
classification is the actual domain name. You can skip the weight if  
you only use domain names. I guess all weights would be 1.


The weight compares to classifications that compares to the clazz. If  
two weights equal I use the lucene score.


You might want to do several passes or nested order to get the group  
top score as the natrual order per group cluster.


It handles 3000+ queries per minute on 12+ documents in RAM, 24/7  
on a dual core at 40% load. And I even use Lucene for persistency  
even though I should not.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Grouping results by choosen field

2006-03-17 Thread karl wettin


17 mar 2006 kl. 21.01 skrev karl wettin:



17 mar 2006 kl. 16.36 skrev Java Programmer:


My problem concerns result grouping, the best example will be  
Google search
where you have results sorted by relevance, and also grouped by  
domain (they

have little indent/margin). In my project I want to get similar
functionality, without very huge CPU consumption.

Can you share any helpful hints ?


I do that. Basically I marshall the hit documents to java instances  
of Comparable. Then I just plain old Collections.sort(the documents  
as object representation).


I just made it complicated after this. Sorry. I ment to say I do well  
grouping fields with a comparator. If I get too many results, I only  
sort the top n hits.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]