On Jul 23, 2010, at 5:06 AM, Karl Wettin wrote:

> 
> 23 jul 2010 kl. 08.30 skrev sk...@sloan.mit.edu:
> 
>> Hi all, I have an interesting problem...instead of going from a query
>> to a document collection, is it possible to come up with the best fit
>> query for a given document collection (results)? "Best fit" being a
>> query which maximizes the hit scores of the resulting document
>> collection.
> 
> It would probably be helpful if you explained what it is you attempt to 
> achieve by doing this. Are you looking for MoreLikeThis?

MatchAllDocsQuery returns the document collection all with a score of 1.  
Somehow, I don't think this is what you are after.  Perhaps you mean given all 
the queries you've seen in the past, find the "best one"?

> 
>> How should I approach this? All suggestions appreciated.
> 
> 
> How exepensive of an operation is this allowed to be? Can you waste seconds, 
> minutes, hours or days?
> Are there any requirements on the precision and recall?
> 
> I would no matter what start with looking at the output from a feature 
> selection algorithm fed with the complete corpus divided in the two classes 
> "query factory set" and "all other documents".
> 
> The output will not tell you why the terms are important, just that they 
> probably are used when deciding when to classify documents as part of query 
> factory set or all other documents.
> 
> It's hard to say where to go from there.
> 
> Create a set of selected terms available in the query factory set.
> Create a set of selected terms available in all other documents.
> Create a set of selected terms only available in the query factory set.
> Create a set of selected terms only available in all other documents.
> 
> See if there is a simple strategy based on above that produce a good result.
> 
> If not you might want to look in to some evolving algorithm that execute 
> queries with permutations of selected features in order to find the best 
> query. Or if you have the resources, simply create all permutation of queries.
> 
> If it works then I think all of the steps above could be optimized, cached or 
> simplified in several ways to make it speedy.
> 
> See Mahout, Weka (has a good experimenter/explorer GUI), Rapidminer, etc for 
> machine learning APIs.
> 
> It should not have to be too complicated to implement a gain ratio feature 
> selector using IndexReader if the term vector space is available.
> 
> 
>       karl
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Reply via email to