On Jul 23, 2010, at 5:06 AM, Karl Wettin wrote: > > 23 jul 2010 kl. 08.30 skrev sk...@sloan.mit.edu: > >> Hi all, I have an interesting problem...instead of going from a query >> to a document collection, is it possible to come up with the best fit >> query for a given document collection (results)? "Best fit" being a >> query which maximizes the hit scores of the resulting document >> collection. > > It would probably be helpful if you explained what it is you attempt to > achieve by doing this. Are you looking for MoreLikeThis?
MatchAllDocsQuery returns the document collection all with a score of 1. Somehow, I don't think this is what you are after. Perhaps you mean given all the queries you've seen in the past, find the "best one"? > >> How should I approach this? All suggestions appreciated. > > > How exepensive of an operation is this allowed to be? Can you waste seconds, > minutes, hours or days? > Are there any requirements on the precision and recall? > > I would no matter what start with looking at the output from a feature > selection algorithm fed with the complete corpus divided in the two classes > "query factory set" and "all other documents". > > The output will not tell you why the terms are important, just that they > probably are used when deciding when to classify documents as part of query > factory set or all other documents. > > It's hard to say where to go from there. > > Create a set of selected terms available in the query factory set. > Create a set of selected terms available in all other documents. > Create a set of selected terms only available in the query factory set. > Create a set of selected terms only available in all other documents. > > See if there is a simple strategy based on above that produce a good result. > > If not you might want to look in to some evolving algorithm that execute > queries with permutations of selected features in order to find the best > query. Or if you have the resources, simply create all permutation of queries. > > If it works then I think all of the steps above could be optimized, cached or > simplified in several ways to make it speedy. > > See Mahout, Weka (has a good experimenter/explorer GUI), Rapidminer, etc for > machine learning APIs. > > It should not have to be too complicated to implement a gain ratio feature > selector using IndexReader if the term vector space is available. > > > karl > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org