Hi, thanks for the hint with Luke. Using Luke I found the problem, but it has nothing to do with query translation, but with the boost a document got assigned. Most of the documents in the index had boost 0.0 which seems to be ignored by default at query evaluation. Using the Luke option "Return all matching results, even low scored (unsorted)" on the HitCollector tab of Search, I got all documents returned which I'd expect for such a query.
How can I tell Nutch to return low scored documents at standard search using the NutchBean class? Is this a configuration property? Thanks in advance. Kind regards, Martina -----Ursprüngliche Nachricht----- Von: Andrzej Bialecki [mailto:[email protected]] Gesendet: Montag, 23. Februar 2009 13:41 An: [email protected] Betreff: Re: Indexed terms are not found during search in current trunk Koch Martina wrote: > Hi, > > > > since a couple of weeks we observe a strange behaviour when > indexing/searching with the current trunk (we use the trunk of Feb, 4th with > some of the major patches applied which were released afterwards). > > > > In an index containing only German documents, we find only about 30 documents > (of 5.000 documents in the index) when searching for common German terms like > articles (der, die das). We don't do stop-word filtering, so we'd expect to > get almost all documents returned on such a search. > > When using Luke or Limo we see that much more documents contain these terms > in the content field. That means the terms got indexed, but strangely they > are not returned on searching. > > > > Did anybody observe something similiar or has an explanation for what goes > wrong? This may be caused by some problem in query translation from Nutch query to Lucene query. Please add some logging in LuceneQueryOptimizer to log the Lucene query just before it's submitted to Lucene IndexSearcher. This should already help you to understand what query is really executed at the Lucene level. No matter how many results you get, please run the same query in Luke - you should get exactly the same number of results. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __________________________________ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
