A further piece of information:
So, I found out about the "url:..." token in the query string, which seems
to restrict the result set to what I want. That is nice. But there is a
problem with that as well.
Say you have a query such as "some text url:foo". The results will be
properly filtered. However, whatever produces the summary that is delivered
for each hit will think that 'foo' should be included in the full text as
well. So, if 'foo' also appears in the full text of the result (not only in
the URL) then a sample section of text will be chosen as summary that also
contains 'foo', which is not at all what I want. 'Foo' is only supposed to
be relevant in the URL, not in the full text.
As a workaround, I have done this:
Query query_1 = Query.parse("some words url:foo", conf);
Query query_2 = Query.parse("some words", conf);
...
Hits hits = bean.search(query_1, max_results);
Hit[] show = hits.getHits(0, length);
HitDetails[] details = bean.getDetails(show);
Summary[] summaries = bean.getSummary(details, query_2);
Note how I am using query_2 to get the summaries for the hits, not query_1
(which was used to get the hits in the first place). This works because
where it all boils down to
(src/java/org/apache/nutch/summary/basic/BasicSummarizer.java) the code only
uses the query that is passed in to extract the query terms, and for nothing
else. Take a look at the getSummary() function in there. It just calls
query.getTerms().
However, getSummary() there doesn't seem to know about 'special' parts of a
query, such as 'site' or 'url'. For a query "some words url:foo" it returns
these terms: [ 'foss', 'open', 'foo' ]. In reality, 'foo' should not appear.
The workaround of creating two separate query objects works, but it seems to
be unnecessarily complex and also results in the unnecessary creation of an
additional query object. A proper solution, it appears, would be to make the
getTerms() function of the query object more intelligent, so that it would
return only those tokens that pertain to the actual full text search, not
any tokens that have been specified for things such as 'url' or 'site'.
Is there a better workaround? Is there some funky plugin that would take
care of all of this magically?
--
View this message in context:
http://www.nabble.com/Searching-in-sub-section-of-site-tp17479657p17481063.html
Sent from the Nutch - User mailing list archive at Nabble.com.