prashant_nutch wrote:
Hi,
Thanks for your early response.
finally i got search result using subcollection,but still some issues,
1.can we should search on more than 2 subcollection at same time?
like command
subcollection:<subcollection name1> <term for search> .......
can we extend this as subcollection:<subcollection name1> <term for search
<subcollection name2> <term for search2>
or how to achieve this?
Actually you can but it requires a little work. Nutch parses the query
by a predefined syntax using JavaCC generated classes, namely
NutchAnalysis.java and NutchAnalysis.cc (Also see Query.parse()).
Unfortunatelly the query syntax does not allow for parsing multiple
terms for a field. And also the query syntax does not include boolean OR
operation. So a query like
<query_term> <field1> : <term1>, <term2>
is not possible as well as a query like
<query_term> (<field1> :<term1> OR <field1>:<term2>)
So for your case, you can add this functionality to NutchAnalysis so
share this with the community, so nutch has this wanted feature.
Alternatively you can add the clauses to the Query object
programmatically if you know the field a priori.
2.in subcollection if we want adding URLs after crawling,or removing from
subcollection or
merging two subcollection, each time we should do new crawl?
can we manage our subcollection according requirement and we don't want to
recrawl again?(like subcollection A , B. Now we want add some URL from A
into B)
like above this is also not an issue of subcollection, but an issue of
lucene herself. All the subcollection indexing extension does is to add
a subcollection field to the document with possible values of
subcollection names. Thus you can do all the operations on the index as
you like. I suggest you learn more about lucene, by reading their wiki
or one of the books. Also you can check out Solr, which manages the
index more dynamically.