Stefan Groschupf wrote:

I like the idea and it is another step in the direction of vertical search, where I personal see the biggest chance for nutch.

How to implement it? Surprisingly, I think that it's very simple - just adding a CrawlDatum.policyId field would suffice, assuming we have a means to store and retrieve these policies by ID; and then instantiate it and call appropriate methods whenever we use today the URLFilters and do the score calculations.


Before we start adding meta data and more meta data, why not once in general adding meta data to the crawlDatum, than we can have any kinds of plugins that add and process metadata that belongs to a url.
Beside policyId, I see much more canidates for crawl metadata:
Last Crawl date. Url category. collection key (similar to google appliance collections) etc.


Hehe... That was what I advocated from the beginning. There is a cost associated with this, though, i.e. any change in CrawlDatum size has a significant impact on most operations' performance.

All solutions I had seen until today load this kind of meta data until indexing from a third party data source (database) and add it into the index. This works but is very slow.


Well, maybe it makes sense to store the CrawlDatum and its "metadata" separately in two MapFiles, so that you can perform some operations using only the lightweight CrawlDatum, and for other operations you will need to load the properties too...

--
Best regards,
Andrzej Bialecki     <><
___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com




-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems?  Stop!  Download the new AJAX search engine that makes
searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
http://ads.osdn.com/?ad_id=7637&alloc_id=16865&op=click
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to