Hi All,
I am new and working on Apache Nutch to crawl some sites , filter
and get content on the base of word not on the base of url. e.g.1. I have to crawl those sites that contain word like 'shop' or 'product' in contents(text). if this word not exists then not crawl further links. 2. I want to get structured (json fields e.g text , url , metadata etc.) data instead of unstructured(whole page source) data. any little help be appreciable. Regards Muhammad umer

