Hi folks, I would deeply appreciate if someone can shed light on how to solve a specific search I am trying to accomplish with Nutch.
I am currently ABLE to do the following: Use Nutch to crawl a directory in the local filesystem ( linux) (The local directory has html files) When I run bin/nutch crawl urls -dir crawllocalfs, it successfully crawls the directory and I can see the search results using the WAR file in Tomcat. The HTML files is raw text with the usual html tags. The HTML text has useful sections which I would like to capture in a way so that I can run a an advanced search in those fields only. I don't understand how the following can be accomplished: 1) How to extract specific parts of the HTML so that it can be grouped in certain fields in the Lucene Index using nutch. 2) How to perform an advanced search on the specific fields which are indexed in Nutch as it has a very basic search interface. I am nutch newbie as you can tell and will appreciate adivse on how to approach this issue? Regards, Taknev
