Hi folks,

I would deeply appreciate if someone can shed light on how to solve a
specific search  I am trying to accomplish with
Nutch.

I am currently ABLE to do the following:

Use Nutch to crawl a directory in the local filesystem ( linux) (The local
directory has html files)

When I run bin/nutch crawl urls  -dir crawllocalfs, it successfully crawls
the directory and I can see the search results using the WAR file in Tomcat.

The HTML files is raw text with the usual html tags. The HTML text has
useful sections which I would like to capture in a way so that I can run a
an advanced search in those fields only.

I don't understand how the following can be accomplished:

1) How to extract specific parts of the HTML so that it can be grouped in
certain fields in the Lucene Index using nutch.

2) How to perform an advanced search on the specific fields which are
indexed in Nutch as it has a very basic search interface.

I am nutch newbie as you can tell and will appreciate adivse on how to
approach this issue?


Regards,
Taknev

Reply via email to