Enis - thanks for the pointer.

Enis Soztutar wrote:
You can write index plugins. Please first read the (slighlty outdated) tutorial and then check http://wiki.apache.org/nutch/PluginCentral. Optionally you may want to write html parse plugins depending on the source of the data.

Chris Hane wrote:
I am looking to use nutch to crawl/index a website. A lot of the pages have videos on them. We have transcripts for the videos that we would like to be included for indexing; but we do not want to put the transcripts on the web pages.

Is there a way to "add" this information to a given web page for purposes of indexing as part of the crawl process? Maybe another point in the process before the index is generated? I am hoping there is a point in the crawl process where I can add augmented content to a page in the nutch segment (rough thought based on very limited time spent looking at nutch).

We are comfortable using java and can write custom code as needed. I would appreciate any pointers on where to look in the nutch code.

Thanks in advance,
Chris.....


Reply via email to