I am looking to use nutch to crawl/index a website. A lot of the pages
have videos on them. We have transcripts for the videos that we would like
to be included for indexing; but we do not want to put the transcripts on
the web pages.
Is there a way to "add" this information to a given web page for purposes
of indexing as part of the crawl process? Maybe another point in the
process before the index is generated? I am hoping there is a point in the
crawl process where I can add augmented content to a page in the nutch
segment (rough thought based on very limited time spent looking at nutch).
We are comfortable using java and can write custom code as needed. I would
appreciate any pointers on where to look in the nutch code.
Thanks in advance,
Chris.....