Enis - thanks for the pointer.
Enis Soztutar wrote:
You can write index plugins. Please first read the (slighlty outdated)
tutorial and then check http://wiki.apache.org/nutch/PluginCentral.
Optionally you may want to write html parse plugins depending on the
source of the data.
Chris Hane wrote:
I am looking to use nutch to crawl/index a website. A lot of the
pages have videos on them. We have transcripts for the videos that we
would like to be included for indexing; but we do not want to put the
transcripts on the web pages.
Is there a way to "add" this information to a given web page for
purposes of indexing as part of the crawl process? Maybe another
point in the process before the index is generated? I am hoping there
is a point in the crawl process where I can add augmented content to a
page in the nutch segment (rough thought based on very limited time
spent looking at nutch).
We are comfortable using java and can write custom code as needed. I
would appreciate any pointers on where to look in the nutch code.
Thanks in advance,
Chris.....