import HTML/XML content files into nutch with properties

David Xiao Mon, 16 Apr 2007 08:40:26 -0700

Hello nutch-users,

I have some content files in HTML/XML well formatted. Each file has a
corresponding URL which associated with.


These files are crawled from BBS whose need to login with cookie,
that's the reason I don't use nutch's built-in crawler to grab them at
all.

Now problems are:
1) How to tell nutch taking these files correctly? Because for XML
files, it should decide which parts are real contents.
2) How to tell nutch taking consideration of corresponding URL as
associate properties to those files?

For example, here I have two files on local disk:
con01.html => http://www.somewhere.com/someurl.html
con02.xml => http://www.somewhere.com/url02.xml
I want to add these two files into nutch, and let nutch remember their
url as well, for future search.

Thank you for your help in advance, because I have read the help
documentation but they didn't explain that well.

Regards,
David

import HTML/XML content files into nutch with properties

Reply via email to