Hello nutch-users, I have some content files in HTML/XML well formatted. Each file has a corresponding URL which associated with.
These files are crawled from BBS whose need to login with cookie, that's the reason I don't use nutch's built-in crawler to grab them at all. Now problems are: 1) How to tell nutch taking these files correctly? Because for XML files, it should decide which parts are real contents. 2) How to tell nutch taking consideration of corresponding URL as associate properties to those files? For example, here I have two files on local disk: con01.html => http://www.somewhere.com/someurl.html con02.xml => http://www.somewhere.com/url02.xml I want to add these two files into nutch, and let nutch remember their url as well, for future search. Thank you for your help in advance, because I have read the help documentation but they didn't explain that well. Regards, David
