I've read through the API docs and I hope to confirm
something: most page meta-data is not stored in the
segment files. Meta-data such as last crawl date is
stored, but most others are not. For example, Nutch
has no way of knowing if a page is HTML, PDF, or an
MP3 once that page has been crawled. Is this correct?
I'm thinking of modifying Nutch to allow storage of
this meta-data. It seems like Nutch is very nicely
designed with clean interfaces so I could change the
segment file format and the fetcher without having
global effects.
Does anyone have thoughts on this? If I made this
change, would Nutch be interested in integrating it?
____________________________________________________
Sell on Yahoo! Auctions no fees. Bid on great items.
http://auctions.yahoo.com/
-------------------------------------------------------
SF.Net email is sponsored by: Discover Easy Linux Migration Strategies
from IBM. Find simple to follow Roadmaps, straightforward articles,
informative Webcasts and more! Get everything you need to get up to
speed, fast. http://ads.osdn.com/?ad_id=7477&alloc_id=16492&op=click
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general