I've read through the API docs and I hope to confirm
something: most page meta-data is not stored in the
segment files.  Meta-data such as last crawl date is
stored, but most others are not.  For example, Nutch
has no way of knowing if a page is HTML, PDF, or an
MP3 once that page has been crawled.  Is this correct?

I'm thinking of modifying Nutch to allow storage of
this meta-data.  It seems like Nutch is very nicely
designed with clean interfaces so I could change the
segment file format and the fetcher without having
global effects.

Does anyone have thoughts on this?  If I made this
change, would Nutch be interested in integrating it?




                
____________________________________________________
Sell on Yahoo! Auctions – no fees. Bid on great items.  
http://auctions.yahoo.com/


-------------------------------------------------------
SF.Net email is sponsored by: Discover Easy Linux Migration Strategies
from IBM. Find simple to follow Roadmaps, straightforward articles,
informative Webcasts and more! Get everything you need to get up to
speed, fast. http://ads.osdn.com/?ad_id=7477&alloc_id=16492&op=click
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to