Vacuum Joe wrote:
I've read through the API docs and I hope to confirm something: most page meta-data is not stored in the segment files. Meta-data such as last crawl date is stored, but most others are not. For example, Nutch
That's correct. Technically speaking, this is possible to do (ParseData.getMetadata()), we just didn't decide yet how to treat multiple values under the same key.
has no way of knowing if a page is HTML, PDF, or an MP3 once that page has been crawled. Is this correct?
This is incorrect. First, the URL is stored, which contains among others the filename (so you can check the so called file extension). Second, the Content.getContentType() gives you the content type reported by the server.
I'm thinking of modifying Nutch to allow storage of this meta-data. It seems like Nutch is very nicely designed with clean interfaces so I could change the segment file format and the fetcher without having global effects.
Please see http://issues.apache.org/jira/browse/NUTCH-62 for a discussion on this subject.
Does anyone have thoughts on this? If I made this change, would Nutch be interested in integrating it?
Sure. Nutch development is driven by community, so if you come up with something useful for a wider audience, we will gladly integrate it. Please check the above link first, to avoid unnecessary work.
-- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __________________________________ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com ------------------------------------------------------- SF.Net email is sponsored by: Discover Easy Linux Migration Strategies from IBM. Find simple to follow Roadmaps, straightforward articles, informative Webcasts and more! Get everything you need to get up to speed, fast. http://ads.osdn.com/?ad_id=7477&alloc_id=16492&op=click _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
