Vacuum Joe wrote:
I've read through the API docs and I hope to confirm
something: most page meta-data is not stored in the
segment files.  Meta-data such as last crawl date is
stored, but most others are not.  For example, Nutch

That's correct. Technically speaking, this is possible to do (ParseData.getMetadata()), we just didn't decide yet how to treat multiple values under the same key.

has no way of knowing if a page is HTML, PDF, or an
MP3 once that page has been crawled.  Is this correct?

This is incorrect. First, the URL is stored, which contains among others the filename (so you can check the so called file extension). Second, the Content.getContentType() gives you the content type reported by the server.


I'm thinking of modifying Nutch to allow storage of
this meta-data.  It seems like Nutch is very nicely
designed with clean interfaces so I could change the
segment file format and the fetcher without having
global effects.

Please see http://issues.apache.org/jira/browse/NUTCH-62 for a discussion on this subject.

Does anyone have thoughts on this?  If I made this
change, would Nutch be interested in integrating it?

Sure. Nutch development is driven by community, so if you come up with something useful for a wider audience, we will gladly integrate it. Please check the above link first, to avoid unnecessary work.

--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



-------------------------------------------------------
SF.Net email is sponsored by: Discover Easy Linux Migration Strategies
from IBM. Find simple to follow Roadmaps, straightforward articles,
informative Webcasts and more! Get everything you need to get up to
speed, fast. http://ads.osdn.com/?ad_id=7477&alloc_id=16492&op=click
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to