> That's correct. Technically speaking, this is > possible to do > (ParseData.getMetadata()), we just didn't decide yet > how to treat > multiple values under the same key.
I have a good idea of how to handle that situation. If there are multiple and conflicting values for important meta-data such as the content-type, the page is horribly broken, and Nutch shouldn't waste effort trying to figure out what's going on. For example, if the HTTP header says "text/plain" and there's a meta-data that says "text/pdf", it's not worth figuring it out. Servers that are that badly broken probably also have content that is not very interesting, useful, or otherwise valuable. What it comes down to is that content on broken servers is not likely to be desirable. > > has no way of knowing if a page is HTML, PDF, or > an > > MP3 once that page has been crawled. Is this > correct? > > This is incorrect. First, the URL is stored, which > contains among others > the filename (so you can check the so called file > extension). That is possible but going on extensions is also a form of guessing (JSP, ASP or PHP could be almost any kind of format, for example), and it violates the HTTP protocol, which says that the content type is specified in the HTTP headers. > Second, > the Content.getContentType() gives you the content > type reported by the > server. That is the valuable piece of information. Is there some reasonable way to use that as part of a Query, so I could filter only on certain content types, or give a higher priority to certain types in the results? > Please see > http://issues.apache.org/jira/browse/NUTCH-62 for a > discussion on this subject. I'm looking at that. I have a few other ideas that go a bit beyond that to make it more useful for content-oriented searches. Basically, the content type is a very important piece of meta-data. There are only a small number of valid content types. A look at a typical mime types file shows only about 400 types, and of those only about 20 or 30 are in common use for search users. My idea is to add two bytes to the page header in the segment which gives a key to the mime content type, and then be able to build indexes or otherwise process that key. I know that this is on the opposite end of the spectrum of dealing with multiple specifications that you described in NUTCH-62, but I think this is the right way to handle content types. A document can only have one content type, and that content type should be specified once, by the server itself. I realize this is a big change from the current direction for handling content types, but to me this seems like the sanest way to make it useful. What do you think? ____________________________________________________ Sell on Yahoo! Auctions no fees. Bid on great items. http://auctions.yahoo.com/ ------------------------------------------------------- SF.Net email is sponsored by: Discover Easy Linux Migration Strategies from IBM. Find simple to follow Roadmaps, straightforward articles, informative Webcasts and more! Get everything you need to get up to speed, fast. http://ads.osdn.com/?ad_id=7477&alloc_id=16492&op=click _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
