> That's correct. Technically speaking, this is
> possible to do 
> (ParseData.getMetadata()), we just didn't decide yet
> how to treat 
> multiple values under the same key.

I have a good idea of how to handle that situation. 
If there are multiple and conflicting values for
important meta-data such as the content-type, the page
is horribly broken, and Nutch shouldn't waste effort
trying to figure out what's going on.  For example, if
the HTTP header says "text/plain" and there's a
meta-data that says "text/pdf", it's not worth
figuring it out.  Servers that are that badly broken
probably also have content that is not very
interesting, useful, or otherwise valuable.  What it
comes down to is that content on broken servers is not
likely to be desirable.

> > has no way of knowing if a page is HTML, PDF, or
> an
> > MP3 once that page has been crawled.  Is this
> correct?
> 
> This is incorrect. First, the URL is stored, which
> contains among others 
> the filename (so you can check the so called file
> extension). 

That is possible but going on extensions is also a
form of guessing (JSP, ASP or PHP could be almost any
kind of format, for example), and it violates the HTTP
protocol, which says that the content type is
specified in the HTTP headers.

> Second, 
> the Content.getContentType() gives you the content
> type reported by the 
> server.

That is the valuable piece of information.  Is there
some reasonable way to use that as part of a Query, so
I could filter only on certain content types, or give
a higher priority to certain types in the results?

> Please see
> http://issues.apache.org/jira/browse/NUTCH-62 for a 
> discussion on this subject.

I'm looking at that.  I have a few other ideas that go
a bit beyond that to make it more useful for
content-oriented searches.

Basically, the content type is a very important piece
of meta-data.  There are only a small number of valid
content types.  A look at a typical mime types file
shows only about 400 types, and of those only about 20
or 30 are in common use for search users.  My idea is
to add two bytes to the page header in the segment
which gives a key to the mime content type, and then
be able to build indexes or otherwise process that
key.

I know that this is on the opposite end of the
spectrum of dealing with multiple specifications that
you described in NUTCH-62, but I think this is the
right way to handle content types.  A document can
only have one content type, and that content type
should be specified once, by the server itself.

I realize this is a big change from the current
direction for handling content types, but to me this
seems like the sanest way to make it useful.  What do
you think?




                
____________________________________________________
Sell on Yahoo! Auctions – no fees. Bid on great items.  
http://auctions.yahoo.com/


-------------------------------------------------------
SF.Net email is sponsored by: Discover Easy Linux Migration Strategies
from IBM. Find simple to follow Roadmaps, straightforward articles,
informative Webcasts and more! Get everything you need to get up to
speed, fast. http://ads.osdn.com/?ad_id=7477&alloc_id=16492&op=click
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to