Jérôme,
Are you mainly concerned with charset in Content-Type?
Currently, what happens when Content-Type exists in both HTTP layer and in META 
tag (if contents is HTML)?
How does Nutch guesses Content-Type, and when does it need to do that?
Is there a situation where the guessed content-type differs from the 
content-type in the metadata?
If so, what class uses which? 
-kuro

> -----Original Message-----
> From: Jérôme Charron [mailto:[EMAIL PROTECTED] 
> Sent: 2006-4-13 12:57
> To: [email protected]
> Subject: Re: Content-Type inconsistency?
> 
> I would like to come back on this issue:
> The Content object holds two content-types:
> 1. The raw content-type from the protocol layer (http header 
> in case of
> http) in the Content's metadata
> 2. The guessed content-type in a private field content-type.
> 
> When a ParseData object is created, it takes only the 
> Content's metadata.
> So, the ParseData can only access the raw content type and not the one
> guessed.
> 
> What I suggest is :
> 1. add a content-type parameter in the ParseData constructors (so that
> Parsers  can pass the guessed content-type to ParseData).
> 2. The Content object stores the guessed content-type in it's 
> metadata in a
> special attribute named for instance GUESSED_CONTENT_TYPE, so that the
> ParseData can access it
> 
> I think 1. is really cleanest way to implement this, but 
> there is a lot of
> code impacted => all the parsers.
> Solution 2. have no impact on APIs, so the code changes are 
> very small.
> 
> Suggestions? Comments?
> 
> Jérôme
> 
> --
> http://motrech.free.fr/
> http://www.frutch.org/
> 


-------------------------------------------------------
This SF.Net email is sponsored by xPML, a groundbreaking scripting language
that extends applications into web and mobile media. Attend the live webcast
and join the prime developer group breaking into this new coding territory!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid0944&bid$1720&dat1642
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to