[ 
http://issues.apache.org/jira/browse/NUTCH-133?page=comments#action_12359717 ] 

Jerome Charron commented on NUTCH-133:
--------------------------------------

I think Doug proposal is the good way of solving this content-type issue.
This solution just miss the "guess mime-type from file extension".
In fact, we have 3 clues to guess the content-type of a document:
1. The Content-Type header
2. The file name extension
3. The file content (magic bytes).

We must use all these informations. But dealing with 3 clues is harder than 
dealing with only 2.
What I suggest is:
1. Guess the the content-type from the file content (magic).  [only if the 
mime.type.magic property is true]
2. If a content-type is found, then return it (magic information is the more 
reliable one, so I suggest that there is no need for double/triple check with 
file extension and header).
3. Otherwise:
3.1  Use a double check (as in Doug piece of code) for file extension 
content-type and header content-type. But what is the more important clue? The 
header or the file extension? On the server side, for static files, the 
content-type is generaly setted using file extensions => the file extension is 
then the more reliable information (it provides a way to deal with bad 
configuration on the server side). For dynamic files, the content-type the 
content-type is setted by the application server, by the servlet / cgi / ..., 
or by the page, and the file name can have any (or no) extension => in such a 
case, the header is the more reliable clue we have... So what is the best way 
of solving content-type if no magic bytes resolution is available?
What I suggest in a first approach is to use the double check suggested by 
doug, but on the file-extension instead of on the magic bytes.

If you are ok with this I can provide a patch of the Content.java file by the 
end of the week.
(Chris, here is my feeling about nutch performance: the more we add 
"intelligence" (features) to nutch, the more we decrease its performance. True 
that performance is a very sensitive point for nutch, but nutch global 
performance is more an affair of scalability (designed by architecure) than an 
affair of piece of code raw performance).


> ParserFactory does not work as expected
> ---------------------------------------
>
>          Key: NUTCH-133
>          URL: http://issues.apache.org/jira/browse/NUTCH-133
>      Project: Nutch
>         Type: Bug
>     Versions: 0.8-dev, 0.7.1, 0.7.2-dev
>     Reporter: Stefan Groschupf
>     Priority: Blocker
>  Attachments: ParserFactoryPatch_nutch.0.7_patch.txt, 
> Parserutil_test_patch.txt
>
> Marcel Schnippe detect a set of problems until working with different content 
> and parser types, we worked together to identify the problem source.
> From our point of view this described problems could be the source for many 
> other problems daily described in the mailing lists.
> Find a conclusion of the problems below.
> Problem:
> Some servers returns mixed case but correct header keys like 'Content-type' 
> or 'content-Length'  in the http response header.
> That's why for example a get("Content-Type") fails and a page is detected as 
> zip using the magic content type detection mechanism. 
> Also we note that this a common reason why pdf parsing fails since 
> Content-Length does return the correct value. 
> Sample:
> returns "text/HTML" or "application/PDF" or Content-length
> or this url:
> http://www.lanka.info/dictionary/EnglishToSinhala.jsp
> Solution:
> First just write only lower case keys into the properties and later convert 
> all keys that are used to query the metadata to lower case as well.
> e.g.:
> HttpResponse.java, line 353:
> use lower cases here and for all keys used to query header properties (also 
> content-length) change:  String key = line.substring(0, colonIndex); to  
> String key = line.substring(0, colonIndex) .toLowerCase();
> Problem:
> MimeTypes based discovery (magic and url based) is only done in case the 
> content type was not delivered by the web server, this happens not that 
> often, mostly this was a problem with mixed case keys in the header.
> see:
>  public Content toContent() {
>     String contentType = getHeader("Content-Type");
>     if (contentType == null) {
>       MimeType type = null;
>       if (MAGIC) {
>         type = MIME.getMimeType(orig, content);
>       } else {
>         type = MIME.getMimeType(orig);
>       }
>       if (type != null) {
>           contentType = type.getName();
>       } else {
>           contentType = "";
>       }
>     }
>     return new Content(orig, base, content, contentType, headers);
>   }
> Solution:
> Use the content-type information as it is from the webserver and move the 
> content type discovering from Protocol plugins to the Component where the 
> parsing is done - to the ParseFactory.
> Than just create a list of parsers for the content type returned by the 
> server and the custom detected content type. In the end we can iterate over 
> all parser until we got a successfully parsed status.
> Problem:
> Content will be parsed also if the protocol reports a exception and has a non 
> successful status, in such a case the content is new byte[0] in any case.
> Solution:
> Fetcher.java, line 243.
> Change:   if (!Fetcher.this.parsing ) { .. to 
>  if (!Fetcher.this.parsing || !protocolStatus.isSuccess()) {
>        // TODO we may should not write out here emthy parse text and parse 
> date, i suggest give outputpage a parameter parsed true / false
>           outputPage(new FetcherOutput(fle, hash, protocolStatus),
>                 content, new ParseText(""),
>                 new ParseData(new ParseStatus(ParseStatus.NOTPARSED), "", new 
> Outlink[0], new Properties()));
>         return null;
>       }
> Problem:
> Actually the configuration of parser is done based on plugin id's, but one 
> plugin can have several extentions, so  normally a plugin can provide several 
> parser, but this is no limited just wrong values are used in the 
> configuration process. 
> Solution:
> Change plugin id to  extension id in the parser configuration file and also 
> change this code in the parser factory to use extension id's everywhere.
> Problem:
> there is not a clear differentiation between content type and mime type. 
> I'm notice that some plugins call metaData.get("Content-Type) or 
> content.getContentType();
> Actually in theory this can return different values, since the content type 
> could be detected by the MimesTypes util and is not the same as delivered in 
> the http response header.
> As mentioned actually content type is only detected by the MimeTypes util in 
> case the header does not contains any content type informations or had 
> problems with mixed case keys.
> Solution:
> Take the content type property out of the meta data and clearly restrict the 
> access of this meta data into the own getter method.
> Problem:
> Most protocol plugins  checking if content type is null only in this case the 
> MimeTypes util is used. Since my patch move the mime type detection to the 
> parser factory - where from my point of view - is the right place, it is now 
> unneccary code we can remove from the protocol plugins. I never found a case 
> where no content type was returned just mixed case keys was used. 
> Solution. 
> Remove this detection code, since it is now in the parser factory.
> I didn't change this since more code I  change, I guess there is a  less 
> chance to get the patch into the sources, I suggest we open a low priority 
> issue and once we change the plugins we can remove it.
> Problem:
> This is not a problem, but a 'code smells' (Martin Fowler) There are empty 
> test methods in TestMimeType
>   /** Test of <code>getExtensions</code> method. */
>     public void testGetExtensions() {
>     }
> Solution:
> Implement these tests or remove the test methods.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

Reply via email to