[ http://issues.apache.org/jira/browse/NUTCH-133?page=all ]

Stefan Groschupf updated NUTCH-133:
-----------------------------------

    Attachment: Parserutil_test_patch.txt

A test that reproduce most problems, see a real world sample url in the 
conclusion above.

> ParserFactory does not work as expected
> ---------------------------------------
>
>          Key: NUTCH-133
>          URL: http://issues.apache.org/jira/browse/NUTCH-133
>      Project: Nutch
>         Type: Bug
>     Versions: 0.8-dev, 0.7.1, 0.7.2-dev
>     Reporter: Stefan Groschupf
>     Priority: Blocker
>  Attachments: Parserutil_test_patch.txt
>
> Marcel Schnippe detect a set of problems until working with different content 
> and parser types, we worked together to identify the problem source.
> From our point of view this described problems could be the source for many 
> other problems daily described in the mailing lists.
> Find a conclusion of the problems below.
> Problem:
> Some servers returns mixed case but correct header keys like 'Content-type' 
> or 'content-Length'  in the http response header.
> That's why for example a get("Content-Type") fails and a page is detected as 
> zip using the magic content type detection mechanism. 
> Also we note that this a common reason why pdf parsing fails since 
> Content-Length does return the correct value. 
> Sample:
> returns "text/HTML" or "application/PDF" or Content-length
> or this url:
> http://www.lanka.info/dictionary/EnglishToSinhala.jsp
> Solution:
> First just write only lower case keys into the properties and later convert 
> all keys that are used to query the metadata to lower case as well.
> e.g.:
> HttpResponse.java, line 353:
> use lower cases here and for all keys used to query header properties (also 
> content-length) change:  String key = line.substring(0, colonIndex); to  
> String key = line.substring(0, colonIndex) .toLowerCase();
> Problem:
> MimeTypes based discovery (magic and url based) is only done in case the 
> content type was not delivered by the web server, this happens not that 
> often, mostly this was a problem with mixed case keys in the header.
> see:
>  public Content toContent() {
>     String contentType = getHeader("Content-Type");
>     if (contentType == null) {
>       MimeType type = null;
>       if (MAGIC) {
>         type = MIME.getMimeType(orig, content);
>       } else {
>         type = MIME.getMimeType(orig);
>       }
>       if (type != null) {
>           contentType = type.getName();
>       } else {
>           contentType = "";
>       }
>     }
>     return new Content(orig, base, content, contentType, headers);
>   }
> Solution:
> Use the content-type information as it is from the webserver and move the 
> content type discovering from Protocol plugins to the Component where the 
> parsing is done - to the ParseFactory.
> Than just create a list of parsers for the content type returned by the 
> server and the custom detected content type. In the end we can iterate over 
> all parser until we got a successfully parsed status.
> Problem:
> Content will be parsed also if the protocol reports a exception and has a non 
> successful status, in such a case the content is new byte[0] in any case.
> Solution:
> Fetcher.java, line 243.
> Change:   if (!Fetcher.this.parsing ) { .. to 
>  if (!Fetcher.this.parsing || !protocolStatus.isSuccess()) {
>        // TODO we may should not write out here emthy parse text and parse 
> date, i suggest give outputpage a parameter parsed true / false
>           outputPage(new FetcherOutput(fle, hash, protocolStatus),
>                 content, new ParseText(""),
>                 new ParseData(new ParseStatus(ParseStatus.NOTPARSED), "", new 
> Outlink[0], new Properties()));
>         return null;
>       }
> Problem:
> Actually the configuration of parser is done based on plugin id's, but one 
> plugin can have several extentions, so  normally a plugin can provide several 
> parser, but this is no limited just wrong values are used in the 
> configuration process. 
> Solution:
> Change plugin id to  extension id in the parser configuration file and also 
> change this code in the parser factory to use extension id's everywhere.
> Problem:
> there is not a clear differentiation between content type and mime type. 
> I'm notice that some plugins call metaData.get("Content-Type) or 
> content.getContentType();
> Actually in theory this can return different values, since the content type 
> could be detected by the MimesTypes util and is not the same as delivered in 
> the http response header.
> As mentioned actually content type is only detected by the MimeTypes util in 
> case the header does not contains any content type informations or had 
> problems with mixed case keys.
> Solution:
> Take the content type property out of the meta data and clearly restrict the 
> access of this meta data into the own getter method.
> Problem:
> Most protocol plugins  checking if content type is null only in this case the 
> MimeTypes util is used. Since my patch move the mime type detection to the 
> parser factory - where from my point of view - is the right place, it is now 
> unneccary code we can remove from the protocol plugins. I never found a case 
> where no content type was returned just mixed case keys was used. 
> Solution. 
> Remove this detection code, since it is now in the parser factory.
> I didn't change this since more code I  change, I guess there is a  less 
> chance to get the patch into the sources, I suggest we open a low priority 
> issue and once we change the plugins we can remove it.
> Problem:
> This is not a problem, but a 'code smells' (Martin Fowler) There are empty 
> test methods in TestMimeType
>   /** Test of <code>getExtensions</code> method. */
>     public void testGetExtensions() {
>     }
> Solution:
> Implement these tests or remove the test methods.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

Reply via email to