[ http://issues.apache.org/jira/browse/NUTCH-133?page=all ]
Stefan Groschupf updated NUTCH-133: ----------------------------------- Attachment: Parserutil_test_patch.txt A test that reproduce most problems, see a real world sample url in the conclusion above. > ParserFactory does not work as expected > --------------------------------------- > > Key: NUTCH-133 > URL: http://issues.apache.org/jira/browse/NUTCH-133 > Project: Nutch > Type: Bug > Versions: 0.8-dev, 0.7.1, 0.7.2-dev > Reporter: Stefan Groschupf > Priority: Blocker > Attachments: Parserutil_test_patch.txt > > Marcel Schnippe detect a set of problems until working with different content > and parser types, we worked together to identify the problem source. > From our point of view this described problems could be the source for many > other problems daily described in the mailing lists. > Find a conclusion of the problems below. > Problem: > Some servers returns mixed case but correct header keys like 'Content-type' > or 'content-Length' in the http response header. > That's why for example a get("Content-Type") fails and a page is detected as > zip using the magic content type detection mechanism. > Also we note that this a common reason why pdf parsing fails since > Content-Length does return the correct value. > Sample: > returns "text/HTML" or "application/PDF" or Content-length > or this url: > http://www.lanka.info/dictionary/EnglishToSinhala.jsp > Solution: > First just write only lower case keys into the properties and later convert > all keys that are used to query the metadata to lower case as well. > e.g.: > HttpResponse.java, line 353: > use lower cases here and for all keys used to query header properties (also > content-length) change: String key = line.substring(0, colonIndex); to > String key = line.substring(0, colonIndex) .toLowerCase(); > Problem: > MimeTypes based discovery (magic and url based) is only done in case the > content type was not delivered by the web server, this happens not that > often, mostly this was a problem with mixed case keys in the header. > see: > public Content toContent() { > String contentType = getHeader("Content-Type"); > if (contentType == null) { > MimeType type = null; > if (MAGIC) { > type = MIME.getMimeType(orig, content); > } else { > type = MIME.getMimeType(orig); > } > if (type != null) { > contentType = type.getName(); > } else { > contentType = ""; > } > } > return new Content(orig, base, content, contentType, headers); > } > Solution: > Use the content-type information as it is from the webserver and move the > content type discovering from Protocol plugins to the Component where the > parsing is done - to the ParseFactory. > Than just create a list of parsers for the content type returned by the > server and the custom detected content type. In the end we can iterate over > all parser until we got a successfully parsed status. > Problem: > Content will be parsed also if the protocol reports a exception and has a non > successful status, in such a case the content is new byte[0] in any case. > Solution: > Fetcher.java, line 243. > Change: if (!Fetcher.this.parsing ) { .. to > if (!Fetcher.this.parsing || !protocolStatus.isSuccess()) { > // TODO we may should not write out here emthy parse text and parse > date, i suggest give outputpage a parameter parsed true / false > outputPage(new FetcherOutput(fle, hash, protocolStatus), > content, new ParseText(""), > new ParseData(new ParseStatus(ParseStatus.NOTPARSED), "", new > Outlink[0], new Properties())); > return null; > } > Problem: > Actually the configuration of parser is done based on plugin id's, but one > plugin can have several extentions, so normally a plugin can provide several > parser, but this is no limited just wrong values are used in the > configuration process. > Solution: > Change plugin id to extension id in the parser configuration file and also > change this code in the parser factory to use extension id's everywhere. > Problem: > there is not a clear differentiation between content type and mime type. > I'm notice that some plugins call metaData.get("Content-Type) or > content.getContentType(); > Actually in theory this can return different values, since the content type > could be detected by the MimesTypes util and is not the same as delivered in > the http response header. > As mentioned actually content type is only detected by the MimeTypes util in > case the header does not contains any content type informations or had > problems with mixed case keys. > Solution: > Take the content type property out of the meta data and clearly restrict the > access of this meta data into the own getter method. > Problem: > Most protocol plugins checking if content type is null only in this case the > MimeTypes util is used. Since my patch move the mime type detection to the > parser factory - where from my point of view - is the right place, it is now > unneccary code we can remove from the protocol plugins. I never found a case > where no content type was returned just mixed case keys was used. > Solution. > Remove this detection code, since it is now in the parser factory. > I didn't change this since more code I change, I guess there is a less > chance to get the patch into the sources, I suggest we open a low priority > issue and once we change the plugins we can remove it. > Problem: > This is not a problem, but a 'code smells' (Martin Fowler) There are empty > test methods in TestMimeType > /** Test of <code>getExtensions</code> method. */ > public void testGetExtensions() { > } > Solution: > Implement these tests or remove the test methods. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira