[ http://issues.apache.org/jira/browse/NUTCH-133?page=comments#action_12359603 ]
Chris A. Mattmann commented on NUTCH-133: ----------------------------------------- Just another comment on the issue. The reported "bug" listed as the following: Problem: Actually the configuration of parser is done based on plugin id's, but one plugin can have several extentions, so normally a plugin can provide several parser, but this is no limited just wrong values are used in the configuration process. Solution: Change plugin id to extension id in the parser configuration file and also change this code in the parser factory to use extension id's everywhere. in my mind is not a "bug" at all really. It's intended behavior. While it is true that we based the redesign of the ParserFactory in NUTCH-88 on having the "pluginId" used as the key to map parsing plugins to mimeTypes, rather than the "extensionId", and while it is also true that a a particular pluginId may have several extensions, the way that we have implemented NUTCH-88 is by no means limiting. Consider the following situation in which I write a plugin "parse-foo", which is a "parsing only" plugin that provides several different parser implementations for handling the contentType "foo". Okay, so here's my question. Say I provided parser iimplementations A, B, C, and D. How do you know which one should get called in which order for the content type foo? Should A get called first, because it's first in the plugin.xml file? Should D get called last because it is last? Neither of these are the only correct answer, but they may be to some people, and they may not be to others. Thus, the situation that you describe as a "problem" in our eyes was never really a problem per se. It all has to do with the way that parsing plugins are implemented in Nutch. To me, parsing plugins are a special class of plugins really. If you take a look at all the parse-xxx plugins in the $NUTCH_HOME/src/plugin directory, you see the following situation with respect to the parser implemetnations that each parsing plugin provides: parse-ext - provides 2, however they are clearly separated based on the contentType (or mimeType) parse-html - only provides 1 parse-js - only provides 1 parse-mp3 - only provides 1 parse-mspowerpoint - only provides 1 parse-msword - only provides 1 parse-pdf - only provides 1 parse-rss - only provides 1 parse-rtf - only provides 1 parse-text - only provides 1 parse-zip - only provides 1 Thus, all but one of the existing parser plugins only provides 1 parsing implementation. Furthermore, even in the case where a parsing plugin provides 2 implementations, as in the case of parse-ext, the way that NUTCH-88 works right now still is able to deal with that situation, as long as the 2 parsing implementations are different classes (on the other hand, you can see why this wouldn't be an issue if both parsing implementations used the same class to handle different mimeTypes, as in the case of parse-ext) and both handle different "mimeTypes", or as they are described in the plugin.xml, "contentTypes". Say we encounter the mimeType "foo", and we have a parse-foo plugin, which provides two parsing implementation classes (e.g., classes that implement the org.apache.nutch.parse.Parser) extension point interface, A, and B, which are different classes, and that A handles the mimeType "foo2", and that B handles the mimeType "foo". Okay, then consider that in the parse-plugins.xml file we have mapped the mimeType "foo" to the "parse-foo" plugin. Okay, so, then what happens now after NUTCH-88 is that when foo is encountered by the protocol plugins, and then the ParserFactory is called to get a parser and then parse the content returned from protocol land, the way that the parser factory works is that it would obtain an prioritized list of all the Parser implementation clasess that for the parsing plugins that were mapped to the mimeType "foo", AND claim can handle "foo". So, even if parse-foo in our example provides A, and B parser implementations as a plugin, and even though we mapped the "plugin" "parse-foo" to the mimeType "foo" via parse-plugins.xml, the only parser implementation will get returned is "B" becaue "B" is the only plugin that actually claims it can deal with "foo". Thus, NUTCH-88 still provides the as-intended behavior in my mind to deal the issue that you claim is a bug. Or, am I missing something here? Thanks, Chris > ParserFactory does not work as expected > --------------------------------------- > > Key: NUTCH-133 > URL: http://issues.apache.org/jira/browse/NUTCH-133 > Project: Nutch > Type: Bug > Versions: 0.8-dev, 0.7.1, 0.7.2-dev > Reporter: Stefan Groschupf > Priority: Blocker > Attachments: ParserFactoryPatch_nutch.0.7_patch.txt, > Parserutil_test_patch.txt > > Marcel Schnippe detect a set of problems until working with different content > and parser types, we worked together to identify the problem source. > From our point of view this described problems could be the source for many > other problems daily described in the mailing lists. > Find a conclusion of the problems below. > Problem: > Some servers returns mixed case but correct header keys like 'Content-type' > or 'content-Length' in the http response header. > That's why for example a get("Content-Type") fails and a page is detected as > zip using the magic content type detection mechanism. > Also we note that this a common reason why pdf parsing fails since > Content-Length does return the correct value. > Sample: > returns "text/HTML" or "application/PDF" or Content-length > or this url: > http://www.lanka.info/dictionary/EnglishToSinhala.jsp > Solution: > First just write only lower case keys into the properties and later convert > all keys that are used to query the metadata to lower case as well. > e.g.: > HttpResponse.java, line 353: > use lower cases here and for all keys used to query header properties (also > content-length) change: String key = line.substring(0, colonIndex); to > String key = line.substring(0, colonIndex) .toLowerCase(); > Problem: > MimeTypes based discovery (magic and url based) is only done in case the > content type was not delivered by the web server, this happens not that > often, mostly this was a problem with mixed case keys in the header. > see: > public Content toContent() { > String contentType = getHeader("Content-Type"); > if (contentType == null) { > MimeType type = null; > if (MAGIC) { > type = MIME.getMimeType(orig, content); > } else { > type = MIME.getMimeType(orig); > } > if (type != null) { > contentType = type.getName(); > } else { > contentType = ""; > } > } > return new Content(orig, base, content, contentType, headers); > } > Solution: > Use the content-type information as it is from the webserver and move the > content type discovering from Protocol plugins to the Component where the > parsing is done - to the ParseFactory. > Than just create a list of parsers for the content type returned by the > server and the custom detected content type. In the end we can iterate over > all parser until we got a successfully parsed status. > Problem: > Content will be parsed also if the protocol reports a exception and has a non > successful status, in such a case the content is new byte[0] in any case. > Solution: > Fetcher.java, line 243. > Change: if (!Fetcher.this.parsing ) { .. to > if (!Fetcher.this.parsing || !protocolStatus.isSuccess()) { > // TODO we may should not write out here emthy parse text and parse > date, i suggest give outputpage a parameter parsed true / false > outputPage(new FetcherOutput(fle, hash, protocolStatus), > content, new ParseText(""), > new ParseData(new ParseStatus(ParseStatus.NOTPARSED), "", new > Outlink[0], new Properties())); > return null; > } > Problem: > Actually the configuration of parser is done based on plugin id's, but one > plugin can have several extentions, so normally a plugin can provide several > parser, but this is no limited just wrong values are used in the > configuration process. > Solution: > Change plugin id to extension id in the parser configuration file and also > change this code in the parser factory to use extension id's everywhere. > Problem: > there is not a clear differentiation between content type and mime type. > I'm notice that some plugins call metaData.get("Content-Type) or > content.getContentType(); > Actually in theory this can return different values, since the content type > could be detected by the MimesTypes util and is not the same as delivered in > the http response header. > As mentioned actually content type is only detected by the MimeTypes util in > case the header does not contains any content type informations or had > problems with mixed case keys. > Solution: > Take the content type property out of the meta data and clearly restrict the > access of this meta data into the own getter method. > Problem: > Most protocol plugins checking if content type is null only in this case the > MimeTypes util is used. Since my patch move the mime type detection to the > parser factory - where from my point of view - is the right place, it is now > unneccary code we can remove from the protocol plugins. I never found a case > where no content type was returned just mixed case keys was used. > Solution. > Remove this detection code, since it is now in the parser factory. > I didn't change this since more code I change, I guess there is a less > chance to get the patch into the sources, I suggest we open a low priority > issue and once we change the plugins we can remove it. > Problem: > This is not a problem, but a 'code smells' (Martin Fowler) There are empty > test methods in TestMimeType > /** Test of <code>getExtensions</code> method. */ > public void testGetExtensions() { > } > Solution: > Implement these tests or remove the test methods. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
