[ http://issues.apache.org/jira/browse/NUTCH-133?page=all ]
Stefan Groschupf updated NUTCH-133:
-----------------------------------
Attachment: Parserutil_test_patch.txt
A test that reproduce most problems, see a real world sample url in the
conclusion above.
> ParserFactory does not work as expected
> ---------------------------------------
>
> Key: NUTCH-133
> URL: http://issues.apache.org/jira/browse/NUTCH-133
> Project: Nutch
> Type: Bug
> Versions: 0.8-dev, 0.7.1, 0.7.2-dev
> Reporter: Stefan Groschupf
> Priority: Blocker
> Attachments: Parserutil_test_patch.txt
>
> Marcel Schnippe detect a set of problems until working with different content
> and parser types, we worked together to identify the problem source.
> From our point of view this described problems could be the source for many
> other problems daily described in the mailing lists.
> Find a conclusion of the problems below.
> Problem:
> Some servers returns mixed case but correct header keys like 'Content-type'
> or 'content-Length' in the http response header.
> That's why for example a get("Content-Type") fails and a page is detected as
> zip using the magic content type detection mechanism.
> Also we note that this a common reason why pdf parsing fails since
> Content-Length does return the correct value.
> Sample:
> returns "text/HTML" or "application/PDF" or Content-length
> or this url:
> http://www.lanka.info/dictionary/EnglishToSinhala.jsp
> Solution:
> First just write only lower case keys into the properties and later convert
> all keys that are used to query the metadata to lower case as well.
> e.g.:
> HttpResponse.java, line 353:
> use lower cases here and for all keys used to query header properties (also
> content-length) change: String key = line.substring(0, colonIndex); to
> String key = line.substring(0, colonIndex) .toLowerCase();
> Problem:
> MimeTypes based discovery (magic and url based) is only done in case the
> content type was not delivered by the web server, this happens not that
> often, mostly this was a problem with mixed case keys in the header.
> see:
> public Content toContent() {
> String contentType = getHeader("Content-Type");
> if (contentType == null) {
> MimeType type = null;
> if (MAGIC) {
> type = MIME.getMimeType(orig, content);
> } else {
> type = MIME.getMimeType(orig);
> }
> if (type != null) {
> contentType = type.getName();
> } else {
> contentType = "";
> }
> }
> return new Content(orig, base, content, contentType, headers);
> }
> Solution:
> Use the content-type information as it is from the webserver and move the
> content type discovering from Protocol plugins to the Component where the
> parsing is done - to the ParseFactory.
> Than just create a list of parsers for the content type returned by the
> server and the custom detected content type. In the end we can iterate over
> all parser until we got a successfully parsed status.
> Problem:
> Content will be parsed also if the protocol reports a exception and has a non
> successful status, in such a case the content is new byte[0] in any case.
> Solution:
> Fetcher.java, line 243.
> Change: if (!Fetcher.this.parsing ) { .. to
> if (!Fetcher.this.parsing || !protocolStatus.isSuccess()) {
> // TODO we may should not write out here emthy parse text and parse
> date, i suggest give outputpage a parameter parsed true / false
> outputPage(new FetcherOutput(fle, hash, protocolStatus),
> content, new ParseText(""),
> new ParseData(new ParseStatus(ParseStatus.NOTPARSED), "", new
> Outlink[0], new Properties()));
> return null;
> }
> Problem:
> Actually the configuration of parser is done based on plugin id's, but one
> plugin can have several extentions, so normally a plugin can provide several
> parser, but this is no limited just wrong values are used in the
> configuration process.
> Solution:
> Change plugin id to extension id in the parser configuration file and also
> change this code in the parser factory to use extension id's everywhere.
> Problem:
> there is not a clear differentiation between content type and mime type.
> I'm notice that some plugins call metaData.get("Content-Type) or
> content.getContentType();
> Actually in theory this can return different values, since the content type
> could be detected by the MimesTypes util and is not the same as delivered in
> the http response header.
> As mentioned actually content type is only detected by the MimeTypes util in
> case the header does not contains any content type informations or had
> problems with mixed case keys.
> Solution:
> Take the content type property out of the meta data and clearly restrict the
> access of this meta data into the own getter method.
> Problem:
> Most protocol plugins checking if content type is null only in this case the
> MimeTypes util is used. Since my patch move the mime type detection to the
> parser factory - where from my point of view - is the right place, it is now
> unneccary code we can remove from the protocol plugins. I never found a case
> where no content type was returned just mixed case keys was used.
> Solution.
> Remove this detection code, since it is now in the parser factory.
> I didn't change this since more code I change, I guess there is a less
> chance to get the patch into the sources, I suggest we open a low priority
> issue and once we change the plugins we can remove it.
> Problem:
> This is not a problem, but a 'code smells' (Martin Fowler) There are empty
> test methods in TestMimeType
> /** Test of <code>getExtensions</code> method. */
> public void testGetExtensions() {
> }
> Solution:
> Implement these tests or remove the test methods.
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
http://www.atlassian.com/software/jira