[ 
http://issues.apache.org/jira/browse/NUTCH-133?page=comments#action_12359610 ] 

Stefan Groschupf commented on NUTCH-133:
----------------------------------------

Jerome: 
Since 3 months or so url extentions and also magic content type detection is 
never used. I suggest to assign this patch asap since many people are working 
with the latest sources nightly builds, we have a kind of responsibility - so 
we should focus on this issue today. 
I also suggest, we should make a maintenance release for the 0.7 asap,  since 
this problem is also part of the latest release. The solution today took us 2 
steps back, because the solution used before first also used url extensions, 
secondly tried all parsers in the list and took the one that was able to parse 
the content. Today we just use the content type returned by the servers (this 
is many many times wrong) and we use just one parser. 
So today there is just no content type detection used at all. 
If you can improve and work on MimeTypes.java that would be great. I see many 
space for improvements. 
For example in case just the char encoding is different than utf8  the mime 
type util detects a html page as gzip application. 
I didn't change anything on your MimeType classes, so you can  straight forward 
improve it and use it with my new ParserFactory.
So pleace focus on this  MimeTypes.java improvements! If you have a version, 
please put it as patch to the jira so all developer can review it before you 
commit it to the svn. 


Chris: 
Thanks for taking the time to take a look. 
First, I only rewrote classes where it was absolutely necessary, also you will 
notice that only rewritten classes has a new author (my IDE  sets the author 
name until create a new File). Secondly only classes that were unused were 
removed. The code does exactly what you guys had described, what a 
parserfactory should do and not just one part of  your concept was changed, now 
it just works and may has less lines of  code. 

The REAL bug is that by today i'm able to parse less than 80 % of pages with my 
testset in a intranet project with a wrong configured ISS we was not able to 
parse anything at all, so don't mix things - this is a serious blocker bug! 
First we talk about that by today the parser selction do not use any kind of 
detection mechanism and than we talk also about improvements.
Regarding your last comment, using a extension id in you parser configuation 
file instead of a plugin id would give you the chance to have a order of 
parsers, also if the parsers ships inside the same plugin, since extension ids 
are unique in one plugin and all plugins.

> ParserFactory does not work as expected
> ---------------------------------------
>
>          Key: NUTCH-133
>          URL: http://issues.apache.org/jira/browse/NUTCH-133
>      Project: Nutch
>         Type: Bug
>     Versions: 0.8-dev, 0.7.1, 0.7.2-dev
>     Reporter: Stefan Groschupf
>     Priority: Blocker
>  Attachments: ParserFactoryPatch_nutch.0.7_patch.txt, 
> Parserutil_test_patch.txt
>
> Marcel Schnippe detect a set of problems until working with different content 
> and parser types, we worked together to identify the problem source.
> From our point of view this described problems could be the source for many 
> other problems daily described in the mailing lists.
> Find a conclusion of the problems below.
> Problem:
> Some servers returns mixed case but correct header keys like 'Content-type' 
> or 'content-Length'  in the http response header.
> That's why for example a get("Content-Type") fails and a page is detected as 
> zip using the magic content type detection mechanism. 
> Also we note that this a common reason why pdf parsing fails since 
> Content-Length does return the correct value. 
> Sample:
> returns "text/HTML" or "application/PDF" or Content-length
> or this url:
> http://www.lanka.info/dictionary/EnglishToSinhala.jsp
> Solution:
> First just write only lower case keys into the properties and later convert 
> all keys that are used to query the metadata to lower case as well.
> e.g.:
> HttpResponse.java, line 353:
> use lower cases here and for all keys used to query header properties (also 
> content-length) change:  String key = line.substring(0, colonIndex); to  
> String key = line.substring(0, colonIndex) .toLowerCase();
> Problem:
> MimeTypes based discovery (magic and url based) is only done in case the 
> content type was not delivered by the web server, this happens not that 
> often, mostly this was a problem with mixed case keys in the header.
> see:
>  public Content toContent() {
>     String contentType = getHeader("Content-Type");
>     if (contentType == null) {
>       MimeType type = null;
>       if (MAGIC) {
>         type = MIME.getMimeType(orig, content);
>       } else {
>         type = MIME.getMimeType(orig);
>       }
>       if (type != null) {
>           contentType = type.getName();
>       } else {
>           contentType = "";
>       }
>     }
>     return new Content(orig, base, content, contentType, headers);
>   }
> Solution:
> Use the content-type information as it is from the webserver and move the 
> content type discovering from Protocol plugins to the Component where the 
> parsing is done - to the ParseFactory.
> Than just create a list of parsers for the content type returned by the 
> server and the custom detected content type. In the end we can iterate over 
> all parser until we got a successfully parsed status.
> Problem:
> Content will be parsed also if the protocol reports a exception and has a non 
> successful status, in such a case the content is new byte[0] in any case.
> Solution:
> Fetcher.java, line 243.
> Change:   if (!Fetcher.this.parsing ) { .. to 
>  if (!Fetcher.this.parsing || !protocolStatus.isSuccess()) {
>        // TODO we may should not write out here emthy parse text and parse 
> date, i suggest give outputpage a parameter parsed true / false
>           outputPage(new FetcherOutput(fle, hash, protocolStatus),
>                 content, new ParseText(""),
>                 new ParseData(new ParseStatus(ParseStatus.NOTPARSED), "", new 
> Outlink[0], new Properties()));
>         return null;
>       }
> Problem:
> Actually the configuration of parser is done based on plugin id's, but one 
> plugin can have several extentions, so  normally a plugin can provide several 
> parser, but this is no limited just wrong values are used in the 
> configuration process. 
> Solution:
> Change plugin id to  extension id in the parser configuration file and also 
> change this code in the parser factory to use extension id's everywhere.
> Problem:
> there is not a clear differentiation between content type and mime type. 
> I'm notice that some plugins call metaData.get("Content-Type) or 
> content.getContentType();
> Actually in theory this can return different values, since the content type 
> could be detected by the MimesTypes util and is not the same as delivered in 
> the http response header.
> As mentioned actually content type is only detected by the MimeTypes util in 
> case the header does not contains any content type informations or had 
> problems with mixed case keys.
> Solution:
> Take the content type property out of the meta data and clearly restrict the 
> access of this meta data into the own getter method.
> Problem:
> Most protocol plugins  checking if content type is null only in this case the 
> MimeTypes util is used. Since my patch move the mime type detection to the 
> parser factory - where from my point of view - is the right place, it is now 
> unneccary code we can remove from the protocol plugins. I never found a case 
> where no content type was returned just mixed case keys was used. 
> Solution. 
> Remove this detection code, since it is now in the parser factory.
> I didn't change this since more code I  change, I guess there is a  less 
> chance to get the patch into the sources, I suggest we open a low priority 
> issue and once we change the plugins we can remove it.
> Problem:
> This is not a problem, but a 'code smells' (Martin Fowler) There are empty 
> test methods in TestMimeType
>   /** Test of <code>getExtensions</code> method. */
>     public void testGetExtensions() {
>     }
> Solution:
> Implement these tests or remove the test methods.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

Reply via email to