2006/8/27, Chris Mattmann <[EMAIL PROTECTED]>:
Hi Sami, I'm not sure that I agree that the entire set of mime types that you list below should be removed from the parse-plugins.xml default mapping. For instance, if you look at the current mapping file, many of the types below would have no other option for parsing them besides the TextParser. I think it makes a lot of sense to parse some of the below documents with the TextParser because, in fact, they are text documents.
A LaTeX document is a
plan text document.
Yes it can contain textual content among other things. However without proper parsing the outcome is (at least pars of it) not something I would like to see in search results. Text/css is essentially a plain text document.
yes, contents are most often ASCII but is it really something one wants to index by default? An rfc822
message is indeed (stripped of headers), a plain text document.
yes, contents are most often ascii, but I quess as often encoded (for example mime) to be more or less useless in unparsed form. There's a careful tradeoff that must be made in terms of having a default
config file that allows the greatest coverage of mime tyeps that are available, and the handling of them with at least * one * parser, in contrast to not including any parser at all for a particular mime type. I struggled with this very issue when I initially created that file and what you see in there now represents a "best guess" of mime types mapped to the available parsers that exist in Nutch. The other option on that file is that people can modify it on their own. For instance, in a domain-specific deployment, a user can add and remove whatever mime type to plugin mappings she wants from the parse-plugins.xml file: it was never meant to be something that was "set in stone" per se. It would be good to see some experiments to see what the best config set for parse-plugins.xml is.
My opinion is that we should not try to pretend to be able to parse something when we really are not. We should give a default config that allows the greatest set of mime types Nutch really can handle. Then again those two text type of documents you picked up are quite rare and not mainstream and probably enabling/disabling them doesn't really make any difference in search results. -- Sami Siren
