Re: Nutch doesn't dive deeper

sami siren Sun, 27 Aug 2006 12:56:55 -0700

2006/8/27, Chris Mattmann <[EMAIL PROTECTED]>:


Hi Sami,

  I'm not sure that I agree that the entire set of mime types that you
list
below should be removed from the parse-plugins.xml default mapping. For
instance, if you look at the current mapping file, many of the types below
would have no other option for parsing them besides the TextParser. I
think
it makes a lot of sense to parse some of the below documents with the
TextParser because, in fact, they are text documents.




A LaTeX document is a

plan text document.



Yes it can contain textual content among other things. However without
proper parsing the outcome is (at least pars of it) not something I would
like to see in search results.

Text/css is essentially a plain text document.


yes, contents are most often ASCII but is it really something one wants to
index by default?


An rfc822

message is indeed (stripped of headers), a plain text document.



yes, contents are most often ascii, but  I quess as often encoded (for
example mime) to be more or less useless in unparsed form.

  There's a careful tradeoff that must be made in terms of having a default

config file that allows the greatest coverage of mime tyeps that are
available, and the handling of them with at least * one * parser, in
contrast to not including any parser at all for a particular mime type. I
struggled with this very issue when I initially created that file and what
you see in there now represents a "best guess" of mime types mapped to the
available parsers that exist in Nutch. The other option on that file is
that
people can modify it on their own. For instance, in a domain-specific
deployment, a user can add and remove whatever mime type to plugin
mappings
she wants from the parse-plugins.xml file: it was never meant to be
something that was "set in stone" per se. It would be good to see some
experiments to see what the best config set for parse-plugins.xml is.




My opinion is that we should not try to pretend to be able to parse
something when we really are not. We should give a default config that
allows the greatest set of mime types Nutch really can handle. Then again
those two text type of documents you picked up are quite rare and not
mainstream and probably enabling/disabling them doesn't really make any
difference in search results.

--
Sami Siren

Re: Nutch doesn't dive deeper

Reply via email to