Re: What to do with items for which is no parser?

Lewis John Mcgibbney Tue, 03 Jan 2012 13:43:47 -0800

It's a good point Markus. I would imagine that we would wish to do one
of two things

1) Create a parser to fetch the contentType in question (not the aim
of Nutch but geared more towards Tika contribution...)
2) As you mention, use a parser implementation which stores this
contentType as false for parsing e.g. skip this contentType when it is
encountered again. However are we not able to achieve this through use
of an urlfilter which denies the .x-flv suffix?

On Tue, Jan 3, 2012 at 5:18 PM, Markus Jelsma
<[email protected]> wrote:
> Hi,
>
> Right now the state of the crawldb is set to success for items without a
> parser that throw:
>
> Exception in thread "main" org.apache.nutch.parse.ParseException: parser not
> found for contentType=video/x-flv url=
>        at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:78)
>        at org.apache.nutch.parse.ParserChecker.run(ParserChecker.java:101)
>        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>        at org.apache.nutch.parse.ParserChecker.main(ParserChecker.java:138)
>
> Should we do that at all? It doesn't seem right. I, for instance, am not
> interested in retrying such an URL again for a very long time.
>
> Thoughts?
> Thanks

-- 
Lewis

Re: What to do with items for which is no parser?

Reply via email to