> It's a good point Markus. I would imagine that we would wish to do one
> of two things
> 
> 1) Create a parser to fetch the contentType in question (not the aim
> of Nutch but geared more towards Tika contribution...)
> 2) As you mention, use a parser implementation which stores this
> contentType as false for parsing e.g. skip this contentType when it is
> encountered again. However are we not able to achieve this through use
> of an urlfilter which denies the .x-flv suffix?

Indeed, the question is more about the state of the CrawlDB. I think the type 
should still be stored because it is valuable information if once decides to 
parse that type later.

I wonder if a db_gone status would be more appropriate in such a case. We 
cannot filter all url's by using the suffix filter because sometimes url's 
just dome have an extension at all but can be of any format.

Also, what would the signature be of an unparsed file (sorry, can't check 
right now). It must not change or let the fetch scheduler think it must be 
fetched sooner than interval.

> 
> On Tue, Jan 3, 2012 at 5:18 PM, Markus Jelsma
> 
> <markus.jel...@openindex.io> wrote:
> > Hi,
> > 
> > Right now the state of the crawldb is set to success for items without a
> > parser that throw:
> > 
> > Exception in thread "main" org.apache.nutch.parse.ParseException: parser
> > not found for contentType=video/x-flv url=
> >        at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:78)
> >        at
> > org.apache.nutch.parse.ParserChecker.run(ParserChecker.java:101) at
> > org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at
> > org.apache.nutch.parse.ParserChecker.main(ParserChecker.java:138)
> > 
> > Should we do that at all? It doesn't seem right. I, for instance, am not
> > interested in retrying such an URL again for a very long time.
> > 
> > Thoughts?
> > Thanks

Reply via email to