> It's a good point Markus. I would imagine that we would wish to do one > of two things > > 1) Create a parser to fetch the contentType in question (not the aim > of Nutch but geared more towards Tika contribution...) > 2) As you mention, use a parser implementation which stores this > contentType as false for parsing e.g. skip this contentType when it is > encountered again. However are we not able to achieve this through use > of an urlfilter which denies the .x-flv suffix?
Indeed, the question is more about the state of the CrawlDB. I think the type should still be stored because it is valuable information if once decides to parse that type later. I wonder if a db_gone status would be more appropriate in such a case. We cannot filter all url's by using the suffix filter because sometimes url's just dome have an extension at all but can be of any format. Also, what would the signature be of an unparsed file (sorry, can't check right now). It must not change or let the fetch scheduler think it must be fetched sooner than interval. > > On Tue, Jan 3, 2012 at 5:18 PM, Markus Jelsma > > <markus.jel...@openindex.io> wrote: > > Hi, > > > > Right now the state of the crawldb is set to success for items without a > > parser that throw: > > > > Exception in thread "main" org.apache.nutch.parse.ParseException: parser > > not found for contentType=video/x-flv url= > > at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:78) > > at > > org.apache.nutch.parse.ParserChecker.run(ParserChecker.java:101) at > > org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at > > org.apache.nutch.parse.ParserChecker.main(ParserChecker.java:138) > > > > Should we do that at all? It doesn't seem right. I, for instance, am not > > interested in retrying such an URL again for a very long time. > > > > Thoughts? > > Thanks