Hi guys,

I am porting Arch (https://www.atnf.csiro.au/computing/software/arch/) to Nutch 
1.14 and Solr 7.2, and I have come across a few serious issues, of which you 
should be aware:


1.       The Nutch-2071 is still an issue in 1.14, because the returned 
parseResult is never null. If a parser fails to parse a document, it returns an 
empty result, but not null. This means that, from a chain of parser candidates, 
only the first one has a chance to try to parse the document.

2.       Nutch adopted Tika as a general parsing tool, and stopped supporting 
"legacy" parsing (OO, MS) plugins. I continued using them and hoped to stop 
supporting them in the next version of Arch I am preparing to be released, but 
I still can't do it, because Tika fails to parse too many documents on our 
site. But, when I reinforce Tika with the legacy parsers, I achieve almost 100% 
parsing success rate. This is why NUTCH-2071 is important for Arch. I think you 
should bring back legacy parsers to Nutch, because the quality of parsing of 
"real life" data, such as ours, is not great without them.

3.       The lines defining fall-back (*) plugin in parse-plugins.xml are not 
effective, because they are ignored, as long as there is at least one plugin 
claiming * in its plugin.xml file. In some cases, Nutch assigns * capability to 
plugins that don't even claim it. For example, I can't understand, why Arch 
content blocking plugin gets it.

4.       In earlier versions of Nutch, use of the native libraries really 
helped. It reduced crawling of our site from a couple of days to 6-7 hours. In 
Nutch 1.14, I don't notice this. I've obtained Hadoop libraries, placed them 
where they are expected, even inserted an explicit load library call in my 
code, but I still don't notice any significant time savings.

5.       The Feed plugin seems to have a major problem. The line 102 in  
FeedIndexingFilter.java generated a NumberFormatException (which caused the 
failure of the entire crawling process!) because it was trying to parse a date 
in string format, not a number. Given that this metadata piece was generated by 
the feed parser (same plugin), it seems that the plugin is in disagreement with 
itself.

6.       This is less important, but when Tika fails to parse a document, it 
generates a scary error message and ugly stack trace. I think this should be a 
one line warning, because other parsers may still parse this document 
successfully.

Hope this helps.

Regards,

Arkadi

Reply via email to