Jérôme Charron wrote:
Hello Jon, and sorry for the late response,
I'd appreciate any thoughts. Perhaps something for parser policy. I've
traced the source code a bit and nothing jumped out at me...
There's some currently identified issues on the parser policy (ie
ParserFactory), and we are actively working on it.
I don't undestand why the parse-ext plugin is called in your case, whereas
it should be parser-pdf or parse-html plugins.
Here's a workaround: if you don't have needs for the parse-ext (plugin used
to perform parsing using some exernal commands), simply remove it and all
should be ok.
Could you please send me your /usr/local/nutch/plugins/parse-ext/plugin.xml
file so that I can check if something goes wrong in it.
Regards
Jérôme
--
http://motrech.free.fr/
http://www.frutch.org/
"should be ok" ... as in content will be parsed correctly or that we
will not see the error message. Lack of an error message does nto mean
thigns are ok. :)
Pased below is the file. This is from the release-0.7 build with
patches as 0.7.1 is getting prepared.
<?xml version="1.0" encoding="UTF-8"?>
<plugin
id="parse-ext"
name="External Parser Plug-in"
version="1.0.0"
provider-name="nutch.org">
<runtime>
<library name="parse-ext.jar">
<export name="*"/>
</library>
</runtime>
<extension id="org.apache.nutch.parse.ext"
name="ExtParse"
point="org.apache.nutch.parse.Parser">
<implementation id="ExtParser"
class="org.apache.nutch.parse.ext.ExtParser"
contentType="application/vnd.nutch.example.cat"
pathSuffix=""
command="./build/plugins/parse-ext/command"
timeout="10"/>
<implementation id="ExtParser"
class="org.apache.nutch.parse.ext.ExtParser"
contentType="application/vnd.nutch.example.md5sum"
pathSuffix=""
command="./build/plugins/parse-ext/command"
timeout="20"/>
</extension>
</plugin>