hello again,
I've added a printStackTrace to where the fetcher throws the exception:
java.lang.NullPointerException
at org.apache.hadoop.io.Text.encode(Text.java:375)
at org.apache.hadoop.io.Text.encode(Text.java:356)
at org.apache.hadoop.io.Text.writeString(Text.java:396)
at org.apache.nutch.parse.ParseData.write(ParseData.java:159)
at org.apache.nutch.parse.ParseImpl.write(ParseImpl.java:55)
at org.apache.nutch.fetcher.FetcherOutput.write(FetcherOutput.java
:61)
at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(
MapTask.java:315)
at org.apache.nutch.fetcher.Fetcher$FetcherThread.output(
Fetcher.java:397)
at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java
:163)
here is the snippet from the end of my code:
// don't need those fields...
String resultText = null;
String resultTitle = null;
Outlink[] outlinks = null;
final ParseData parseData = new ParseData( ParseStatus.STATUS_SUCCESS,
resultTitle, outlinks,
content.getMetadata());
return new ParseImpl(resultText, parseData);
On 10/18/07, eyal edri <[EMAIL PROTECTED]> wrote:
>
> Found how to associate multiple contentTypes to a certain plugin:
>
> just add the content type to the conf/parse-plugins.xml file: (the plugin
> can take more than one type)
>
> <mimeType name="application/x-dosexec">
> <plugin id="parse-exe" />
> </mimeType>
>
> On 10/18/07, eyal edri <[EMAIL PROTECTED]> wrote:
> >
> > Excellent !! :)
> >
> > that did the trick!
> >
> > Any chance to create a new page on the plugin central for writing a
> > nutch-0.9 plugin, stating the checklist (written below)?
> > (i would have uploaded, but dont have the rights to open a new page)
> >
> > The checklist: (relevant for a parse plugin, implementing the Parse
> > extention point)
> >
> > 1. Create new dir under $NUTCH_HOME/src/plugins/parse-XXX
> > 2. Create new $NUTCH_HOME/src/plugins/parse-XXX/plugin.xml
> > [displayed below]
> > 3. Create new $NUTCH_HOME/src/plugins/parse-XXX/build.xml
> > [displayed below]
> > 4. Write the java code
> >
> > $NUTCH_HOME/src/plugin/parse-XXX/src/java/org/apache/nutch/parse/XXX/XXXParser.java
> > 5. Add "nutch-extensionpoints" & "parse-XXX" to the
> > 'plugins-include' property in $NUTCH_HOME/conf/nutch- site.xml
> > 6. Add code to the $NUTCH_HOME/conf/parse-plugins.xml [written
> > below] (new mime type & alias)
> > 7. Added code the the $NUTCH_HOME/src/plugins/build.xml [written
> > below]
> > 8. copied $NUTCH_HOME/build/plugins/parse-XXX/parse- XXX.jar to
> > $NUTCH_HOME/plugins/parse-XXX
> > 9. run ant (build successful)
> >
> > I've got a few of more questions just to tie the loose ends..:
> >
> > 1. Exe extension has a few content types related to it (e.g .
> > application(x-exe|x-msdos|x-msdownload|octet-strem))
> > how can i config parse-exe to capture all of them? (solved)
> > 2. i've noticed that after every build i need to copy
> > build/parse-exe/parse- exe.jar to plugins/parse-exe, any way to tell him
> > to build it directly
> > to plugins/parse-exe?
> > 3. i get a nullPointerException from fetcher after the parse-exe works,
> > can you guide me on what i should return from the parse-exe?
> >
> > the parse-exe plugin: ( the getParse funtion)
> >
> > public class ExeParser implements Parser {
> > public static final Log LOG = LogFactory.getLog (ExeParser.class );
> > private Configuration conf;
> > public static final String DOWNLOAD_DIR = "/home/eyale/HTTPSEC/nutch-
> > 0.9/DOWNLOADS/";
> >
> > public ExeParser() {
> > LOG.info ("EDRI:: created exe-parser object");
> > }
> >
> > public Parse getParse(Content content) {
> > String resultText = null;
> > String resultTitle = null;
> > Outlink[] outlinks = null;
> >
> > try {
> >
> > byte[] raw = content.getContent();
> >
> > // enter here my code
> >
> > String contentLength = content.getMetadata().get(
> > Response.CONTENT_LENGTH);
> > if (contentLength != null && raw.length !=
> > Integer.parseInt(contentLength))
> > {
> > return new ParseStatus(ParseStatus.FAILED ,
> > ParseStatus.FAILED_TRUNCATED,
> > "Content truncated at "+raw.length
> > +" bytes. Parser can't handle incomplete exe
> > file.").getEmptyParse(getConf());
> > }
> > // download the file (private void function)
> > downloadContentType(content);
> >
> > }catch (Exception e) { // run time exception
> > if (LOG.isWarnEnabled()) {
> > LOG.warn("General exception in EXE parser: "+e.getMessage());
> > e.printStackTrace (LogUtil.getWarnStream(LOG));
> > }
> > return new ParseStatus(ParseStatus.FAILED,
> > "Can't be handled as exe document. " +
> > e).getEmptyParse(getConf());
> > }
> >
>
// this is where i don't know what to return....
final ParseData parseData = new ParseData( ParseStatus.STATUS_SUCCESS,
> > resultTitle, outlinks,
> > content.getMetadata());
> > return new ParseImpl(resultText, parseData);
> > }
> >
> > Thanks!!!
> >
> >
> >
> >
> >
> >
> > On 10/17/07, eyal edri < [EMAIL PROTECTED]> wrote:
> > >
> > > Hi all,
> > >
> > > I'm trying to write a new plugin that will download pages with
> > > contentType: x-dosexec (EXE) files.
> > > i've followed the "write your own plugin tutorial" in the wiki and
> > > done the following actions: (some actions are not mentioned in the
> > > tutorial)
> > >
> > >
> > > 1. Created a new dir under $NUTCH_HOME/src/plugins/parse-exe
> > > 2. Created new $NUTCH_HOME/src/plugins/parse-exe/plugin.xml
> > > [displayed below]
> > > 3. Created new $NUTCH_HOME/src/plugins/parse-exe/build.xml
> > > [displayed below]
> > > 4. Written the java code
> > >
> > > $NUTCH_HOME/src/plugin/parse-exe/src/java/org/apache/nutch/parse/exe/ExeParser.java
> > > 5. Add "nutch-extensionpoints" & "parse-exe" to the
> > > 'plugins-include' property in $NUTCH_HOME/conf/nutch- site.xml
> > > 6. Add code to the $NUTCH_HOME/conf/parse-plugins.xml [written
> > > below]
> > > 7. Added code the the $NUTCH_HOME/src/plugins/build.xml [written
> > > below]
> > > 8. copied $NUTCH_HOME/build/plugins/parse-exe/parse- exe.jar to
> > > $NUTCH_HOME/plugins/parse-exe
> > > 9. run ant (build successful)
> > >
> > > the log shows that nutch identifies the plugin:
> > >
> > > 2007-10-17 15:15:55,657 INFO plugin.PluginRepository - Registered
> > > Plugins:
> > > 2007-10-17 15:15:55,657 INFO plugin.PluginRepository - the
> > > nutch core extension points (nutch-extensionpoints)
> > > 2007-10-17 15:15:55,657 INFO plugin.PluginRepository - Html
> > > Parse Plug-in (parse-html)
> > > 2007-10-17 15:15:55,657 INFO plugin.PluginRepository - Exe
> > > Parse Plug-in (parse-exe)
> > >
> > > but when the fetcher encounters a x-dosexec file it thorws an
> > > exception:
> > >
> > > 2007-10-17 15:17:16,146 WARN parse.ParseUtil - No suitable parser
> > > found when trying to parse content http://www.foo.com/yyy/foo.exe of
> > > type application/x-dosexec
> > > 2007-10-17 15:17:16,146 WARN fetcher.Fetcher - Error parsing:
> > > http://www.foo.com/yyy/foo.exe: failed(2,200):
> > > org.apache.nutch.parse.ParseException: parser not found for
> > > contentType=application/x-dosexec url=http://www.foo.com/yyy/movie30.exe
> > >
> > > (sorry, but the url has been masked for security reasons)
> > >
> > > Am i missing something??
> > >
> > > thanks !!
> > >
> > >
> > >
> > > [$NUTCH_HOME/src/plugins/build.xml]
> > >
> > > <ant dir="parse-exe" target="deploy"/>
> > >
> > > [parse-plugins.xml]
> > >
> > > <mimeType name="application/x-dosexec">
> > > <plugin id="parse-exe" />
> > > </mimeType>
> > >
> > >
> > > [plugin.xml] // copied and changed from parse-pdf
> > >
> > > <?xml version="1.0" encoding="UTF-8"?>
> > > <plugin
> > > id="parse-exe"
> > > name="Exe Parse Plug-in"
> > > version="1.0.0"
> > > provider-name="nutch.org">
> > >
> > > <runtime>
> > > <library name="parse-exe.jar">
> > > <export name="*"/>
> > > </library>
> > > </runtime>
> > >
> > > <requires>
> > > <import plugin="nutch-extensionpoints"/>
> > > <import plugin="lib-log4j"/>
> > > </requires>
> > >
> > > <extension id="org.apache.nutch.parse.exe"
> > > name="ExeParse"
> > > point="org.apache.nutch.parse.Parser">
> > >
> > > <implementation id="org.apache.nutch.parse.exe.ExeParse"
> > > class=" org.apache.nutch.parse.exe.ExeParse">
> > > <parameter name="contentType" value="application/x-dosexec"/>
> > > <parameter name="pathSuffix" value=""/>
> > > </implementation>
> > > </extension>
> > >
> > > </plugin>
> > >
> > >
> > > -----------------------------------------------------------------------------------------------------------------
> > >
> > > [build.xml]
> > >
> > > <?xml version=" 1.0"?>
> > >
> > > <project name="parse-exe" default="jar-core">
> > >
> > > <import file="../build-plugin.xml"/>
> > >
> > > </project>
> > >
> > >
> > > ------------------------------------------------------------------------
> > > [ExeParser.java]
> > >
> > > public class ExeParser implements Parser {
> > > public static final Log LOG = LogFactory.getLog("
> > > org.apache.nutch.parse.exe");
> > > private Configuration conf;
> > >
> > > public Parse getParse(Content content) {
> > >
> > > try {
> > >
> > > byte[] raw = content.getContent();
> > >
> > > // enter here my code ( i will replace this with real code)
> > > LOG.info ("EDRI:: you have reached the parse-exe plugin!");
> > > System.out.println("EDRI:: system.out.print... parse-exe");
> > >
> > >
> > >
> > >
> > > String contentLength = content.getMetadata().get(
> > > Response.CONTENT_LENGTH );
> > > if (contentLength != null && raw.length !=
> > > Integer.parseInt(contentLength))
> > > {
> > > return new ParseStatus(ParseStatus.FAILED,
> > > ParseStatus.FAILED_TRUNCATED,
> > > "Content truncated at "+raw.length
> > > +" bytes. Parser can't handle incomplete exe
> > > file.").getEmptyParse(getConf());
> > > }
> > >
> > > } catch (Exception e) { // run time exception
> > > if (LOG.isWarnEnabled()) {
> > > LOG.warn("General exception in EXE parser:
> > > "+e.getMessage());
> > > e.printStackTrace(LogUtil.getWarnStream(LOG));
> > > }
> > > return new ParseStatus(ParseStatus.FAILED,
> > > "Can't be handled as exe document. " +
> > > e).getEmptyParse(getConf());
> > > }
> > >
> > > /// i'm not sure what to return here if i only need to d/l the
> > > file
> > >
> > > ParseData parseData = new ParseData(ParseStatus.STATUS_SUCCESS,
> > > "",null, null, null);
> > > parseData.setConf(this.conf);
> > > return new ParseImpl("", parseData);
> > > }
> > >
> > > public void setConf(Configuration conf) {
> > > this.conf = conf;
> > > }
> > >
> > > public Configuration getConf() {
> > > return this.conf;
> > > }
> > >
> > >
> > >
> > >
> > >
> > > --
> > > Eyal Edri
> >
> >
> >
> >
> > --
> > Eyal Edri
> >
>
>
>
> --
> Eyal Edri
--
Eyal Edri