Hello. You should check the code from package org.apache.nutch.crawl, the
file Crawl.java. It is a good crawl example, with some comments, and clear
enough (I think). It is the code used when using nutch from command line. I
hope this help.


2008/11/10 Lukas, Ray <[EMAIL PROTECTED]>

> Thanks Hasan:
>
> Forgive me.. First your generosity is greatly appreciated. Please accept
> my thanks.. I might be wrong, but... Humm.. I think that we are missing
> a few things here that I also need and, is, in fact, why I selected
> Nutch.
> Nutch does some things .. like.. perform the post event, gather up and
> parse HTML, discover, and follow, the nested url links recursively, and
> countless other things as well. Maintain a database of what was scanned,
> and what should be scanned (WebDB), and I will let the experts expand on
> my limited feature list.
> If I do this.. I am loosing all of those things. I am thinking that
> there is a chunk of code that demonstrates how to call and use the Nutch
> Crawl object and indexers.
> I am, this very moment, going through the sample code (which need to be
> commented by the way, no offense to anyone) in hopes of understanding
> how this all works together.
>
> I am up on Lucene, well kind of, Nutch is the bolder I have to move, or
> get around.. you see.. Do not be discouraged, kindness, and helping
> other people, is never a mistake.
>
>
> -----Original Message-----
> From: Hasan Diwan [mailto:[EMAIL PROTECTED]
> Sent: Monday, November 10, 2008 1:55 PM
> To: [email protected]
> Subject: Re: Example in Java Please
>
> Ray,
> I am feeling charitable this morning, so have posted code to do what
> you desire at the end.
> 2008/11/10 Lukas, Ray <[EMAIL PROTECTED]>:
> > If you could, please. I am, as you probably are, or have been in the
> > recent past, short on time for my project. I need something very
> simple.
> > An example that goes to a single URL, parses the pages under it,
> gathers
> > up all the words (terms) and returns me a Lucene index of them so that
> I
> > can then say "do any of the words I am thinking (terms from my Oracle
> > database) appear in this index and how many times do they appear".
> That
> > is it, very simple. I would like to use Nutch.
> > I am going through the Nutch source code examples which require
> someone
> > to understand Hadoop. I would love to, if I had the time, which I do
> > not. So can someone post or point me to an example.
> > Sorry to bother you, but time is a problem, I hope that you
> understand,
>
> import org.cyberneko.pull.util.DefaultHandler;
> import org.apache.lucene.document.Document;
> import org.apache.lucene.document.Field;
>
> public class HTMLParser extends DefaultHandler {
>     void handleCharacters(CharacterEvent event) {
>        this.text += event.text.toString();
>     }
>
>     void handleEndDocument(DocumentEvent event) {
>          Document doc = new Document();
>          doc.add(new Field("all", this.text, Field.Store.YES,
> Field.Index.TOKENIZED));
>      }
> }
>
> This should get you started.
> --
> Cheers,
> Hasan Diwan <[EMAIL PROTECTED]>
>

Reply via email to