Thanks Hasan:

Forgive me.. First your generosity is greatly appreciated. Please accept
my thanks.. I might be wrong, but... Humm.. I think that we are missing
a few things here that I also need and, is, in fact, why I selected
Nutch. 
Nutch does some things .. like.. perform the post event, gather up and
parse HTML, discover, and follow, the nested url links recursively, and
countless other things as well. Maintain a database of what was scanned,
and what should be scanned (WebDB), and I will let the experts expand on
my limited feature list. 
If I do this.. I am loosing all of those things. I am thinking that
there is a chunk of code that demonstrates how to call and use the Nutch
Crawl object and indexers. 
I am, this very moment, going through the sample code (which need to be
commented by the way, no offense to anyone) in hopes of understanding
how this all works together. 

I am up on Lucene, well kind of, Nutch is the bolder I have to move, or
get around.. you see.. Do not be discouraged, kindness, and helping
other people, is never a mistake.


-----Original Message-----
From: Hasan Diwan [mailto:[EMAIL PROTECTED] 
Sent: Monday, November 10, 2008 1:55 PM
To: [email protected]
Subject: Re: Example in Java Please

Ray,
I am feeling charitable this morning, so have posted code to do what
you desire at the end.
2008/11/10 Lukas, Ray <[EMAIL PROTECTED]>:
> If you could, please. I am, as you probably are, or have been in the
> recent past, short on time for my project. I need something very
simple.
> An example that goes to a single URL, parses the pages under it,
gathers
> up all the words (terms) and returns me a Lucene index of them so that
I
> can then say "do any of the words I am thinking (terms from my Oracle
> database) appear in this index and how many times do they appear".
That
> is it, very simple. I would like to use Nutch.
> I am going through the Nutch source code examples which require
someone
> to understand Hadoop. I would love to, if I had the time, which I do
> not. So can someone post or point me to an example.
> Sorry to bother you, but time is a problem, I hope that you
understand,

import org.cyberneko.pull.util.DefaultHandler;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;

public class HTMLParser extends DefaultHandler {
     void handleCharacters(CharacterEvent event) {
        this.text += event.text.toString();
     }

     void handleEndDocument(DocumentEvent event) {
          Document doc = new Document();
          doc.add(new Field("all", this.text, Field.Store.YES,
Field.Index.TOKENIZED));
      }
}

This should get you started.
-- 
Cheers,
Hasan Diwan <[EMAIL PROTECTED]>

Reply via email to