Ray,
I am feeling charitable this morning, so have posted code to do what
you desire at the end.
2008/11/10 Lukas, Ray <[EMAIL PROTECTED]>:
> If you could, please. I am, as you probably are, or have been in the
> recent past, short on time for my project. I need something very simple.
> An example that goes to a single URL, parses the pages under it, gathers
> up all the words (terms) and returns me a Lucene index of them so that I
> can then say "do any of the words I am thinking (terms from my Oracle
> database) appear in this index and how many times do they appear". That
> is it, very simple. I would like to use Nutch.
> I am going through the Nutch source code examples which require someone
> to understand Hadoop. I would love to, if I had the time, which I do
> not. So can someone post or point me to an example.
> Sorry to bother you, but time is a problem, I hope that you understand,

import org.cyberneko.pull.util.DefaultHandler;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;

public class HTMLParser extends DefaultHandler {
     void handleCharacters(CharacterEvent event) {
        this.text += event.text.toString();
     }

     void handleEndDocument(DocumentEvent event) {
          Document doc = new Document();
          doc.add(new Field("all", this.text, Field.Store.YES,
Field.Index.TOKENIZED));
      }
}

This should get you started.
-- 
Cheers,
Hasan Diwan <[EMAIL PROTECTED]>

Reply via email to