Ray,
I am feeling charitable this morning, so have posted code to do what
you desire at the end.
2008/11/10 Lukas, Ray <[EMAIL PROTECTED]>:
> If you could, please. I am, as you probably are, or have been in the
> recent past, short on time for my project. I need something very simple.
> An example that goes to a single URL, parses the pages under it, gathers
> up all the words (terms) and returns me a Lucene index of them so that I
> can then say "do any of the words I am thinking (terms from my Oracle
> database) appear in this index and how many times do they appear". That
> is it, very simple. I would like to use Nutch.
> I am going through the Nutch source code examples which require someone
> to understand Hadoop. I would love to, if I had the time, which I do
> not. So can someone post or point me to an example.
> Sorry to bother you, but time is a problem, I hope that you understand,
import org.cyberneko.pull.util.DefaultHandler;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
public class HTMLParser extends DefaultHandler {
void handleCharacters(CharacterEvent event) {
this.text += event.text.toString();
}
void handleEndDocument(DocumentEvent event) {
Document doc = new Document();
doc.add(new Field("all", this.text, Field.Store.YES,
Field.Index.TOKENIZED));
}
}
This should get you started.
--
Cheers,
Hasan Diwan <[EMAIL PROTECTED]>