I really, really recommend buying a copy of "Lucene In Action", especially if you don't already have a good grasp of what indexing and searching is all about. It's well worth the effort to read. Sure, you'll have a ton of questions after you're done with it (I certainly had/do), but at least you'll have a better context...
In your case, it would be worth the cost of the book, oh, about 1,000 times over just for Chapter 7, titled "parsing common document formats" which shows how to index XML, PDF, HTML, Word, RTF and plain text documents. The folks on this list have been extraordinarily generous with their time and advice, but I have to think it's more fun for them when they can answer as specific a question as possible, and you'll have much more specific questions after reading the manual. Best Erick Erickson