Ironically, I just had to solve this exact problem just 10 minutes ago... Check into javax.swing.text.html.HTMLEditorKit and javax.swing.text.html.HTMLDocument. Here's a URL that I found helpful (the site is Japanese, but the source code is still Java):
http://java-house.jp/ml/archive/j-h-b/037727.html?#_body "Lichty, Kent" wrote: > We have a web application that builds pages "on the fly" by reading directly > from a database. The database contains both normal content and HTML. We use > Lucene as our search engine, but I need to figure out how to cause it to NOT > include content that is within HTML tags. I assume that this entails the > creation of a custom Analyzer. Are there any existing Analyzers already out > there that work like this? Thanks! > > ---------- Internet E-mail Confidentiality Disclaimer ---------- > > PRIVILEGED / CONFIDENTIAL INFORMATION may be contained in this message. If > you are not the addressee indicated in this message or the employee or agent > responsible for delivering it to the addressee, you are hereby on notice > that you are in possession of confidential and privileged information. Any > dissemination, distribution, or copying of this e-mail is strictly > prohibited. In such case, you should destroy this message and kindly notify > the sender by reply e-mail. Please advise immediately if you or your > employer do not consent to Internet email for messages of this kind. > > Opinions, conclusions, and other information in this message that do not > relate to the official business of my firm shall be understood as neither > given nor endorsed by it. > > -- > To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@;jakarta.apache.org> > For additional commands, e-mail: <mailto:lucene-user-help@;jakarta.apache.org> -- To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@;jakarta.apache.org> For additional commands, e-mail: <mailto:lucene-user-help@;jakarta.apache.org>