Hello Terrence, Ah, you got me. I guess I need a bit of both. I need to just strip HTML and get raw body text so that I can stick it in Lucene's index. I would also like something that can extract at least the <title>...</title> stuff, so that I can stick that in a separate field in Lucene index. While doing that I, like you, need to be able to handle poorly formatted web pages.
In a future I may need something that has the ability to extract HREFs, but I'll stick to one of the XP principles and just look for something that meets current needs :) I looked for ANTLR-based HTML parser a few days ago, but must have missed the one you pointed out. I'll take a look at it now. Can you share or describe your stripHTML method? Simple java that looks for <s and >s or something smarter? Thanks, Otis P.S. This type of thing makes me wish I can use Perl or Python :) --- Terence Parr <[EMAIL PROTECTED]> wrote: > > On Thursday, April 18, 2002, at 10:28 PM, Otis Gospodnetic wrote: > > > Hello, > > > > I need to select an HTML parser for the application that I'm > writing > > and I'm not sure what to choose. > > The HTML parser included with Lucene looks flimsy, JTidy looks like > a > > hack and an overkill, using classes written for Swing > > (javax.swing.text.html.parser) seems wrong, and I haven't tried > David > > McNicol's parser (included with Spindle). > > > > Somebody on this list must have done some research on this subject. > > Can anyone share some experiences? > > Have you found a better HTML parser than any of those I listed > above? > > If your application deals with HTML, what do you use for parsing > it? > > Hi Otis, > > I have an HTML parser built for ANTLR, but it's pretty strict in what > it > accepts. Not sure how useful it will be for you, but here it is: > > http://www.antlr.org/grammars/HTML > > I am not sure what your goal is, but I personally have to scarf all > sorts of HTML from various websites to such them into the jGuru > search > engine. I use a simple stripHTML() method I wrote to handle it. > Works > great. Kills everything but the text. is that the kind of thing you > > are looking for or do you really want to parse not filter? > > Terence > -- > Co-founder, http://www.jguru.com > Creator, ANTLR Parser Generator: http://www.antlr.org > > > -- > To unsubscribe, e-mail: > <mailto:[EMAIL PROTECTED]> > For additional commands, e-mail: > <mailto:[EMAIL PROTECTED]> > __________________________________________________ Do You Yahoo!? Yahoo! Tax Center - online filing with TurboTax http://taxes.yahoo.com/ -- To unsubscribe, e-mail: <mailto:[EMAIL PROTECTED]> For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>
