david.stu...@progressivealliance.co.uk
Thu, 03 Dec 2009 14:03:18 -0800
Yep that was it thanks, one more quick one The BodyContentHandler handler seems to return just the text. I would like it to return the everything inside the body tag (including the html) Doc <html xmlns="http://www.w3.org/1999/xhtml"> <head> <meta name="DC.title" lang="en" content="The Strategies" /> <title>Dave TEst</title> </head> <body> <p>THe quick brown fox</p></body> </html>
currently returning The quick brown fox wanting to return <p>The quick brown fox</p> Any ideas thanks for your help On 03 December 2009 at 21:52 Jukka Zitting <jukka.zitt...@gmail.com> wrote: > Hi, > > On Thu, Dec 3, 2009 at 9:03 PM, david.stu...@progressivealliance.co.uk > <david.stu...@progressivealliance.co.uk> wrote: > > I compile is javac TikaParseHtml.java then when I run > > the command > > java -classpath [path to jar dir]/tika-core-0.5.jar TikaParseHtml i.html > > > > should I add other tika jars? > > Yes. By itself the tika-core jar only contains the core classes of > Tika but none of the wrappers around the external parser libraries > that are used to do the actual parsing of the various document > formats. See [1] for a description of the various jars and what to > include. The easiest way to get started is simply to replace the > tika-core jar with the tika-app jar that contains everything you need. > > [1] http://lucene.apache.org/tika/gettingstarted.html > > BR, > > Jukka Zitting