tika-user  

Re: Simple implementation help

david.stu...@progressivealliance.co.uk
Thu, 03 Dec 2009 14:03:18 -0800

Yep that was it thanks,

one more quick one 
The BodyContentHandler handler seems to return just the text. I would like it to
return the everything inside the body tag (including the html)
Doc
<html xmlns="http://www.w3.org/1999/xhtml";>
<head>
 <meta name="DC.title" lang="en" content="The Strategies" />
 <title>Dave TEst</title>
</head>
<body>
 <p>THe quick brown fox</p></body>
</html>

currently returning

The quick brown fox

wanting to return

<p>The quick brown fox</p>

Any ideas

thanks for your help


On 03 December 2009 at 21:52 Jukka Zitting <jukka.zitt...@gmail.com> wrote:

> Hi,
> 
> On Thu, Dec 3, 2009 at 9:03 PM, david.stu...@progressivealliance.co.uk
> <david.stu...@progressivealliance.co.uk> wrote:
> > I compile is javac TikaParseHtml.java then when I run
> > the command
> > java  -classpath [path to jar dir]/tika-core-0.5.jar TikaParseHtml i.html
> >
> > should I add other tika jars?
> 
> Yes. By itself the tika-core jar only contains the core classes of
> Tika but none of the wrappers around the external parser libraries
> that are used to do the actual parsing of the various document
> formats. See [1] for a description of the various jars and what to
> include. The easiest way to get started is simply to replace the
> tika-core jar with the tika-app jar that contains everything you need.
> 
> [1] http://lucene.apache.org/tika/gettingstarted.html
> 
> BR,
> 
> Jukka Zitting