Anne Blankert
Fri, 04 Dec 2009 02:34:05 -0800
On 2009-12-03 23:02, david.stu...@progressivealliance.co.uk wrote:
Yep that was it thanks, one more quick oneThe BodyContentHandler handler seems to return just the text. I would like it to return the everything inside the body tag (including the html)
If I am correct, the whole point of the Tika library is that it converts all kinds of documents to plain text. Tika extracts the human language content from documents and removes markup, layout and other non-language instructions.
If you are just interested in the html content of the html <body>, you could consider loading the html file into a String and cut the part that starts with "<body>" and ends with "</body>".
For more complex elements than "<body>", such as elements that contain sub-elements with the same html tag, you could also extend BodyContentHandler and override methods BodyContentHandler.StartElement(), BodyContentHandler.StopElement() and BodyContentHandler.toString(). The StartElement and StopElement methods get called for many (but not all!) html elements. You could re-insert the html tags into the output for the toString() method.
Anne