Hi, On Mon, Jun 4, 2012 at 2:21 PM, andrewtr <[email protected]> wrote: > While I am parsing the PDF or Word document using AutoDetectParser the <li>, > <ul> tags are converted as <p> tags. I need the exact HTML content what is > been there for PDF or Word Document.
<li> and <ul> tags in PDF or Word? I assume you rather mean the native list formatting of those document types? The Tika parsers for PDF and Office documents could/should automatically map such formatting to equivalent XHTML constructs, but I don't think they currently do. You'll need to look into the source code to see how to make that happen. BR, Jukka Zitting
