Re: HTML styles and
tags are ignored

Jukka Zitting Mon, 04 Jun 2012 05:37:06 -0700

Hi,

On Mon, Jun 4, 2012 at 2:21 PM, andrewtr <[email protected]> wrote:
> While I am parsing the PDF or Word document using AutoDetectParser the <li>,
> <ul> tags are converted as <p> tags. I need the exact HTML content what is
> been there for PDF or Word Document.


<li> and <ul> tags in PDF or Word? I assume you rather mean the native
list formatting of those document types?

The Tika parsers for PDF and Office documents could/should
automatically map such formatting to equivalent XHTML constructs, but
I don't think they currently do. You'll need to look into the source
code to see how to make that happen.

BR,

Jukka Zitting

Re: HTML styles and tags are ignored

Reply via email to

Re: HTML styles and
tags are ignored