Re: [poppler] What Triggers PDFtoHtml to convert pdf page to image?

Leonard Rosenthol Wed, 04 Aug 2021 07:24:09 -0700

Is it possible that the PDF itself is just a bunch of images (aka a Scanned 
PDF) instead of a born digital document?   PDFtoHTML doesn’t do things like OCR 
or content analysis – it just outputs what is already there.  Can’t get blood 
from a stone…

Leonard

From: poppler <[email protected]> on behalf of Alfredo Jr. 
Go <[email protected]>
Date: Wednesday, August 4, 2021 at 10:19 AM
To: [email protected] <[email protected]>
Subject: [poppler] What Triggers PDFtoHtml to convert pdf page to image?
Hi,

I am trying to convert pdf files to html. Running it with pdftohtml -c -s input 
output works fine on simple PDFs. PDFtoHTML converts the file properly into the 
intended html file with para tags.

But, when I tried testing it on PDF files (court documents), PDFtoHTML just 
converts them into a PNG file  and then links them in the output html file. So 
I have an HTML file that just links an image.

Sample:
<!-- Page 5 -->
<a name="5"></a>
<style type="text/css">
<!--
    p {margin: 0; padding: 0;}  
.ft519{font-size:27px;font-family:Helvetica;color:#000000;}
-->
</style>
<div id="page5-div" style="position:relative;width:918px;height:1188px;">
<img width="918" height="1188" src="SEP272019_02A6245005.png" alt="background 
image"/>
</div>
<!-- Page 6 -->
<a name="6"></a>
<style type="text/css">
<!--
    p {margin: 0; padding: 0;}  
.ft620{font-size:13px;font-family:Times;color:#000000;}
    .ft621{font-size:10px;font-family:Helvetica;color:#000000;}
-->
</style>
<div id="page6-div" style="position:relative;width:918px;height:1188px;">
<img width="918" height="1188" src="SEP272019_02A6245006.png" alt="background 
image"/>
</div>
<!-- Page 7 -->
<a name="7"></a>
<style type="text/css">
<!--
    p {margin: 0; padding: 0;}  
.ft722{font-size:15px;font-family:Helvetica;color:#000000;}
-->
</style>
<div id="page7-div" style="position:relative;width:918px;height:1188px;">
<img width="918" height="1188" src="SEP272019_02A6245007.png" alt="background 
image"/>
</div>
<!-- Page 8 -->
<a name="8"></a>
<style type="text/css">
<!--
    p {margin: 0; padding: 0;}  
.ft823{font-size:28px;font-family:Helvetica;color:#000000;}
    .ft824{font-size:72px;font-family:Helvetica;color:#000000;}
-->
</style>
<div id="page8-div" style="position:relative;width:918px;height:1188px;">
<img width="918" height="1188" src="SEP272019_02A6245008.png" alt="background 
image"/>
</div>
<!-- Page 9 -->
<a name="9"></a>
<style type="text/css">
<!--
    p {margin: 0; padding: 0;}-->
</style>
<div id="page9-div" style="position:relative;width:918px;height:1188px;">
<img width="918" height="1188" src="SEP272019_02A6245009.png" alt="background 
image"/>
</div>

What triggers this behavior? I was hoping that it would try to convert the PDFs 
to a HTML file with text in tags but it just converts them into images and 
links them in the output html file.

I am not allowed to share the PDF files since they are legal documents.

Regards,
Fred.

Re: [poppler] What Triggers PDFtoHtml to convert pdf page to image?

Reply via email to