Re: [poppler] What Triggers PDFtoHtml to convert pdf page to image?

Alfredo Jr. Go Wed, 04 Aug 2021 07:53:40 -0700

Yeah the files are scanned by the courts so they are scanned PDFs. I am
assuming that running OCR on the files still won't change anything?


I guess that makes sense.  Thanks Leonard.

On Wed, Aug 4, 2021 at 10:23 PM Leonard Rosenthol <[email protected]>
wrote:

> Is it possible that the PDF itself is just a bunch of images (aka a
> Scanned PDF) instead of a born digital document?   PDFtoHTML doesn’t do
> things like OCR or content analysis – it just outputs what is already
> there.  Can’t get blood from a stone…
>
>
>
> Leonard
>
>
>
> *From: *poppler <[email protected]> on behalf of
> Alfredo Jr. Go <[email protected]>
> *Date: *Wednesday, August 4, 2021 at 10:19 AM
> *To: *[email protected] <[email protected]>
> *Subject: *[poppler] What Triggers PDFtoHtml to convert pdf page to image?
>
> Hi,
>
>
>
> I am trying to convert pdf files to html. Running it with pdftohtml -c -s
> input output works fine on simple PDFs. PDFtoHTML converts the file
> properly into the intended html file with para tags.
>
> But, when I tried testing it on PDF files (court documents), PDFtoHTML
> just converts them into a PNG file  and then links them in the output html
> file. So I have an HTML file that just links an image.
>
> Sample:
>
> <!-- Page 5 -->
>
> <a name="5"></a>
>
> <style type="text/css">
>
> <!--
>
>     p {margin: 0; padding: 0;}  .ft519{font-size:27px;font-family:
> Helvetica;color:#000000;}
>
> -->
>
> </style>
>
> <div id="page5-div" style="position:relative;width:918px;height:1188px;">
>
> <img
>  width="918" height="1188" src="SEP272019_02A6245005.png" alt="background 
> image"/>
>
> </div>
>
> <!-- Page 6 -->
>
> <a name="6"></a>
>
> <style type="text/css">
>
> <!--
>
>     p {margin: 0; padding: 0;}  .ft620{font-size:13px;font-family:Times;
> color:#000000;}
>
>     .ft621{font-size:10px;font-family:Helvetica;color:#000000;}
>
> -->
>
> </style>
>
> <div id="page6-div" style="position:relative;width:918px;height:1188px;">
>
> <img
>  width="918" height="1188" src="SEP272019_02A6245006.png" alt="background 
> image"/>
>
> </div>
>
> <!-- Page 7 -->
>
> <a name="7"></a>
>
> <style type="text/css">
>
> <!--
>
>     p {margin: 0; padding: 0;}  .ft722{font-size:15px;font-family:
> Helvetica;color:#000000;}
>
> -->
>
> </style>
>
> <div id="page7-div" style="position:relative;width:918px;height:1188px;">
>
> <img
>  width="918" height="1188" src="SEP272019_02A6245007.png" alt="background 
> image"/>
>
> </div>
>
> <!-- Page 8 -->
>
> <a name="8"></a>
>
> <style type="text/css">
>
> <!--
>
>     p {margin: 0; padding: 0;}  .ft823{font-size:28px;font-family:
> Helvetica;color:#000000;}
>
>     .ft824{font-size:72px;font-family:Helvetica;color:#000000;}
>
> -->
>
> </style>
>
> <div id="page8-div" style="position:relative;width:918px;height:1188px;">
>
> <img
>  width="918" height="1188" src="SEP272019_02A6245008.png" alt="background 
> image"/>
>
> </div>
>
> <!-- Page 9 -->
>
> <a name="9"></a>
>
> <style type="text/css">
>
> <!--
>
>     p {margin: 0; padding: 0;}-->
>
> </style>
>
> <div id="page9-div" style="position:relative;width:918px;height:1188px;">
>
> <img
>  width="918" height="1188" src="SEP272019_02A6245009.png" alt="background 
> image"/>
>
> </div>
>
>
> What triggers this behavior? I was hoping that it would try to convert the
> PDFs to a HTML file with text in tags but it just converts them into images
> and links them in the output html file.
>
>
>
> I am not allowed to share the PDF files since they are legal documents.
>
>
>
> Regards,
>
> Fred.
>

Re: [poppler] What Triggers PDFtoHtml to convert pdf page to image?

Reply via email to