Running OCR would provide you with the text, which is what I am assuming you are trying to get out here…
From: Alfredo Jr. Go <frederick0...@gmail.com> Date: Wednesday, August 4, 2021 at 10:53 AM To: Leonard Rosenthol <lrose...@adobe.com> Cc: poppler@lists.freedesktop.org <poppler@lists.freedesktop.org> Subject: Re: [poppler] What Triggers PDFtoHtml to convert pdf page to image? Yeah the files are scanned by the courts so they are scanned PDFs. I am assuming that running OCR on the files still won't change anything? I guess that makes sense. Thanks Leonard. On Wed, Aug 4, 2021 at 10:23 PM Leonard Rosenthol <lrose...@adobe.com<mailto:lrose...@adobe.com>> wrote: Is it possible that the PDF itself is just a bunch of images (aka a Scanned PDF) instead of a born digital document? PDFtoHTML doesn’t do things like OCR or content analysis – it just outputs what is already there. Can’t get blood from a stone… Leonard From: poppler <poppler-boun...@lists.freedesktop.org<mailto:poppler-boun...@lists.freedesktop.org>> on behalf of Alfredo Jr. Go <frederick0...@gmail.com<mailto:frederick0...@gmail.com>> Date: Wednesday, August 4, 2021 at 10:19 AM To: poppler@lists.freedesktop.org<mailto:poppler@lists.freedesktop.org> <poppler@lists.freedesktop.org<mailto:poppler@lists.freedesktop.org>> Subject: [poppler] What Triggers PDFtoHtml to convert pdf page to image? Hi, I am trying to convert pdf files to html. Running it with pdftohtml -c -s input output works fine on simple PDFs. PDFtoHTML converts the file properly into the intended html file with para tags. But, when I tried testing it on PDF files (court documents), PDFtoHTML just converts them into a PNG file and then links them in the output html file. So I have an HTML file that just links an image. Sample: <!-- Page 5 --> <a name="5"></a> <style type="text/css"> <!-- p {margin: 0; padding: 0;} .ft519{font-size:27px;font-family:Helvetica;color:#000000;} --> </style> <div id="page5-div" style="position:relative;width:918px;height:1188px;"> <img width="918" height="1188" src="SEP272019_02A6245005.png" alt="background image"/> </div> <!-- Page 6 --> <a name="6"></a> <style type="text/css"> <!-- p {margin: 0; padding: 0;} .ft620{font-size:13px;font-family:Times;color:#000000;} .ft621{font-size:10px;font-family:Helvetica;color:#000000;} --> </style> <div id="page6-div" style="position:relative;width:918px;height:1188px;"> <img width="918" height="1188" src="SEP272019_02A6245006.png" alt="background image"/> </div> <!-- Page 7 --> <a name="7"></a> <style type="text/css"> <!-- p {margin: 0; padding: 0;} .ft722{font-size:15px;font-family:Helvetica;color:#000000;} --> </style> <div id="page7-div" style="position:relative;width:918px;height:1188px;"> <img width="918" height="1188" src="SEP272019_02A6245007.png" alt="background image"/> </div> <!-- Page 8 --> <a name="8"></a> <style type="text/css"> <!-- p {margin: 0; padding: 0;} .ft823{font-size:28px;font-family:Helvetica;color:#000000;} .ft824{font-size:72px;font-family:Helvetica;color:#000000;} --> </style> <div id="page8-div" style="position:relative;width:918px;height:1188px;"> <img width="918" height="1188" src="SEP272019_02A6245008.png" alt="background image"/> </div> <!-- Page 9 --> <a name="9"></a> <style type="text/css"> <!-- p {margin: 0; padding: 0;}--> </style> <div id="page9-div" style="position:relative;width:918px;height:1188px;"> <img width="918" height="1188" src="SEP272019_02A6245009.png" alt="background image"/> </div> What triggers this behavior? I was hoping that it would try to convert the PDFs to a HTML file with text in tags but it just converts them into images and links them in the output html file. I am not allowed to share the PDF files since they are legal documents. Regards, Fred.