Yeah the files are scanned by the courts so they are scanned PDFs. I am assuming that running OCR on the files still won't change anything?
I guess that makes sense. Thanks Leonard. On Wed, Aug 4, 2021 at 10:23 PM Leonard Rosenthol <[email protected]> wrote: > Is it possible that the PDF itself is just a bunch of images (aka a > Scanned PDF) instead of a born digital document? PDFtoHTML doesn’t do > things like OCR or content analysis – it just outputs what is already > there. Can’t get blood from a stone… > > > > Leonard > > > > *From: *poppler <[email protected]> on behalf of > Alfredo Jr. Go <[email protected]> > *Date: *Wednesday, August 4, 2021 at 10:19 AM > *To: *[email protected] <[email protected]> > *Subject: *[poppler] What Triggers PDFtoHtml to convert pdf page to image? > > Hi, > > > > I am trying to convert pdf files to html. Running it with pdftohtml -c -s > input output works fine on simple PDFs. PDFtoHTML converts the file > properly into the intended html file with para tags. > > But, when I tried testing it on PDF files (court documents), PDFtoHTML > just converts them into a PNG file and then links them in the output html > file. So I have an HTML file that just links an image. > > Sample: > > <!-- Page 5 --> > > <a name="5"></a> > > <style type="text/css"> > > <!-- > > p {margin: 0; padding: 0;} .ft519{font-size:27px;font-family: > Helvetica;color:#000000;} > > --> > > </style> > > <div id="page5-div" style="position:relative;width:918px;height:1188px;"> > > <img > width="918" height="1188" src="SEP272019_02A6245005.png" alt="background > image"/> > > </div> > > <!-- Page 6 --> > > <a name="6"></a> > > <style type="text/css"> > > <!-- > > p {margin: 0; padding: 0;} .ft620{font-size:13px;font-family:Times; > color:#000000;} > > .ft621{font-size:10px;font-family:Helvetica;color:#000000;} > > --> > > </style> > > <div id="page6-div" style="position:relative;width:918px;height:1188px;"> > > <img > width="918" height="1188" src="SEP272019_02A6245006.png" alt="background > image"/> > > </div> > > <!-- Page 7 --> > > <a name="7"></a> > > <style type="text/css"> > > <!-- > > p {margin: 0; padding: 0;} .ft722{font-size:15px;font-family: > Helvetica;color:#000000;} > > --> > > </style> > > <div id="page7-div" style="position:relative;width:918px;height:1188px;"> > > <img > width="918" height="1188" src="SEP272019_02A6245007.png" alt="background > image"/> > > </div> > > <!-- Page 8 --> > > <a name="8"></a> > > <style type="text/css"> > > <!-- > > p {margin: 0; padding: 0;} .ft823{font-size:28px;font-family: > Helvetica;color:#000000;} > > .ft824{font-size:72px;font-family:Helvetica;color:#000000;} > > --> > > </style> > > <div id="page8-div" style="position:relative;width:918px;height:1188px;"> > > <img > width="918" height="1188" src="SEP272019_02A6245008.png" alt="background > image"/> > > </div> > > <!-- Page 9 --> > > <a name="9"></a> > > <style type="text/css"> > > <!-- > > p {margin: 0; padding: 0;}--> > > </style> > > <div id="page9-div" style="position:relative;width:918px;height:1188px;"> > > <img > width="918" height="1188" src="SEP272019_02A6245009.png" alt="background > image"/> > > </div> > > > What triggers this behavior? I was hoping that it would try to convert the > PDFs to a HTML file with text in tags but it just converts them into images > and links them in the output html file. > > > > I am not allowed to share the PDF files since they are legal documents. > > > > Regards, > > Fred. >
