Reinier Olislagers wrote:
On 23-7-2012 12:16, Mark Morgan Lloyd wrote:

Do you ever have a situation where you need to index into a PDF, and if
so how do you cope?

I'm looking at something where it would be beneficial to have a
collection of electronics info, and I'd want immediate access to a given
section. Allowing for the dominance of PDF in that industry, the best I
can think of so far is to burst each PDF into pages, and then convert
each page back to PDF.

Are you talking about PDFs with digital text or images with hidden text
("searchable PDFs" as they are sometimes called) as opposed to images only?

Any sort of PDF, including murky scans from Bitsavers.

If so, extracting the text using e.g. pdftotext (e.g. page by page),
finding the text etc and instructing a pdf reader to open at a certain
page - if that reader supports it - may be helpful??
E.g. for the Sumatra PDF reader (IIRC, only available on Windows):
sumatrapdf -page <pageno>

Depending on the reader, you could do more, see e.g.:
https://code.google.com/p/sumatrapdf/wiki/CommandLineArguments

Although highly non-portable. My thoughts were to have documentation on a server to be available to anybody using a specialist Lazarus app, the whole thing would be spoilt if as well as loading the app (tentatively, Borg-UI) and possibly lhelp and associated CHMs I required users to find a specific PDF reader and possibly integrate it with their browser.

If the PDFs are images only, the only feasible way I'd see would be to
extract the images, OCR them and deal with it then; e.g. rebuild the
PDFs, adding the resulting text as hidden images, e.g. with the Linux
hocr2pdf tool provided by the exactimage package:
http://www.exactcode.com/site/open_source/exactimage/

Although even if the original wasn't a bitmap from Bitsavers, electronics stuff with lots of tabular material (control registers and the likes) is notoriously difficult to reformat. I /hope/ to be able to get the docs into a database (PostgreSQL handles binaries fairly well, which would simplify some of the replication/dissemination issues) even if they were saved as files local to each HTTP daemon.

FYI, I'm actually in the process of building a fairly simple document
scanning solution (for now on Linux) that:
- scans the document using sane, outputing tiff files
- performs optional cleanup, e.g. with the scantailor package
- recognizes text with tesseract, output layout/text info into hocr format
- combines the image and text into "searchable PDFs" using hocr2pdf
- adds metadata to the PDF
- adds the text to either a database or more probably some kind of full
text search package
- registers the scan into a database
- provide a viewer to search for documents, print them etc; in future
perhaps to be used for correcting OCR results

Not very far, but:
https://bitbucket.org/reiniero/papertiger/overview

For Dutch speaking readers: yes, the project name is intentional ;)

I'm afraid I don't speak Dutch, although my mother used to quote "Bûter, brea, en griene tsiis" out of context every few months. But I like the palindrome :-)

--
Mark Morgan Lloyd
markMLl .AT. telemetry.co .DOT. uk

[Opinions above are the author's, not those of his employers or colleagues]

--
_______________________________________________
Lazarus mailing list
[email protected]
http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus

Reply via email to