Hi, In general, PDF input support is very important and useful, so contributions are appreciated.
Full, high quality PDF import actually depends on the kind of PDF. Many PDFs are just collections of scanned page images. In those cases, the best thing to do is to extract the page images and hand them to OCRopus directly. If those images contain OCR text from Distiller, that, too, is potentially useful and it would be good to extract that so that OCRopus can combine it with its own results. Other PDFs are purely typeset and contain perfect textual information. In those case, OCRopus layout analysis is still useful, but we should find a way of extracting the text itself and skipping the OCR step itself. And then there are many other different kinds of PDFs; for those, it is probably best to render them at 300dpi and then hand the rendered images over to OCRopus like TIFF input. I haven't had a look at the patch itself, but if it does any of those three things, that's the right way. If it does something different, maybe you could describe it? Eventually, I hope we can support all three approaches, and select automatically in the usual case, and provide manual overrides. Tom > http://code.google.com/p/ocropus/issues/detail?id=146discusses the > possibility of creating searchable image PDFs. I'm not sure how well that is > going to work out, but basic PDF support seems like it might be useful. > > As a first cut (not intended to be applied at this stage), I've added support > for reading PDF files to iulib and ocropus. It relies on the poppler library. > I > recognise it isn't complete and that RGB doesn't work yet, but it is enough to > get ocropus book2pages to run on a multiple page PDF file. > > I've based the support on the TIFF implementation - much of the patch is > fairly mechanical (well, it was once I understood what was going on, anyway). > > From the comment in issue 146, I'm assuming that this the sort of thing you'd > like to see added to ocropus. However is this the sort of implementation you'd > expected? > > Also, the tests for this are fairly messy, because I can't write a PDF file. > Instead, I rely on an existing PDF files, and a set of "known answer" PNG > files. > See test-io_pdf.cc (attached) for the test as it currently exits. I've tried > to choose a fairly small PDF file, which gives: > $ ls -go orientation* > -rw-r--r-- 1 12033 2009-07-12 21:41 orientation-0.png > -rw-r--r-- 1 13078 2009-07-13 08:30 orientation-1.png > -rw-r--r-- 1 14030 2009-07-13 08:26 orientation-2.png > -rw-r--r-- 1 13403 2009-07-13 08:30 orientation-3.png > -rw-r--r-- 1 14675 2009-07-11 22:08 orientation.pdf > > That 67K of test files just covers the simple gray case. I'll try to make the > other examples just one page, but the png images will still take a bit of > space. Is that OK for iulib? > As an alternative, we could just check that the results were the right size. > There are reasonable amount of things that could go wrong in such a case > though (for example, that probably won't pick up endianess problems in the > image). > > Thoughts? Comments? > > ocropus-pdf-read-2009-07-13.patch > 4KViewDownload > > iulib-pdf-read-2009-07-13.patch > 14KViewDownload > > test-io_pdf.cc > 2KViewDownload --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "ocropus" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/ocropus?hl=en -~----------~----~----~----~------~----~------~--~---
