On Fri, Jul 17, 2009 at 11:44, Brad Hards<[email protected]> wrote:
>
> On Friday 17 July 2009 17:26:56 tmbdev wrote:
>> Full, high quality PDF import actually depends on the kind of PDF.
> Indeed.
>
> [Note: re-ordered paragraphs follow]
>> And then there are many other different kinds of PDFs; for those, it
>> is probably best to render them at 300dpi and then hand the rendered
>> images over to OCRopus like TIFF input.
> This is basically what I'm doing (although I chose 150dpi - its an parameter
> though).

Generally, OCR works best at 300dpi and above (OCRopus as well as
others).  Above 600dpi, there probably isn't much of a difference.

>> Many PDFs are just collections of scanned page images.  In those
>> cases, the best thing to do is to extract the page images and hand
>> them to OCRopus directly.  If those images contain OCR text from
>> Distiller, that, too, is potentially useful and it would be good to
>> extract that so that OCRopus can combine it with its own results.
> I don't do that yet. I render, rather than extracting images. It might not be
> hard to detect this though. We'd also need to decide how to match up page
> images and text.

Match up in what sense?

>> Other PDFs are purely typeset and contain perfect textual
>> information.  In those case, OCRopus layout analysis is still useful,
>> but we should find a way of extracting the text itself and skipping
>> the OCR step itself.
> This isn't necessarily too hard either. Presumably we want to get the text,
> and the bounding box / position for each character?

Yes, it's not hard to get the text, but something needs to be done
with it afterwards.

The new book level representation makes this possible now.  What needs
to happen is roughly the following:

-- the PDF import extracts characters and bounding boxes and puts that
into book/0000.chars etc.

-- the PDF import also generates binary page images book/0000.png

-- the layout analysis is run on the binary page images and generates
book/0000.pseg.png

-- a new command line tool combines book/0000.chars and
book/0000.pseg.png and generates book/0000/010001.txt and (maybe)
book/0000/010001.cseg.png; this replaces the usual OCR step.  It
involves iterating through the line bounding boxes in
book/0000.pseg.png, finding all the characters from the
book/0000.chars file that are inside each line bounding box, and then
sorting them left-to-right (some scripts may require more complicated
solutions, but that's good enough for starters).

After that, OCRopus tools for building HTML and PDF from the data can
take over again.

This would be great because it would give us conversions from untagged
PDF to tagged PDF, and from untagged PDF to linear HTML (there is a
nice tool called pdftohtml, but it reproduces the original PDF
layout).

Tom

>
> Brad
>
>
>
> >
>

--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"ocropus" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to 
[email protected]
For more options, visit this group at 
http://groups.google.com/group/ocropus?hl=en
-~----------~----~----~----~------~----~------~--~---

Reply via email to