Jussi: sorry for the double post to your addres, somehow gmail managed to merge your addres with the launchpad one, again
Hi, again, On Sat, Sep 6, 2008 at 4:09 PM, René Rebe <[EMAIL PROTECTED]> wrote: > Hi, > > On Tue, Sep 2, 2008 at 1:22 PM, René Rebe <[EMAIL PROTECTED]> wrote: >> Hi all, >> >> I plan to add PDF writing to create searchable PDFs from cuneiform on Linux. >> >> So far I "only" did some light scrolling and grep'ing back and force over >> the code, and as I have not yet fully memorized it's structure I wanted to >> ask the ones already more familiar with the code before I start about the >> best place to add such code. >> >> So far I identified: cuneiform_src/Kern/rout/src/ >> >> In which I would start by making a copy of html.cpp to add the corresponding >> PDF tag writeouts, using ExactImage >> (http://www.exactcode.de/site/open_source/exactimage/) >> for the actual PDF structure generation. (ExactImage SVN:HEAD only includes >> very static pure image writing, but I already rewrote that part and have any >> vector, font, image and multi-page writing in my local working copy, >> already). >> >> Any hints welcome, > > Ok, a debugger was not too helpful with all the pointers to handles, sigh. > > Anyway, I found how to get the layout bounding boxes while within the > HTML writer: > > { > char buf[256] = ""; > edRect r = CED_GetCharLayout(hObject); > if (r.left != -1) { > sprintf (buf, "<span > title=\"bbox %d %d %d %d\">", r.left, r.top, r.right, r.bottom); > PUT_STRING(buf); > } > } > ... > > One issue I found during testing is, that the engine does not appear > to generate line-breaks deterministically. For my 2 column test text > another issue aries: sometimes (not always) hyphen are skipped from > the output when a word break is recognized. Of course for writing a > useful PDF some form of "soft hyphen" needs to be generated, > especially with line break in order to format the text at the correct > location. > > One workaround that immediately comes to mind is to use some form of > post-processing, where the x position is tracked and missing line > breaks inserted where the content flow wraps around depending on the > writing direction. Probably soft hyphens can be inserted when a line > break is "auto detected" in the middle of a word. > > I'll post more complete patches when I get somewhere there. > > Are there any preferences where to add this hOCR related HTML > annotation? Conditionalized into the existing html writer, or as a > second copy of it adding those bounding boxes and possibly some post > processing for the issues mentioned above? > > Have a nice weekend, > René while adding paragraph bounding boxes I had to notice that currently all paragraphs are apparently created with the layout (bounding box) coordinates set to -1: With the CEDSection::CreateParagraph instrumented to log the layout I only see calls like this: CreateParagraph: -1, -1, -1, -1 CreateParagraph: -1, -1, -1, -1 CreateParagraph: -1, -1, -1, -1 CreateParagraph: -1, -1, -1, -1 CreateParagraph: -1, -1, -1, -1 CreateParagraph: -1, -1, -1, -1 CreateParagraph: -1, -1, -1, -1 CreateParagraph: -1, -1, -1, -1 CreateParagraph: -1, -1, -1, -1 And this -1 values stay until they reach my modified HTML writer, writing out the BROWSE_PARAGRAPH_START. I took a look at the paragraph creation all all places appear to not initialize the paragraph with real values :-( I guess I'll work with bounding boxes for each character glyph for now and construct the lines and paragraph boxes outside of cuneiform for the beginning :-( _______________________________________________ Mailing list: https://launchpad.net/~cuneiform Post to : [email protected] Unsubscribe : https://launchpad.net/~cuneiform More help : https://help.launchpad.net/ListHelp

