Hi Tony, first of all thanks for investigation in this subject. Please attach your patch to PDFBOX-508 if possible so that we are able to compare Dmitrys and your solution. Perhaps a combination of both will solve all parts of that issue.
Thanks in advance, Andreas Lehmkühler Tony Scerri schrieb: > Following on from this I now have the character spacing and word spacing > being done in image writing and the output looks almost identical to the PDF > viewed in Adobe Reader (wrt to text rendering including layout). It was a > bit of a desperate approach but shows the results can be achieved. It > appears to be a similar fix to that suggested in Jira PDFBOX-508, but I only > needed to modify the PDFStreamEngine.java class. I changed the > processEncodedText method to simply process the text position of each > character found in the stream. > > The only undesirable consequence would have to be performance as this will > trigger one call back to processTextPosition for each character rather than > a sequence, but given this would appear to be the only reliable way to > establish where each character should be placed I'm not sure what the > alternative would be. > > Like I said I didnt modify anything else get this going, and text extraction > wasnt effected when sorting by position for horizontal text. For diagonal > text going up from bottom left to top right things changed, but the original > wasnt perfect and it came from text pieces in an embed image (EPS). What I > got out after the change was the text being read from bottom to top, going > left to right, so a vertical read and the characters came out in the right > order by position in that orientation, so that would be a differen problem > to solve. > > On Mon, Sep 7, 2009 at 4:40 PM, Tony Scerri <tony.sce...@gmail.com> wrote: > >> Not sure if this is a possible cause for issues others have reported. I >> found that when creating images from PDFs I was getting a lot of jumbled >> text, bits overlapping others etc, and generaly it looked wrong. Turns out >> after much digging and tinkering that the FontManager was returning the >> wrong font even for standard fonts available in most environments. >> >> The fix I put in was inside the iterations of the available AWT fonts >> inside the loadFonts method of FontManager. The last line of the for loop I >> added: >> >> envFonts.put(normalizeFontname(font.getPSName()),font); >> >> This puts in the post script name which is quite often used inside PDFs >> from what I have been seeing lately on my work. This now has a much better >> chance of looking up the correct font. I now dont have overlapped words etc >> because the font has a much better metric with what was expected. >> >> I think this problem may be more prevelant on PDFs where the text has been >> fully justified. I have run into a subsequent issues still plodding my way >> through. Which is that I'm now left with large gaps in lines in the middle >> of words because PDF box isnt rendering the word spacing correctly (might >> also be character spacing) which is all down to the use of AWT rendering of >> fonts which as far as I can tell wont allow for the kinds of control >> required when rendering a whole string, the alternative seems to be to have >> to render each character one by one with the appropriate displacement >> between each glyph. >> >> Tony >> >> >> On Wed, Sep 2, 2009 at 6:47 AM, Andreas Lehmkühler (JIRA) <j...@apache.org >>> wrote: >>> [ >>> https://issues.apache.org/jira/browse/PDFBOX-302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel] >>> >>> Andreas Lehmkühler resolved PDFBOX-302. >>> --------------------------------------- >>> >>> Resolution: Fixed >>> Fix Version/s: 0.8.0-incubator >>> >>> AFAIK there aren't any issues with this improvement, so that I'll set this >>> to resolved. >>> >>> For now there aren't any mappings mssing. If we find some later, it'll be >>> no problem to add them. >>> >>>> Improve font handling (was: layout print problem) >>>> ------------------------------------------------- >>>> >>>> Key: PDFBOX-302 >>>> URL: https://issues.apache.org/jira/browse/PDFBOX-302 >>>> Project: PDFBox >>>> Issue Type: Improvement >>>> Components: PDFReader >>>> Reporter: Jukka Zitting >>>> Assignee: Andreas Lehmkühler >>>> Priority: Minor >>>> Fix For: 0.8.0-incubator >>>> >>>> >>>> [imported from SourceForge] >>>> >>> http://sourceforge.net/tracker/index.php?group_id=78314&atid=552832&aid=1787501 >>>> Originally submitted by gjniewenhuijse on 2007-09-04 00:24. >>>> When i print the attached file, some things are not printed well. >>>> - The gray box at the top >>>> - and the fonts are printed bold and thats not right. >>>> Is there any solution for now, or for later? >>>> When i open and print this file with adobe reader, everything is fine, >>> but with pdfbox i've got a layout problem. >>>> I used the newest pdfbox version (also tested the nightly build) >>>> [attachment on SourceForge] >>>> >>> http://sourceforge.net/tracker/download.php?group_id=78314&atid=552832&aid=1787501&file_id=244104 >>>> orarrp.pdf (application/pdf), 7871 bytes >>>> pdf with print problem >>> -- >>> This message is automatically generated by JIRA. >>> - >>> You can reply to this email to add a comment to the issue online. >>> >>> >