[ 
https://issues.apache.org/jira/browse/PDFBOX-3056?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14977771#comment-14977771
 ] 

John Hewson commented on PDFBOX-3056:
-------------------------------------

Yes, sorry I meant showGlyph (onGlyph is an API from a work project).

{quote}How do PageDrawer and PDFStreamEngine relate?{quote}

PDFStreamEngine handles the generic logic of parsing a PDF content stream and 
maintaining the graphics and text states. It doesn't have any of its own 
functionality beyond this, one must subclass its methods to add functionality. 
PageDrawer is one such subclass, it draws a page to an AWT Graphics2D.

{quote}
The part that I most frequently see issues with is pdfbox deciding whether or 
not to put a space or a new line, which is where I was going to focus my 
efforts. I really liked the TextPosition that's used in PDFTextStreamEngine 
because it gives me things to work with that I immediately understand without 
knowing the pdf spec at all. I was hoping I could just work with that assuming 
that the TextPosition was typically created correctly.
{quote}

As you've probably figured out a TextPosition is not created correctly. If you 
encapsulate the parameters of the showGlyph method into a new class then you'll 
have all you need, i.e. something which represents a glyph on the page. You 
could keep the TextPosition API to do this if you want, but it's not doing 
anything that's fundamentally necessary.

{quote}
Is PDFTextStreamEngine fixable? It sounds like the main issue with fixing it is 
that PDFTextStripper also would need to be updated?
{quote}

Yes, absolutely. PDFStreamEngine already performs all the necessary 
calculations correctly. Fixing PDFTextStreamEngine is trivial - just delete the 
overridden calculations - but it will cause regressions in PDFTextStripper. 
That might not be too much work to fix, I haven't tried.

{quote}
If someone can make an alternate version of PDFTextStreamEngine that works as 
it should then I might be able to make PDFTextStripper use it. 
{quote}

Just delete everything in the showGlyph override before the comment "// use our 
additional glyph list for Unicode mapping". You'll need to figure out a new set 
of parameters for TextPosition as most of the current arguments are erroneous.

{quote}
I'm afraid I don't quite know what to do with a text rendering matrix and if I 
have to treat type 3 fonts differently, etc.
{quote}

The TRM is a composite transform of the TM and CTM and the text position, so 
conceptually a glyph is drawn at (0,0) in a space defined by the TRM. The PDF 
spec provides more details. Type 3 fonts should be the same as the others, as 
PDFont is abstracting over things for you.

> Make PDFTextStreamEngine public
> -------------------------------
>
>                 Key: PDFBOX-3056
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-3056
>             Project: PDFBox
>          Issue Type: Bug
>    Affects Versions: 2.0.0
>            Reporter: Ben McCann
>             Fix For: 2.0.0
>
>
> I'd like to experiment with writing my own text extractor that works better 
> for my use case than PDFTextStripper does. Hopefully I can port some of the 
> improvements to PDFTextStripper if I get something working well. I'd really 
> like PDFTextStreamEngine and its constructor to have increased visibility as 
> I believe that it's pretty reasonable for people to be able write their own 
> PDFTextStrippers without having to reimplement showGlyph.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to