[
https://issues.apache.org/jira/browse/PDFBOX-3056?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14977771#comment-14977771
]
John Hewson commented on PDFBOX-3056:
-------------------------------------
Yes, sorry I meant showGlyph (onGlyph is an API from a work project).
{quote}How do PageDrawer and PDFStreamEngine relate?{quote}
PDFStreamEngine handles the generic logic of parsing a PDF content stream and
maintaining the graphics and text states. It doesn't have any of its own
functionality beyond this, one must subclass its methods to add functionality.
PageDrawer is one such subclass, it draws a page to an AWT Graphics2D.
{quote}
The part that I most frequently see issues with is pdfbox deciding whether or
not to put a space or a new line, which is where I was going to focus my
efforts. I really liked the TextPosition that's used in PDFTextStreamEngine
because it gives me things to work with that I immediately understand without
knowing the pdf spec at all. I was hoping I could just work with that assuming
that the TextPosition was typically created correctly.
{quote}
As you've probably figured out a TextPosition is not created correctly. If you
encapsulate the parameters of the showGlyph method into a new class then you'll
have all you need, i.e. something which represents a glyph on the page. You
could keep the TextPosition API to do this if you want, but it's not doing
anything that's fundamentally necessary.
{quote}
Is PDFTextStreamEngine fixable? It sounds like the main issue with fixing it is
that PDFTextStripper also would need to be updated?
{quote}
Yes, absolutely. PDFStreamEngine already performs all the necessary
calculations correctly. Fixing PDFTextStreamEngine is trivial - just delete the
overridden calculations - but it will cause regressions in PDFTextStripper.
That might not be too much work to fix, I haven't tried.
{quote}
If someone can make an alternate version of PDFTextStreamEngine that works as
it should then I might be able to make PDFTextStripper use it.
{quote}
Just delete everything in the showGlyph override before the comment "// use our
additional glyph list for Unicode mapping". You'll need to figure out a new set
of parameters for TextPosition as most of the current arguments are erroneous.
{quote}
I'm afraid I don't quite know what to do with a text rendering matrix and if I
have to treat type 3 fonts differently, etc.
{quote}
The TRM is a composite transform of the TM and CTM and the text position, so
conceptually a glyph is drawn at (0,0) in a space defined by the TRM. The PDF
spec provides more details. Type 3 fonts should be the same as the others, as
PDFont is abstracting over things for you.
> Make PDFTextStreamEngine public
> -------------------------------
>
> Key: PDFBOX-3056
> URL: https://issues.apache.org/jira/browse/PDFBOX-3056
> Project: PDFBox
> Issue Type: Bug
> Affects Versions: 2.0.0
> Reporter: Ben McCann
> Fix For: 2.0.0
>
>
> I'd like to experiment with writing my own text extractor that works better
> for my use case than PDFTextStripper does. Hopefully I can port some of the
> improvements to PDFTextStripper if I get something working well. I'd really
> like PDFTextStreamEngine and its constructor to have increased visibility as
> I believe that it's pretty reasonable for people to be able write their own
> PDFTextStrippers without having to reimplement showGlyph.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]