[ 
https://issues.apache.org/jira/browse/PDFBOX-1213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13194697#comment-13194697
 ] 

Timo Boehme commented on PDFBOX-1213:
-------------------------------------

In my opinion the proposed changes to PDFTextStripper are too much centered on 
the use case. I think we need a more general solution here because sometimes 
more parameters can be extracted from the font definitions.

I would propose a fontChanged notification, maybe as a listener pattern because 
if no listeners are registered we can skip cycles for font information 
extraction:

interface FontChangedListener {
    public void fontChanged( FontInformation _fInfo );
}

class FontInformation {
    public boolean isBold();
    public boolean isItalic();
    public boolean isRoman();
    public boolean isSansSerif();
    public String getFontName();
    public float getFontSizePt();
}

class PDFTextStripper {
...
   protected List<FontListener> fontListeners = new LinkedList<FontListener>();
...
   public void registeFontListener( FontListener listener ) {
      fontListeners.add( listener );
   }

   writePage() {
      ...
      if ( ! fontListeners.isEmpty() ) {
         // test for font changes and notify listeners
      }
      ...
   }
}

In PDFText2HTML you have to keep track if a span was opened with font style 
information and close it before closing other tags.

                
> Adding style information to the PDF to HTML converter
> -----------------------------------------------------
>
>                 Key: PDFBOX-1213
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-1213
>             Project: PDFBox
>          Issue Type: Improvement
>    Affects Versions: 1.6.0
>            Reporter: Enrique Pérez
>         Attachments: diff.patch
>
>
> This patch modifies the PDF to HTML conversion in order to add style 
> information (bold, italic and size font) in the resulting file. Moreover, we 
> have deleted the "DOCTYPE" header because some parsers throws the following 
> exception:
> [Fatal Error] loose.dtd:31:3: The declaration for the entity "HTML.Version" 
> must end with '>'.
> org.xml.sax.SAXParseException: The declaration for the entity "HTML.Version" 
> must end with '>'.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira


Reply via email to