[jira] Commented: (PDFBOX-213) Text Extraction with Formatting

Mel Martinez (JIRA) Tue, 11 Jan 2011 08:20:09 -0800

    [ 
https://issues.apache.org/jira/browse/PDFBOX-213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12980161#action_12980161
 ]


Mel Martinez commented on PDFBOX-213:
-------------------------------------

In theory, with a lot of work, one could 'render' the PDF using HTML tags but 
that would be non-trivial.

The current (1.4) version of the text extraction 
(org.apache.pdfbox.util.PDFTextStripper) does allow one to instrument the 
demarcation of the following document points:

page start
page end
article start
article end
paragraph start
paragraph end
line separation
word separation
characters

By default the word separation is a space and line separation is a simple new 
line character and the others demarcations just reuse the latter.

Each of these can be modified by using either the corresponding setter (example 
'setPageStart(String)' ) accessors or by subclassing the PDFTextStripper class 
and overriding the corresponding getter methods and/or the 'writeXXX()' 
methods.    The decision on which to do (use the setters or to subclass 
entirely) depends on whether you are using static strings for demarcation or if 
instead you want to have meta information knowledge between demarcations.  For 
example, if you'd like to put a page number into each page demarcation, then 
you'd need to track a page count variable with each page start insertion.

For example, an xml formatter might override getPageStart() to be like so:

    public String getPageStart(){
        return "<page num=\""+getPageNum()+"\">"+CR;
    }

where 'getPageNum()' returns the current page number, which is incremented at 
the end of each 'writePageEnd()' method.

Thus it is straightforward to create an 'XML' format extractor that outputs the 
text in a format something like so:

<xml ......>
<document>
...
<page num="1">
  <article num="1">
  <paragraph num="1"><![CDATA[Some text that was extracted from the document 
that belongs in this paragraph.
  ]]</paragraph>
  <paragraph num="2"><![CDATA[Some more text ...blah blah ....
  ]]</paragraph>
 </article>
 <article num="2">
   ...
 </article>
</page>
<page num="2">
 ...
</page>
</document>

Once you've got it in XML form, you can then apply an XSL stylesheet to 
transform the results as you please, such as into HTML or WML.

Finally, as Ben indicates above, if you want to capture individual character 
styling you would need to override the 'writeCharacters(TextPosition)' method.  
  That may require some smarts in order to tag multi-character runs of a given 
attribute or else your output could end up extremely weighed down with tags.

This is just an example of what is possible.  Obviously you could also create a 
parser that outputs directly to HTML with <div> object separation.   In my case 
I need to work with the intermediate XML DOM form so I opted for this strategy. 
 I.E. after outputting to the above abstract XML form, I load it into a DOM 
with an xml parser (JDom).

I hope these suggestions provide ideas.


> Text Extraction with Formatting
> -------------------------------
>
>                 Key: PDFBOX-213
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-213
>             Project: PDFBox
>          Issue Type: New Feature
>          Components: Text extraction
>            Priority: Minor
>
> [imported from SourceForge]
> http://sourceforge.net/tracker/index.php?group_id=78314&atid=552835&aid=1589018
> Originally submitted by cetinsert on 2006-11-01 17:50.
> Is it possible to extract text from a PDF without
> ignoring the formatting?
> HTML tags might be used for example. I thought the
> PDFText2Html class would do the trick but it does not.
> Thank you for reading.
> [comment on SourceForge]
> Originally sent by rrufai.
> Logged In: YES 
> user_id=1776491
> Originator: NO
> It's sent.
> [comment on SourceForge]
> Originally sent by rrufai.
> Logged In: YES 
> user_id=1776491
> Originator: NO
> What email address should I send it to? 
> [comment on SourceForge]
> Originally sent by cetinsert.
> Logged In: YES 
> user_id=1562185
> Originator: YES
> @ rruffai
> > You might send a compiled 32-bit windows or linux binary personally to me.
> > (I'm a user of pdftohtml.)
> I messed things up. This was also PDFBox. Hehe, sorry.
> [comment on SourceForge]
> Originally sent by cetinsert.
> Logged In: YES 
> user_id=1562185
> Originator: YES
> @ rrufai
> what is the trouble you have with handling underlines?
> You might send a compiled 32-bit windows or linux binary personally to me. 
> (I'm a user of pdftohtml.)
> [comment on SourceForge]
> Originally sent by rrufai.
> Logged In: YES 
> user_id=1776491
> Originator: NO
> Hi Ben,
> I've extended PDFText2Html to handle bold, new lines (with <br> tags). 
> However, I'm having trouble figuring out how to handle underlines.
> Also, I don't know how to post updates. 
> Regards,
> Raimi
> [comment on SourceForge]
> Originally sent by cetinsert.
> Logged In: YES 
> user_id=1562185
> Uhmm... well bold, italic, underlined etc... would be a good
> beginning but my ultimate wish would be something like
> quoted below:
> <?xml version="1.0" encoding="ISO-8859-1"?>
> <!DOCTYPE pdf2xml SYSTEM "pdf2xml.dtd">
> <pdf2xml>
> <page number="1" position="absolute" top="0" left="0"
> height="1262" width="892">
>  <fontspec id="0" size="16" family="Times" color="#000000"/>
>  <fontspec id="1" size="16" family="Times" color="#000000"/>
>  <fontspec id="2" size="16" family="Times" color="#000000"/>
> <text top="110" left="106" width="137" height="18"
> font="0"><i>She </i>told <b>me</b>. Ã¤ÂµÃ </text>
> </page>
> </pdf2xml>
> I think I have made a mistake by naming it "Text Extraction
> with Formatting"... I should have put my question under a
> more fitting title, something like "PDF to (HTML/)XML
> Conversion with formatting".
> Thank you very much for your prompt replies. ^_^
> [comment on SourceForge]
> Originally sent by benlitchfield.
> Logged In: YES 
> user_id=601708
> Specifically are you looking only for bold & italic or other things?
> [comment on SourceForge]
> Originally sent by cetinsert.
> Logged In: YES 
> user_id=1562185
> That's exactly what I am looking for. But is this not a
> priority issue for the PDFBox package? It would take me
> quite a time to extend the stripper on my own. One of the
> PDFBox developers might do it better I think.
> If you insist that it's a user's issue and PDFBox developers
> would not invest their time in such an extension, could you
> at least tell me whether you have any links to any
> information regarding this matter?
> [comment on SourceForge]
> Originally sent by benlitchfield.
> Logged In: YES 
> user_id=601708
> HTML tags are not used to format a PDF document.  Font information is 
> available but can be tricky to get what you 
> want.  You will need to extend PDFTextStripper and override writeCharacters 
> to get formatting such as bold/italic.  
> Is that what you are looking for?
> Ben

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PDFBOX-213) Text Extraction with Formatting

Reply via email to