[jira] [Commented] (PDFBOX-4532) PDFTextStripper replacing the decimal with white space

Tilman Hausherr (JIRA) Fri, 03 May 2019 00:29:25 -0700


    [ 
https://issues.apache.org/jira/browse/PDFBOX-4532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16832306#comment-16832306
 ]


Tilman Hausherr commented on PDFBOX-4532:
-----------------------------------------

Here's some code for anybody who wants to understand or work on the problem:
{code:java}
public class PDFBox4532ExtractText extends PDFTextStripper
{
    public static void main(String[] args) throws IOException
    {
        PDDocument doc = PDDocument.load(new File("PDFBOX-4532-reduced.pdf"));
        PDFTextStripper stripper = new PDFBox4532ExtractText();
        stripper.getText(doc);
    }
    
    public PDFBox4532ExtractText() throws IOException
    {
        addOperator(new BeginMarkedContentSequenceWithProperties());
        addOperator(new EndMarkedContentSequence());
    }
    
    @Override
    public void beginMarkedContentSequence(COSName tag, COSDictionary 
properties)
    {
        System.out.println("BMC: tag: " + tag.getName() + ", properties: " + 
properties);
        if (properties != null && properties.containsKey(COSName.ACTUAL_TEXT))
        {
            System.out.println("BMC: TextMatrix: " + getTextMatrix());
            System.out.println("BMC: ActualText: " + 
properties.getString(COSName.ACTUAL_TEXT));
        }
        super.beginMarkedContentSequence(tag, properties);
    }
    @Override
    public void endMarkedContentSequence()
    {
        System.out.println("EMC: TextMatrix: " + getTextMatrix());
        System.out.println("EMC: CharactersByArticle: " + charactersByArticle);
        super.endMarkedContentSequence();
    }
}{code}
The output is:
{noformat}
BMC: tag: Span, properties: 
COSDictionary{COSName{Lang}:COSString{en-US};COSName{MCID}:COSInt{8};}
BMC: tag: Span, properties: COSDictionary{COSName{ActualText}:COSString{.};}
BMC: TextMatrix: [50.0,0.0,0.0,50.0,125.0,700.0]
BMC: ActualText: .
EMC: TextMatrix: [50.0,0.0,0.0,50.0,137.5,700.0]
EMC: CharactersByArticle: [[0,  ]]
EMC: TextMatrix: null
EMC: CharactersByArticle: [[0,  , 1]]
{noformat}


> PDFTextStripper replacing the decimal with white space
> ------------------------------------------------------
>
>                 Key: PDFBOX-4532
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-4532
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 2.0.15
>            Reporter: Akash Gupta
>            Priority: Major
>              Labels: ActualText
>         Attachments: FSUSA00BDD.pdf, PDFBOX-4532-reduced.pdf, 
> code_textStripper.PNG, numbers_without_decimal.PNG
>
>
> I'm using the PDFTextStripperByArea to be specific and trying to extract a 
> particular area from the document. 
> In the output most the numbers (all but one) have their decimal point 
> replaced by a white space. When I copy and paste the text using Abobe 
> reader/chrome the decimal point are preserved.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Commented] (PDFBOX-4532) PDFTextStripper replacing the decimal with white space

Reply via email to