[ 
https://issues.apache.org/jira/browse/PDFBOX-4482?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated PDFBOX-4482:
------------------------------------
    Component/s: Text extraction

> True Type vs Embedded Text Output
> ---------------------------------
>
>                 Key: PDFBOX-4482
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-4482
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 2.0.14
>         Environment: Windows or Linux
>            Reporter: Stev Dempsey
>            Priority: Trivial
>         Attachments: 581.pdf, 582.pdf, 583.pdf, 584.pdf, 585.pdf
>
>
> Kinda difficult to describe but here goes
> We use tinymce editor and then process that html document through M$Word to 
> create a PDF. All is good there. Once we have the PDF we need to send the 
> info to another system that only recognizes text. We need to preserve the 
> vertical spacing between parts of the document.
> If Arial font is used all works well.
> In Times font is used the <P> Paragraphs are messed up.
> Source HTML for Times is here:
> <p><span style="font-family: times new roman,times; font-size: 
> 10pt;">COMPARISON: None</span></p>
> <p><span style="font-family: times new roman,times; font-size: 
> 10pt;">TECHNIQUE: Axial CT images obtained of the spine with sagittal and 
> coronal reconstructions.  This is the technique section.  This is the 
> technique section.  This is the technique section.  This is the technique 
> section.  This is the technique section.  This is the technique section.  
> This is the technique section.  This is the technique section.  This is the 
> technique section.  This is the technique section.  This is the technique 
> section.</span></p>
> <p><span style="font-family: times new roman,times; font-size: 
> 10pt;">FINDINGS: No acute fracture, dislocation or abnormal lesion is shown. 
> There is loss of the cervical lordosis. Osteophyte is noted at C4-5, and C5-6 
> levels with moderate disc space narrowing. The spinal cord is normal. No 
> Chiari malformation. No extradural soft tissue masses or paraspinal soft 
> tissue masses. Upper thoracic spine is normal.  This is the findings section. 
>  This is the findings section.  This is the findings section.  This is the 
> findings section.  This is the findings section.  This is the findings 
> section.  This is the findings section.  This is the findings section.  This 
> is the findings section.  This is the findings section.  The previous lines 
> are all one paragraph and should not have any breaks.  The next several lines 
> are one line (one paragraph) per spine section.</span></p>
> <p><span style="font-family: times new roman,times; font-size: 10pt;">C2-3: 
> No disc herniation, central stenosis or neural foraminal stenosis.</span></p>
> <p><span style="font-family: times new roman,times; font-size: 10pt;">C3-4: 
> Shallow central and right paracentral disc herniation. No central stenosis or 
> neural foraminal stenosis.</span></p>
> <p><span style="font-family: times new roman,times; font-size: 10pt;">C4-5: 
> Diffuse disc bulge and right posterior lateral osteophyte with uncovertebral 
> joint hypertrophy and moderate bilateral neural foraminal stenosis. No 
> central stenosis.</span></p>
> <p><span style="font-family: times new roman,times; font-size: 10pt;">C5-6: 
> Shallow central and right posterior lateral disc herniation with mass-effect 
> along the cervical cord, right lateral recess stenosis and bilateral neural 
> foraminal stenosis, moderate on the left and mild on the right. No central 
> stenosis.</span></p>
> <p><span style="font-family: times new roman,times; font-size: 10pt;">C6-7: 
> No disc herniation, central stenosis or neural foraminal stenosis.</span></p>
> <p><span style="font-family: times new roman,times; font-size: 10pt;">C7-T1: 
> No disc herniation, central stenosis or neural foraminal stenosis.</span></p>
> <p><span style="font-family: times new roman,times; font-size: 10pt;"><span 
> style="font-family: times new roman,times; font-size: 10pt;">IMPRESSION: 
> </span><br /><span style="font-family: times new roman,times; font-size: 
> 10pt;">1. Moderate degenerative disc disease with loss of cervical lordosis.  
> This is line 1 of the impression.</span><br /><br /><span style="font-family: 
> times new roman,times; font-size: 10pt;">2. C3-4 level with shallow central 
> and right paracentral disc herniation but no stenosis.  This is line 2 of the 
> impression.</span><br /><br /><span style="font-family: times new 
> roman,times; font-size: 10pt;">3. C4-5 level with diffuse disc bulge, right 
> posterior lateral osteophyte and uncovertebral joint hypertrophy with 
> moderate bilateral neural foraminal stenosis.  This is line 3 of the 
> impression.</span><br /><br /><span style="font-family: times new 
> roman,times; font-size: 10pt;">4. C5-6 level with shallow central and right 
> posterior lateral disc herniation, right lateral recess stenosis and 
> bilateral neural foraminal stenosis, moderate on the left greater than right. 
>  This is line 4 of the impression.</span></span></p>
> :BREAK!
> this results in a decoded result that breaks the paragraphs <P> on each line 
> instead of keeping the whole paragraph intact and keeping the line breaks.
> :SUBPART!
> INFO  <p>TECHNIQUE: Axial CT images obtained of the spine with sagittal and 
> coronal reconstructions.  This is the technique 
> INFO  </p>
> INFO  <p>section.  This is the technique section.  This is the technique 
> section.  This is the technique section.  This is the technique 
> INFO  </p>
> INFO  <p>section.  This is the technique section.  This is the technique 
> section.  This is the technique section.  This is the technique 
> INFO  </p>
> INFO  <p>section.  This is the technique section.  This is the technique 
> section. 
> INFO  </p>
> INFO  <p>FINDINGS: No acute fracture, dislocation or abnormal lesion is 
> shown. There is loss of the cervical lordosis. Osteophyte is 
> INFO  </p>
> INFO  <p>noted at C4-5, and C5-6 levels with moderate disc space narrowing. 
> The spinal cord is normal. No Chiari malformation. No 
> INFO  </p>
> INFO  <p>extradural soft tissue masses or paraspinal soft tissue masses. 
> Upper thoracic spine is normal.  This is the findings 
> INFO  </p>
> INFO  <p>section.  This is the findings section.  This is the findings 
> section.  This is the findings section.  This is the findings 
> INFO  </p>
> INFO  <p>section.  This is the findings section.  This is the findings 
> section.  This is the findings section.  This is the findings 
> INFO  </p>
> INFO  <p>section.  This is the findings section.  The previous lines are all 
> one paragraph and should not have any breaks.  The next 
> INFO  </p>
> INFO  <p>several lines are one line (one paragraph) per spine section. 
> INFO  </p>
> INFO  <p>C2-3: No disc herniation, central stenosis or neural foraminal 
> stenosis. 
> INFO  </p>
>  
> :The original <P> is broken up line by line and not represented as a true 
> paragraph. Am I doing something wrong or is it the conversion?
> Any help appreciated!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to