Stev Dempsey created PDFBOX-4482:
------------------------------------

             Summary: True Type vs Embedded Text Output
                 Key: PDFBOX-4482
                 URL: https://issues.apache.org/jira/browse/PDFBOX-4482
             Project: PDFBox
          Issue Type: Bug
          Components: Parsing
    Affects Versions: 2.0.14
         Environment: Windows or Linux
            Reporter: Stev Dempsey
             Fix For: 2.0.14


Kinda difficult to describe but here goes

We use tinymce editor and then process that html document through M$Word to 
create a PDF. All is good there. Once we have the PDF we need to send the info 
to another system that only recognizes text. We need to preserve the vertical 
spacing between parts of the document.

If Arial font is used all works well.

In Times font is used the <P> Paragraphs are messed up.

Source HTML for Times is here:

<p><span style="font-family: times new roman,times; font-size: 
10pt;">COMPARISON: None</span></p>
<p><span style="font-family: times new roman,times; font-size: 
10pt;">TECHNIQUE: Axial CT images obtained of the spine with sagittal and 
coronal reconstructions.  This is the technique section.  This is the technique 
section.  This is the technique section.  This is the technique section.  This 
is the technique section.  This is the technique section.  This is the 
technique section.  This is the technique section.  This is the technique 
section.  This is the technique section.  This is the technique 
section.</span></p>
<p><span style="font-family: times new roman,times; font-size: 10pt;">FINDINGS: 
No acute fracture, dislocation or abnormal lesion is shown. There is loss of 
the cervical lordosis. Osteophyte is noted at C4-5, and C5-6 levels with 
moderate disc space narrowing. The spinal cord is normal. No Chiari 
malformation. No extradural soft tissue masses or paraspinal soft tissue 
masses. Upper thoracic spine is normal.  This is the findings section.  This is 
the findings section.  This is the findings section.  This is the findings 
section.  This is the findings section.  This is the findings section.  This is 
the findings section.  This is the findings section.  This is the findings 
section.  This is the findings section.  The previous lines are all one 
paragraph and should not have any breaks.  The next several lines are one line 
(one paragraph) per spine section.</span></p>
<p><span style="font-family: times new roman,times; font-size: 10pt;">C2-3: No 
disc herniation, central stenosis or neural foraminal stenosis.</span></p>
<p><span style="font-family: times new roman,times; font-size: 10pt;">C3-4: 
Shallow central and right paracentral disc herniation. No central stenosis or 
neural foraminal stenosis.</span></p>
<p><span style="font-family: times new roman,times; font-size: 10pt;">C4-5: 
Diffuse disc bulge and right posterior lateral osteophyte with uncovertebral 
joint hypertrophy and moderate bilateral neural foraminal stenosis. No central 
stenosis.</span></p>
<p><span style="font-family: times new roman,times; font-size: 10pt;">C5-6: 
Shallow central and right posterior lateral disc herniation with mass-effect 
along the cervical cord, right lateral recess stenosis and bilateral neural 
foraminal stenosis, moderate on the left and mild on the right. No central 
stenosis.</span></p>
<p><span style="font-family: times new roman,times; font-size: 10pt;">C6-7: No 
disc herniation, central stenosis or neural foraminal stenosis.</span></p>
<p><span style="font-family: times new roman,times; font-size: 10pt;">C7-T1: No 
disc herniation, central stenosis or neural foraminal stenosis.</span></p>
<p><span style="font-family: times new roman,times; font-size: 10pt;"><span 
style="font-family: times new roman,times; font-size: 10pt;">IMPRESSION: 
</span><br /><span style="font-family: times new roman,times; font-size: 
10pt;">1. Moderate degenerative disc disease with loss of cervical lordosis.  
This is line 1 of the impression.</span><br /><br /><span style="font-family: 
times new roman,times; font-size: 10pt;">2. C3-4 level with shallow central and 
right paracentral disc herniation but no stenosis.  This is line 2 of the 
impression.</span><br /><br /><span style="font-family: times new roman,times; 
font-size: 10pt;">3. C4-5 level with diffuse disc bulge, right posterior 
lateral osteophyte and uncovertebral joint hypertrophy with moderate bilateral 
neural foraminal stenosis.  This is line 3 of the impression.</span><br /><br 
/><span style="font-family: times new roman,times; font-size: 10pt;">4. C5-6 
level with shallow central and right posterior lateral disc herniation, right 
lateral recess stenosis and bilateral neural foraminal stenosis, moderate on 
the left greater than right.  This is line 4 of the 
impression.</span></span></p>

:BREAK!

this results in a decoded result that breaks the paragraphs <P> on each line 
instead of keeping the whole paragraph intact and keeping the line breaks.

:SUBPART!

INFO  <p>TECHNIQUE: Axial CT images obtained of the spine with sagittal and 
coronal reconstructions.  This is the technique 
INFO  </p>
INFO  <p>section.  This is the technique section.  This is the technique 
section.  This is the technique section.  This is the technique 
INFO  </p>
INFO  <p>section.  This is the technique section.  This is the technique 
section.  This is the technique section.  This is the technique 
INFO  </p>
INFO  <p>section.  This is the technique section.  This is the technique 
section. 
INFO  </p>
INFO  <p>FINDINGS: No acute fracture, dislocation or abnormal lesion is shown. 
There is loss of the cervical lordosis. Osteophyte is 
INFO  </p>
INFO  <p>noted at C4-5, and C5-6 levels with moderate disc space narrowing. The 
spinal cord is normal. No Chiari malformation. No 
INFO  </p>
INFO  <p>extradural soft tissue masses or paraspinal soft tissue masses. Upper 
thoracic spine is normal.  This is the findings 
INFO  </p>
INFO  <p>section.  This is the findings section.  This is the findings section. 
 This is the findings section.  This is the findings 
INFO  </p>
INFO  <p>section.  This is the findings section.  This is the findings section. 
 This is the findings section.  This is the findings 
INFO  </p>
INFO  <p>section.  This is the findings section.  The previous lines are all 
one paragraph and should not have any breaks.  The next 
INFO  </p>
INFO  <p>several lines are one line (one paragraph) per spine section. 
INFO  </p>
INFO  <p>C2-3: No disc herniation, central stenosis or neural foraminal 
stenosis. 
INFO  </p>

 

:The original <P> is broken up line by line and not represented as a true 
paragraph. Am I doing something wrong or is it the conversion?

Any help appreciated!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to