[
https://issues.apache.org/jira/browse/PDFBOX-4482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16788721#comment-16788721
]
Tilman Hausherr commented on PDFBOX-4482:
-----------------------------------------
I'll keep this issue open for now but may work on it again later to solve the
mystery why these two almost identical files are treated differently. Maybe it
is a bug, maybe not. Maybe it is a documentation issue.
Yes the defines should work in Tika.
> True Type vs Embedded Text Output
> ---------------------------------
>
> Key: PDFBOX-4482
> URL: https://issues.apache.org/jira/browse/PDFBOX-4482
> Project: PDFBox
> Issue Type: Bug
> Components: Text extraction
> Affects Versions: 2.0.14
> Environment: Windows or Linux
> Reporter: Stev Dempsey
> Priority: Trivial
> Attachments: 581.pdf, 582.pdf, 583.pdf, 584.pdf, 585.pdf
>
>
> Kinda difficult to describe but here goes
> We use tinymce editor and then process that html document through M$Word to
> create a PDF. All is good there. Once we have the PDF we need to send the
> info to another system that only recognizes text. We need to preserve the
> vertical spacing between parts of the document.
> If Arial font is used all works well.
> In Times font is used the <P> Paragraphs are messed up.
> Source HTML for Times is here:
> <p><span style="font-family: times new roman,times; font-size:
> 10pt;">COMPARISON: None</span></p>
> <p><span style="font-family: times new roman,times; font-size:
> 10pt;">TECHNIQUE: Axial CT images obtained of the spine with sagittal and
> coronal reconstructions. This is the technique section. This is the
> technique section. This is the technique section. This is the technique
> section. This is the technique section. This is the technique section.
> This is the technique section. This is the technique section. This is the
> technique section. This is the technique section. This is the technique
> section.</span></p>
> <p><span style="font-family: times new roman,times; font-size:
> 10pt;">FINDINGS: No acute fracture, dislocation or abnormal lesion is shown.
> There is loss of the cervical lordosis. Osteophyte is noted at C4-5, and C5-6
> levels with moderate disc space narrowing. The spinal cord is normal. No
> Chiari malformation. No extradural soft tissue masses or paraspinal soft
> tissue masses. Upper thoracic spine is normal. This is the findings section.
> This is the findings section. This is the findings section. This is the
> findings section. This is the findings section. This is the findings
> section. This is the findings section. This is the findings section. This
> is the findings section. This is the findings section. The previous lines
> are all one paragraph and should not have any breaks. The next several lines
> are one line (one paragraph) per spine section.</span></p>
> <p><span style="font-family: times new roman,times; font-size: 10pt;">C2-3:
> No disc herniation, central stenosis or neural foraminal stenosis.</span></p>
> <p><span style="font-family: times new roman,times; font-size: 10pt;">C3-4:
> Shallow central and right paracentral disc herniation. No central stenosis or
> neural foraminal stenosis.</span></p>
> <p><span style="font-family: times new roman,times; font-size: 10pt;">C4-5:
> Diffuse disc bulge and right posterior lateral osteophyte with uncovertebral
> joint hypertrophy and moderate bilateral neural foraminal stenosis. No
> central stenosis.</span></p>
> <p><span style="font-family: times new roman,times; font-size: 10pt;">C5-6:
> Shallow central and right posterior lateral disc herniation with mass-effect
> along the cervical cord, right lateral recess stenosis and bilateral neural
> foraminal stenosis, moderate on the left and mild on the right. No central
> stenosis.</span></p>
> <p><span style="font-family: times new roman,times; font-size: 10pt;">C6-7:
> No disc herniation, central stenosis or neural foraminal stenosis.</span></p>
> <p><span style="font-family: times new roman,times; font-size: 10pt;">C7-T1:
> No disc herniation, central stenosis or neural foraminal stenosis.</span></p>
> <p><span style="font-family: times new roman,times; font-size: 10pt;"><span
> style="font-family: times new roman,times; font-size: 10pt;">IMPRESSION:
> </span><br /><span style="font-family: times new roman,times; font-size:
> 10pt;">1. Moderate degenerative disc disease with loss of cervical lordosis.
> This is line 1 of the impression.</span><br /><br /><span style="font-family:
> times new roman,times; font-size: 10pt;">2. C3-4 level with shallow central
> and right paracentral disc herniation but no stenosis. This is line 2 of the
> impression.</span><br /><br /><span style="font-family: times new
> roman,times; font-size: 10pt;">3. C4-5 level with diffuse disc bulge, right
> posterior lateral osteophyte and uncovertebral joint hypertrophy with
> moderate bilateral neural foraminal stenosis. This is line 3 of the
> impression.</span><br /><br /><span style="font-family: times new
> roman,times; font-size: 10pt;">4. C5-6 level with shallow central and right
> posterior lateral disc herniation, right lateral recess stenosis and
> bilateral neural foraminal stenosis, moderate on the left greater than right.
> This is line 4 of the impression.</span></span></p>
> :BREAK!
> this results in a decoded result that breaks the paragraphs <P> on each line
> instead of keeping the whole paragraph intact and keeping the line breaks.
> :SUBPART!
> INFO <p>TECHNIQUE: Axial CT images obtained of the spine with sagittal and
> coronal reconstructions. This is the technique
> INFO </p>
> INFO <p>section. This is the technique section. This is the technique
> section. This is the technique section. This is the technique
> INFO </p>
> INFO <p>section. This is the technique section. This is the technique
> section. This is the technique section. This is the technique
> INFO </p>
> INFO <p>section. This is the technique section. This is the technique
> section.
> INFO </p>
> INFO <p>FINDINGS: No acute fracture, dislocation or abnormal lesion is
> shown. There is loss of the cervical lordosis. Osteophyte is
> INFO </p>
> INFO <p>noted at C4-5, and C5-6 levels with moderate disc space narrowing.
> The spinal cord is normal. No Chiari malformation. No
> INFO </p>
> INFO <p>extradural soft tissue masses or paraspinal soft tissue masses.
> Upper thoracic spine is normal. This is the findings
> INFO </p>
> INFO <p>section. This is the findings section. This is the findings
> section. This is the findings section. This is the findings
> INFO </p>
> INFO <p>section. This is the findings section. This is the findings
> section. This is the findings section. This is the findings
> INFO </p>
> INFO <p>section. This is the findings section. The previous lines are all
> one paragraph and should not have any breaks. The next
> INFO </p>
> INFO <p>several lines are one line (one paragraph) per spine section.
> INFO </p>
> INFO <p>C2-3: No disc herniation, central stenosis or neural foraminal
> stenosis.
> INFO </p>
>
> :The original <P> is broken up line by line and not represented as a true
> paragraph. Am I doing something wrong or is it the conversion?
> Any help appreciated!
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]