[jira] [Created] (TIKA-3682) PDFParser is extracting each char of a word in a new line

Sree Harsha (Jira) Mon, 21 Feb 2022 23:50:08 -0800

Sree Harsha created TIKA-3682:
---------------------------------

             Summary: PDFParser is extracting each char of a word in a new line
                 Key: TIKA-3682
                 URL: https://issues.apache.org/jira/browse/TIKA-3682
             Project: Tika
          Issue Type: Bug
    Affects Versions: 2.3.0, 1.26
            Reporter: Sree Harsha
         Attachments: image-2022-02-22-13-14-14-067.png


when pdf parser is trying to extract text from a pdf document having a 
different orientation for text, each character of word is extracted to a  new 
line.

For eg the text is extracted like below:

TO
 P

LA
C

E
A

N
 O

R
D

E
R

where the original text is like 

!image-2022-02-22-13-14-14-067.png!

setExtractBookmarksText(false);
getPDFParserConfig().setEnableAutoSpace(true);

 

After adding the below options:

setSortByPosition(true);
setSuppressDuplicateOverlappingText(true);
setOcrStrategy(OCR_STRATEGY.OCR_AND_TEXT_EXTRACTION);

 

The text is extracted like:

TO PLACE xxxxxxx
yyyyyyy AN ORDER

 

where xxxxxx, yyyyyyy refers to some other text at same level in pdf document.

If i search for TO PLACE AN ORDER in acrobat reader it works but if i search 
for the same text in extracted text content, it won't work..

Is there any option to exclude unnecessary new line characters shown in first 
example and also solve the side effect or sort by position issue..

The the output should look like:

TO PLACE AN ORDER

xxxxxx yyyyyyyy



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Created] (TIKA-3682) PDFParser is extracting each char of a word in a new line

Reply via email to