Sree Harsha created TIKA-3682:
---------------------------------
Summary: PDFParser is extracting each char of a word in a new line
Key: TIKA-3682
URL: https://issues.apache.org/jira/browse/TIKA-3682
Project: Tika
Issue Type: Bug
Affects Versions: 2.3.0, 1.26
Reporter: Sree Harsha
Attachments: image-2022-02-22-13-14-14-067.png
when pdf parser is trying to extract text from a pdf document having a
different orientation for text, each character of word is extracted to a new
line.
For eg the text is extracted like below:
TO
P
LA
C
E
A
N
O
R
D
E
R
where the original text is like
!image-2022-02-22-13-14-14-067.png!
setExtractBookmarksText(false);
getPDFParserConfig().setEnableAutoSpace(true);
After adding the below options:
setSortByPosition(true);
setSuppressDuplicateOverlappingText(true);
setOcrStrategy(OCR_STRATEGY.OCR_AND_TEXT_EXTRACTION);
The text is extracted like:
TO PLACE xxxxxxx
yyyyyyy AN ORDER
where xxxxxx, yyyyyyy refers to some other text at same level in pdf document.
If i search for TO PLACE AN ORDER in acrobat reader it works but if i search
for the same text in extracted text content, it won't work..
Is there any option to exclude unnecessary new line characters shown in first
example and also solve the side effect or sort by position issue..
The the output should look like:
TO PLACE AN ORDER
xxxxxx yyyyyyyy
--
This message was sent by Atlassian Jira
(v8.20.1#820001)