[
https://issues.apache.org/jira/browse/TIKA-2190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15766980#comment-15766980
]
Bipul Kumar commented on TIKA-2190:
-----------------------------------
Thanks Tim. I know that option and am using it but issue with hocr is that
sometimes the y co-ordinate donot match for the words on the same line. So the
TXT format can be used as extra info instead of writing code to predict the
words on same line.
Moreover many users can simply use TXT format with space info for simple and
straight forward usecases instead of writing code to parse HOCR output. Simple
user friendly.
> Add "preserve_interword_spaces" option of tesseract
> ---------------------------------------------------
>
> Key: TIKA-2190
> URL: https://issues.apache.org/jira/browse/TIKA-2190
> Project: Tika
> Issue Type: Improvement
> Components: ocr
> Reporter: Bipul Kumar
> Assignee: Tim Allison
> Fix For: 2.0, 1.15
>
>
> This option will preserve the spaces for TXT output type so that the layout
> or context can be inferred while further parsing.
> to enable :: -c preserve_interword_spaces=1
> to disable :: -c preserve_interword_spaces=0 or simply don't mention
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)