[jira] [Comment Edited] (TIKA-1794) TXTParser removes form feed characters

Olivier M (JIRA) Mon, 16 Nov 2015 02:01:25 -0800

    [ 
https://issues.apache.org/jira/browse/TIKA-1794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15006441#comment-15006441
 ]


Olivier M edited comment on TIKA-1794 at 11/16/15 10:00 AM:
------------------------------------------------------------

Txt file with form feed character attached.


was (Author: maol):
Txt file with form feed character.

> TXTParser removes form feed characters
> --------------------------------------
>
>                 Key: TIKA-1794
>                 URL: https://issues.apache.org/jira/browse/TIKA-1794
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.11
>         Environment: Java(TM) SE Runtime Environment (build 1.8.0_60-b27)
>            Reporter: Olivier M
>            Priority: Minor
>              Labels: parser, txt
>         Attachments: form_feed.txt
>
>
> Just noticed that Apache Tika removes form feed characters (0C in UTF-8) when 
> parsing a text file.
> If I compare the hex bytes of the original file and the hex bytes of the 
> extracted text I can see that the 0C character is replaced by  EF BF BD which 
> is the UTF-8 replacement character.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Comment Edited] (TIKA-1794) TXTParser removes form feed characters

Reply via email to