[
https://issues.apache.org/jira/browse/TIKA-1794?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Olivier M updated TIKA-1794:
----------------------------
Description:
Just noticed that Apache Tika removes form feed characters (0C in UTF-8) when
parsing a text file.
If I compare the hex bytes of the original file and the hex bytes of the
extracted text I can see that the 0C character is replaced by EF BF BD which
is the UTF-8 replacement character.
was:
Just noticed that Apache Tika removes form feed characters (0C in UTF-8) when
parsing a text file.
If I compare the hex bytes of the original file and the hex bytes of the
extracted text I can see that the 0C character is replaced by EF BF BD which
is the UTF-8 replacement character.
public static void main(String[] args) {
InputStream is = null;
try {
is = new FileInputStream("form_feed.txt");
AutoDetectParser parser = new AutoDetectParser();
Writer stringWriter = new StringWriter();
ContentHandler handler = new
BodyContentHandler(stringWriter);
Metadata metadata = new Metadata();
parser.parse(is, handler, metadata);
String extractedText = stringWriter.toString();
System.out.println(extractedText);
String hex =
Hex.encodeHexString(extractedText.getBytes("UTF-8"));
System.out.println(hex); //0C is replaced by EFBFBD
} catch (Exception e) {
e.printStackTrace();
} finally {
IOUtils.closeQuietly(is);
}
}
> TXTParser removes form feed characters
> --------------------------------------
>
> Key: TIKA-1794
> URL: https://issues.apache.org/jira/browse/TIKA-1794
> Project: Tika
> Issue Type: Bug
> Components: parser
> Affects Versions: 1.11
> Environment: Java(TM) SE Runtime Environment (build 1.8.0_60-b27)
> Reporter: Olivier M
> Priority: Minor
> Labels: parser, txt
> Attachments: form_feed.txt
>
>
> Just noticed that Apache Tika removes form feed characters (0C in UTF-8) when
> parsing a text file.
> If I compare the hex bytes of the original file and the hex bytes of the
> extracted text I can see that the 0C character is replaced by EF BF BD which
> is the UTF-8 replacement character.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)