[
https://issues.apache.org/jira/browse/TIKA-1794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15006797#comment-15006797
]
Ken Krugler commented on TIKA-1794:
-----------------------------------
Tika uses XHTML 1.0, which doesn't allow the form-feed character. See section
C.15. of http://www.w3.org/TR/xhtml1/#C_15:
bq. Some characters that are legal in HTML documents, are illegal in XML
document. For example, in HTML, the Formfeed character (U+000C) is treated as
white space, in XHTML, due to XML's definition of characters, it is illegal.
> TXTParser removes form feed characters
> --------------------------------------
>
> Key: TIKA-1794
> URL: https://issues.apache.org/jira/browse/TIKA-1794
> Project: Tika
> Issue Type: Bug
> Components: parser
> Affects Versions: 1.11
> Environment: Java(TM) SE Runtime Environment (build 1.8.0_60-b27)
> Reporter: Olivier M
> Priority: Minor
> Labels: parser, txt
> Attachments: form_feed.txt
>
>
> Just noticed that Apache Tika removes form feed characters (0C in UTF-8) when
> parsing a text file.
> If I compare the hex bytes of the original file and the hex bytes of the
> extracted text I can see that the 0C character is replaced by EF BF BD which
> is the UTF-8 replacement character.
> {code:title=Test.java|borderStyle=solid}
> public static void main(String[] args) {
> InputStream is = null;
>
> try {
> is = new FileInputStream("form_feed.txt");
>
> AutoDetectParser parser = new AutoDetectParser();
> Writer stringWriter = new StringWriter();
> ContentHandler handler = new
> BodyContentHandler(stringWriter);
> Metadata metadata = new Metadata();
> parser.parse(is, handler, metadata);
>
> String extractedText = stringWriter.toString();
> System.out.println(extractedText);
>
> String hex =
> Hex.encodeHexString(extractedText.getBytes("UTF-8"));
>
> System.out.println(hex); //0C replaced by EFBFBD
> } catch (Exception e) {
> e.printStackTrace();
> } finally {
> IOUtils.closeQuietly(is);
> }
> }
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)