[ 
https://issues.apache.org/jira/browse/TIKA-1794?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olivier M closed TIKA-1794.
---------------------------
    Resolution: Won't Fix

Marked as won't fix as the form-feed character is not allowed in XHTML 1.0.

> TXTParser removes form feed characters
> --------------------------------------
>
>                 Key: TIKA-1794
>                 URL: https://issues.apache.org/jira/browse/TIKA-1794
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.11
>         Environment: Java(TM) SE Runtime Environment (build 1.8.0_60-b27)
>            Reporter: Olivier M
>            Priority: Minor
>              Labels: parser, txt
>         Attachments: form_feed.txt
>
>
> Just noticed that Apache Tika removes form feed characters (0C in UTF-8) when 
> parsing a text file.
> If I compare the hex bytes of the original file and the hex bytes of the 
> extracted text I can see that the 0C character is replaced by  EF BF BD which 
> is the UTF-8 replacement character.
> {code:title=Test.java|borderStyle=solid}
>       public static void main(String[] args) {
>               InputStream is = null;
>               
>               try {
>                       is = new FileInputStream("form_feed.txt");
>                       
>                       AutoDetectParser parser = new AutoDetectParser();
>                       Writer stringWriter = new StringWriter();
>                       ContentHandler handler = new 
> BodyContentHandler(stringWriter);
>                       Metadata metadata = new Metadata();
>                       parser.parse(is, handler, metadata);
>                       
>                       String extractedText = stringWriter.toString();
>                       System.out.println(extractedText);
>                       
>                       String hex = 
> Hex.encodeHexString(extractedText.getBytes("UTF-8"));
>                       
>                       System.out.println(hex); //0C replaced by EFBFBD
>               } catch (Exception e) {
>                       e.printStackTrace();
>               } finally {
>                       IOUtils.closeQuietly(is);
>               }
>       }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to