[ 
https://issues.apache.org/jira/browse/TIKA-1794?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olivier M updated TIKA-1794:
----------------------------
    Description: 
Just noticed that Apache Tika removes form feed characters (0C in UTF-8) when 
parsing a text file.

If I compare the hex bytes of the original file and the hex bytes of the 
extracted text I can see that the 0C character is replaced by  EF BF BD which 
is the UTF-8 replacement character.


  was:
Just noticed that Apache Tika removes form feed characters (0C in UTF-8) when 
parsing a text file.

If I compare the hex bytes of the original file and the hex bytes of the 
extracted text I can see that the 0C character is replaced by  EF BF BD which 
is the UTF-8 replacement character.

        public static void main(String[] args) {
                InputStream is = null;
                
                try {
                        is = new FileInputStream("form_feed.txt");
                        
                        AutoDetectParser parser = new AutoDetectParser();
                        Writer stringWriter = new StringWriter();
                        ContentHandler handler = new 
BodyContentHandler(stringWriter);
                        Metadata metadata = new Metadata();
                        parser.parse(is, handler, metadata);
                        
                        String extractedText = stringWriter.toString();
                        System.out.println(extractedText);
                        
                        String hex = 
Hex.encodeHexString(extractedText.getBytes("UTF-8"));
                        
                        System.out.println(hex); //0C is replaced by EFBFBD

                } catch (Exception e) {
                        e.printStackTrace();
                } finally {
                        IOUtils.closeQuietly(is);
                }
        }



> TXTParser removes form feed characters
> --------------------------------------
>
>                 Key: TIKA-1794
>                 URL: https://issues.apache.org/jira/browse/TIKA-1794
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.11
>         Environment: Java(TM) SE Runtime Environment (build 1.8.0_60-b27)
>            Reporter: Olivier M
>            Priority: Minor
>              Labels: parser, txt
>         Attachments: form_feed.txt
>
>
> Just noticed that Apache Tika removes form feed characters (0C in UTF-8) when 
> parsing a text file.
> If I compare the hex bytes of the original file and the hex bytes of the 
> extracted text I can see that the 0C character is replaced by  EF BF BD which 
> is the UTF-8 replacement character.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to