without BOMs correctly

Andreas Meier (JIRA) Sun, 29 Oct 2017 23:05:19 -0700

     [ 
https://issues.apache.org/jira/browse/TIKA-2484?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Andreas Meier updated TIKA-2484:
--------------------------------
    Description: 
I would like to help to improve the recognition accuracy of the CharsetDetector.

Therefore I created a testset of plain/text-files to check the quality of 
org.apache.tika.parser.txt.CharsetDetector: charset.tar.gz
(Testset created out of 
http://source.icu-project.org/repos/icu/icu4j/tags/release-4-8/main/tests/core/src/com/ibm/icu/dev/test/charsetdet/CharsetDetectionTests.xml)

The Testset was processed using TIKA1.17 (ID: 877d621, HEAD from 26.10.2017) 
and ICU4J 59.1 CharsetDetector with custom UTF-7 improvements. Here are the 
results:

{noformat}
TIKA-1.17
charset.tar.gz
Correct recognitions: 165/341
{noformat}

{noformat}
TIKA-1.17+ UTF-7 recognizer:
charset.tar.gz
Correct recognitions: 213/341
{noformat}

{noformat}
ICU4j 59.1 + UTF-7 recognizer:
charset.tar.gz
Correct recognitions: 333/341
{noformat}




As UTF-7 recognizer I used these two simple classes:

{code:java}
package test.utils;

import java.util.Arrays;

/**
 * Pattern state container for the Boyer-Moore algorithm
 */
public final class BoyerMoorePattern
{

    private final byte[] pattern;

    private final int[] skipArray;

    public BoyerMoorePattern(byte[] pattern)
    {
        this.pattern = pattern;
        skipArray = new int[256];
        Arrays.fill(skipArray, -1);
        // Initialize with pattern values
        for (int i = 0; i < pattern.length; i++)
        {
            skipArray[pattern[i] & 0xFF] = i;
        }
    }

    /**
     * Get the pattern length
     * 
     * @return length as int
     */
    public int getLength()
    {
        return pattern.length;
    }

    /**
     * Searches for the first occurrence of the pattern in the input byte array.
     * 
     * @param data - The data we want to search in
     * @param startIdx - The startindex
     * @param endIdx - The endindex
     * @return offset as int or -1 if not found at all
     */
    public final int searchPattern(byte[] data, int startIdx, int endIdx)
    {
        int patternLength = pattern.length;
        int skip = 0;
        for (int i = startIdx; i <= endIdx - patternLength; i += skip)
        {
            skip = 0;
            for (int j = patternLength - 1; j >= 0; j--)
            {
                if (pattern[j] != data[i + j])
                {
                    skip = Math.max(1, j - skipArray[data[i + j] & 0xFF]);
                    break;
                }
            }
            if (skip == 0)
            {
                return i;
            }
        }

        return -1;
    }

    /**
     * Searches for the first occurrence of the pattern in the input byte array.
     * 
     * @param data - The data we want to search in
     * @param startIdx - The startindex
     * @return offset as int
     */
    public final int searchPattern(byte[] data, int startIdx)
    {
        return searchPattern(data, startIdx, data.length);
    }

    /**
     * Searches for the first occurrence of the pattern in the input byte array.
     * 
     * @param data - The data we want to search in
     * @return offset as int or -1 if not found at all
     */
    public final int searchPattern(byte[] data)
    {
        return searchPattern(data, 0, data.length);
    }
}
{code}



{code:java}
package test;

import java.io.IOException;
import java.io.InputStream;
import java.nio.charset.Charset;
import java.util.logging.Logger;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

import org.apache.commons.io.IOUtils;
import org.apache.tika.detect.EncodingDetector;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.txt.CharsetDetector;
import org.apache.tika.parser.txt.CharsetMatch;

import test.utils.BoyerMoorePattern;


public class MyEncodingDetector implements EncodingDetector {
        
        public Charset detect(InputStream input, Metadata metadata)
                        throws IOException {

                
                CharsetDetector detector;
                CharsetMatch match;

                detector = new CharsetDetector();

                detector.setText(input);
                match = detector.detect();

                match = detector.detect();
                
                String charsetName = isItUtf7(match, 
IOUtils.toByteArray(input)); // determines whether the match is UTF-7 or not
                
                if (charsetName != null) {
                        return Charset.forName(charsetName);
                }
                return null;
        }


    /**
     * Checks for BOM and determines whether it is UTF-7 or not.
     * 
     * @param match - The default match we expect, if it is not UTF-7
     * @param data - The bytearray we want to check
     * 
     * @return match
     */
    private String isItUtf7(CharsetMatch match, byte[] data)
    {
        if (isUTF7withBOM(data) || isUTF7withoutBOM(data)) {
            return "UTF-7";
        } else {
                if (match != null) {
                        return match.getName();
                }
                return null;
        }        
    }
    
    private boolean isUTF7withBOM(byte[] data) {
        if ((data.length > 4 && data[0] == 43 && data[1] == 47 && data[2] == 
118)
                && (data[3] == 56 || data[3] == 57 || data[3] == 43 || data[3] 
== 47))
        {
            // Checkin byte-array for "byte order marks" (BOM):
            // 43 47 118 56
            // 43 47 118 57
            // 43 47 118 43
            // 43 47 118 47
            return true;
        }
        return false;
    }
    
    private boolean isUTF7withoutBOM(byte[] data) {
        byte[] utf7StartPattern = "+".getBytes();
        byte[] utf7EndPattern = "-".getBytes();
        BoyerMoorePattern bmpattern = new BoyerMoorePattern(utf7StartPattern); 
// create a new pattern with the bytes
        int startPosSP = bmpattern.searchPattern(data);
        
        BoyerMoorePattern empattern = new BoyerMoorePattern(utf7EndPattern); // 
create a new pattern with the bytes
        int startPosEP = empattern.searchPattern(data);
                
        if (startPosSP != -1 && startPosEP != -1) {
                // the pattern was found, so we can create a regular expression 
for the basic pattern now

                Pattern p = Pattern.compile("\\+[a-zA-Z]\\w{2,}\\-");   // a 
word with length of at least 3 characters or more
                Matcher m = p.matcher(new String(data));
                
                int numberMatches = 0;
                while (m.find()) {
                        numberMatches++;
                }
                
                System.out.println("Number of possible UTF-7 regex matches: " + 
numberMatches);

                int minimumMatches = 3;
                
                if (numberMatches > minimumMatches) {   // if there are more 
than minimumMatches "+xxx-" words the expected encoding shall be UTF-7
                        return true;
                }
        }
        
        return false;
    }
}
{code}



There might be some false positive (FP) recognitions with the current regex and 
the number of matches.
A better approach might be to set the minimumMatches in dependence of the 
amount of text given to the detector.

This is just a simple first try, nothing for productivity. It even does not 
cover all possible UTF-7 strings.



By the way:

I am perfectly aware of the fact that the current testset does only cover a few 
encodings. However, the specified files address the main weakness of the 
current CharsetDetector.

I don't know the history that lead to the creation of the CharsetDetector in 
TIKA and why ICU4J was rebuild with extensions like the cp866 ngram detection, 
instead of participating in icu4j development.
Wouldn't it be better to forward the changes of the CharsetDetector to the 
ICU4J developers so they can implement missing encodings?

Is it planned to include the newest version of ICU4J in future releases of TIKA?

What about neural networks to determine some or all charsets? (given that there 
are enough testfiles)

  was:
I would like to help to improve the recognition accuracy of the CharsetDetector.

Therefore I created a testset of plain/text-files to check the quality of 
org.apache.tika.parser.txt.CharsetDetector: charset.tar.gz
(Testset created out of 
http://source.icu-project.org/repos/icu/icu4j/tags/release-4-8/main/tests/core/src/com/ibm/icu/dev/test/charsetdet/CharsetDetectionTests.xml)

The Testset was processed using TIKA1.17 (ID: 877d621, HEAD from 26.10.2017) 
and ICU4J 59.1 CharsetDetector with custom UTF-7 improvements. Here are the 
results:

{noformat}
TIKA-1.17
charset.tar.gz
165/341
{noformat}

{noformat}
TIKA-1.17+ UTF-7 recognizer:
charset.tar.gz
213/341
{noformat}

{noformat}
ICU4j 59.1 + UTF-7 recognizer:
charset.tar.gz
333/341
{noformat}


As UTF-7 recognizer I used these two simple classes:

{code:java}
package test.utils;

import java.util.Arrays;

/**
 * Pattern state container for the Boyer-Moore algorithm
 */
public final class BoyerMoorePattern
{

    private final byte[] pattern;

    private final int[] skipArray;

    public BoyerMoorePattern(byte[] pattern)
    {
        this.pattern = pattern;
        skipArray = new int[256];
        Arrays.fill(skipArray, -1);
        // Initialize with pattern values
        for (int i = 0; i < pattern.length; i++)
        {
            skipArray[pattern[i] & 0xFF] = i;
        }
    }

    /**
     * Get the pattern length
     * 
     * @return length as int
     */
    public int getLength()
    {
        return pattern.length;
    }

    /**
     * Searches for the first occurrence of the pattern in the input byte array.
     * 
     * @param data - The data we want to search in
     * @param startIdx - The startindex
     * @param endIdx - The endindex
     * @return offset as int or -1 if not found at all
     */
    public final int searchPattern(byte[] data, int startIdx, int endIdx)
    {
        int patternLength = pattern.length;
        int skip = 0;
        for (int i = startIdx; i <= endIdx - patternLength; i += skip)
        {
            skip = 0;
            for (int j = patternLength - 1; j >= 0; j--)
            {
                if (pattern[j] != data[i + j])
                {
                    skip = Math.max(1, j - skipArray[data[i + j] & 0xFF]);
                    break;
                }
            }
            if (skip == 0)
            {
                return i;
            }
        }

        return -1;
    }

    /**
     * Searches for the first occurrence of the pattern in the input byte array.
     * 
     * @param data - The data we want to search in
     * @param startIdx - The startindex
     * @return offset as int
     */
    public final int searchPattern(byte[] data, int startIdx)
    {
        return searchPattern(data, startIdx, data.length);
    }

    /**
     * Searches for the first occurrence of the pattern in the input byte array.
     * 
     * @param data - The data we want to search in
     * @return offset as int or -1 if not found at all
     */
    public final int searchPattern(byte[] data)
    {
        return searchPattern(data, 0, data.length);
    }
}
{code}



{code:java}
package test;

import java.io.IOException;
import java.io.InputStream;
import java.nio.charset.Charset;
import java.util.logging.Logger;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

import org.apache.commons.io.IOUtils;
import org.apache.tika.detect.EncodingDetector;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.txt.CharsetDetector;
import org.apache.tika.parser.txt.CharsetMatch;

import test.utils.BoyerMoorePattern;


public class MyEncodingDetector implements EncodingDetector {
        
        public Charset detect(InputStream input, Metadata metadata)
                        throws IOException {

                
                CharsetDetector detector;
                CharsetMatch match;

                detector = new CharsetDetector();

                detector.setText(input);
                match = detector.detect();

                match = detector.detect();
                
                String charsetName = isItUtf7(match, 
IOUtils.toByteArray(input)); // determines whether the match is UTF-7 or not
                
                if (charsetName != null) {
                        return Charset.forName(charsetName);
                }
                return null;
        }


    /**
     * Checks for BOM and determines whether it is UTF-7 or not.
     * 
     * @param match - The default match we expect, if it is not UTF-7
     * @param data - The bytearray we want to check
     * 
     * @return match
     */
    private String isItUtf7(CharsetMatch match, byte[] data)
    {
        if (isUTF7withBOM(data) || isUTF7withoutBOM(data)) {
            return "UTF-7";
        } else {
                if (match != null) {
                        return match.getName();
                }
                return null;
        }        
    }
    
    private boolean isUTF7withBOM(byte[] data) {
        if ((data.length > 4 && data[0] == 43 && data[1] == 47 && data[2] == 
118)
                && (data[3] == 56 || data[3] == 57 || data[3] == 43 || data[3] 
== 47))
        {
            // Checkin byte-array for "byte order marks" (BOM):
            // 43 47 118 56
            // 43 47 118 57
            // 43 47 118 43
            // 43 47 118 47
            return true;
        }
        return false;
    }
    
    private boolean isUTF7withoutBOM(byte[] data) {
        byte[] utf7StartPattern = "+".getBytes();
        byte[] utf7EndPattern = "-".getBytes();
        BoyerMoorePattern bmpattern = new BoyerMoorePattern(utf7StartPattern); 
// create a new pattern with the bytes
        int startPosSP = bmpattern.searchPattern(data);
        
        BoyerMoorePattern empattern = new BoyerMoorePattern(utf7EndPattern); // 
create a new pattern with the bytes
        int startPosEP = empattern.searchPattern(data);
                
        if (startPosSP != -1 && startPosEP != -1) {
                // the pattern was found, so we can create a regular expression 
for the basic pattern now

                Pattern p = Pattern.compile("\\+[a-zA-Z]\\w{2,}\\-");   // a 
word with length of at least 3 characters or more
                Matcher m = p.matcher(new String(data));
                
                int numberMatches = 0;
                while (m.find()) {
                        numberMatches++;
                }
                
                System.out.println("Number of possible UTF-7 regex matches: " + 
numberMatches);

                int minimumMatches = 3;
                
                if (numberMatches > minimumMatches) {   // if there are more 
than minimumMatches "+xxx-" words the expected encoding shall be UTF-7
                        return true;
                }
        }
        
        return false;
    }
}
{code}



There might be some false positive (FP) recognitions with the current regex and 
the number of matches.
A better approach might be to set the minimumMatches in dependence of the 
amount of text given to the detector.

This is just a simple first try, nothing for productivity. It even does not 
cover all possible UTF-7 strings.



By the way:

I am perfectly aware of the fact that the current testset does only cover a few 
encodings. However, the specified files address the main weakness of the 
current CharsetDetector.

I don't know the history that lead to the creation of the CharsetDetector in 
TIKA and why ICU4J was rebuild with extensions like the cp866 ngram detection, 
instead of participating in icu4j development.
Wouldn't it be better to forward the changes of the CharsetDetector to the 
ICU4J developers so they can implement missing encodings?

Is it planned to include the newest version of ICU4J in future releases of TIKA?

What about neural networks to determine some or all charsets? (given that there 
are enough testfiles)


> Improve CharsetDetector to recognize UTF-16LE/BE,UTF-32LE/BE and UTF-7 
> with/without BOMs correctly
> --------------------------------------------------------------------------------------------------
>
>                 Key: TIKA-2484
>                 URL: https://issues.apache.org/jira/browse/TIKA-2484
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 1.16, 1.17
>            Reporter: Andreas Meier
>            Priority: Minor
>         Attachments: IUC10-ar.UTF-16BE.with-BOM, 
> IUC10-ar.UTF-16BE.without-BOM, IUC10-ar.UTF-16LE.with-BOM, 
> IUC10-ar.UTF-16LE.without-BOM, IUC10-ar.UTF-32BE.with-BOM, 
> IUC10-ar.UTF-32BE.without-BOM, IUC10-ar.UTF-32LE.with-BOM, 
> IUC10-ar.UTF-32LE.without-BOM, IUC10-ar.UTF-7.with-BOM, 
> IUC10-ar.UTF-7.without-BOM, IUC10-fr.UTF-16BE.with-BOM, 
> IUC10-fr.UTF-16BE.without-BOM, IUC10-fr.UTF-16LE.with-BOM, 
> IUC10-fr.UTF-16LE.without-BOM, IUC10-fr.UTF-32BE.with-BOM, 
> IUC10-fr.UTF-32BE.without-BOM, IUC10-fr.UTF-32LE.with-BOM, 
> IUC10-fr.UTF-32LE.without-BOM, IUC10-fr.UTF-7.with-BOM, 
> IUC10-fr.UTF-7.without-BOM
>
>
> I would like to help to improve the recognition accuracy of the 
> CharsetDetector.
> Therefore I created a testset of plain/text-files to check the quality of 
> org.apache.tika.parser.txt.CharsetDetector: charset.tar.gz
> (Testset created out of 
> http://source.icu-project.org/repos/icu/icu4j/tags/release-4-8/main/tests/core/src/com/ibm/icu/dev/test/charsetdet/CharsetDetectionTests.xml)
> The Testset was processed using TIKA1.17 (ID: 877d621, HEAD from 26.10.2017) 
> and ICU4J 59.1 CharsetDetector with custom UTF-7 improvements. Here are the 
> results:
> {noformat}
> TIKA-1.17
> charset.tar.gz
> Correct recognitions: 165/341
> {noformat}
> {noformat}
> TIKA-1.17+ UTF-7 recognizer:
> charset.tar.gz
> Correct recognitions: 213/341
> {noformat}
> {noformat}
> ICU4j 59.1 + UTF-7 recognizer:
> charset.tar.gz
> Correct recognitions: 333/341
> {noformat}
> As UTF-7 recognizer I used these two simple classes:
> {code:java}
> package test.utils;
> import java.util.Arrays;
> /**
>  * Pattern state container for the Boyer-Moore algorithm
>  */
> public final class BoyerMoorePattern
> {
>     private final byte[] pattern;
>     private final int[] skipArray;
>     public BoyerMoorePattern(byte[] pattern)
>     {
>         this.pattern = pattern;
>         skipArray = new int[256];
>         Arrays.fill(skipArray, -1);
>         // Initialize with pattern values
>         for (int i = 0; i < pattern.length; i++)
>         {
>             skipArray[pattern[i] & 0xFF] = i;
>         }
>     }
>     /**
>      * Get the pattern length
>      * 
>      * @return length as int
>      */
>     public int getLength()
>     {
>         return pattern.length;
>     }
>     /**
>      * Searches for the first occurrence of the pattern in the input byte 
> array.
>      * 
>      * @param data - The data we want to search in
>      * @param startIdx - The startindex
>      * @param endIdx - The endindex
>      * @return offset as int or -1 if not found at all
>      */
>     public final int searchPattern(byte[] data, int startIdx, int endIdx)
>     {
>         int patternLength = pattern.length;
>         int skip = 0;
>         for (int i = startIdx; i <= endIdx - patternLength; i += skip)
>         {
>             skip = 0;
>             for (int j = patternLength - 1; j >= 0; j--)
>             {
>                 if (pattern[j] != data[i + j])
>                 {
>                     skip = Math.max(1, j - skipArray[data[i + j] & 0xFF]);
>                     break;
>                 }
>             }
>             if (skip == 0)
>             {
>                 return i;
>             }
>         }
>         return -1;
>     }
>     /**
>      * Searches for the first occurrence of the pattern in the input byte 
> array.
>      * 
>      * @param data - The data we want to search in
>      * @param startIdx - The startindex
>      * @return offset as int
>      */
>     public final int searchPattern(byte[] data, int startIdx)
>     {
>         return searchPattern(data, startIdx, data.length);
>     }
>     /**
>      * Searches for the first occurrence of the pattern in the input byte 
> array.
>      * 
>      * @param data - The data we want to search in
>      * @return offset as int or -1 if not found at all
>      */
>     public final int searchPattern(byte[] data)
>     {
>         return searchPattern(data, 0, data.length);
>     }
> }
> {code}
> {code:java}
> package test;
> import java.io.IOException;
> import java.io.InputStream;
> import java.nio.charset.Charset;
> import java.util.logging.Logger;
> import java.util.regex.Matcher;
> import java.util.regex.Pattern;
> import org.apache.commons.io.IOUtils;
> import org.apache.tika.detect.EncodingDetector;
> import org.apache.tika.metadata.Metadata;
> import org.apache.tika.parser.txt.CharsetDetector;
> import org.apache.tika.parser.txt.CharsetMatch;
> import test.utils.BoyerMoorePattern;
> public class MyEncodingDetector implements EncodingDetector {
>       
>       public Charset detect(InputStream input, Metadata metadata)
>                       throws IOException {
>               
>               CharsetDetector detector;
>               CharsetMatch match;
>               detector = new CharsetDetector();
>               detector.setText(input);
>               match = detector.detect();
>               match = detector.detect();
>               
>               String charsetName = isItUtf7(match, 
> IOUtils.toByteArray(input)); // determines whether the match is UTF-7 or not
>               
>               if (charsetName != null) {
>                       return Charset.forName(charsetName);
>               }
>               return null;
>       }
>     /**
>      * Checks for BOM and determines whether it is UTF-7 or not.
>      * 
>      * @param match - The default match we expect, if it is not UTF-7
>      * @param data - The bytearray we want to check
>      * 
>      * @return match
>      */
>     private String isItUtf7(CharsetMatch match, byte[] data)
>     {
>         if (isUTF7withBOM(data) || isUTF7withoutBOM(data)) {
>             return "UTF-7";
>         } else {
>               if (match != null) {
>                       return match.getName();
>               }
>               return null;
>         }        
>     }
>     
>     private boolean isUTF7withBOM(byte[] data) {
>         if ((data.length > 4 && data[0] == 43 && data[1] == 47 && data[2] == 
> 118)
>                 && (data[3] == 56 || data[3] == 57 || data[3] == 43 || 
> data[3] == 47))
>         {
>             // Checkin byte-array for "byte order marks" (BOM):
>             // 43 47 118 56
>             // 43 47 118 57
>             // 43 47 118 43
>             // 43 47 118 47
>             return true;
>         }
>         return false;
>     }
>     
>     private boolean isUTF7withoutBOM(byte[] data) {
>         byte[] utf7StartPattern = "+".getBytes();
>         byte[] utf7EndPattern = "-".getBytes();
>         BoyerMoorePattern bmpattern = new 
> BoyerMoorePattern(utf7StartPattern); // create a new pattern with the bytes
>         int startPosSP = bmpattern.searchPattern(data);
>         
>         BoyerMoorePattern empattern = new BoyerMoorePattern(utf7EndPattern); 
> // create a new pattern with the bytes
>         int startPosEP = empattern.searchPattern(data);
>               
>         if (startPosSP != -1 && startPosEP != -1) {
>               // the pattern was found, so we can create a regular expression 
> for the basic pattern now
>               Pattern p = Pattern.compile("\\+[a-zA-Z]\\w{2,}\\-");   // a 
> word with length of at least 3 characters or more
>               Matcher m = p.matcher(new String(data));
>               
>               int numberMatches = 0;
>               while (m.find()) {
>                       numberMatches++;
>               }
>               
>               System.out.println("Number of possible UTF-7 regex matches: " + 
> numberMatches);
>               int minimumMatches = 3;
>               
>               if (numberMatches > minimumMatches) {   // if there are more 
> than minimumMatches "+xxx-" words the expected encoding shall be UTF-7
>                       return true;
>               }
>         }
>         
>         return false;
>     }
> }
> {code}
> There might be some false positive (FP) recognitions with the current regex 
> and the number of matches.
> A better approach might be to set the minimumMatches in dependence of the 
> amount of text given to the detector.
> This is just a simple first try, nothing for productivity. It even does not 
> cover all possible UTF-7 strings.
> By the way:
> I am perfectly aware of the fact that the current testset does only cover a 
> few encodings. However, the specified files address the main weakness of the 
> current CharsetDetector.
> I don't know the history that lead to the creation of the CharsetDetector in 
> TIKA and why ICU4J was rebuild with extensions like the cp866 ngram 
> detection, instead of participating in icu4j development.
> Wouldn't it be better to forward the changes of the CharsetDetector to the 
> ICU4J developers so they can implement missing encodings?
> Is it planned to include the newest version of ICU4J in future releases of 
> TIKA?
> What about neural networks to determine some or all charsets? (given that 
> there are enough testfiles)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Updated] (TIKA-2484) Improve CharsetDetector to recognize UTF-16LE/BE,UTF-32LE/BE and UTF-7 with/without BOMs correctly

Reply via email to