[jira] [Created] (PDFBOX-4539) Cache CharsetDecoder

Jonathan (JIRA) Thu, 09 May 2019 04:32:08 -0700

Jonathan created PDFBOX-4539:
--------------------------------

             Summary: Cache CharsetDecoder
                 Key: PDFBOX-4539
                 URL: https://issues.apache.org/jira/browse/PDFBOX-4539
             Project: PDFBox
          Issue Type: Improvement
          Components: Parsing
    Affects Versions: 2.0.14
            Reporter: Jonathan
             Fix For: 2.0.16



We were using PDFBox to parse and process a large number of PDFs, which could 
potentially contains thousands of pages in total, so performance mattered to us.

Thus, we'd like to suggest to cache the CharsetDecoder, which is currently 
instantiated on each call of `isValidUTF8(byte[])`.

Our suggestion in BaseParser.java
{code:java}
private static final CharsetDecoder csUTF_8 = Charsets.UTF_8.newDecoder();

/**
 * Returns true if a byte sequence is valid UTF-8.
 */
private boolean isValidUTF8(byte[] input)
{
    CharsetDecoder cs = Charsets.UTF_8.newDecoder();
    try
    {
        cs.decode(ByteBuffer.wrap(input));
        csUTF_8.decode(ByteBuffer.wrap(input));
        return true;
    }
    catch (CharacterCodingException e)
    {
        return false;
    }
}
{code}
 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Created] (PDFBOX-4539) Cache CharsetDecoder

Reply via email to