[ 
https://issues.apache.org/jira/browse/PDFBOX-4539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16836297#comment-16836297
 ] 

Tilman Hausherr edited comment on PDFBOX-4539 at 5/9/19 11:43 AM:
------------------------------------------------------------------

Your code does not reset the decoder (what if there is an incomplete UTF 
string?), and it uses two decoders, why?

And if you do the reset, you should check whether the alleged optimization is 
still there. Did you verify this optimization with a benchmark, or is it 
something that you noticed in code review?


was (Author: tilman):
Your code does not reset the decoder (what if there is an incomplete UTF 
string?), and it uses two decoders, why?

> Cache CharsetDecoder
> --------------------
>
>                 Key: PDFBOX-4539
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-4539
>             Project: PDFBox
>          Issue Type: Improvement
>          Components: Parsing
>    Affects Versions: 2.0.14
>            Reporter: Jonathan
>            Priority: Major
>              Labels: performance
>             Fix For: 2.0.16
>
>
> We were using PDFBox to parse and process a large number of PDFs, which could 
> potentially contains thousands of pages in total, so performance mattered to 
> us.
> Thus, we'd like to suggest to cache the CharsetDecoder, which is currently 
> instantiated on each call of `isValidUTF8(byte[])`.
> Our suggestion in BaseParser.java
> {code:java}
> private static final CharsetDecoder csUTF_8 = Charsets.UTF_8.newDecoder();
> /**
>  * Returns true if a byte sequence is valid UTF-8.
>  */
> private boolean isValidUTF8(byte[] input)
> {
>     CharsetDecoder cs = Charsets.UTF_8.newDecoder();
>     try
>     {
>         cs.decode(ByteBuffer.wrap(input));
>         csUTF_8.decode(ByteBuffer.wrap(input));
>         return true;
>     }
>     catch (CharacterCodingException e)
>     {
>         return false;
>     }
> }
> {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to