[
https://issues.apache.org/jira/browse/PDFBOX-5660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17797058#comment-17797058
]
Axel Howind commented on PDFBOX-5660:
-------------------------------------
[~tilman] Some info about the charset part:
PDFBox has to decode the byte buffer into a string. It has used UTF-8 to do
that until PDFBOX-3347. The problem there seems to be that the character data
was not encoded as UTF-8. So the code was changed like this:
# Test if buffer uses UTF-8 encoding
# If yes, decode using UTF-8, otherwise decode using Windows-1252
Now the test in step 1 was: try to decode using UTF-8. If that succeeds, return
true, otherwise false.
That means the buffer was decoded two times regardless of it being UTF-8. The
patch changes this to try decode using UTF-8 and if it succeeds, return the
text. Otherwise it uses the alternative encoding.
The alternative decoding using WINDOWS-1252 was introduced by fixing
PDFBOX-3347 which contained a file where the characters were not using proper
UTF-8 encoding.
Now the alternative encoding before the patch has always been Windows-1252. But
that codepage is not guaranteed to be provided by the JDK. Because of that, I
added the static field that is initialised to either Windows-1252 if present,
or ISO-8859-1 if not. ISO-8859-1 is guaranteed to be present because it is one
of the standard charsets defined by the JDK. Windows-1252 is a superset of
ISO-8859-1 that adds some characters.
The following character are both in Windows-1252 and ISO-8859-1:
{noformat}
¡¢£¤¥¦§¨©ª«¬®¯°±²³´µ¶·¸¹º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖרÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿ{noformat}
And these are in Windows-1252 but not in ISO-8859-1:
{noformat}
€‚ƒ„…†‡ˆ‰Š‹ŒŽ‘’“”•–—˜™š›œžŸ{noformat}
So even in case Windows-1252 is not present on the system, parsing using
ISO-8859-1 should work in most cases (i.e. when none of the characters from the
second table are used.
The comment about the file from PDFBOX-3347 is because I tried to load that
file in the debugger and my breakpoint on the alternative encoding was not hit
(neither before not after applying the patch). Maybe I did something wrong with
the debugger or something else changed and the alternative encoding is not
necessary at all anymore.
> Improve code quality (5)
> ------------------------
>
> Key: PDFBOX-5660
> URL: https://issues.apache.org/jira/browse/PDFBOX-5660
> Project: PDFBox
> Issue Type: Improvement
> Reporter: Tilman Hausherr
> Priority: Minor
> Attachments: AnnotationSample.Standard.pdf,
> DRY_refactoring_Typ2CharStringParser.patch,
> Removed_the_readFully_method_in_the_PfbParser_class_and_replaced__with_calling_readAllByte.patch,
>
> Simplify_list_and_map_operations,_use_known_size_when_creating_StringBuilder.patch,
> Simplify_string_conversion_in_PDFHighlighter.patch,
> avoid_multiple_unboxing.patch, code_cleanup.patch,
> do_not_create_temporary_File_instance.patch,
> extract_common_code,_move_toUpperCase()_out_of_loop.patch,
> fix_HTML_error_in_Javadoc.patch, fix_javadoc_problems.patch,
> make_inner_class_static.patch, refactor_isEndOfName.patch,
> remove_code_duplication_in_Type2CharStringParser.patch,
> remove_obsolete_class_NullOutputStream.patch,
> remove_unnecessary_calls_to_toString()_String_valueOf().patch,
> replace_System_getProperty()_calls.patch, screenshot-1.png,
> simplify_hashCode()_and_equals(),_test_name_first_because_Map_equals()_is_expensive.patch,
> simplify_stream_operations.patch, use_Map_ofEntries().patch,
> use_Math_min()_to_make_code_more_readable.patch, use_Objects_equals().patch,
> use_String_isEmpty()_Collection_isEmpty()_instead_of_checking_length_size.patch,
> use_String_join().patch, use_switch_for_readability.patch,
> use_try-with-resources_(since_Java_9_the_variable_declaration_in_the_try_is_not_necessary_.patch
>
>
> This is a longterm issue for the task to improve code quality, by using the
> SonarQube report, hints in different IDEs, the FindBugs tool and other code
> quality tools.
> This is a follow-up of PDFBOX-4892, which was getting too long.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]