[ https://issues.apache.org/jira/browse/PDFBOX-5660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17797058#comment-17797058 ]
Axel Howind commented on PDFBOX-5660: ------------------------------------- [~tilman] Some info about the charset part: PDFBox has to decode the byte buffer into a string. It has used UTF-8 to do that until PDFBOX-3347. The problem there seems to be that the character data was not encoded as UTF-8. So the code was changed like this: # Test if buffer uses UTF-8 encoding # If yes, decode using UTF-8, otherwise decode using Windows-1252 Now the test in step 1 was: try to decode using UTF-8. If that succeeds, return true, otherwise false. That means the buffer was decoded two times regardless of it being UTF-8. The patch changes this to try decode using UTF-8 and if it succeeds, return the text. Otherwise it uses the alternative encoding. The alternative decoding using WINDOWS-1252 was introduced by fixing PDFBOX-3347 which contained a file where the characters were not using proper UTF-8 encoding. Now the alternative encoding before the patch has always been Windows-1252. But that codepage is not guaranteed to be provided by the JDK. Because of that, I added the static field that is initialised to either Windows-1252 if present, or ISO-8859-1 if not. ISO-8859-1 is guaranteed to be present because it is one of the standard charsets defined by the JDK. Windows-1252 is a superset of ISO-8859-1 that adds some characters. The following character are both in Windows-1252 and ISO-8859-1: {noformat} ¡¢£¤¥¦§¨©ª«¬®¯°±²³´µ¶·¸¹º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖרÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿ{noformat} And these are in Windows-1252 but not in ISO-8859-1: {noformat} €‚ƒ„…†‡ˆ‰Š‹ŒŽ‘’“”•–—˜™š›œžŸ{noformat} So even in case Windows-1252 is not present on the system, parsing using ISO-8859-1 should work in most cases (i.e. when none of the characters from the second table are used. The comment about the file from PDFBOX-3347 is because I tried to load that file in the debugger and my breakpoint on the alternative encoding was not hit (neither before not after applying the patch). Maybe I did something wrong with the debugger or something else changed and the alternative encoding is not necessary at all anymore. > Improve code quality (5) > ------------------------ > > Key: PDFBOX-5660 > URL: https://issues.apache.org/jira/browse/PDFBOX-5660 > Project: PDFBox > Issue Type: Improvement > Reporter: Tilman Hausherr > Priority: Minor > Attachments: AnnotationSample.Standard.pdf, > DRY_refactoring_Typ2CharStringParser.patch, > Removed_the_readFully_method_in_the_PfbParser_class_and_replaced__with_calling_readAllByte.patch, > > Simplify_list_and_map_operations,_use_known_size_when_creating_StringBuilder.patch, > Simplify_string_conversion_in_PDFHighlighter.patch, > avoid_multiple_unboxing.patch, code_cleanup.patch, > do_not_create_temporary_File_instance.patch, > extract_common_code,_move_toUpperCase()_out_of_loop.patch, > fix_HTML_error_in_Javadoc.patch, fix_javadoc_problems.patch, > make_inner_class_static.patch, refactor_isEndOfName.patch, > remove_code_duplication_in_Type2CharStringParser.patch, > remove_obsolete_class_NullOutputStream.patch, > remove_unnecessary_calls_to_toString()_String_valueOf().patch, > replace_System_getProperty()_calls.patch, screenshot-1.png, > simplify_hashCode()_and_equals(),_test_name_first_because_Map_equals()_is_expensive.patch, > simplify_stream_operations.patch, use_Map_ofEntries().patch, > use_Math_min()_to_make_code_more_readable.patch, use_Objects_equals().patch, > use_String_isEmpty()_Collection_isEmpty()_instead_of_checking_length_size.patch, > use_String_join().patch, use_switch_for_readability.patch, > use_try-with-resources_(since_Java_9_the_variable_declaration_in_the_try_is_not_necessary_.patch > > > This is a longterm issue for the task to improve code quality, by using the > SonarQube report, hints in different IDEs, the FindBugs tool and other code > quality tools. > This is a follow-up of PDFBOX-4892, which was getting too long. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org