[ 
https://issues.apache.org/jira/browse/PDFBOX-5660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17797058#comment-17797058
 ] 

Axel Howind commented on PDFBOX-5660:
-------------------------------------

[~tilman] Some info about the charset part:

PDFBox has to decode the byte buffer into a string. It has used UTF-8 to do 
that until PDFBOX-3347. The problem there seems to be that the character data 
was not encoded as UTF-8. So the code was changed like this:
 # Test if buffer uses UTF-8 encoding
 # If yes, decode using UTF-8, otherwise decode using Windows-1252

Now the test in step 1 was: try to decode using UTF-8. If that succeeds, return 
true, otherwise false.

That means the buffer was decoded two times regardless of it being UTF-8. The 
patch changes this to try decode using UTF-8 and if it succeeds, return the 
text. Otherwise it uses the alternative encoding.

The alternative decoding using WINDOWS-1252 was introduced by fixing 
PDFBOX-3347 which contained a file where the characters were not using proper 
UTF-8 encoding.

Now the alternative encoding before the patch has always been Windows-1252. But 
that codepage is not guaranteed to be provided by the JDK. Because of that, I 
added the static field that is initialised to either Windows-1252 if present, 
or ISO-8859-1 if not. ISO-8859-1 is guaranteed to be present because it is one 
of the standard charsets defined by the JDK. Windows-1252 is a superset of 
ISO-8859-1 that adds some characters.

The following character are both in Windows-1252 and ISO-8859-1: 
{noformat}
¡¢£¤¥¦§¨©ª«¬®¯°±²³´µ¶·¸¹º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖרÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿ{noformat}
And these are in Windows-1252 but not in ISO-8859-1:
{noformat}
€‚ƒ„…†‡ˆ‰Š‹ŒŽ‘’“”•–—˜™š›œžŸ{noformat}
So even in case Windows-1252 is not present on the system, parsing using 
ISO-8859-1 should work in most cases (i.e. when none of the characters from the 
second table are used.

The comment about the file from PDFBOX-3347 is because I tried to load that 
file in the debugger and my breakpoint on the alternative encoding was not hit 
(neither before not after applying the patch). Maybe I did something wrong with 
the debugger or something else changed and the alternative encoding is not 
necessary at all anymore. 

> Improve code quality (5)
> ------------------------
>
>                 Key: PDFBOX-5660
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-5660
>             Project: PDFBox
>          Issue Type: Improvement
>            Reporter: Tilman Hausherr
>            Priority: Minor
>         Attachments: AnnotationSample.Standard.pdf, 
> DRY_refactoring_Typ2CharStringParser.patch, 
> Removed_the_readFully_method_in_the_PfbParser_class_and_replaced__with_calling_readAllByte.patch,
>  
> Simplify_list_and_map_operations,_use_known_size_when_creating_StringBuilder.patch,
>  Simplify_string_conversion_in_PDFHighlighter.patch, 
> avoid_multiple_unboxing.patch, code_cleanup.patch, 
> do_not_create_temporary_File_instance.patch, 
> extract_common_code,_move_toUpperCase()_out_of_loop.patch, 
> fix_HTML_error_in_Javadoc.patch, fix_javadoc_problems.patch, 
> make_inner_class_static.patch, refactor_isEndOfName.patch, 
> remove_code_duplication_in_Type2CharStringParser.patch, 
> remove_obsolete_class_NullOutputStream.patch, 
> remove_unnecessary_calls_to_toString()_String_valueOf().patch, 
> replace_System_getProperty()_calls.patch, screenshot-1.png, 
> simplify_hashCode()_and_equals(),_test_name_first_because_Map_equals()_is_expensive.patch,
>  simplify_stream_operations.patch, use_Map_ofEntries().patch, 
> use_Math_min()_to_make_code_more_readable.patch, use_Objects_equals().patch, 
> use_String_isEmpty()_Collection_isEmpty()_instead_of_checking_length_size.patch,
>  use_String_join().patch, use_switch_for_readability.patch, 
> use_try-with-resources_(since_Java_9_the_variable_declaration_in_the_try_is_not_necessary_.patch
>
>
> This is a longterm issue for the task to improve code quality, by using the 
> SonarQube report, hints in different IDEs, the FindBugs tool and other code 
> quality tools.
> This is a follow-up of PDFBOX-4892, which was getting too long.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

Reply via email to