[PR] refactor TextToPDF.call method [pdfbox]

via GitHub Thu, 01 Jan 2026 07:42:46 -0800


valerybokov opened a new pull request, #388:
URL: https://github.com/apache/pdfbox/pull/388


   Current algorithm is:
   1 if charset is UTF8 then read 3 bytes.
   2 if these 3 bytes have expected values then mark a hasUtf8BOM variable as 
true
   3 close stream
   4 open new stream of the same file
   5 if the variable hasUtf8BOM is true then skip 3 bytes.
   6 if couldn't skip 3 bytes then throw an exception
   7 If bytes were skipped or there is no need to skip them, the rest of the 
file should be read.
   
   These are questions not for you, but for the algorithm:
   1 why we need read the file twice when it increases the likelihood that we 
won't succeed the second time?
   2 Opening a stream twice is slower than opening it once.
   3 According to the code, there's a possibility that we couldn't read the 
file a second time. Then why isn't there a check to see if the file is 
corrupted? That is, it's UTF-8 encoding, but what if one or two of these three 
bytes are different. Perhaps the format has such combinations that this can't 
be verified.
   
   At this point, I propose making one stream instead of two. There are no 
other changes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] refactor TextToPDF.call method [pdfbox]

Reply via email to