Hi Craig, On 30/03/12 01:23, Craig Ringer wrote: > On 03/30/2012 05:09 AM, J.Pietschmann wrote: >> Am 29.03.2012 01:24, schrieb Craig Ringer: >>> I'd also like to have getEncodedName() return a byte[] not a >>> String, since an encoded PDF name isn't actually text data. >> Sounds like a reasonable idea. >> >>> BTW, is there any reason Fop's PDF library uses java.lang.String when >>> working with sequences of PDF data bytes? >> I'd chalk this up to "historical reasons", as usual. Fell free to >> provide a patch which cleans this up. >> >> J.Pietschmann > > Here's how I'd like to rewrite PDFName; untested code as an example of what > I'm getting at. This is just a standalone file; a patch that incorporates it > into the main sources will be a lot more work that I'm holding off on until I > know folks here agree with the approach.
>From a quick look that sounds about right. Are you developing with a 1.7 JDK? You would have to make your code 1.5-compatible. Also, it would be good if you could back your optimizations with profiling data. If code safety/readability have to be compromised there has to be a good reason for it. > In any case, after reading more of the PDF library I'm rethinking the wisdom > of trying to make this change. The change its self is correct, but it'll be > really hard to safely integrate into the rest of the PDF library because of > the difficulty of auditing every site to ensure nothing breaks. Java likes to > call `toString' automatically in places, meaning that anywhere that doesn't > use the proper PDFWritable output methods PDFName inherits will break by > producing bad PDF data that might be quite hard to spot. I'd start by making > PDFName.toString() throw (for testing), but that'd only catch issues in code > that test paths actually hit. > > Given the number of these kinds of issues in fop's pdf library I'm more and > more inclined to wonder if it should just be replaced with PDFBox. It's *full* > of text encoding issues, it crams 8-bit binary data into the lower 8 bits of > Unicode strings, etc. Most of the classes that extend basics like > PDFDictionary act like the base class isn't public API and break if anyone > else changes the dictionary in ways they don't expect, too; they should > "have-a" PDFDictionary not "be-a" PDFDictionary really. > > PDFBox is far from perfect, but it has a clean separation between the model > classes (PDxxxx) and the basic PDF data types (COSxxx); it has a clean > PDFName, PDFString, etc; it has a good PDF parser already, etc. Maybe it'd be > easier for me to whip up a port of FOP's PDF output code to PDFBox? I suspect > I'm insane to mention the possibility of doing that without evaluating the > amount of work involved first, so I'm not promising anything, but by the looks > it might be easier than doing the cleanups I'd like to do in fop. We’ve already talked about that before: http://markmail.org/message/r6pjwwmxgaawhzcp There is a /lot/ of work to do on the PDF package to turn it into a proper library, still it may be more practical to progressively refactor it rather than starting from scratch a new renderer based on PDFBox. That said, please do feel free to give it a try if that route appeals to you. We would certainly consider a switch if it looks more promising in the long term. > Thoughts? > > -- > Craig Ringer Thanks, Vincent
