Hi,

Am 04.07.2013 15:18, schrieb Hannes Erven:
Hi,


I'd like to use PDFBOX to remove possibly confidential metadata (like author,
keywords, comments, ...) from a document.

 From http://pdfbox.apache.org/cookbook/workingwithmetadata.html , I see I can
easily use the PDDocumentInformation.setXXX() methods to void that data; okay,
that was easy.

But what about XML metadata attached to some PDModel structure? Can this also be
safely removed?
The same example shows how to add/extract XML-based metadata from a pdf. Saying
that, it should be possible to remove all metadata by setting the metadata value
of the document catalogue to null,
If you like to remove parts of the metadata, you have to read/parse the existing
data, remove the unwanted and write those data back to the stream of the
PDMetadata class.

What about included files, how can I detect and remove them?
There is an example [1] on how to extract embedded files.

Is there perhaps a toolkit solution to remove all non-display-related data from
a document?
I'm afraid there isn't any, but patches are welcome ;-)

Thanks for your comments,
best regards

     -hannes

BR
Andreas Lehmkühler

[1] http://svn.apache.org/repos/asf/pdfbox/trunk/examples/src/main/java/org/apache/pdfbox/examples/pdmodel/ExtractEmbeddedFiles.java

Reply via email to