[
https://issues.apache.org/jira/browse/PDFBOX-1618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13812346#comment-13812346
]
Michael Kuß commented on PDFBOX-1618:
-------------------------------------
I have to add that splitting may result in big splitted files if the resource
catalog is global to all pages. Thus also adobe will produce large files. This
is true for all splitting programs I know of.
To work around this you have to create local resource catalogs on every page.
To get rid for example of globally referenced images on every page and include
only the actually images on the current page you can do something like the
following:
{code}
List<String> doTokens = new ArrayList<String>();
PDStream pdstream = page.getContents();
if (pdstream != null) {
// find images in the contentstream of the page
PDFStreamParser parser = new PDFStreamParser(
pdstream.getStream(), true);
Iterator<Object> iter = parser.getTokenIterator();
COSName name = null;
while (iter.hasNext()) {
Object o = iter.next();
// System.out.println(o);
if (o instanceof COSName) {
name = (COSName) o;
}
if (o instanceof PDFOperator) {
PDFOperator operator = (PDFOperator) o;
if (operator.getOperation().equals("Do")
&& name != null) {
doTokens.add(name.getName());
}
}
}
}
PDResources resources = page.getResources();
Map<String, PDXObject> map = resources.getXObjects();
List<String> deleteKeys = new ArrayList<String>();
for (String key : map.keySet()) {
PDXObject xobject = map.get(key);
if (xobject instanceof PDXObjectImage) {
if (!doTokens.contains(key)) {
// find the images in the global resource catalog
deleteKeys.add(key);
}
}
}
// make a local resource catalog
PDResources r = new PDResources();
PDFCloneUtility clone = new PDFCloneUtility(doc);
r.getCOSDictionary().mergeInto(
(COSDictionary) clone.cloneForNewDocument(resources
.getCOSDictionary()));
if (!dofonts) {
r.getCOSDictionary().removeItem(COSName.FONT);
r.getCOSDictionary().setItem(
COSName.FONT,
resources.getCOSDictionary().getDictionaryObject(
COSName.FONT));
}
r.getCOSDictionary().removeItem(COSName.PROC_SET);
r.getCOSDictionary().setItem(
COSName.PROC_SET,
resources.getCOSDictionary().getDictionaryObject(
COSName.PROC_SET));
r.getCOSDictionary().removeItem(COSName.COLORSPACE);
r.getCOSDictionary().setItem(
COSName.COLORSPACE,
resources.getCOSDictionary().getDictionaryObject(
COSName.COLORSPACE));
COSDictionary dictResources = (COSDictionary) r.getCOSDictionary()
.getDictionaryObject(COSName.XOBJECT);
for (String key : deleteKeys) {
dictResources.removeItem(COSName.getPDFName(key));
}
{code}
> Split PDF file to single page files, some files are inflated in size
> --------------------------------------------------------------------
>
> Key: PDFBOX-1618
> URL: https://issues.apache.org/jira/browse/PDFBOX-1618
> Project: PDFBox
> Issue Type: Bug
> Components: PDModel
> Affects Versions: 1.8.1
> Environment: Windows 7, JVM 1.6.0_29
> Reporter: Tom Taylor
> Attachments: 112080-TECHNICAL MANUAL FOR GENERATOR NIR 7194 A-10LW OF
> 4038 KVA.pdf, Test_PDFs.zip, internalstructure.png
>
>
> A PDF file is split into single pages for inclusion within another document
> (pdfbox.utils.Splitter within our code but same phenomenon observed when
> splitting using command line PDFSplit tool). Som of the pages are almost as
> large as the original file which causes performance problems for our
> customers.
> Again, I have a sample pdf to attach.
--
This message was sent by Atlassian JIRA
(v6.1#6144)