[
https://issues.apache.org/jira/browse/PDFBOX-2128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14027972#comment-14027972
]
Ludovic Davoine commented on PDFBOX-2128:
-----------------------------------------
The workaround works for images in CMYK from Photoshop, but fails for the other
image.
In this PDF (http://ludoda.free.fr/CAPITAL.pdf), the images are in CMYK but
don't come from Photoshop, so the result is KO with the workaround
(http://ludoda.free.fr/CAPITAL_IMAGE_KO.jpg) but OK with the native method
"PDJpeg.write2file(File f)".
As you explained, the problem is to know in advance the image type (in order to
use the right way to extract the image). I could use Apache Imaging to read the
header, but this project needs a File object or a byte Array :
{code}JPEGImageDecoder decoder = JPEGCodec.createJPEGDecoder(new
FileInputStream( new File(pFilename) ) );{code}
{code}
JpegImageParser parser = new JpegImageParser();
ByteSource byteSource = new ByteSourceArray(bytes);
{code}
and when i try to have this kind of object from my PDJpeg object, i have an
exception "Unsupported Image Type" because ImageIO doesn't not support CMYK
file:
{code}
byte[] bytes=myPDJpeg.getPDStream().getByteArray();
{code}
{code}
javax.imageio.IIOException: Unsupported Image Type
at com.sun.imageio.plugins.jpeg.JPEGImageReader.readInternal(Unknown
Source)
at com.sun.imageio.plugins.jpeg.JPEGImageReader.read(Unknown Source)
at javax.imageio.ImageIO.read(Unknown Source)
at javax.imageio.ImageIO.read(Unknown Source)
at org.apache.pdfbox.filter.JPXFilter.decode(JPXFilter.java:56)
at org.apache.pdfbox.cos.COSStream.doDecode(COSStream.java:342)
at org.apache.pdfbox.cos.COSStream.doDecode(COSStream.java:254)
at
org.apache.pdfbox.cos.COSStream.getUnfilteredStream(COSStream.java:188)
at
org.apache.pdfbox.pdmodel.common.PDStream.createInputStream(PDStream.java:232)
at
org.apache.pdfbox.pdmodel.common.PDStream.getByteArray(PDStream.java:510)
{code}
So i am going in circles...
One solution could be to save the image a first time with the native method
"PDJpeg.write2file(File f)", then check te file with Apache Imaging, and if the
image is found from Photoshop, relaunch a second saving with the workaround.
But the process time will be too long... Is there another way ?
> CMYK images are not supported correctly
> ---------------------------------------
>
> Key: PDFBOX-2128
> URL: https://issues.apache.org/jira/browse/PDFBOX-2128
> Project: PDFBox
> Issue Type: Bug
> Components: PDModel
> Affects Versions: 1.8.5, 1.8.6, 2.0.0
> Environment: Windows 7 Professional
> Running jvm: Java HotSpot(TM) 64-Bit Server VM - 1.6.0_26-b03 - 20.1-b02 -
> Sun Microsystems Inc
> Reporter: Ludovic Davoine
> Labels: PDJpeg, cmyk, images
> Attachments: porsche_cmyk.pdf-2.png
>
> Original Estimate: 1h
> Remaining Estimate: 1h
>
> I have a PDF with CMYK images inside and i need to extract the images in the
> RGB format. But the PDJpeg class seems to not work correctly; the colors are
> bad. Example:
> - Original image in te PDF : http://ludoda.free.fr/IMAGE_IN_PDF.jpg
> - Extracted image: http://ludoda.free.fr/IMAGE_EXTRACTED.jpg
> You can download the PDF : http://ludoda.free.fr/PORSCHE_CMYK.PDF
> and try my simple Test Case (I'm using PDFbox 1.8.5):
> {code}
> import java.awt.image.BufferedImage;
> import java.io.File;
> import java.io.IOException;
> import java.util.Iterator;
> import java.util.List;
> import java.util.Map;
> import javax.imageio.ImageIO;
> import org.apache.pdfbox.pdmodel.PDDocument;
> import org.apache.pdfbox.pdmodel.PDPage;
> import org.apache.pdfbox.pdmodel.PDResources;
> import org.apache.pdfbox.pdmodel.graphics.xobject.PDJpeg;
> import org.apache.pdfbox.pdmodel.graphics.xobject.PDXObject;
> import org.apache.pdfbox.pdmodel.graphics.xobject.PDXObjectImage;
> public class TestCase {
>
> public static void main(String[] args)
> {
> try
> {
> System.out.println("START EXTRACTING IMAGES...");
> read_pdf();
> System.out.println("COMPLETE");
> }
> catch (IOException ex)
> {
> System.out.println("" + ex);
> }
> }
> public static void read_pdf() throws IOException
> {
> PDDocument document = null;
> document = PDDocument.load("C:\\temp\\PORSCHE_CMYK.pdf");
> @SuppressWarnings("unchecked")
> List<PDPage> pages =
> document.getDocumentCatalog().getAllPages();
> Iterator<PDPage> iter = pages.iterator();
> int i =1;
> while (iter.hasNext())
> {
> PDPage page = (PDPage) iter.next();
> PDResources resources = page.getResources();
> Map<String, PDXObject> pageImages =
> resources.getXObjects();
> if (pageImages != null)
> {
> Iterator<String> imageIter =
> pageImages.keySet().iterator();
> while (imageIter.hasNext())
> {
> String key = (String) imageIter.next();
> if(pageImages.get(key) instanceof
> PDXObjectImage)
> {
> PDJpeg image = (PDJpeg)
> pageImages.get(key);
>
> // Test 1 : write2file
>
> image.write2file("C:\\workspace\\JAVA_PDFTools\\temp\\image" + i);
>
> // Test 2: getRGBImage
> BufferedImage
> bimage=image.getRGBImage();
> File outputfile = new
> File("C:\\workspace\\JAVA_PDFTools\\temp\\image" + i+"_buffered.jpg");
> ImageIO.write(bimage, "jpg",
> outputfile);
> i ++;
> }
> }
> }
> }
> }
> }
> {code}
--
This message was sent by Atlassian JIRA
(v6.2#6252)