[
https://issues.apache.org/jira/browse/TIKA-2444?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Matthew Caruana Galizia updated TIKA-2444:
------------------------------------------
Attachment: balloon.j2c
Example JP2K codestream file attached.
> JP2 codestream files not parsed
> -------------------------------
>
> Key: TIKA-2444
> URL: https://issues.apache.org/jira/browse/TIKA-2444
> Project: Tika
> Issue Type: Bug
> Components: parser
> Affects Versions: 1.16
> Reporter: Matthew Caruana Galizia
> Labels: imageio, images, ocr
> Attachments: balloon.j2c
>
>
> We've come across some embedded files in the wild that are detected by Tika
> as {{image/x-jp2-codestream}}. The identification is correct according to a
> description of the format [1].
> However, no Parser implementation declares support for this format.
> It would makes to declare support for this format in the Tesseract OCR
> parser. However, the parser would need to contain functionality that either:
> 1) wraps the codestream in a JP2 container;
> 2) or transcodes the image to PNG.
> This is because while Tesseract supports JP2 (via Leptonica), it doesn't
> support the raw codestream as a file.
> [1] http://fileformats.archiveteam.org/wiki/JPEG_2000_codestream
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)