[jira] [Updated] (TIKA-2444) JP2 codestream files not parsed

Matthew Caruana Galizia (JIRA) Tue, 22 Aug 2017 09:04:23 -0700

     [ 
https://issues.apache.org/jira/browse/TIKA-2444?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Matthew Caruana Galizia updated TIKA-2444:
------------------------------------------
    Attachment: balloon.j2c

Example JP2K codestream file attached.

> JP2 codestream files not parsed
> -------------------------------
>
>                 Key: TIKA-2444
>                 URL: https://issues.apache.org/jira/browse/TIKA-2444
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.16
>            Reporter: Matthew Caruana Galizia
>              Labels: imageio, images, ocr
>         Attachments: balloon.j2c
>
>
> We've come across some embedded files in the wild that are detected by Tika 
> as {{image/x-jp2-codestream}}. The identification is correct according to a 
> description of the format [1].
> However, no Parser implementation declares support for this format.
> It would makes to declare support for this format in the Tesseract OCR 
> parser. However, the parser would need to contain functionality that either:
> 1) wraps the codestream in a JP2 container;
> 2) or transcodes the image to PNG.
> This is because while Tesseract supports JP2 (via Leptonica), it doesn't 
> support the raw codestream as a file.
> [1] http://fileformats.archiveteam.org/wiki/JPEG_2000_codestream



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Updated] (TIKA-2444) JP2 codestream files not parsed

Reply via email to