[ https://issues.apache.org/jira/browse/TIKA-2444?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Matthew Caruana Galizia updated TIKA-2444: ------------------------------------------ Attachment: balloon.j2c Example JP2K codestream file attached. > JP2 codestream files not parsed > ------------------------------- > > Key: TIKA-2444 > URL: https://issues.apache.org/jira/browse/TIKA-2444 > Project: Tika > Issue Type: Bug > Components: parser > Affects Versions: 1.16 > Reporter: Matthew Caruana Galizia > Labels: imageio, images, ocr > Attachments: balloon.j2c > > > We've come across some embedded files in the wild that are detected by Tika > as {{image/x-jp2-codestream}}. The identification is correct according to a > description of the format [1]. > However, no Parser implementation declares support for this format. > It would makes to declare support for this format in the Tesseract OCR > parser. However, the parser would need to contain functionality that either: > 1) wraps the codestream in a JP2 container; > 2) or transcodes the image to PNG. > This is because while Tesseract supports JP2 (via Leptonica), it doesn't > support the raw codestream as a file. > [1] http://fileformats.archiveteam.org/wiki/JPEG_2000_codestream -- This message was sent by Atlassian JIRA (v6.4.14#64029)