[jira] [Comment Edited] (TIKA-2473) PCX and DCX image support
[ https://issues.apache.org/jira/browse/TIKA-2473?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16194435#comment-16194435 ] Matthew Caruana Galizia edited comment on TIKA-2473 at 10/6/17 10:42 AM: - Magic for PCX: byte 0: x0A byte 1: either x00, 0x02, 0x03, 0x04 or 0x05 MIME type: image/vnd.zbrush.pcx via: https://www.iana.org/assignments/media-types/image/vnd.zbrush.pcx was (Author: mcaruanagalizia): Magic: byte 0: x0A byte 1: either x00, 0x02, 0x03, 0x04 or 0x05 MIME type: image/vnd.zbrush.pcx via: https://www.iana.org/assignments/media-types/image/vnd.zbrush.pcx > PCX and DCX image support > - > > Key: TIKA-2473 > URL: https://issues.apache.org/jira/browse/TIKA-2473 > Project: Tika > Issue Type: Improvement > Components: parser >Affects Versions: 1.16 >Reporter: Matthew Caruana Galizia > > It's straightforward in theory to implement support for PCX and DCX. There's > support for it in Commons Imaging as well as in ImageIO via TwelveMonkeys. > In practise, however, I'm not really sure how implement support. We obviously > want to OCR the images, but Tesseract has no support for the format. So where > do we do the conversion to a BufferedImage? I tried to look for what is done > to handle JBIG2 files but I can't find that anywhere. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (TIKA-2473) PCX and DCX image support
[ https://issues.apache.org/jira/browse/TIKA-2473?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16194435#comment-16194435 ] Matthew Caruana Galizia commented on TIKA-2473: --- Magic: byte 0: x0A byte 1: either x00, 0x02, 0x03, 0x04 or 0x05 MIME type: image/vnd.zbrush.pcx via: https://www.iana.org/assignments/media-types/image/vnd.zbrush.pcx > PCX and DCX image support > - > > Key: TIKA-2473 > URL: https://issues.apache.org/jira/browse/TIKA-2473 > Project: Tika > Issue Type: Improvement > Components: parser >Affects Versions: 1.16 >Reporter: Matthew Caruana Galizia > > It's straightforward in theory to implement support for PCX and DCX. There's > support for it in Commons Imaging as well as in ImageIO via TwelveMonkeys. > In practise, however, I'm not really sure how implement support. We obviously > want to OCR the images, but Tesseract has no support for the format. So where > do we do the conversion to a BufferedImage? I tried to look for what is done > to handle JBIG2 files but I can't find that anywhere. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (TIKA-2473) PCX and DCX image support
Matthew Caruana Galizia created TIKA-2473: - Summary: PCX and DCX image support Key: TIKA-2473 URL: https://issues.apache.org/jira/browse/TIKA-2473 Project: Tika Issue Type: Improvement Components: parser Affects Versions: 1.16 Reporter: Matthew Caruana Galizia It's straightforward in theory to implement support for PCX and DCX. There's support for it in Commons Imaging as well as in ImageIO via TwelveMonkeys. In practise, however, I'm not really sure how implement support. We obviously want to OCR the images, but Tesseract has no support for the format. So where do we do the conversion to a BufferedImage? I tried to look for what is done to handle JBIG2 files but I can't find that anywhere. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (TIKA-2471) Tab-prefixed message body lines in Mbox interpreted as headers
[ https://issues.apache.org/jira/browse/TIKA-2471?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matthew Caruana Galizia updated TIKA-2471: -- Attachment: mbox Reduced test case attached. The result of parsing this file will include metadata keys with names like {{MboxParser-class=3dmsonormal>sincerely,Tab-prefixed message body lines in Mbox interpreted as headers > -- > > Key: TIKA-2471 > URL: https://issues.apache.org/jira/browse/TIKA-2471 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.16 >Reporter: Matthew Caruana Galizia > Labels: message, rfc822 > Attachments: mbox > > > The mbox parser code is overly optimistic. It parses the entire message > looking for anything that matches a header pattern, wherever it occurs in a > line! > It looks to me like the parsing logic is in desperate need of a refactor. But > more to the point, what is the idea behind setting the headers in the > MboxParser if they're going to be set by the RFC822Parser in any case? > Also, out of curiosity, why does the parser force Windows-1252 as the charset? -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (TIKA-2471) Tab-prefixed message body lines in Mbox interpreted as headers
Matthew Caruana Galizia created TIKA-2471: - Summary: Tab-prefixed message body lines in Mbox interpreted as headers Key: TIKA-2471 URL: https://issues.apache.org/jira/browse/TIKA-2471 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.16 Reporter: Matthew Caruana Galizia The mbox parser code is overly optimistic. It parses the entire message looking for anything that matches a header pattern, wherever it occurs in a line! It looks to me like the parsing logic is in desperate need of a refactor. But more to the point, what is the idea behind setting the headers in the MboxParser if they're going to be set by the RFC822Parser in any case? Also, out of curiosity, why does the parser force Windows-1252 as the charset? -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (TIKA-2219) CharsetDetector no longer detects windows-1252 charset
[ https://issues.apache.org/jira/browse/TIKA-2219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16150718#comment-16150718 ] Matthew Caruana Galizia commented on TIKA-2219: --- Thanks for getting back. Shouldn't the RFC822Parser attempt to detect the body encoding when none is declared? > CharsetDetector no longer detects windows-1252 charset > -- > > Key: TIKA-2219 > URL: https://issues.apache.org/jira/browse/TIKA-2219 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.14 > Environment: Any. >Reporter: Pascal Essiembre >Priority: Minor > Fix For: 2.0, 1.15 > > Attachments: test.txt > > > Starting with Tika 1.14, windows-1252 is no longer detected, as ISO-8859-1 is > always detected instead. While not tested, this likely affects other > windows-125* encodings as well. > I tracked it down to a change in the > {{CharsetRecog_sbcs.CharsetRecog_8859_1#getName()}} method. Now it always > returns "ISO-8859-1" whereas before it was: {{return haveC1Bytes ? > "windows-1252" : "ISO-8859-1";}} > Now that condition has been moved to the {{match(CharsetDetector det)}} > method so that the returned CharsetMatch has the proper name. The problem > with that is {{CharsetDetector#detectAll()}} method overwrites the correct > match with a new one that will return the value of {{#getName()}} from the > {{CharsetRecognizer}} instead (which is always "ISO-8859-1" in this case). > There might be legitimate reasons why the {{CharsetMatch}} instances in > {{detectAll()}} method are replaced with new ones, but changing this code in > that method appears to work for me: > // Remove this: > //CharsetMatch m = new CharsetMatch(this, csr, > confidence); > //matches.add(m); > // Add this instead: > matches.add(charsetMatch); -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (TIKA-2219) CharsetDetector no longer detects windows-1252 charset
[ https://issues.apache.org/jira/browse/TIKA-2219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matthew Caruana Galizia updated TIKA-2219: -- Attachment: test.txt This file contains x92 characters which should force detection to Windows-1252. > CharsetDetector no longer detects windows-1252 charset > -- > > Key: TIKA-2219 > URL: https://issues.apache.org/jira/browse/TIKA-2219 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.14 > Environment: Any. >Reporter: Pascal Essiembre >Priority: Minor > Fix For: 2.0, 1.15 > > Attachments: test.txt > > > Starting with Tika 1.14, windows-1252 is no longer detected, as ISO-8859-1 is > always detected instead. While not tested, this likely affects other > windows-125* encodings as well. > I tracked it down to a change in the > {{CharsetRecog_sbcs.CharsetRecog_8859_1#getName()}} method. Now it always > returns "ISO-8859-1" whereas before it was: {{return haveC1Bytes ? > "windows-1252" : "ISO-8859-1";}} > Now that condition has been moved to the {{match(CharsetDetector det)}} > method so that the returned CharsetMatch has the proper name. The problem > with that is {{CharsetDetector#detectAll()}} method overwrites the correct > match with a new one that will return the value of {{#getName()}} from the > {{CharsetRecognizer}} instead (which is always "ISO-8859-1" in this case). > There might be legitimate reasons why the {{CharsetMatch}} instances in > {{detectAll()}} method are replaced with new ones, but changing this code in > that method appears to work for me: > // Remove this: > //CharsetMatch m = new CharsetMatch(this, csr, > confidence); > //matches.add(m); > // Add this instead: > matches.add(charsetMatch); -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (TIKA-2219) CharsetDetector no longer detects windows-1252 charset
[ https://issues.apache.org/jira/browse/TIKA-2219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16149673#comment-16149673 ] Matthew Caruana Galizia commented on TIKA-2219: --- [~talli...@mitre.org] I think this issue has regressed. Please take a look at the attached file. It's parsed as an email but the body text is detected as US-ASCII instead of Windows-1252 (note the x92 characters). > CharsetDetector no longer detects windows-1252 charset > -- > > Key: TIKA-2219 > URL: https://issues.apache.org/jira/browse/TIKA-2219 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.14 > Environment: Any. >Reporter: Pascal Essiembre >Priority: Minor > Fix For: 2.0, 1.15 > > > Starting with Tika 1.14, windows-1252 is no longer detected, as ISO-8859-1 is > always detected instead. While not tested, this likely affects other > windows-125* encodings as well. > I tracked it down to a change in the > {{CharsetRecog_sbcs.CharsetRecog_8859_1#getName()}} method. Now it always > returns "ISO-8859-1" whereas before it was: {{return haveC1Bytes ? > "windows-1252" : "ISO-8859-1";}} > Now that condition has been moved to the {{match(CharsetDetector det)}} > method so that the returned CharsetMatch has the proper name. The problem > with that is {{CharsetDetector#detectAll()}} method overwrites the correct > match with a new one that will return the value of {{#getName()}} from the > {{CharsetRecognizer}} instead (which is always "ISO-8859-1" in this case). > There might be legitimate reasons why the {{CharsetMatch}} instances in > {{detectAll()}} method are replaced with new ones, but changing this code in > that method appears to work for me: > // Remove this: > //CharsetMatch m = new CharsetMatch(this, csr, > confidence); > //matches.add(m); > // Add this instead: > matches.add(charsetMatch); -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (TIKA-2455) Flag in metadata for alternative email bodies
Matthew Caruana Galizia created TIKA-2455: - Summary: Flag in metadata for alternative email bodies Key: TIKA-2455 URL: https://issues.apache.org/jira/browse/TIKA-2455 Project: Tika Issue Type: Improvement Components: parser Affects Versions: 1.16 Reporter: Matthew Caruana Galizia Priority: Minor When multipart RFC822 emails are being parsed, there's no way to distinguish between alternative versions of the body and attachments. It would be ideal if some kind of flag were set in the metadata passed to the {{EmbeddedDocumentExtractor}} that indicates that the stream is an alternative. In GUIs that present the data extracted from the email, alternative bodies can be distinguished from attachments and presented separately. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (TIKA-2454) Emails extracted from PSTs detected as unexpected file types
[ https://issues.apache.org/jira/browse/TIKA-2454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16148117#comment-16148117 ] Matthew Caruana Galizia commented on TIKA-2454: --- I don't know if the same thing can be done wholesale for mbox files. There are four variants of emails in mbox files: http://www.forensicswiki.org/wiki/MBox#MBOX_File_Variants > Emails extracted from PSTs detected as unexpected file types > > > Key: TIKA-2454 > URL: https://issues.apache.org/jira/browse/TIKA-2454 > Project: Tika > Issue Type: Bug > Components: detector, parser >Affects Versions: 1.16 >Reporter: Matthew Caruana Galizia > Fix For: 1.17 > > > This issue is severe. The Outlook PST parser extracts a string for the body > of every email and passes that string to the {{EmbeddedDocumentExtractor}}. > However, no content type is set on the {{Metadata}} object passed to the > extractor. Therefore, if for example, the body of the email starts with the > string "From John Smith." (for example, when an email was forwarded), then > body of the email is detected as {{application/mbox}} and parsed as though it > were an mbox file. > I think the immediate fix for this issue is to force the type of the email to > {{text/plain}} and for it to be parsed as such. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (TIKA-2454) Emails extracted from PSTs detected as unexpected file types
[ https://issues.apache.org/jira/browse/TIKA-2454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16147907#comment-16147907 ] Matthew Caruana Galizia commented on TIKA-2454: --- I agree with you. The fact that you can't force the type means that PST parsing is completely broken. > Emails extracted from PSTs detected as unexpected file types > > > Key: TIKA-2454 > URL: https://issues.apache.org/jira/browse/TIKA-2454 > Project: Tika > Issue Type: Bug > Components: detector, parser >Affects Versions: 1.16 >Reporter: Matthew Caruana Galizia > > This issue is severe. The Outlook PST parser extracts a string for the body > of every email and passes that string to the {{EmbeddedDocumentExtractor}}. > However, no content type is set on the {{Metadata}} object passed to the > extractor. Therefore, if for example, the body of the email starts with the > string "From John Smith." (for example, when an email was forwarded), then > body of the email is detected as {{application/mbox}} and parsed as though it > were an mbox file. > I think the immediate fix for this issue is to force the type of the email to > {{text/plain}} and for it to be parsed as such. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (TIKA-2444) JP2 codestream files not parsed
[ https://issues.apache.org/jira/browse/TIKA-2444?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16147657#comment-16147657 ] Matthew Caruana Galizia commented on TIKA-2444: --- I have no idea. I'm trying to solve a similar problem with raw G4 bytestreams that are not contained in a TIFF container. Anyone you know who has experience with image parsing in Java? > JP2 codestream files not parsed > --- > > Key: TIKA-2444 > URL: https://issues.apache.org/jira/browse/TIKA-2444 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.16 >Reporter: Matthew Caruana Galizia > Labels: imageio, images, ocr > Attachments: balloon.j2c > > > We've come across some embedded files in the wild that are detected by Tika > as {{image/x-jp2-codestream}}. The identification is correct according to a > description of the format [1]. > However, no Parser implementation declares support for this format. > It would makes to declare support for this format in the Tesseract OCR > parser. However, the parser would need to contain functionality that either: > 1) wraps the codestream in a JP2 container; > 2) or transcodes the image to PNG. > This is because while Tesseract supports JP2 (via Leptonica), it doesn't > support the raw codestream as a file. > [1] http://fileformats.archiveteam.org/wiki/JPEG_2000_codestream -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (TIKA-2450) OfficeParser.parse called for zero-byte file with .doc extension
[ https://issues.apache.org/jira/browse/TIKA-2450?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16147632#comment-16147632 ] Matthew Caruana Galizia commented on TIKA-2450: --- Thank you, that looks like a good solution! > OfficeParser.parse called for zero-byte file with .doc extension > > > Key: TIKA-2450 > URL: https://issues.apache.org/jira/browse/TIKA-2450 > Project: Tika > Issue Type: Bug > Components: detector, parser >Affects Versions: 1.16 >Reporter: Matthew Caruana Galizia >Priority: Minor > Fix For: 1.17 > > > A zero-byte (empty) file with a .doc extension is detected as a Word Document > and the {{OfficeParser.parse}} method is called for this file. > We then get a {{TikaException}}, with the cause given as an > {{org.apache.poi.EmptyFileException}}. > I think it would be more useful if the file were NOT detected as a Word > Document, meaning that the {{AutoDetectParser}} would then fall back to > whatever is set as the fallback parser in the parse context. > This is more useful because the user can then trigger some special logic for > handling empty files. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (TIKA-2450) OfficeParser.parse called for zero-byte file with .doc extension
[ https://issues.apache.org/jira/browse/TIKA-2450?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16147347#comment-16147347 ] Matthew Caruana Galizia commented on TIKA-2450: --- When you put it that way, then I'll say yes. There is value in knowing what it might have been. > OfficeParser.parse called for zero-byte file with .doc extension > > > Key: TIKA-2450 > URL: https://issues.apache.org/jira/browse/TIKA-2450 > Project: Tika > Issue Type: Bug > Components: detector, parser >Affects Versions: 1.16 >Reporter: Matthew Caruana Galizia >Priority: Minor > > A zero-byte (empty) file with a .doc extension is detected as a Word Document > and the {{OfficeParser.parse}} method is called for this file. > We then get a {{TikaException}}, with the cause given as an > {{org.apache.poi.EmptyFileException}}. > I think it would be more useful if the file were NOT detected as a Word > Document, meaning that the {{AutoDetectParser}} would then fall back to > whatever is set as the fallback parser in the parse context. > This is more useful because the user can then trigger some special logic for > handling empty files. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (TIKA-2450) OfficeParser.parse called for zero-byte file with .doc extension
[ https://issues.apache.org/jira/browse/TIKA-2450?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16147331#comment-16147331 ] Matthew Caruana Galizia commented on TIKA-2450: --- OK, with that in mind then I will agree you. In the same way that all the various encrypted document exceptions are normalised to a {{tika.exception.EncryptedDocumentException}} then I think it would be useful to, as Tim suggests, normalise empty file exceptions to a {{tika.exception.ZeroByteFileException}} (that extends {{TikaException}}). > OfficeParser.parse called for zero-byte file with .doc extension > > > Key: TIKA-2450 > URL: https://issues.apache.org/jira/browse/TIKA-2450 > Project: Tika > Issue Type: Bug > Components: detector, parser >Affects Versions: 1.16 >Reporter: Matthew Caruana Galizia >Priority: Minor > > A zero-byte (empty) file with a .doc extension is detected as a Word Document > and the {{OfficeParser.parse}} method is called for this file. > We then get a {{TikaException}}, with the cause given as an > {{org.apache.poi.EmptyFileException}}. > I think it would be more useful if the file were NOT detected as a Word > Document, meaning that the {{AutoDetectParser}} would then fall back to > whatever is set as the fallback parser in the parse context. > This is more useful because the user can then trigger some special logic for > handling empty files. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (TIKA-2450) OfficeParser.parse called for zero-byte file with .doc extension
[ https://issues.apache.org/jira/browse/TIKA-2450?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16147303#comment-16147303 ] Matthew Caruana Galizia commented on TIKA-2450: --- I would argue that the raison d'etre of tika-detect is not to provide extension-based detection, but to provide detection. A zero-bye file can never be a Word Document, so assuming my first statement is true then logically it should not be detected as a Word Document. > OfficeParser.parse called for zero-byte file with .doc extension > > > Key: TIKA-2450 > URL: https://issues.apache.org/jira/browse/TIKA-2450 > Project: Tika > Issue Type: Bug > Components: detector, parser >Affects Versions: 1.16 >Reporter: Matthew Caruana Galizia >Priority: Minor > > A zero-byte (empty) file with a .doc extension is detected as a Word Document > and the {{OfficeParser.parse}} method is called for this file. > We then get a {{TikaException}}, with the cause given as an > {{org.apache.poi.EmptyFileException}}. > I think it would be more useful if the file were NOT detected as a Word > Document, meaning that the {{AutoDetectParser}} would then fall back to > whatever is set as the fallback parser in the parse context. > This is more useful because the user can then trigger some special logic for > handling empty files. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (TIKA-2450) OfficeParser.parse called for zero-byte file with .doc extension
Matthew Caruana Galizia created TIKA-2450: - Summary: OfficeParser.parse called for zero-byte file with .doc extension Key: TIKA-2450 URL: https://issues.apache.org/jira/browse/TIKA-2450 Project: Tika Issue Type: Bug Components: detector, parser Affects Versions: 1.16 Reporter: Matthew Caruana Galizia Priority: Minor A zero-byte (empty) file with a .doc extension is detected as a Word Document and the {{OfficeParser.parse}} method is called for this file. We then get a {{TikaException}}, with the cause given as an {{org.apache.poi.EmptyFileException}}. I think it would be more useful if the file were NOT detected as a Word Document, meaning that the {{AutoDetectParser}} would then fall back to whatever is set as the fallback parser in the parse context. This is more useful because the user can then trigger some special logic for handling empty files. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (TIKA-2444) JP2 codestream files not parsed
[ https://issues.apache.org/jira/browse/TIKA-2444?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matthew Caruana Galizia updated TIKA-2444: -- Attachment: balloon.j2c Example JP2K codestream file attached. > JP2 codestream files not parsed > --- > > Key: TIKA-2444 > URL: https://issues.apache.org/jira/browse/TIKA-2444 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.16 >Reporter: Matthew Caruana Galizia > Labels: imageio, images, ocr > Attachments: balloon.j2c > > > We've come across some embedded files in the wild that are detected by Tika > as {{image/x-jp2-codestream}}. The identification is correct according to a > description of the format [1]. > However, no Parser implementation declares support for this format. > It would makes to declare support for this format in the Tesseract OCR > parser. However, the parser would need to contain functionality that either: > 1) wraps the codestream in a JP2 container; > 2) or transcodes the image to PNG. > This is because while Tesseract supports JP2 (via Leptonica), it doesn't > support the raw codestream as a file. > [1] http://fileformats.archiveteam.org/wiki/JPEG_2000_codestream -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (TIKA-2444) JP2 codestream files not parsed
Matthew Caruana Galizia created TIKA-2444: - Summary: JP2 codestream files not parsed Key: TIKA-2444 URL: https://issues.apache.org/jira/browse/TIKA-2444 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.16 Reporter: Matthew Caruana Galizia We've come across some embedded files in the wild that are detected by Tika as {{image/x-jp2-codestream}}. The identification is correct according to a description of the format [1]. However, no Parser implementation declares support for this format. It would makes to declare support for this format in the Tesseract OCR parser. However, the parser would need to contain functionality that either: 1) wraps the codestream in a JP2 container; 2) or transcodes the image to PNG. This is because while Tesseract supports JP2 (via Leptonica), it doesn't support the raw codestream as a file. [1] http://fileformats.archiveteam.org/wiki/JPEG_2000_codestream -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (TIKA-2436) Support for GZIP-compressed EMF files
[ https://issues.apache.org/jira/browse/TIKA-2436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16106136#comment-16106136 ] Matthew Caruana Galizia commented on TIKA-2436: --- To give you an example of why this is a problem in an actual use case, we are ingesting the text extracting from files into Solr. The way the files are stored in the index represents the same hierarchy that you have on disk: files extracted from container files are stored in the index as child documents of the container document. Therefore, for an EMZ file within a DOCX file, we end up with three documents: DOCX -> EMZ -> EMF Whereas we expect: DOCX -> EMZ > Support for GZIP-compressed EMF files > - > > Key: TIKA-2436 > URL: https://issues.apache.org/jira/browse/TIKA-2436 > Project: Tika > Issue Type: Improvement > Components: mime, parser >Affects Versions: 1.15 >Reporter: Matthew Caruana Galizia > Attachments: image004.emz > > > Tika is currently detecting EMZ (compressed EMF) files as simple gzip files. > These files should instead be detected as EMF files and the EMFParser should > perform decompression transparently. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (TIKA-2436) Support for GZIP-compressed EMF files
[ https://issues.apache.org/jira/browse/TIKA-2436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16106135#comment-16106135 ] Matthew Caruana Galizia commented on TIKA-2436: --- The difference is that the file is a treated as a package or container format, when I don't think it should be. It is a distinct file format that happens to be compressed. Instead of treating it as a container and relying on the CompressorParser to call the ParsingEmbeddedDocumentExtractor, the EMFParser should instead have native support for the compression, unwrapping the compression itself. The same should be true for SVGZ and WMZ. To draw a parallel, DOCX is also a compressed format, but Tika does not treat it as a package. It understands that the compression is an artefact of the format rather than an explicit container. > Support for GZIP-compressed EMF files > - > > Key: TIKA-2436 > URL: https://issues.apache.org/jira/browse/TIKA-2436 > Project: Tika > Issue Type: Improvement > Components: mime, parser >Affects Versions: 1.15 >Reporter: Matthew Caruana Galizia > Attachments: image004.emz > > > Tika is currently detecting EMZ (compressed EMF) files as simple gzip files. > These files should instead be detected as EMF files and the EMFParser should > perform decompression transparently. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (TIKA-2436) Support for GZIP-compressed EMF files
[ https://issues.apache.org/jira/browse/TIKA-2436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matthew Caruana Galizia updated TIKA-2436: -- Attachment: image004.emz Example EMZ file attached. Common Compress will yield an EMF from this (see COMPRESS-68). > Support for GZIP-compressed EMF files > - > > Key: TIKA-2436 > URL: https://issues.apache.org/jira/browse/TIKA-2436 > Project: Tika > Issue Type: Improvement > Components: mime, parser >Affects Versions: 1.15 >Reporter: Matthew Caruana Galizia > Attachments: image004.emz > > > Tika is currently detecting EMZ (compressed EMF) files as simple gzip files. > These files should instead be detected as EMF files and the EMFParser should > perform decompression transparently. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (TIKA-2436) Support for GZIP-compressed EMF files
Matthew Caruana Galizia created TIKA-2436: - Summary: Support for GZIP-compressed EMF files Key: TIKA-2436 URL: https://issues.apache.org/jira/browse/TIKA-2436 Project: Tika Issue Type: Improvement Components: mime, parser Affects Versions: 1.15 Reporter: Matthew Caruana Galizia Tika is currently detecting EMZ (compressed EMF) files as simple gzip files. These files should instead be detected as EMF files and the EMFParser should perform decompression transparently. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (TIKA-879) Detection problem: message/rfc822 file is detected as text/plain.
[ https://issues.apache.org/jira/browse/TIKA-879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matthew Caruana Galizia updated TIKA-879: - Attachment: mbox_email_section.txt As described in TIKA-2042, the attached file [^mbox_email_section.txt] contains a section of an MBOX file, itself containing a message stream which is detected as text/html instead of message/rfc822, even though the correct mimetype is set on the Metadata object by the MBOXParser. > Detection problem: message/rfc822 file is detected as text/plain. > - > > Key: TIKA-879 > URL: https://issues.apache.org/jira/browse/TIKA-879 > Project: Tika > Issue Type: Bug > Components: metadata, mime >Affects Versions: 1.0, 1.1, 1.2 > Environment: linux 3.2.9 > oracle jdk7, openjdk7, sun jdk6 >Reporter: Konstantin Gribov > Labels: new-parser > Attachments: mbox_email_section.txt, mime_diffs_A_to_B.html, > TIKA-879-thunderbird.eml > > > When using {{DefaultDetector}} mime type for {{.eml}} files is different (you > can test it on {{testRFC822}} and {{testRFC822_base64}} in > {{tika-parsers/src/test/resources/test-documents/}}). > Main reason for such behavior is that only magic detector is really works for > such files. Even if you set {{CONTENT_TYPE}} in metadata or some {{.eml}} > file name in {{RESOURCE_NAME_KEY}}. > As I found {{MediaTypeRegistry.isSpecializationOf("message/rfc822", > "text/plain")}} returns {{false}}, so detection by {{MimeTypes.detect(...)}} > works only by magic. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Comment Edited] (TIKA-2042) MBOX file detected wrongly as text/html
[ https://issues.apache.org/jira/browse/TIKA-2042?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16085847#comment-16085847 ] Matthew Caruana Galizia edited comment on TIKA-2042 at 7/13/17 3:13 PM: I've attached a sample of one of the message sections from the MBOX. Detected as text/html instead of message/rfc822. was (Author: mcaruanagalizia): Sample of one of the message sections from the MBOX. Detected as text/html instead of message/rfc822. > MBOX file detected wrongly as text/html > --- > > Key: TIKA-2042 > URL: https://issues.apache.org/jira/browse/TIKA-2042 > Project: Tika > Issue Type: Bug >Affects Versions: 1.13 > Environment: Ubuntu 14.04, Apache Tika 1.13 and 1.14 nightly at the > time of this writing >Reporter: Vjeran Marcinko > Fix For: 1.14 > > Attachments: clojure.mbox, mbox_email_section.txt, mbox_header.txt > > > MBOX file doesn't get recognized via "magic detection" mechanism as > "application/mbox", but wrongly as "text/html". > Workaround for this in Tika 1.13 is achieved by placing following in > custom-mimetypes.xml, as suggested on mailing list (priority has to be larger > than message/rfc822): > > > > > > > Sample MBOX file is attached. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (TIKA-2042) MBOX file detected wrongly as text/html
[ https://issues.apache.org/jira/browse/TIKA-2042?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matthew Caruana Galizia updated TIKA-2042: -- Attachment: mbox_email_section.txt Sample of one of the message sections from the MBOX. Detected as text/html instead of message/rfc822. > MBOX file detected wrongly as text/html > --- > > Key: TIKA-2042 > URL: https://issues.apache.org/jira/browse/TIKA-2042 > Project: Tika > Issue Type: Bug >Affects Versions: 1.13 > Environment: Ubuntu 14.04, Apache Tika 1.13 and 1.14 nightly at the > time of this writing >Reporter: Vjeran Marcinko > Fix For: 1.14 > > Attachments: clojure.mbox, mbox_email_section.txt, mbox_header.txt > > > MBOX file doesn't get recognized via "magic detection" mechanism as > "application/mbox", but wrongly as "text/html". > Workaround for this in Tika 1.13 is achieved by placing following in > custom-mimetypes.xml, as suggested on mailing list (priority has to be larger > than message/rfc822): > > > > > > > Sample MBOX file is attached. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (TIKA-2042) MBOX file detected wrongly as text/html
[ https://issues.apache.org/jira/browse/TIKA-2042?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16085842#comment-16085842 ] Matthew Caruana Galizia commented on TIKA-2042: --- [~gagravarr] thank you - that fixes the detection of at least one of the MBOX files. Now the problem is that that when the email streams get passed to the delegate parser by the ParsingEmbeddedDocumentExtractor implementation, they're detected as text/html instead of message/rfc822. > MBOX file detected wrongly as text/html > --- > > Key: TIKA-2042 > URL: https://issues.apache.org/jira/browse/TIKA-2042 > Project: Tika > Issue Type: Bug >Affects Versions: 1.13 > Environment: Ubuntu 14.04, Apache Tika 1.13 and 1.14 nightly at the > time of this writing >Reporter: Vjeran Marcinko > Fix For: 1.14 > > Attachments: clojure.mbox, mbox_email_section.txt, mbox_header.txt > > > MBOX file doesn't get recognized via "magic detection" mechanism as > "application/mbox", but wrongly as "text/html". > Workaround for this in Tika 1.13 is achieved by placing following in > custom-mimetypes.xml, as suggested on mailing list (priority has to be larger > than message/rfc822): > > > > > > > Sample MBOX file is attached. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Comment Edited] (TIKA-2042) MBOX file detected wrongly as text/html
[ https://issues.apache.org/jira/browse/TIKA-2042?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16085709#comment-16085709 ] Matthew Caruana Galizia edited comment on TIKA-2042 at 7/13/17 2:22 PM: I'd like to ask for this issue to be reopened. Around half the MBOX files in our corpus are being detected as text/html. My guess is that there are two reasons for this: 1) the files have no extension - the filenames are literally "mbox" rather than "*.mbox" (I think this is the way they're generated or used to be generated on Macs - they're in an *.mbox container directory, but the meat is within an mbox file contained within that directory); 2) the headers don't fall within the 256 byte offset specified by the matcher in the mimetypes XML file. was (Author: mcaruanagalizia): I'd like to ask for this issue to be reopened. Around half the MBOX files in our corpus are being detected as text/html. My guess is that there are two reasons for this: 1) the files have no extension - the filenames are literally "mbox" rather than "*.mbox" (I think this is the way they're generate or used to be generate on Macs - they're in a *.mbox container directory, but the meat is within an mbox file contained within that directory); 2) the headers don't fall within the 256 byte offset specified by the matcher in the mimetypes XML file. > MBOX file detected wrongly as text/html > --- > > Key: TIKA-2042 > URL: https://issues.apache.org/jira/browse/TIKA-2042 > Project: Tika > Issue Type: Bug >Affects Versions: 1.13 > Environment: Ubuntu 14.04, Apache Tika 1.13 and 1.14 nightly at the > time of this writing >Reporter: Vjeran Marcinko > Fix For: 1.14 > > Attachments: clojure.mbox, mbox_header.txt > > > MBOX file doesn't get recognized via "magic detection" mechanism as > "application/mbox", but wrongly as "text/html". > Workaround for this in Tika 1.13 is achieved by placing following in > custom-mimetypes.xml, as suggested on mailing list (priority has to be larger > than message/rfc822): > > > > > > > Sample MBOX file is attached. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (TIKA-2042) MBOX file detected wrongly as text/html
[ https://issues.apache.org/jira/browse/TIKA-2042?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matthew Caruana Galizia updated TIKA-2042: -- Attachment: mbox_header.txt Header attached with identifying information stripped out. This file is detected as text/html instead of application/mbox. > MBOX file detected wrongly as text/html > --- > > Key: TIKA-2042 > URL: https://issues.apache.org/jira/browse/TIKA-2042 > Project: Tika > Issue Type: Bug >Affects Versions: 1.13 > Environment: Ubuntu 14.04, Apache Tika 1.13 and 1.14 nightly at the > time of this writing >Reporter: Vjeran Marcinko > Fix For: 1.14 > > Attachments: clojure.mbox, mbox_header.txt > > > MBOX file doesn't get recognized via "magic detection" mechanism as > "application/mbox", but wrongly as "text/html". > Workaround for this in Tika 1.13 is achieved by placing following in > custom-mimetypes.xml, as suggested on mailing list (priority has to be larger > than message/rfc822): > > > > > > > Sample MBOX file is attached. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (TIKA-2042) MBOX file detected wrongly as text/html
[ https://issues.apache.org/jira/browse/TIKA-2042?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16085709#comment-16085709 ] Matthew Caruana Galizia commented on TIKA-2042: --- I'd like to ask for this issue to be reopened. Around half the MBOX files in our corpus are being detected as text/html. My guess is that there are two reasons for this: 1) the files have no extension - the filenames are literally "mbox" rather than "*.mbox" (I think this is the way they're generate or used to be generate on Macs - they're in a *.mbox container directory, but the meat is within an mbox file contained within that directory); 2) the headers don't fall within the 256 byte offset specified by the matcher in the mimetypes XML file. > MBOX file detected wrongly as text/html > --- > > Key: TIKA-2042 > URL: https://issues.apache.org/jira/browse/TIKA-2042 > Project: Tika > Issue Type: Bug >Affects Versions: 1.13 > Environment: Ubuntu 14.04, Apache Tika 1.13 and 1.14 nightly at the > time of this writing >Reporter: Vjeran Marcinko > Fix For: 1.14 > > Attachments: clojure.mbox > > > MBOX file doesn't get recognized via "magic detection" mechanism as > "application/mbox", but wrongly as "text/html". > Workaround for this in Tika 1.13 is achieved by placing following in > custom-mimetypes.xml, as suggested on mailing list (priority has to be larger > than message/rfc822): > > > > > > > Sample MBOX file is attached. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (TIKA-2399) Version conflict with non-ASL jai-imageio-jpeg2000 and edu.ucar jj2000
[ https://issues.apache.org/jira/browse/TIKA-2399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16078012#comment-16078012 ] Matthew Caruana Galizia commented on TIKA-2399: --- OK. I can't think of any other option for now. For the future, does Apache have a legal team that lobbies for companies to change the licenses they use? > Version conflict with non-ASL jai-imageio-jpeg2000 and edu.ucar jj2000 > -- > > Key: TIKA-2399 > URL: https://issues.apache.org/jira/browse/TIKA-2399 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.15 >Reporter: Tim Allison > > For users who want to extract jp2000 from PDFs for inline-image OCR, they > have to add non- ASL 2.0 compatible: > {noformat} > > com.github.jai-imageio > jai-imageio-jpeg2000 > 1.3.0 > > {noformat} > However, this creates a conflict with GRIB's jj2000: > {noformat} > > edu.ucar > jj2000 > 5.2 > > {noformat} > [~mcaruanagalizia] (I'm guessing?) identified this conflict > [here|https://github.com/ICIJ/extract/blob/master/pom.xml] and fixes it by > upgrading jj2000 to 5.3. However, that doesn't exist in maven central, but > only in [Boundless|http://example.com]. > What do we do? > # We could exclude the jj2000 dependency from GRIB, and that functionality > won't work for GRIB folks > # We could add a warning if we see {{jai-imageio-jpeg2000}} is on the > classpath to instruct users to exclude jj2000. > # Other options? -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (TIKA-2399) Version conflict with non-ASL jai-imageio-jpeg2000 and edu.ucar jj2000
[ https://issues.apache.org/jira/browse/TIKA-2399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16076092#comment-16076092 ] Matthew Caruana Galizia commented on TIKA-2399: --- Their response: bq. I wouldn't mind if you forked the repo and published your own artifact, but please use a different group ID. bq. However, I am not a lawyer, and I don't know what, if any, legal ramifications there are for doing so. We (Unidata) do not own JJ2000. Furthermore, its license is uncertain. The Google Code page (1) claims "GNU Lesser GPL", but see also this issue (2). bq. (1) https://code.google.com/archive/p/jj2000/ bq. (2) https://code.google.com/archive/p/jj2000/issues/3 > Version conflict with non-ASL jai-imageio-jpeg2000 and edu.ucar jj2000 > -- > > Key: TIKA-2399 > URL: https://issues.apache.org/jira/browse/TIKA-2399 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.15 >Reporter: Tim Allison > > For users who want to extract jp2000 from PDFs for inline-image OCR, they > have to add non- ASL 2.0 compatible: > {noformat} > > com.github.jai-imageio > jai-imageio-jpeg2000 > 1.3.0 > > {noformat} > However, this creates a conflict with GRIB's jj2000: > {noformat} > > edu.ucar > jj2000 > 5.2 > > {noformat} > [~mcaruanagalizia] (I'm guessing?) identified this conflict > [here|https://github.com/ICIJ/extract/blob/master/pom.xml] and fixes it by > upgrading jj2000 to 5.3. However, that doesn't exist in maven central, but > only in [Boundless|http://example.com]. > What do we do? > # We could exclude the jj2000 dependency from GRIB, and that functionality > won't work for GRIB folks > # We could add a warning if we see {{jai-imageio-jpeg2000}} is on the > classpath to instruct users to exclude jj2000. > # Other options? -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (TIKA-2399) Version conflict with non-ASL jai-imageio-jpeg2000 and edu.ucar jj2000
[ https://issues.apache.org/jira/browse/TIKA-2399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16074918#comment-16074918 ] Matthew Caruana Galizia commented on TIKA-2399: --- I've emailed Unidata to ask about publishing with your key. > Version conflict with non-ASL jai-imageio-jpeg2000 and edu.ucar jj2000 > -- > > Key: TIKA-2399 > URL: https://issues.apache.org/jira/browse/TIKA-2399 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.15 >Reporter: Tim Allison > > For users who want to extract jp2000 from PDFs for inline-image OCR, they > have to add non- ASL 2.0 compatible: > {noformat} > > com.github.jai-imageio > jai-imageio-jpeg2000 > 1.3.0 > > {noformat} > However, this creates a conflict with GRIB's jj2000: > {noformat} > > edu.ucar > jj2000 > 5.2 > > {noformat} > [~mcaruanagalizia] (I'm guessing?) identified this conflict > [here|https://github.com/ICIJ/extract/blob/master/pom.xml] and fixes it by > upgrading jj2000 to 5.3. However, that doesn't exist in maven central, but > only in [Boundless|http://example.com]. > What do we do? > # We could exclude the jj2000 dependency from GRIB, and that functionality > won't work for GRIB folks > # We could add a warning if we see {{jai-imageio-jpeg2000}} is on the > classpath to instruct users to exclude jj2000. > # Other options? -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (TIKA-2399) Version conflict with non-ASL jai-imageio-jpeg2000 and edu.ucar jj2000
[ https://issues.apache.org/jira/browse/TIKA-2399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16074881#comment-16074881 ] Matthew Caruana Galizia commented on TIKA-2399: --- Tim, see https://github.com/jai-imageio/jai-imageio-jpeg2000/issues/8. Once jj2000 is in central then it will also be used by jai-imageio-jpeg2000 to get images from PDFs for OCR under the hood. > Version conflict with non-ASL jai-imageio-jpeg2000 and edu.ucar jj2000 > -- > > Key: TIKA-2399 > URL: https://issues.apache.org/jira/browse/TIKA-2399 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.15 >Reporter: Tim Allison > > For users who want to extract jp2000 from PDFs for inline-image OCR, they > have to add non- ASL 2.0 compatible: > {noformat} > > com.github.jai-imageio > jai-imageio-jpeg2000 > 1.3.0 > > {noformat} > However, this creates a conflict with GRIB's jj2000: > {noformat} > > edu.ucar > jj2000 > 5.2 > > {noformat} > [~mcaruanagalizia] (I'm guessing?) identified this conflict > [here|https://github.com/ICIJ/extract/blob/master/pom.xml] and fixes it by > upgrading jj2000 to 5.3. However, that doesn't exist in maven central, but > only in [Boundless|http://example.com]. > What do we do? > # We could exclude the jj2000 dependency from GRIB, and that functionality > won't work for GRIB folks > # We could add a warning if we see {{jai-imageio-jpeg2000}} is on the > classpath to instruct users to exclude jj2000. > # Other options? -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (TIKA-2399) Version conflict with non-ASL jai-imageio-jpeg2000 and edu.ucar jj2000
[ https://issues.apache.org/jira/browse/TIKA-2399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16073660#comment-16073660 ] Matthew Caruana Galizia commented on TIKA-2399: --- Wouldn't it be better to warn? (Option 2 in your description.) > Version conflict with non-ASL jai-imageio-jpeg2000 and edu.ucar jj2000 > -- > > Key: TIKA-2399 > URL: https://issues.apache.org/jira/browse/TIKA-2399 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.15 >Reporter: Tim Allison > > For users who want to extract jp2000 from PDFs for inline-image OCR, they > have to add non- ASL 2.0 compatible: > {noformat} > > com.github.jai-imageio > jai-imageio-jpeg2000 > 1.3.0 > > {noformat} > However, this creates a conflict with GRIB's jj2000: > {noformat} > > edu.ucar > jj2000 > 5.2 > > {noformat} > [~mcaruanagalizia] (I'm guessing?) identified this conflict > [here|https://github.com/ICIJ/extract/blob/master/pom.xml] and fixes it by > upgrading jj2000 to 5.3. However, that doesn't exist in maven central, but > only in [Boundless|http://example.com]. > What do we do? > # We could exclude the jj2000 dependency from GRIB, and that functionality > won't work for GRIB folks > # We could add a warning if we see {{jai-imageio-jpeg2000}} is on the > classpath to instruct users to exclude jj2000. > # Other options? -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (TIKA-2399) Version conflict with non-ASL jai-imageio-jpeg2000 and edu.ucar jj2000
[ https://issues.apache.org/jira/browse/TIKA-2399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16055541#comment-16055541 ] Matthew Caruana Galizia commented on TIKA-2399: --- I had emailed Unidata in February about publishing to Central and got this reply: bq. Publishing to Maven Central is on our TODO list, not only for JJ2000, but for all of our Java products. However, the process is tedious, especially since some of our products are built using Gradle. So, we've been reluctant to tackle this task. We intend to do it some time this year, but I can't be any more specific than that. > Version conflict with non-ASL jai-imageio-jpeg2000 and edu.ucar jj2000 > -- > > Key: TIKA-2399 > URL: https://issues.apache.org/jira/browse/TIKA-2399 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.15 >Reporter: Tim Allison > > For users who want to extract jp2000 from PDFs for inline-image OCR, they > have to add non- ASL 2.0 compatible: > {noformat} > > com.github.jai-imageio > jai-imageio-jpeg2000 > 1.3.0 > > {noformat} > However, this creates a conflict with GRIB's jj2000: > {noformat} > > edu.ucar > jj2000 > 5.2 > > {noformat} > [~mcaruanagalizia] (I'm guessing?) identified this conflict > [here|https://github.com/ICIJ/extract/blob/master/pom.xml] and fixes it by > upgrading jj2000 to 5.3. However, that doesn't exist in maven central, but > only in [Boundless|http://example.com]. > What do we do? > # We could exclude the jj2000 dependency from GRIB, and that functionality > won't work for GRIB folks > # We could add a warning if we see {{jai-imageio-jpeg2000}} is on the > classpath to instruct users to exclude jj2000. > # Other options? -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (TIKA-2394) Unknown message type: IPM.Note.Rules.OofTemplate.Microsoft
[ https://issues.apache.org/jira/browse/TIKA-2394?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16050226#comment-16050226 ] Matthew Caruana Galizia commented on TIKA-2394: --- I remember seeing how to override a provided jar version in another issue but I can't seem to find it. How do you do that? > Unknown message type: IPM.Note.Rules.OofTemplate.Microsoft > -- > > Key: TIKA-2394 > URL: https://issues.apache.org/jira/browse/TIKA-2394 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.15 >Reporter: Matthew Caruana Galizia >Priority: Minor > Labels: container, email, pst > > When parsing a PST, I get this message logged to stderr multiple times: > Unknown message type: IPM.Note.Rules.OofTemplate.Microsoft > Unfortunately I cannot supply the PST, as its contents is confidential. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Comment Edited] (TIKA-2394) Unknown message type: IPM.Note.Rules.OofTemplate.Microsoft
[ https://issues.apache.org/jira/browse/TIKA-2394?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16050226#comment-16050226 ] Matthew Caruana Galizia edited comment on TIKA-2394 at 6/15/17 9:28 AM: I remember seeing how to override a provided jar version in another issue but I can't seem to find it. How do you do that? (With Maven). was (Author: mcaruanagalizia): I remember seeing how to override a provided jar version in another issue but I can't seem to find it. How do you do that? > Unknown message type: IPM.Note.Rules.OofTemplate.Microsoft > -- > > Key: TIKA-2394 > URL: https://issues.apache.org/jira/browse/TIKA-2394 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.15 >Reporter: Matthew Caruana Galizia >Priority: Minor > Labels: container, email, pst > > When parsing a PST, I get this message logged to stderr multiple times: > Unknown message type: IPM.Note.Rules.OofTemplate.Microsoft > Unfortunately I cannot supply the PST, as its contents is confidential. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (TIKA-2394) Unknown message type: IPM.Note.Rules.OofTemplate.Microsoft
[ https://issues.apache.org/jira/browse/TIKA-2394?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matthew Caruana Galizia updated TIKA-2394: -- Affects Version/s: 1.15 Labels: container email pst (was: ) Priority: Minor (was: Major) Description: When parsing a PST, I get this message logged to stderr multiple times: Unknown message type: IPM.Note.Rules.OofTemplate.Microsoft Unfortunately I cannot supply the PST, as its contents is confidential. Component/s: parser Summary: Unknown message type: IPM.Note.Rules.OofTemplate.Microsoft (was: "Unknown message type") > Unknown message type: IPM.Note.Rules.OofTemplate.Microsoft > -- > > Key: TIKA-2394 > URL: https://issues.apache.org/jira/browse/TIKA-2394 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.15 >Reporter: Matthew Caruana Galizia >Priority: Minor > Labels: container, email, pst > > When parsing a PST, I get this message logged to stderr multiple times: > Unknown message type: IPM.Note.Rules.OofTemplate.Microsoft > Unfortunately I cannot supply the PST, as its contents is confidential. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (TIKA-2394) "Unknown message type"
Matthew Caruana Galizia created TIKA-2394: - Summary: "Unknown message type" Key: TIKA-2394 URL: https://issues.apache.org/jira/browse/TIKA-2394 Project: Tika Issue Type: Bug Reporter: Matthew Caruana Galizia -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (TIKA-2389) Warn log level is pretty strong for missing JBIG2ImageReader
[ https://issues.apache.org/jira/browse/TIKA-2389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16044297#comment-16044297 ] Matthew Caruana Galizia commented on TIKA-2389: --- Please don't move this to info. Before seeing this warning, I didn't even know that the JBIG2 format existed. And then yes, maybe who knows, we would have never found things that we found after adding support. For want of a nail... the battle was lost. > Warn log level is pretty strong for missing JBIG2ImageReader > > > Key: TIKA-2389 > URL: https://issues.apache.org/jira/browse/TIKA-2389 > Project: Tika > Issue Type: Wish > Components: parser >Affects Versions: 1.15 >Reporter: Thomas Mortagne > > Given the license of jbig2-imageio many projects (Apache or LGPL projects for > example) won't include it and will always end up with a warning because of it > while they probably don't really care that much about this image format. > Ideally ImageParser should probably be made more extensible and jbig2 part > moved in an optional module but in the meantime is this warning that > necessary ? -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (TIKA-1195) XLSB support
[ https://issues.apache.org/jira/browse/TIKA-1195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15929105#comment-15929105 ] Matthew Caruana Galizia commented on TIKA-1195: --- [~talli...@mitre.org] d'you reckon that will be out with Tika 1.15? > XLSB support > > > Key: TIKA-1195 > URL: https://issues.apache.org/jira/browse/TIKA-1195 > Project: Tika > Issue Type: Improvement > Components: general >Affects Versions: 1.4 > Environment: W2008R2 >Reporter: Frederic Ronny > Labels: new-parser > > We use Manifoldcf 1.3 and Solr 4.4 to index a shared network drive, works > fine for most of our Office filetypes ( docx, xlsx, ) but we also have a > lot of files with filetype xlsb which are not in the supported filetypes. > In order to keep using this solution it is essential to us that there will be > a solution provided in the future -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Closed] (TIKA-2280) message_from not extracted from Outlook emails
[ https://issues.apache.org/jira/browse/TIKA-2280?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matthew Caruana Galizia closed TIKA-2280. - Resolution: Duplicate > message_from not extracted from Outlook emails > -- > > Key: TIKA-2280 > URL: https://issues.apache.org/jira/browse/TIKA-2280 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.14 >Reporter: Matthew Caruana Galizia >Priority: Minor > Labels: email, outlook, poi > > While the MESSAGE_FROM metadata field is extracted for both RFC and Outlook > emails, it doesn't include the address for Outlook emails. > For example, if the raw from field is "John Doe", the > Outlook email parser sets MESSAGE_FROM to "John Doe" while the RFC email > parser sets it to "John Doe ". > Currently I'm getting the from address from the RAW_HEADER_FROM field for > Outlook emails, but it would be nice to be able to use a standard across > email formats. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (TIKA-1865) Save sender email address in Outlook MSG metadata
[ https://issues.apache.org/jira/browse/TIKA-1865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15890800#comment-15890800 ] Matthew Caruana Galizia commented on TIKA-1865: --- Thank you, this is a big improvement. > Save sender email address in Outlook MSG metadata > - > > Key: TIKA-1865 > URL: https://issues.apache.org/jira/browse/TIKA-1865 > Project: Tika > Issue Type: Improvement > Components: parser >Affects Versions: 1.11 > Environment: Windows 7 x64, jre 1.8.0_60 x64 >Reporter: Luis Filipe Nassif > Attachments: report.xlsx > > > Sender email address is lost when extracting metadata from Outlook msg files. > Currently only sender name is extracted. That is an important information to > be extracted for search engines. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (TIKA-2235) Use Tesseract's recommended DPI for PDF images
[ https://issues.apache.org/jira/browse/TIKA-2235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15890324#comment-15890324 ] Matthew Caruana Galizia commented on TIKA-2235: --- Ah, good catch. OCR'ing inline. > Use Tesseract's recommended DPI for PDF images > -- > > Key: TIKA-2235 > URL: https://issues.apache.org/jira/browse/TIKA-2235 > Project: Tika > Issue Type: Improvement > Components: parser >Affects Versions: 1.14 >Reporter: Matthew Caruana Galizia >Priority: Minor > Labels: ocr, pdf > Fix For: 2.0, 1.15 > > > From the [Tesseract > wiki|https://github.com/tesseract-ocr/tesseract/wiki/ImproveQuality]: > {quote} > Tesseract works best on images which have a DPI of at least 300 dpi > {quote} > PDFParserConfig is currently initialised with a value of 200 for ocrDPI. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (TIKA-2235) Use Tesseract's recommended DPI for PDF images
[ https://issues.apache.org/jira/browse/TIKA-2235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15890106#comment-15890106 ] Matthew Caruana Galizia commented on TIKA-2235: --- In the majority of cases, JPEG, JBIG2 (embedded in PDFs) and TIFF. > Use Tesseract's recommended DPI for PDF images > -- > > Key: TIKA-2235 > URL: https://issues.apache.org/jira/browse/TIKA-2235 > Project: Tika > Issue Type: Improvement > Components: parser >Affects Versions: 1.14 >Reporter: Matthew Caruana Galizia >Priority: Minor > Labels: ocr, pdf > Fix For: 2.0, 1.15 > > > From the [Tesseract > wiki|https://github.com/tesseract-ocr/tesseract/wiki/ImproveQuality]: > {quote} > Tesseract works best on images which have a DPI of at least 300 dpi > {quote} > PDFParserConfig is currently initialised with a value of 200 for ocrDPI. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (TIKA-2280) message_from not extracted from Outlook emails
[ https://issues.apache.org/jira/browse/TIKA-2280?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15889192#comment-15889192 ] Matthew Caruana Galizia commented on TIKA-2280: --- OK, so this is a duplicate then. Your proposed strategy in your penultimate comment makes sense to me. We have a larger corpus of about 1 million MSG files. I could test your fix on that and report the results. > message_from not extracted from Outlook emails > -- > > Key: TIKA-2280 > URL: https://issues.apache.org/jira/browse/TIKA-2280 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.14 >Reporter: Matthew Caruana Galizia >Priority: Minor > Labels: email, outlook, poi > > While the MESSAGE_FROM metadata field is extracted for both RFC and Outlook > emails, it doesn't include the address for Outlook emails. > For example, if the raw from field is "John Doe", the > Outlook email parser sets MESSAGE_FROM to "John Doe" while the RFC email > parser sets it to "John Doe ". > Currently I'm getting the from address from the RAW_HEADER_FROM field for > Outlook emails, but it would be nice to be able to use a standard across > email formats. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (TIKA-2280) message_from not extracted from Outlook emails
[ https://issues.apache.org/jira/browse/TIKA-2280?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matthew Caruana Galizia updated TIKA-2280: -- Description: While the MESSAGE_FROM metadata field is extracted for both RFC and Outlook emails, it doesn't include the address for Outlook emails. For example, if the raw from field is "John Doe", the Outlook email parser sets MESSAGE_FROM to "John Doe" while the RFC email parser sets it to "John Doe ". Currently I'm getting the from address from the RAW_HEADER_FROM field for Outlook emails, but it would be nice to be able to use a standard across email formats. was: While the MESSAGE_FROM metadata field is extracted for RFC emails, it isn't for Outlook emails. The closest thing we have for Outlook emails is the creator field, which only includes the name (but not the email address). Currently I'm getting the from address from the RAW_HEADER_FROM field, but it would be nice to be able to use a standard across email formats. > message_from not extracted from Outlook emails > -- > > Key: TIKA-2280 > URL: https://issues.apache.org/jira/browse/TIKA-2280 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.14 >Reporter: Matthew Caruana Galizia >Priority: Minor > Labels: email, outlook, poi > > While the MESSAGE_FROM metadata field is extracted for both RFC and Outlook > emails, it doesn't include the address for Outlook emails. > For example, if the raw from field is "John Doe ", the > Outlook email parser sets MESSAGE_FROM to "John Doe" while the RFC email > parser sets it to "John Doe ". > Currently I'm getting the from address from the RAW_HEADER_FROM field for > Outlook emails, but it would be nice to be able to use a standard across > email formats. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (TIKA-2280) message_from not extracted from Outlook emails
Matthew Caruana Galizia created TIKA-2280: - Summary: message_from not extracted from Outlook emails Key: TIKA-2280 URL: https://issues.apache.org/jira/browse/TIKA-2280 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.14 Reporter: Matthew Caruana Galizia Priority: Minor While the MESSAGE_FROM metadata field is extracted for RFC emails, it isn't for Outlook emails. The closest thing we have for Outlook emails is the creator field, which only includes the name (but not the email address). Currently I'm getting the from address from the RAW_HEADER_FROM field, but it would be nice to be able to use a standard across email formats. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (TIKA-2274) and metadata collision
[ https://issues.apache.org/jira/browse/TIKA-2274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15887559#comment-15887559 ] Matthew Caruana Galizia commented on TIKA-2274: --- Thanks fot checking up on this. Try Metadata.TITLE rather than TikaCoreProperties.TITLE. My suggestion for a namespace is "meta", so in this example the resulting name would be "metatitle". > and metadata collision > -- > > Key: TIKA-2274 > URL: https://issues.apache.org/jira/browse/TIKA-2274 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.14 >Reporter: Matthew Caruana Galizia >Priority: Minor > Labels: html > > In several different corpuses I've found HTML files which look like the > following: > {code} > > >Some title > > >... > > {code} > This causes the "title" property in the metadata to have two values set, when > one would expect that this field is not multivalued. > Perhaps some fields from tags, like this one, should be namespaced. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (TIKA-2274) and metadata collision
Matthew Caruana Galizia created TIKA-2274: - Summary: and metadata collision Key: TIKA-2274 URL: https://issues.apache.org/jira/browse/TIKA-2274 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.14 Reporter: Matthew Caruana Galizia Priority: Minor In several different corpuses I've found HTML files which look like the following: {code} Some title ... {code} This causes the "title" property in the metadata to have two values set, when one would expect that this field is not multivalued. Perhaps some fields from tags, like this one, should be namespaced. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (TIKA-2245) Standardise logging
[ https://issues.apache.org/jira/browse/TIKA-2245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15829856#comment-15829856 ] Matthew Caruana Galizia commented on TIKA-2245: --- So should we agree that parsers should use ONLY JUL and rid it of slf4j and log4j? Or should we standardise on slf4j? > Standardise logging > --- > > Key: TIKA-2245 > URL: https://issues.apache.org/jira/browse/TIKA-2245 > Project: Tika > Issue Type: Improvement > Components: parser >Affects Versions: 1.14, 1.15 >Reporter: Matthew Caruana Galizia > Labels: logging > > Tika parsers sometimes use Log4j's Logger, sometimes the JUL > (java.util.logging) Logger and sometimes SLF4j. > It would be better to standardise on a single facade, for the sake of not > having to configure multiple loggers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-2245) Standardise logging
[ https://issues.apache.org/jira/browse/TIKA-2245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matthew Caruana Galizia updated TIKA-2245: -- Description: Tika parsers sometimes use Log4j's Logger, sometimes the JUL (java.util.logging) Logger and sometimes SLF4j. It would be better to standardise on a single facade, for the sake of not having to configure multiple loggers. was: Tika parsers sometimes use Log4j's Logger and the JUL (java.util.logging) Logger. I will happily make a pull request to standardise on the latter, as I believe users shouldn't be forced to use a third-party library. It would be better to standardise on the lowest common denominator and leave users free to use their own bridge, for example JUL-to-log4j or whatever they want. Summary: Standardise logging (was: Standardise on java.util.Logging) > Standardise logging > --- > > Key: TIKA-2245 > URL: https://issues.apache.org/jira/browse/TIKA-2245 > Project: Tika > Issue Type: Improvement > Components: parser >Affects Versions: 1.14, 1.15 >Reporter: Matthew Caruana Galizia > Labels: logging > > Tika parsers sometimes use Log4j's Logger, sometimes the JUL > (java.util.logging) Logger and sometimes SLF4j. > It would be better to standardise on a single facade, for the sake of not > having to configure multiple loggers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (TIKA-2245) Standardise on java.util.Logging
Matthew Caruana Galizia created TIKA-2245: - Summary: Standardise on java.util.Logging Key: TIKA-2245 URL: https://issues.apache.org/jira/browse/TIKA-2245 Project: Tika Issue Type: Improvement Components: parser Affects Versions: 1.14, 1.15 Reporter: Matthew Caruana Galizia Tika parsers sometimes use Log4j's Logger and the JUL (java.util.logging) Logger. I will happily make a pull request to standardise on the latter, as I believe users shouldn't be forced to use a third-party library. It would be better to standardise on the lowest common denominator and leave users free to use their own bridge, for example JUL-to-log4j or whatever they want. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-2232) Add JBIG2 image parsing support
[ https://issues.apache.org/jira/browse/TIKA-2232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15823691#comment-15823691 ] Matthew Caruana Galizia commented on TIKA-2232: --- Could we at least log a warning once when the ClassNotFoundException is thrown? Otherwise I feel like we're sweeping the problem under the rug. In the meantime I've asked one of the Levigo developers if they'd consider switching to a license which is compatible with the ASL v2. > Add JBIG2 image parsing support > --- > > Key: TIKA-2232 > URL: https://issues.apache.org/jira/browse/TIKA-2232 > Project: Tika > Issue Type: New Feature > Components: parser >Affects Versions: 1.14 > Environment: Any >Reporter: Pascal Essiembre >Assignee: Tim Allison >Priority: Minor > Fix For: 2.0, 1.15 > > > If you are interested, I would like to add support for JBIG2 image files > (.jb2, or .jbig2). I have encountered them PDFs. > I will make a pull-request shortly. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-2235) Use Tesseract's recommended DPI for PDF images
[ https://issues.apache.org/jira/browse/TIKA-2235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15818176#comment-15818176 ] Matthew Caruana Galizia commented on TIKA-2235: --- Yes, I am already! Thanks for linking me to that. It's good that that pull request adds metadata support for JBIG2, but would it not be better to wait for the PDFBox 2.0.5 release (which I'm assuming is soon) instead of adding todos? > Use Tesseract's recommended DPI for PDF images > -- > > Key: TIKA-2235 > URL: https://issues.apache.org/jira/browse/TIKA-2235 > Project: Tika > Issue Type: Improvement > Components: parser >Affects Versions: 1.14 >Reporter: Matthew Caruana Galizia >Priority: Minor > Labels: ocr, pdf > Fix For: 2.0, 1.15 > > > From the [Tesseract > wiki|https://github.com/tesseract-ocr/tesseract/wiki/ImproveQuality]: > {quote} > Tesseract works best on images which have a DPI of at least 300 dpi > {quote} > PDFParserConfig is currently initialised with a value of 200 for ocrDPI. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (TIKA-2235) Use Tesseract's recommended DPI for PDF images
Matthew Caruana Galizia created TIKA-2235: - Summary: Use Tesseract's recommended DPI for PDF images Key: TIKA-2235 URL: https://issues.apache.org/jira/browse/TIKA-2235 Project: Tika Issue Type: Improvement Components: parser Affects Versions: 1.14 Reporter: Matthew Caruana Galizia Priority: Minor >From the [Tesseract >wiki|https://github.com/tesseract-ocr/tesseract/wiki/ImproveQuality]: {quote} Tesseract works best on images which have a DPI of at least 300 dpi {quote} PDFParserConfig is currently initialised with a value of 200 for ocrDPI. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (TIKA-2221) poi.EncryptedDocumentException not wrapped in tika.exception.EncryptedDocumentException
Matthew Caruana Galizia created TIKA-2221: - Summary: poi.EncryptedDocumentException not wrapped in tika.exception.EncryptedDocumentException Key: TIKA-2221 URL: https://issues.apache.org/jira/browse/TIKA-2221 Project: Tika Issue Type: Bug Affects Versions: 1.14 Reporter: Matthew Caruana Galizia Priority: Minor Fix For: 1.15 When parsing an encrypted Word document, a org.apache.poi.EncryptedDocumentException is thrown at WordExtractor.java#151. Tika catches this too far up the stack and incorrectly wraps it in a plain TikaException instead of a org.apache.tika.exception.EncryptedDocumentException. The fix would be to catch and wrap the exception correctly, for example: {noformat} try { document = new HWPFDocument(root); } catch (org.apache.poi.EncryptedDocumentException e) { throw new EncryptedDocumentException(e); } catch (OldWordFileFormatException e) { parseWord6(root, xhtml); return; } {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-2175) Enable extraction of inlined jp2/jpx from PDF
[ https://issues.apache.org/jira/browse/TIKA-2175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15701919#comment-15701919 ] Matthew Caruana Galizia commented on TIKA-2175: --- The problem was OpenCL support in Tesseract. Once I rebuilt Tesseract without OpenCL support, I got the same results as you above, but using setExtractInlineImages(true) instead of setOcrStrategy(...). Thank you for testing. > Enable extraction of inlined jp2/jpx from PDF > - > > Key: TIKA-2175 > URL: https://issues.apache.org/jira/browse/TIKA-2175 > Project: Tika > Issue Type: Bug >Reporter: Tim Allison > Attachments: pdf-with-jp2-images.pdf > > > On TIKA-2174, [~mcaruanagalizia] reported that inline jp2 images in PDFs were > not being OCR'd. TIKA-2174 added that file type to our tesseract parser, but > we our code in the PDFParser wasn't extracting the inline images as well. > Let's fix that. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-2175) Enable extraction of inlined jp2/jpx from PDF
[ https://issues.apache.org/jira/browse/TIKA-2175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15696377#comment-15696377 ] Matthew Caruana Galizia commented on TIKA-2175: --- Still no joy, both with my bridge classes and with tika-app from trunk. It seems the images in the PDF are skipped over entirely. I don't think that the embedded document parsing handler is ever even invoked. I've attached the PDF in question. If you open it in a hex editor, you can see that the files are declared to be "jp2" format. > Enable extraction of inlined jp2/jpx from PDF > - > > Key: TIKA-2175 > URL: https://issues.apache.org/jira/browse/TIKA-2175 > Project: Tika > Issue Type: Bug >Reporter: Tim Allison > Attachments: pdf-with-jp2-images.pdf > > > On TIKA-2174, [~mcaruanagalizia] reported that inline jp2 images in PDFs were > not being OCR'd. TIKA-2174 added that file type to our tesseract parser, but > we our code in the PDFParser wasn't extracting the inline images as well. > Let's fix that. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1896) Invalid closing script tag not handled gracefully by HtmlParser
[ https://issues.apache.org/jira/browse/TIKA-1896?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15663830#comment-15663830 ] Matthew Caruana Galizia commented on TIKA-1896: --- Perhaps we should push ahead with Jsoup integration instead of trying to hack Tagsoup? > Invalid closing script tag not handled gracefully by HtmlParser > --- > > Key: TIKA-1896 > URL: https://issues.apache.org/jira/browse/TIKA-1896 > Project: Tika > Issue Type: Improvement > Components: parser >Affects Versions: 1.12 >Reporter: Matthew Caruana Galizia >Priority: Minor > Attachments: reports.tar.bz2, test.html > > > When an HTML file contains an invalid closing script tag, all content after > that tag is interpreted as script data and therefore ignored. > Reduced test case file attached. > To reproduce: > 1) create a file with the following HTML > {code:html} > "http://www.w3.org/TR/html4/loose.dtd;> > > >
[jira] [Commented] (TIKA-2175) Enable extraction of inlined jp2/jpx from PDF
[ https://issues.apache.org/jira/browse/TIKA-2175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15653441#comment-15653441 ] Matthew Caruana Galizia commented on TIKA-2175: --- I've filed [an issue|https://github.com/jai-imageio/jai-imageio-jpeg2000/issues/8] with the jpeg2000 imageio project to declare jpx support. The decode/encoders support that format - the issue is simply that it's not declared so PDFBox doesn't find them. As a temporary workaround and proof of concept I've added these two bridge Spi classes: https://github.com/ICIJ/extract/tree/master/src/main/java/org/icij/imageio/jpx > Enable extraction of inlined jp2/jpx from PDF > - > > Key: TIKA-2175 > URL: https://issues.apache.org/jira/browse/TIKA-2175 > Project: Tika > Issue Type: Bug >Reporter: Tim Allison > > On TIKA-2174, [~mcaruanagalizia] reported that inline jp2 images in PDFs were > not being OCR'd. TIKA-2174 added that file type to our tesseract parser, but > we our code in the PDFParser wasn't extracting the inline images as well. > Let's fix that. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-2174) Too few formats in support declared by TesseractOCRParser
[ https://issues.apache.org/jira/browse/TIKA-2174?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15653430#comment-15653430 ] Matthew Caruana Galizia commented on TIKA-2174: --- Thank you! I've also confirmed that Tesseract can handle image/x-portable-pixmap (PPM) files, so perhaps we could add that too? > Too few formats in support declared by TesseractOCRParser > - > > Key: TIKA-2174 > URL: https://issues.apache.org/jira/browse/TIKA-2174 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.14 >Reporter: Matthew Caruana Galizia > > A complete install of Leptonica with Tesseract will add support for formats > that are not declared by TesseractOCRParser. These include JP2, JPX and PPM. > Tesseract produces OCR output fine for JPX images as of this version: > {noformat} > $ tesseract -v > tesseract 3.04.01 >leptonica-1.73 > libjpeg 8d : libpng 1.6.26 : libtiff 4.0.6 : zlib 1.2.5}} > {noformat} > However, these types are not declared by getSupportTypes so no output is > produced for PDFs which contained JPX images of scanned documents, for > example. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-2174) Too few formats in support declared by TesseractOCRParser
[ https://issues.apache.org/jira/browse/TIKA-2174?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15651347#comment-15651347 ] Matthew Caruana Galizia commented on TIKA-2174: --- That issue went away once I added 'jp2' and 'jpx' to the list of supported types in TesseractOCRParser via a new proxy parser that declares support for these types. It seems the embedded images are then handed off to Tesseract but nothing is OCRed, although that seems to be a separate issue arising from PDFBox. > Too few formats in support declared by TesseractOCRParser > - > > Key: TIKA-2174 > URL: https://issues.apache.org/jira/browse/TIKA-2174 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.14 >Reporter: Matthew Caruana Galizia > > A complete install of Leptonica with Tesseract will add support for formats > that are not declared by TesseractOCRParser. These include JP2, JPX and PPM. > Tesseract produces OCR output fine for JPX images as of this version: > {noformat} > $ tesseract -v > tesseract 3.04.01 >leptonica-1.73 > libjpeg 8d : libpng 1.6.26 : libtiff 4.0.6 : zlib 1.2.5}} > {noformat} > However, these types are not declared by getSupportTypes so no output is > produced for PDFs which contained JPX images of scanned documents, for > example. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-2174) Too few formats in support declared by TesseractOCRParser
[ https://issues.apache.org/jira/browse/TIKA-2174?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15650892#comment-15650892 ] Matthew Caruana Galizia commented on TIKA-2174: --- Both on inline and independent files. I've renamed the issue and added PPM (image/x-portable-pixmap) to the list of formats that could be supported. > Too few formats in support declared by TesseractOCRParser > - > > Key: TIKA-2174 > URL: https://issues.apache.org/jira/browse/TIKA-2174 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.14 >Reporter: Matthew Caruana Galizia > > A complete install of Leptonica with Tesseract will add support for formats > that are not declared by TesseractOCRParser. These include JP2, JPX and PPM. > Tesseract produces OCR output fine for JPX images as of this version: > {noformat} > $ tesseract -v > tesseract 3.04.01 >leptonica-1.73 > libjpeg 8d : libpng 1.6.26 : libtiff 4.0.6 : zlib 1.2.5}} > {noformat} > However, these types are not declared by getSupportTypes so no output is > produced for PDFs which contained JPX images of scanned documents, for > example. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-2174) Too few formats in support declared by TesseractOCRParser
[ https://issues.apache.org/jira/browse/TIKA-2174?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matthew Caruana Galizia updated TIKA-2174: -- Description: A complete install of Leptonica with Tesseract will add support for formats that are not declared by TesseractOCRParser. These include JP2, JPX and PPM. Tesseract produces OCR output fine for JPX images as of this version: {noformat} $ tesseract -v tesseract 3.04.01 leptonica-1.73 libjpeg 8d : libpng 1.6.26 : libtiff 4.0.6 : zlib 1.2.5}} {noformat} However, these types are not declared by getSupportTypes so no output is produced for PDFs which contained JPX images of scanned documents, for example. was: Tesseract produces OCR output fine for JPX images as of this version: {noformat} $ tesseract -v tesseract 3.04.01 leptonica-1.73 libjpeg 8d : libpng 1.6.26 : libtiff 4.0.6 : zlib 1.2.5}} {noformat} However, these types are not declared by getSupportTypes so no output is produced for PDFs which contained JPX images of scanned documents, for example. Summary: Too few formats in support declared by TesseractOCRParser (was: JP2 and JPX (JPEG 2000) support not declared by TesseractOCRParser) > Too few formats in support declared by TesseractOCRParser > - > > Key: TIKA-2174 > URL: https://issues.apache.org/jira/browse/TIKA-2174 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.14 >Reporter: Matthew Caruana Galizia > > A complete install of Leptonica with Tesseract will add support for formats > that are not declared by TesseractOCRParser. These include JP2, JPX and PPM. > Tesseract produces OCR output fine for JPX images as of this version: > {noformat} > $ tesseract -v > tesseract 3.04.01 >leptonica-1.73 > libjpeg 8d : libpng 1.6.26 : libtiff 4.0.6 : zlib 1.2.5}} > {noformat} > However, these types are not declared by getSupportTypes so no output is > produced for PDFs which contained JPX images of scanned documents, for > example. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (TIKA-2174) JP2 and JPX (JPEG 2000) support not declared by TesseractOCRParser
Matthew Caruana Galizia created TIKA-2174: - Summary: JP2 and JPX (JPEG 2000) support not declared by TesseractOCRParser Key: TIKA-2174 URL: https://issues.apache.org/jira/browse/TIKA-2174 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.14 Reporter: Matthew Caruana Galizia Tesseract produces OCR output fine for JPX images as of this version: {noformat} $ tesseract -v tesseract 3.04.01 leptonica-1.73 libjpeg 8d : libpng 1.6.26 : libtiff 4.0.6 : zlib 1.2.5}} {noformat} However, these types are not declared by getSupportTypes so no output is produced for PDFs which contained JPX images of scanned documents, for example. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-2167) Image processing causes OCR to fail
[ https://issues.apache.org/jira/browse/TIKA-2167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15644455#comment-15644455 ] Matthew Caruana Galizia commented on TIKA-2167: --- [~talli...@mitre.org] to replicate the issue: 1) build tika-app from master 2) java -jar target/tika-app-1.15-SNAPSHOT.jar 3) drag simple.tiff onto the window 4) select View > Plain text Result: the only output is a series of newlines. > Image processing causes OCR to fail > --- > > Key: TIKA-2167 > URL: https://issues.apache.org/jira/browse/TIKA-2167 > Project: Tika > Issue Type: Bug > Components: ocr >Affects Versions: 1.14 > Environment: Mac OS X 10.11.6; Java 1.8.0_45; tesseract 3.04.01; > ImageMagick 6.9.6-2 >Reporter: Matthew Caruana Galizia >Priority: Critical > Labels: convert, image, ocr, tiff > Attachments: simple.tiff > > > Image processing before OCR is enabled by default in the OCR configuration > properties file. Unless this is disabled, running Tika on a simple TIFF image > (attached) with two clear words fails. When image processing is disabled, it > succeeds. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-2167) Image processing causes OCR to fail
[ https://issues.apache.org/jira/browse/TIKA-2167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matthew Caruana Galizia updated TIKA-2167: -- Attachment: simple.tiff > Image processing causes OCR to fail > --- > > Key: TIKA-2167 > URL: https://issues.apache.org/jira/browse/TIKA-2167 > Project: Tika > Issue Type: Bug > Components: ocr >Affects Versions: 1.14 > Environment: Mac OS X 10.11.6; Java 1.8.0_45; tesseract 3.04.01; > ImageMagick 6.9.6-2 >Reporter: Matthew Caruana Galizia >Priority: Critical > Labels: convert, image, ocr, tiff > Attachments: simple.tiff > > > Image processing before OCR is enabled by default in the OCR configuration > properties file. Unless this is disabled, running Tika on a simple TIFF image > (attached) with two clear words fails. When image processing is disabled, it > succeeds. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (TIKA-2167) Image processing causes OCR to fail
Matthew Caruana Galizia created TIKA-2167: - Summary: Image processing causes OCR to fail Key: TIKA-2167 URL: https://issues.apache.org/jira/browse/TIKA-2167 Project: Tika Issue Type: Bug Components: ocr Affects Versions: 1.14 Environment: Mac OS X 10.11.6; Java 1.8.0_45; tesseract 3.04.01; ImageMagick 6.9.6-2 Reporter: Matthew Caruana Galizia Priority: Critical Image processing before OCR is enabled by default in the OCR configuration properties file. Unless this is disabled, running Tika on a simple TIFF image (attached) with two clear words fails. When image processing is disabled, it succeeds. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1896) Invalid closing script tag not handled gracefully by HtmlParser
[ https://issues.apache.org/jira/browse/TIKA-1896?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15638978#comment-15638978 ] Matthew Caruana Galizia commented on TIKA-1896: --- [~talli...@mitre.org] did you ever run your workaround against the corpus? Although this might not seem a major bug at face value, it's preventing us from extracting text from hundreds of thousands of HTML files without a lot of manual manipulation of the files first. > Invalid closing script tag not handled gracefully by HtmlParser > --- > > Key: TIKA-1896 > URL: https://issues.apache.org/jira/browse/TIKA-1896 > Project: Tika > Issue Type: Improvement > Components: parser >Affects Versions: 1.12 >Reporter: Matthew Caruana Galizia >Priority: Minor > Attachments: test.html > > > When an HTML file contains an invalid closing script tag, all content after > that tag is interpreted as script data and therefore ignored. > Reduced test case file attached. > To reproduce: > 1) create a file with the following HTML > {code:html} > "http://www.w3.org/TR/html4/loose.dtd;> > > >
[jira] [Comment Edited] (TIKA-1896) Invalid closing script tag not handled gracefully by HtmlParser
[ https://issues.apache.org/jira/browse/TIKA-1896?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15184969#comment-15184969 ] Matthew Caruana Galizia edited comment on TIKA-1896 at 3/8/16 2:33 PM: --- TagSoup handles this well in HTML mode: {{java -jar tagsoup-1.2.1.jar --method=html test.html}} outputs a closing tag correctly. Without HTML mode, all HTML after the invalid closing tag is interpreted as part of the script CDATA section. was (Author: mcaruanagalizia): TagSoup handles this well in HTML mode: {{java -jar tagsoup-1.2.1.jar --method=html ~/Downloads/test.html}} outputs a closing tag correctly. Without HTML mode, all HTML after the invalid closing tag is interpreted as part of the script CDATA section. > Invalid closing script tag not handled gracefully by HtmlParser > --- > > Key: TIKA-1896 > URL: https://issues.apache.org/jira/browse/TIKA-1896 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.12 >Reporter: Matthew Caruana Galizia > Attachments: test.html > > > When an HTML file contains an invalid closing script tag, all content after > that tag is interpreted as script data and therefore ignored. > Reduced test case file attached. > To reproduce: > 1) create a file with the following HTML > {code:html} > "http://www.w3.org/TR/html4/loose.dtd;> > > >
[jira] [Updated] (TIKA-1896) Invalid closing script tag not handled gracefully by HtmlParser
[ https://issues.apache.org/jira/browse/TIKA-1896?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matthew Caruana Galizia updated TIKA-1896: -- Attachment: test.html > Invalid closing script tag not handled gracefully by HtmlParser > --- > > Key: TIKA-1896 > URL: https://issues.apache.org/jira/browse/TIKA-1896 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.12 >Reporter: Matthew Caruana Galizia > Attachments: test.html > > > When an HTML file contains an invalid closing script tag, all content after > that tag is interpreted as script data and therefore ignored. > Reduced test case file attached. > To reproduce: > 1) create a file with the following HTML > {code:html} > "http://www.w3.org/TR/html4/loose.dtd;> > > >