[jira] [Comment Edited] (TIKA-2473) PCX and DCX image support

2017-10-06 Thread Matthew Caruana Galizia (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2473?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16194435#comment-16194435
 ] 

Matthew Caruana Galizia edited comment on TIKA-2473 at 10/6/17 10:42 AM:
-

Magic for PCX:

byte 0: x0A
byte 1: either x00, 0x02, 0x03, 0x04 or 0x05

MIME type: image/vnd.zbrush.pcx

via: https://www.iana.org/assignments/media-types/image/vnd.zbrush.pcx


was (Author: mcaruanagalizia):
Magic:

byte 0: x0A
byte 1: either x00, 0x02, 0x03, 0x04 or 0x05

MIME type: image/vnd.zbrush.pcx

via: https://www.iana.org/assignments/media-types/image/vnd.zbrush.pcx

> PCX and DCX image support
> -
>
> Key: TIKA-2473
> URL: https://issues.apache.org/jira/browse/TIKA-2473
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.16
>Reporter: Matthew Caruana Galizia
>
> It's straightforward in theory to implement support for PCX and DCX. There's 
> support for it in Commons Imaging as well as in ImageIO via TwelveMonkeys.
> In practise, however, I'm not really sure how implement support. We obviously 
> want to OCR the images, but Tesseract has no support for the format. So where 
> do we do the conversion to a BufferedImage? I tried to look for what is done 
> to handle JBIG2 files but I can't find that anywhere.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (TIKA-2473) PCX and DCX image support

2017-10-06 Thread Matthew Caruana Galizia (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2473?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16194435#comment-16194435
 ] 

Matthew Caruana Galizia commented on TIKA-2473:
---

Magic:

byte 0: x0A
byte 1: either x00, 0x02, 0x03, 0x04 or 0x05

MIME type: image/vnd.zbrush.pcx

via: https://www.iana.org/assignments/media-types/image/vnd.zbrush.pcx

> PCX and DCX image support
> -
>
> Key: TIKA-2473
> URL: https://issues.apache.org/jira/browse/TIKA-2473
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.16
>Reporter: Matthew Caruana Galizia
>
> It's straightforward in theory to implement support for PCX and DCX. There's 
> support for it in Commons Imaging as well as in ImageIO via TwelveMonkeys.
> In practise, however, I'm not really sure how implement support. We obviously 
> want to OCR the images, but Tesseract has no support for the format. So where 
> do we do the conversion to a BufferedImage? I tried to look for what is done 
> to handle JBIG2 files but I can't find that anywhere.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (TIKA-2473) PCX and DCX image support

2017-10-06 Thread Matthew Caruana Galizia (JIRA)
Matthew Caruana Galizia created TIKA-2473:
-

 Summary: PCX and DCX image support
 Key: TIKA-2473
 URL: https://issues.apache.org/jira/browse/TIKA-2473
 Project: Tika
  Issue Type: Improvement
  Components: parser
Affects Versions: 1.16
Reporter: Matthew Caruana Galizia


It's straightforward in theory to implement support for PCX and DCX. There's 
support for it in Commons Imaging as well as in ImageIO via TwelveMonkeys.

In practise, however, I'm not really sure how implement support. We obviously 
want to OCR the images, but Tesseract has no support for the format. So where 
do we do the conversion to a BufferedImage? I tried to look for what is done to 
handle JBIG2 files but I can't find that anywhere.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (TIKA-2471) Tab-prefixed message body lines in Mbox interpreted as headers

2017-09-29 Thread Matthew Caruana Galizia (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2471?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthew Caruana Galizia updated TIKA-2471:
--
Attachment: mbox

Reduced test case attached. The result of parsing this file will include 
metadata keys with names like 
{{MboxParser-class=3dmsonormal>sincerely, Tab-prefixed message body lines in Mbox interpreted as headers
> --
>
> Key: TIKA-2471
> URL: https://issues.apache.org/jira/browse/TIKA-2471
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.16
>Reporter: Matthew Caruana Galizia
>  Labels: message, rfc822
> Attachments: mbox
>
>
> The mbox parser code is overly optimistic. It parses the entire message 
> looking for anything that matches a header pattern, wherever it occurs in a 
> line!
> It looks to me like the parsing logic is in desperate need of a refactor. But 
> more to the point, what is the idea behind setting the headers in the 
> MboxParser if they're going to be set by the RFC822Parser in any case?
> Also, out of curiosity, why does the parser force Windows-1252 as the charset?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (TIKA-2471) Tab-prefixed message body lines in Mbox interpreted as headers

2017-09-29 Thread Matthew Caruana Galizia (JIRA)
Matthew Caruana Galizia created TIKA-2471:
-

 Summary: Tab-prefixed message body lines in Mbox interpreted as 
headers
 Key: TIKA-2471
 URL: https://issues.apache.org/jira/browse/TIKA-2471
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.16
Reporter: Matthew Caruana Galizia


The mbox parser code is overly optimistic. It parses the entire message looking 
for anything that matches a header pattern, wherever it occurs in a line!

It looks to me like the parsing logic is in desperate need of a refactor. But 
more to the point, what is the idea behind setting the headers in the 
MboxParser if they're going to be set by the RFC822Parser in any case?

Also, out of curiosity, why does the parser force Windows-1252 as the charset?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (TIKA-2219) CharsetDetector no longer detects windows-1252 charset

2017-09-01 Thread Matthew Caruana Galizia (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16150718#comment-16150718
 ] 

Matthew Caruana Galizia commented on TIKA-2219:
---

Thanks for getting back. Shouldn't the RFC822Parser attempt to detect the body 
encoding when none is declared?

> CharsetDetector no longer detects windows-1252 charset
> --
>
> Key: TIKA-2219
> URL: https://issues.apache.org/jira/browse/TIKA-2219
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.14
> Environment: Any.
>Reporter: Pascal Essiembre
>Priority: Minor
> Fix For: 2.0, 1.15
>
> Attachments: test.txt
>
>
> Starting with Tika 1.14, windows-1252 is no longer detected, as ISO-8859-1 is 
> always detected instead.  While not tested, this likely affects other 
> windows-125* encodings as well.
> I tracked it down to a change in the 
> {{CharsetRecog_sbcs.CharsetRecog_8859_1#getName()}} method.  Now it always 
> returns "ISO-8859-1" whereas before it was: {{return haveC1Bytes ? 
> "windows-1252" : "ISO-8859-1";}}
> Now that condition has been moved to the {{match(CharsetDetector det)}} 
> method so that the returned CharsetMatch has the proper name.  The problem 
> with that is {{CharsetDetector#detectAll()}} method overwrites the correct 
> match with a new one that will return the value of {{#getName()}}  from the 
> {{CharsetRecognizer}} instead (which is always "ISO-8859-1" in this case).
> There might be legitimate reasons why the {{CharsetMatch}} instances in 
> {{detectAll()}} method are replaced with new ones, but changing this code in 
> that method appears to work for me:
> // Remove this:
> //CharsetMatch m = new CharsetMatch(this, csr, 
> confidence);
> //matches.add(m);
> // Add this instead:
> matches.add(charsetMatch);



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (TIKA-2219) CharsetDetector no longer detects windows-1252 charset

2017-08-31 Thread Matthew Caruana Galizia (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthew Caruana Galizia updated TIKA-2219:
--
Attachment: test.txt

This file contains x92 characters which should force detection to Windows-1252.

> CharsetDetector no longer detects windows-1252 charset
> --
>
> Key: TIKA-2219
> URL: https://issues.apache.org/jira/browse/TIKA-2219
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.14
> Environment: Any.
>Reporter: Pascal Essiembre
>Priority: Minor
> Fix For: 2.0, 1.15
>
> Attachments: test.txt
>
>
> Starting with Tika 1.14, windows-1252 is no longer detected, as ISO-8859-1 is 
> always detected instead.  While not tested, this likely affects other 
> windows-125* encodings as well.
> I tracked it down to a change in the 
> {{CharsetRecog_sbcs.CharsetRecog_8859_1#getName()}} method.  Now it always 
> returns "ISO-8859-1" whereas before it was: {{return haveC1Bytes ? 
> "windows-1252" : "ISO-8859-1";}}
> Now that condition has been moved to the {{match(CharsetDetector det)}} 
> method so that the returned CharsetMatch has the proper name.  The problem 
> with that is {{CharsetDetector#detectAll()}} method overwrites the correct 
> match with a new one that will return the value of {{#getName()}}  from the 
> {{CharsetRecognizer}} instead (which is always "ISO-8859-1" in this case).
> There might be legitimate reasons why the {{CharsetMatch}} instances in 
> {{detectAll()}} method are replaced with new ones, but changing this code in 
> that method appears to work for me:
> // Remove this:
> //CharsetMatch m = new CharsetMatch(this, csr, 
> confidence);
> //matches.add(m);
> // Add this instead:
> matches.add(charsetMatch);



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (TIKA-2219) CharsetDetector no longer detects windows-1252 charset

2017-08-31 Thread Matthew Caruana Galizia (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16149673#comment-16149673
 ] 

Matthew Caruana Galizia commented on TIKA-2219:
---

[~talli...@mitre.org] I think this issue has regressed. Please take a look at 
the attached file. It's parsed as an email but the body text is detected as 
US-ASCII instead of Windows-1252 (note the x92 characters).

> CharsetDetector no longer detects windows-1252 charset
> --
>
> Key: TIKA-2219
> URL: https://issues.apache.org/jira/browse/TIKA-2219
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.14
> Environment: Any.
>Reporter: Pascal Essiembre
>Priority: Minor
> Fix For: 2.0, 1.15
>
>
> Starting with Tika 1.14, windows-1252 is no longer detected, as ISO-8859-1 is 
> always detected instead.  While not tested, this likely affects other 
> windows-125* encodings as well.
> I tracked it down to a change in the 
> {{CharsetRecog_sbcs.CharsetRecog_8859_1#getName()}} method.  Now it always 
> returns "ISO-8859-1" whereas before it was: {{return haveC1Bytes ? 
> "windows-1252" : "ISO-8859-1";}}
> Now that condition has been moved to the {{match(CharsetDetector det)}} 
> method so that the returned CharsetMatch has the proper name.  The problem 
> with that is {{CharsetDetector#detectAll()}} method overwrites the correct 
> match with a new one that will return the value of {{#getName()}}  from the 
> {{CharsetRecognizer}} instead (which is always "ISO-8859-1" in this case).
> There might be legitimate reasons why the {{CharsetMatch}} instances in 
> {{detectAll()}} method are replaced with new ones, but changing this code in 
> that method appears to work for me:
> // Remove this:
> //CharsetMatch m = new CharsetMatch(this, csr, 
> confidence);
> //matches.add(m);
> // Add this instead:
> matches.add(charsetMatch);



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (TIKA-2455) Flag in metadata for alternative email bodies

2017-08-31 Thread Matthew Caruana Galizia (JIRA)
Matthew Caruana Galizia created TIKA-2455:
-

 Summary: Flag in metadata for alternative email bodies
 Key: TIKA-2455
 URL: https://issues.apache.org/jira/browse/TIKA-2455
 Project: Tika
  Issue Type: Improvement
  Components: parser
Affects Versions: 1.16
Reporter: Matthew Caruana Galizia
Priority: Minor


When multipart RFC822 emails are being parsed, there's no way to distinguish 
between alternative versions of the body and attachments.

It would be ideal if some kind of flag were set in the metadata passed to the 
{{EmbeddedDocumentExtractor}} that indicates that the stream is an alternative.

In GUIs that present the data extracted from the email, alternative bodies can 
be distinguished from attachments and presented separately.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (TIKA-2454) Emails extracted from PSTs detected as unexpected file types

2017-08-30 Thread Matthew Caruana Galizia (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16148117#comment-16148117
 ] 

Matthew Caruana Galizia commented on TIKA-2454:
---

I don't know if the same thing can be done wholesale for mbox files. There are 
four variants of emails in mbox files: 
http://www.forensicswiki.org/wiki/MBox#MBOX_File_Variants

> Emails extracted from PSTs detected as unexpected file types
> 
>
> Key: TIKA-2454
> URL: https://issues.apache.org/jira/browse/TIKA-2454
> Project: Tika
>  Issue Type: Bug
>  Components: detector, parser
>Affects Versions: 1.16
>Reporter: Matthew Caruana Galizia
> Fix For: 1.17
>
>
> This issue is severe. The Outlook PST parser extracts a string for the body 
> of every email and passes that string to the {{EmbeddedDocumentExtractor}}.
> However, no content type is set on the {{Metadata}} object passed to the 
> extractor. Therefore, if for example, the body of the email starts with the 
> string "From John Smith." (for example, when an email was forwarded), then 
> body of the email is detected as {{application/mbox}} and parsed as though it 
> were an mbox file.
> I think the immediate fix for this issue is to force the type of the email to 
> {{text/plain}} and for it to be parsed as such.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (TIKA-2454) Emails extracted from PSTs detected as unexpected file types

2017-08-30 Thread Matthew Caruana Galizia (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16147907#comment-16147907
 ] 

Matthew Caruana Galizia commented on TIKA-2454:
---

I agree with you. The fact that you can't force the type means that PST parsing 
is completely broken.

> Emails extracted from PSTs detected as unexpected file types
> 
>
> Key: TIKA-2454
> URL: https://issues.apache.org/jira/browse/TIKA-2454
> Project: Tika
>  Issue Type: Bug
>  Components: detector, parser
>Affects Versions: 1.16
>Reporter: Matthew Caruana Galizia
>
> This issue is severe. The Outlook PST parser extracts a string for the body 
> of every email and passes that string to the {{EmbeddedDocumentExtractor}}.
> However, no content type is set on the {{Metadata}} object passed to the 
> extractor. Therefore, if for example, the body of the email starts with the 
> string "From John Smith." (for example, when an email was forwarded), then 
> body of the email is detected as {{application/mbox}} and parsed as though it 
> were an mbox file.
> I think the immediate fix for this issue is to force the type of the email to 
> {{text/plain}} and for it to be parsed as such.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (TIKA-2444) JP2 codestream files not parsed

2017-08-30 Thread Matthew Caruana Galizia (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2444?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16147657#comment-16147657
 ] 

Matthew Caruana Galizia commented on TIKA-2444:
---

I have no idea. I'm trying to solve a similar problem with raw G4 bytestreams 
that are not contained in a TIFF container.

Anyone you know who has experience with image parsing in Java?

> JP2 codestream files not parsed
> ---
>
> Key: TIKA-2444
> URL: https://issues.apache.org/jira/browse/TIKA-2444
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.16
>Reporter: Matthew Caruana Galizia
>  Labels: imageio, images, ocr
> Attachments: balloon.j2c
>
>
> We've come across some embedded files in the wild that are detected by Tika 
> as {{image/x-jp2-codestream}}. The identification is correct according to a 
> description of the format [1].
> However, no Parser implementation declares support for this format.
> It would makes to declare support for this format in the Tesseract OCR 
> parser. However, the parser would need to contain functionality that either:
> 1) wraps the codestream in a JP2 container;
> 2) or transcodes the image to PNG.
> This is because while Tesseract supports JP2 (via Leptonica), it doesn't 
> support the raw codestream as a file.
> [1] http://fileformats.archiveteam.org/wiki/JPEG_2000_codestream



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (TIKA-2450) OfficeParser.parse called for zero-byte file with .doc extension

2017-08-30 Thread Matthew Caruana Galizia (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2450?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16147632#comment-16147632
 ] 

Matthew Caruana Galizia commented on TIKA-2450:
---

Thank you, that looks like a good solution!

> OfficeParser.parse called for zero-byte file with .doc extension
> 
>
> Key: TIKA-2450
> URL: https://issues.apache.org/jira/browse/TIKA-2450
> Project: Tika
>  Issue Type: Bug
>  Components: detector, parser
>Affects Versions: 1.16
>Reporter: Matthew Caruana Galizia
>Priority: Minor
> Fix For: 1.17
>
>
> A zero-byte (empty) file with a .doc extension is detected as a Word Document 
> and the {{OfficeParser.parse}} method is called for this file.
> We then get a {{TikaException}}, with the cause given as an 
> {{org.apache.poi.EmptyFileException}}.
> I think it would be more useful if the file were NOT detected as a Word 
> Document, meaning that the {{AutoDetectParser}} would then fall back to 
> whatever is set as the fallback parser in the parse context.
> This is more useful because the user can then trigger some special logic for 
> handling empty files.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (TIKA-2450) OfficeParser.parse called for zero-byte file with .doc extension

2017-08-30 Thread Matthew Caruana Galizia (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2450?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16147347#comment-16147347
 ] 

Matthew Caruana Galizia commented on TIKA-2450:
---

When you put it that way, then I'll say yes. There is value in knowing what it 
might have been.

> OfficeParser.parse called for zero-byte file with .doc extension
> 
>
> Key: TIKA-2450
> URL: https://issues.apache.org/jira/browse/TIKA-2450
> Project: Tika
>  Issue Type: Bug
>  Components: detector, parser
>Affects Versions: 1.16
>Reporter: Matthew Caruana Galizia
>Priority: Minor
>
> A zero-byte (empty) file with a .doc extension is detected as a Word Document 
> and the {{OfficeParser.parse}} method is called for this file.
> We then get a {{TikaException}}, with the cause given as an 
> {{org.apache.poi.EmptyFileException}}.
> I think it would be more useful if the file were NOT detected as a Word 
> Document, meaning that the {{AutoDetectParser}} would then fall back to 
> whatever is set as the fallback parser in the parse context.
> This is more useful because the user can then trigger some special logic for 
> handling empty files.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (TIKA-2450) OfficeParser.parse called for zero-byte file with .doc extension

2017-08-30 Thread Matthew Caruana Galizia (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2450?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16147331#comment-16147331
 ] 

Matthew Caruana Galizia commented on TIKA-2450:
---

OK, with that in mind then I will agree you.

In the same way that all the various encrypted document exceptions are 
normalised to a {{tika.exception.EncryptedDocumentException}} then I think it 
would be useful to, as Tim suggests, normalise empty file exceptions to a 
{{tika.exception.ZeroByteFileException}} (that extends {{TikaException}}).

> OfficeParser.parse called for zero-byte file with .doc extension
> 
>
> Key: TIKA-2450
> URL: https://issues.apache.org/jira/browse/TIKA-2450
> Project: Tika
>  Issue Type: Bug
>  Components: detector, parser
>Affects Versions: 1.16
>Reporter: Matthew Caruana Galizia
>Priority: Minor
>
> A zero-byte (empty) file with a .doc extension is detected as a Word Document 
> and the {{OfficeParser.parse}} method is called for this file.
> We then get a {{TikaException}}, with the cause given as an 
> {{org.apache.poi.EmptyFileException}}.
> I think it would be more useful if the file were NOT detected as a Word 
> Document, meaning that the {{AutoDetectParser}} would then fall back to 
> whatever is set as the fallback parser in the parse context.
> This is more useful because the user can then trigger some special logic for 
> handling empty files.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (TIKA-2450) OfficeParser.parse called for zero-byte file with .doc extension

2017-08-30 Thread Matthew Caruana Galizia (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2450?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16147303#comment-16147303
 ] 

Matthew Caruana Galizia commented on TIKA-2450:
---

I would argue that the raison d'etre of tika-detect is not to provide 
extension-based detection, but to provide detection. A zero-bye file can never 
be a Word Document, so assuming my first statement is true then logically it 
should not be detected as a Word Document.

> OfficeParser.parse called for zero-byte file with .doc extension
> 
>
> Key: TIKA-2450
> URL: https://issues.apache.org/jira/browse/TIKA-2450
> Project: Tika
>  Issue Type: Bug
>  Components: detector, parser
>Affects Versions: 1.16
>Reporter: Matthew Caruana Galizia
>Priority: Minor
>
> A zero-byte (empty) file with a .doc extension is detected as a Word Document 
> and the {{OfficeParser.parse}} method is called for this file.
> We then get a {{TikaException}}, with the cause given as an 
> {{org.apache.poi.EmptyFileException}}.
> I think it would be more useful if the file were NOT detected as a Word 
> Document, meaning that the {{AutoDetectParser}} would then fall back to 
> whatever is set as the fallback parser in the parse context.
> This is more useful because the user can then trigger some special logic for 
> handling empty files.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (TIKA-2450) OfficeParser.parse called for zero-byte file with .doc extension

2017-08-30 Thread Matthew Caruana Galizia (JIRA)
Matthew Caruana Galizia created TIKA-2450:
-

 Summary: OfficeParser.parse called for zero-byte file with .doc 
extension
 Key: TIKA-2450
 URL: https://issues.apache.org/jira/browse/TIKA-2450
 Project: Tika
  Issue Type: Bug
  Components: detector, parser
Affects Versions: 1.16
Reporter: Matthew Caruana Galizia
Priority: Minor


A zero-byte (empty) file with a .doc extension is detected as a Word Document 
and the {{OfficeParser.parse}} method is called for this file.

We then get a {{TikaException}}, with the cause given as an 
{{org.apache.poi.EmptyFileException}}.

I think it would be more useful if the file were NOT detected as a Word 
Document, meaning that the {{AutoDetectParser}} would then fall back to 
whatever is set as the fallback parser in the parse context.

This is more useful because the user can then trigger some special logic for 
handling empty files.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (TIKA-2444) JP2 codestream files not parsed

2017-08-22 Thread Matthew Caruana Galizia (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2444?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthew Caruana Galizia updated TIKA-2444:
--
Attachment: balloon.j2c

Example JP2K codestream file attached.

> JP2 codestream files not parsed
> ---
>
> Key: TIKA-2444
> URL: https://issues.apache.org/jira/browse/TIKA-2444
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.16
>Reporter: Matthew Caruana Galizia
>  Labels: imageio, images, ocr
> Attachments: balloon.j2c
>
>
> We've come across some embedded files in the wild that are detected by Tika 
> as {{image/x-jp2-codestream}}. The identification is correct according to a 
> description of the format [1].
> However, no Parser implementation declares support for this format.
> It would makes to declare support for this format in the Tesseract OCR 
> parser. However, the parser would need to contain functionality that either:
> 1) wraps the codestream in a JP2 container;
> 2) or transcodes the image to PNG.
> This is because while Tesseract supports JP2 (via Leptonica), it doesn't 
> support the raw codestream as a file.
> [1] http://fileformats.archiveteam.org/wiki/JPEG_2000_codestream



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (TIKA-2444) JP2 codestream files not parsed

2017-08-22 Thread Matthew Caruana Galizia (JIRA)
Matthew Caruana Galizia created TIKA-2444:
-

 Summary: JP2 codestream files not parsed
 Key: TIKA-2444
 URL: https://issues.apache.org/jira/browse/TIKA-2444
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.16
Reporter: Matthew Caruana Galizia


We've come across some embedded files in the wild that are detected by Tika as 
{{image/x-jp2-codestream}}. The identification is correct according to a 
description of the format [1].

However, no Parser implementation declares support for this format.

It would makes to declare support for this format in the Tesseract OCR parser. 
However, the parser would need to contain functionality that either:

1) wraps the codestream in a JP2 container;
2) or transcodes the image to PNG.

This is because while Tesseract supports JP2 (via Leptonica), it doesn't 
support the raw codestream as a file.

[1] http://fileformats.archiveteam.org/wiki/JPEG_2000_codestream



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (TIKA-2436) Support for GZIP-compressed EMF files

2017-07-29 Thread Matthew Caruana Galizia (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16106136#comment-16106136
 ] 

Matthew Caruana Galizia commented on TIKA-2436:
---

To give you an example of why this is a problem in an actual use case, we are 
ingesting the text extracting from files into Solr. The way the files are 
stored in the index represents the same hierarchy that you have on disk: files 
extracted from container files are stored in the index as child documents of 
the container document.

Therefore, for an EMZ file within a DOCX file, we end up with three documents:

DOCX -> EMZ -> EMF

Whereas we expect:

DOCX -> EMZ

> Support for GZIP-compressed EMF files
> -
>
> Key: TIKA-2436
> URL: https://issues.apache.org/jira/browse/TIKA-2436
> Project: Tika
>  Issue Type: Improvement
>  Components: mime, parser
>Affects Versions: 1.15
>Reporter: Matthew Caruana Galizia
> Attachments: image004.emz
>
>
> Tika is currently detecting EMZ (compressed EMF) files as simple gzip files. 
> These files should instead be detected as EMF files and the EMFParser should 
> perform decompression transparently.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (TIKA-2436) Support for GZIP-compressed EMF files

2017-07-29 Thread Matthew Caruana Galizia (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16106135#comment-16106135
 ] 

Matthew Caruana Galizia commented on TIKA-2436:
---

The difference is that the file is a treated as a package or container format, 
when I don't think it should be. It is a distinct file format that happens to 
be compressed.

Instead of treating it as a container and relying on the CompressorParser to 
call the ParsingEmbeddedDocumentExtractor, the EMFParser should instead have 
native support for the compression, unwrapping the compression itself.

The same should be true for SVGZ and WMZ.

To draw a parallel, DOCX is also a compressed format, but Tika does not treat 
it as a package. It understands that the compression is an artefact of the 
format rather than an explicit container.

> Support for GZIP-compressed EMF files
> -
>
> Key: TIKA-2436
> URL: https://issues.apache.org/jira/browse/TIKA-2436
> Project: Tika
>  Issue Type: Improvement
>  Components: mime, parser
>Affects Versions: 1.15
>Reporter: Matthew Caruana Galizia
> Attachments: image004.emz
>
>
> Tika is currently detecting EMZ (compressed EMF) files as simple gzip files. 
> These files should instead be detected as EMF files and the EMFParser should 
> perform decompression transparently.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (TIKA-2436) Support for GZIP-compressed EMF files

2017-07-28 Thread Matthew Caruana Galizia (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthew Caruana Galizia updated TIKA-2436:
--
Attachment: image004.emz

Example EMZ file attached. Common Compress will yield an EMF from this (see 
COMPRESS-68).

> Support for GZIP-compressed EMF files
> -
>
> Key: TIKA-2436
> URL: https://issues.apache.org/jira/browse/TIKA-2436
> Project: Tika
>  Issue Type: Improvement
>  Components: mime, parser
>Affects Versions: 1.15
>Reporter: Matthew Caruana Galizia
> Attachments: image004.emz
>
>
> Tika is currently detecting EMZ (compressed EMF) files as simple gzip files. 
> These files should instead be detected as EMF files and the EMFParser should 
> perform decompression transparently.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (TIKA-2436) Support for GZIP-compressed EMF files

2017-07-28 Thread Matthew Caruana Galizia (JIRA)
Matthew Caruana Galizia created TIKA-2436:
-

 Summary: Support for GZIP-compressed EMF files
 Key: TIKA-2436
 URL: https://issues.apache.org/jira/browse/TIKA-2436
 Project: Tika
  Issue Type: Improvement
  Components: mime, parser
Affects Versions: 1.15
Reporter: Matthew Caruana Galizia


Tika is currently detecting EMZ (compressed EMF) files as simple gzip files. 
These files should instead be detected as EMF files and the EMFParser should 
perform decompression transparently.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (TIKA-879) Detection problem: message/rfc822 file is detected as text/plain.

2017-07-13 Thread Matthew Caruana Galizia (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthew Caruana Galizia updated TIKA-879:
-
Attachment: mbox_email_section.txt

As described in TIKA-2042, the attached file [^mbox_email_section.txt] contains 
a section of an MBOX file, itself containing a message stream which is detected 
as text/html instead of message/rfc822, even though the correct mimetype is set 
on the Metadata object by the MBOXParser.

> Detection problem: message/rfc822 file is detected as text/plain.
> -
>
> Key: TIKA-879
> URL: https://issues.apache.org/jira/browse/TIKA-879
> Project: Tika
>  Issue Type: Bug
>  Components: metadata, mime
>Affects Versions: 1.0, 1.1, 1.2
> Environment: linux 3.2.9
> oracle jdk7, openjdk7, sun jdk6
>Reporter: Konstantin Gribov
>  Labels: new-parser
> Attachments: mbox_email_section.txt, mime_diffs_A_to_B.html, 
> TIKA-879-thunderbird.eml
>
>
> When using {{DefaultDetector}} mime type for {{.eml}} files is different (you 
> can test it on {{testRFC822}} and {{testRFC822_base64}} in 
> {{tika-parsers/src/test/resources/test-documents/}}).
> Main reason for such behavior is that only magic detector is really works for 
> such files. Even if you set {{CONTENT_TYPE}} in metadata or some {{.eml}} 
> file name in {{RESOURCE_NAME_KEY}}.
> As I found {{MediaTypeRegistry.isSpecializationOf("message/rfc822", 
> "text/plain")}} returns {{false}}, so detection by {{MimeTypes.detect(...)}} 
> works only by magic.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Comment Edited] (TIKA-2042) MBOX file detected wrongly as text/html

2017-07-13 Thread Matthew Caruana Galizia (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2042?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16085847#comment-16085847
 ] 

Matthew Caruana Galizia edited comment on TIKA-2042 at 7/13/17 3:13 PM:


I've attached a sample of one of the message sections from the MBOX. Detected 
as text/html instead of message/rfc822.


was (Author: mcaruanagalizia):
Sample of one of the message sections from the MBOX. Detected as text/html 
instead of message/rfc822.

> MBOX file detected wrongly as text/html
> ---
>
> Key: TIKA-2042
> URL: https://issues.apache.org/jira/browse/TIKA-2042
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.13
> Environment: Ubuntu 14.04, Apache Tika 1.13 and 1.14 nightly at the 
> time of this writing
>Reporter: Vjeran Marcinko
> Fix For: 1.14
>
> Attachments: clojure.mbox, mbox_email_section.txt, mbox_header.txt
>
>
> MBOX file doesn't get recognized via "magic detection" mechanism as 
> "application/mbox", but wrongly as "text/html".
> Workaround for this in Tika 1.13 is achieved by placing following in 
> custom-mimetypes.xml, as suggested on mailing list (priority has to be larger 
> than message/rfc822):
> 
> 
> 
> 
> 
> 
> Sample MBOX file is attached.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (TIKA-2042) MBOX file detected wrongly as text/html

2017-07-13 Thread Matthew Caruana Galizia (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2042?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthew Caruana Galizia updated TIKA-2042:
--
Attachment: mbox_email_section.txt

Sample of one of the message sections from the MBOX. Detected as text/html 
instead of message/rfc822.

> MBOX file detected wrongly as text/html
> ---
>
> Key: TIKA-2042
> URL: https://issues.apache.org/jira/browse/TIKA-2042
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.13
> Environment: Ubuntu 14.04, Apache Tika 1.13 and 1.14 nightly at the 
> time of this writing
>Reporter: Vjeran Marcinko
> Fix For: 1.14
>
> Attachments: clojure.mbox, mbox_email_section.txt, mbox_header.txt
>
>
> MBOX file doesn't get recognized via "magic detection" mechanism as 
> "application/mbox", but wrongly as "text/html".
> Workaround for this in Tika 1.13 is achieved by placing following in 
> custom-mimetypes.xml, as suggested on mailing list (priority has to be larger 
> than message/rfc822):
> 
> 
> 
> 
> 
> 
> Sample MBOX file is attached.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (TIKA-2042) MBOX file detected wrongly as text/html

2017-07-13 Thread Matthew Caruana Galizia (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2042?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16085842#comment-16085842
 ] 

Matthew Caruana Galizia commented on TIKA-2042:
---

[~gagravarr] thank you - that fixes the detection of at least one of the MBOX 
files. Now the problem is that that when the email streams get passed to the 
delegate parser by the ParsingEmbeddedDocumentExtractor implementation, they're 
detected as text/html instead of message/rfc822.

> MBOX file detected wrongly as text/html
> ---
>
> Key: TIKA-2042
> URL: https://issues.apache.org/jira/browse/TIKA-2042
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.13
> Environment: Ubuntu 14.04, Apache Tika 1.13 and 1.14 nightly at the 
> time of this writing
>Reporter: Vjeran Marcinko
> Fix For: 1.14
>
> Attachments: clojure.mbox, mbox_email_section.txt, mbox_header.txt
>
>
> MBOX file doesn't get recognized via "magic detection" mechanism as 
> "application/mbox", but wrongly as "text/html".
> Workaround for this in Tika 1.13 is achieved by placing following in 
> custom-mimetypes.xml, as suggested on mailing list (priority has to be larger 
> than message/rfc822):
> 
> 
> 
> 
> 
> 
> Sample MBOX file is attached.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Comment Edited] (TIKA-2042) MBOX file detected wrongly as text/html

2017-07-13 Thread Matthew Caruana Galizia (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2042?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16085709#comment-16085709
 ] 

Matthew Caruana Galizia edited comment on TIKA-2042 at 7/13/17 2:22 PM:


I'd like to ask for this issue to be reopened. Around half the MBOX files in 
our corpus are being detected as text/html. My guess is that there are two 
reasons for this:

1) the files have no extension - the filenames are literally "mbox" rather than 
"*.mbox" (I think this is the way they're generated or used to be generated on 
Macs - they're in an *.mbox container directory, but the meat is within an mbox 
file contained within that directory);

2) the headers don't fall within the 256 byte offset specified by the matcher 
in the mimetypes XML file.


was (Author: mcaruanagalizia):
I'd like to ask for this issue to be reopened. Around half the MBOX files in 
our corpus are being detected as text/html. My guess is that there are two 
reasons for this:

1) the files have no extension - the filenames are literally "mbox" rather than 
"*.mbox" (I think this is the way they're generate or used to be generate on 
Macs - they're in a *.mbox container directory, but the meat is within an mbox 
file contained within that directory);

2) the headers don't fall within the 256 byte offset specified by the matcher 
in the mimetypes XML file.

> MBOX file detected wrongly as text/html
> ---
>
> Key: TIKA-2042
> URL: https://issues.apache.org/jira/browse/TIKA-2042
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.13
> Environment: Ubuntu 14.04, Apache Tika 1.13 and 1.14 nightly at the 
> time of this writing
>Reporter: Vjeran Marcinko
> Fix For: 1.14
>
> Attachments: clojure.mbox, mbox_header.txt
>
>
> MBOX file doesn't get recognized via "magic detection" mechanism as 
> "application/mbox", but wrongly as "text/html".
> Workaround for this in Tika 1.13 is achieved by placing following in 
> custom-mimetypes.xml, as suggested on mailing list (priority has to be larger 
> than message/rfc822):
> 
> 
> 
> 
> 
> 
> Sample MBOX file is attached.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (TIKA-2042) MBOX file detected wrongly as text/html

2017-07-13 Thread Matthew Caruana Galizia (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2042?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthew Caruana Galizia updated TIKA-2042:
--
Attachment: mbox_header.txt

Header attached with identifying information stripped out. This file is 
detected as text/html instead of application/mbox.

> MBOX file detected wrongly as text/html
> ---
>
> Key: TIKA-2042
> URL: https://issues.apache.org/jira/browse/TIKA-2042
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.13
> Environment: Ubuntu 14.04, Apache Tika 1.13 and 1.14 nightly at the 
> time of this writing
>Reporter: Vjeran Marcinko
> Fix For: 1.14
>
> Attachments: clojure.mbox, mbox_header.txt
>
>
> MBOX file doesn't get recognized via "magic detection" mechanism as 
> "application/mbox", but wrongly as "text/html".
> Workaround for this in Tika 1.13 is achieved by placing following in 
> custom-mimetypes.xml, as suggested on mailing list (priority has to be larger 
> than message/rfc822):
> 
> 
> 
> 
> 
> 
> Sample MBOX file is attached.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (TIKA-2042) MBOX file detected wrongly as text/html

2017-07-13 Thread Matthew Caruana Galizia (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2042?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16085709#comment-16085709
 ] 

Matthew Caruana Galizia commented on TIKA-2042:
---

I'd like to ask for this issue to be reopened. Around half the MBOX files in 
our corpus are being detected as text/html. My guess is that there are two 
reasons for this:

1) the files have no extension - the filenames are literally "mbox" rather than 
"*.mbox" (I think this is the way they're generate or used to be generate on 
Macs - they're in a *.mbox container directory, but the meat is within an mbox 
file contained within that directory);

2) the headers don't fall within the 256 byte offset specified by the matcher 
in the mimetypes XML file.

> MBOX file detected wrongly as text/html
> ---
>
> Key: TIKA-2042
> URL: https://issues.apache.org/jira/browse/TIKA-2042
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.13
> Environment: Ubuntu 14.04, Apache Tika 1.13 and 1.14 nightly at the 
> time of this writing
>Reporter: Vjeran Marcinko
> Fix For: 1.14
>
> Attachments: clojure.mbox
>
>
> MBOX file doesn't get recognized via "magic detection" mechanism as 
> "application/mbox", but wrongly as "text/html".
> Workaround for this in Tika 1.13 is achieved by placing following in 
> custom-mimetypes.xml, as suggested on mailing list (priority has to be larger 
> than message/rfc822):
> 
> 
> 
> 
> 
> 
> Sample MBOX file is attached.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (TIKA-2399) Version conflict with non-ASL jai-imageio-jpeg2000 and edu.ucar jj2000

2017-07-07 Thread Matthew Caruana Galizia (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16078012#comment-16078012
 ] 

Matthew Caruana Galizia commented on TIKA-2399:
---

OK. I can't think of any other option for now. For the future, does Apache have 
a legal team that lobbies for companies to change the licenses they use?

> Version conflict with non-ASL jai-imageio-jpeg2000 and edu.ucar jj2000
> --
>
> Key: TIKA-2399
> URL: https://issues.apache.org/jira/browse/TIKA-2399
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.15
>Reporter: Tim Allison
>
> For users who want to extract jp2000 from PDFs for inline-image OCR, they 
> have to add non- ASL 2.0 compatible:
> {noformat}
> 
> com.github.jai-imageio
> jai-imageio-jpeg2000
> 1.3.0
>   
> {noformat}
> However, this creates a conflict with GRIB's jj2000:
> {noformat}
>
>   edu.ucar
>   jj2000
>   5.2
>  
> {noformat}
> [~mcaruanagalizia] (I'm guessing?) identified this conflict 
> [here|https://github.com/ICIJ/extract/blob/master/pom.xml] and fixes it by 
> upgrading jj2000 to 5.3.  However, that doesn't exist in maven central, but 
> only in [Boundless|http://example.com].
> What do we do?
> # We could exclude the jj2000 dependency from GRIB, and that functionality 
> won't work for GRIB folks
> # We could add a warning if we see {{jai-imageio-jpeg2000}} is on the 
> classpath to instruct users to exclude jj2000.
> # Other options?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (TIKA-2399) Version conflict with non-ASL jai-imageio-jpeg2000 and edu.ucar jj2000

2017-07-06 Thread Matthew Caruana Galizia (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16076092#comment-16076092
 ] 

Matthew Caruana Galizia commented on TIKA-2399:
---

Their response:

bq. I wouldn't mind if you forked the repo and published your own artifact, but 
please use a different group ID. 
bq. However, I am not a lawyer, and I don't know what, if any, legal 
ramifications there are for doing so. We (Unidata) do not own JJ2000. 
Furthermore, its license is uncertain. The Google Code page (1) claims "GNU 
Lesser GPL", but see also this issue (2).
bq. (1) https://code.google.com/archive/p/jj2000/
bq. (2) https://code.google.com/archive/p/jj2000/issues/3

> Version conflict with non-ASL jai-imageio-jpeg2000 and edu.ucar jj2000
> --
>
> Key: TIKA-2399
> URL: https://issues.apache.org/jira/browse/TIKA-2399
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.15
>Reporter: Tim Allison
>
> For users who want to extract jp2000 from PDFs for inline-image OCR, they 
> have to add non- ASL 2.0 compatible:
> {noformat}
> 
> com.github.jai-imageio
> jai-imageio-jpeg2000
> 1.3.0
>   
> {noformat}
> However, this creates a conflict with GRIB's jj2000:
> {noformat}
>
>   edu.ucar
>   jj2000
>   5.2
>  
> {noformat}
> [~mcaruanagalizia] (I'm guessing?) identified this conflict 
> [here|https://github.com/ICIJ/extract/blob/master/pom.xml] and fixes it by 
> upgrading jj2000 to 5.3.  However, that doesn't exist in maven central, but 
> only in [Boundless|http://example.com].
> What do we do?
> # We could exclude the jj2000 dependency from GRIB, and that functionality 
> won't work for GRIB folks
> # We could add a warning if we see {{jai-imageio-jpeg2000}} is on the 
> classpath to instruct users to exclude jj2000.
> # Other options?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (TIKA-2399) Version conflict with non-ASL jai-imageio-jpeg2000 and edu.ucar jj2000

2017-07-05 Thread Matthew Caruana Galizia (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16074918#comment-16074918
 ] 

Matthew Caruana Galizia commented on TIKA-2399:
---

I've emailed Unidata to ask about publishing with your key.

> Version conflict with non-ASL jai-imageio-jpeg2000 and edu.ucar jj2000
> --
>
> Key: TIKA-2399
> URL: https://issues.apache.org/jira/browse/TIKA-2399
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.15
>Reporter: Tim Allison
>
> For users who want to extract jp2000 from PDFs for inline-image OCR, they 
> have to add non- ASL 2.0 compatible:
> {noformat}
> 
> com.github.jai-imageio
> jai-imageio-jpeg2000
> 1.3.0
>   
> {noformat}
> However, this creates a conflict with GRIB's jj2000:
> {noformat}
>
>   edu.ucar
>   jj2000
>   5.2
>  
> {noformat}
> [~mcaruanagalizia] (I'm guessing?) identified this conflict 
> [here|https://github.com/ICIJ/extract/blob/master/pom.xml] and fixes it by 
> upgrading jj2000 to 5.3.  However, that doesn't exist in maven central, but 
> only in [Boundless|http://example.com].
> What do we do?
> # We could exclude the jj2000 dependency from GRIB, and that functionality 
> won't work for GRIB folks
> # We could add a warning if we see {{jai-imageio-jpeg2000}} is on the 
> classpath to instruct users to exclude jj2000.
> # Other options?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (TIKA-2399) Version conflict with non-ASL jai-imageio-jpeg2000 and edu.ucar jj2000

2017-07-05 Thread Matthew Caruana Galizia (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16074881#comment-16074881
 ] 

Matthew Caruana Galizia commented on TIKA-2399:
---

Tim, see https://github.com/jai-imageio/jai-imageio-jpeg2000/issues/8.

Once jj2000 is in central then it will also be used by jai-imageio-jpeg2000 to 
get images from PDFs for OCR under the hood.

> Version conflict with non-ASL jai-imageio-jpeg2000 and edu.ucar jj2000
> --
>
> Key: TIKA-2399
> URL: https://issues.apache.org/jira/browse/TIKA-2399
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.15
>Reporter: Tim Allison
>
> For users who want to extract jp2000 from PDFs for inline-image OCR, they 
> have to add non- ASL 2.0 compatible:
> {noformat}
> 
> com.github.jai-imageio
> jai-imageio-jpeg2000
> 1.3.0
>   
> {noformat}
> However, this creates a conflict with GRIB's jj2000:
> {noformat}
>
>   edu.ucar
>   jj2000
>   5.2
>  
> {noformat}
> [~mcaruanagalizia] (I'm guessing?) identified this conflict 
> [here|https://github.com/ICIJ/extract/blob/master/pom.xml] and fixes it by 
> upgrading jj2000 to 5.3.  However, that doesn't exist in maven central, but 
> only in [Boundless|http://example.com].
> What do we do?
> # We could exclude the jj2000 dependency from GRIB, and that functionality 
> won't work for GRIB folks
> # We could add a warning if we see {{jai-imageio-jpeg2000}} is on the 
> classpath to instruct users to exclude jj2000.
> # Other options?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (TIKA-2399) Version conflict with non-ASL jai-imageio-jpeg2000 and edu.ucar jj2000

2017-07-04 Thread Matthew Caruana Galizia (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16073660#comment-16073660
 ] 

Matthew Caruana Galizia commented on TIKA-2399:
---

Wouldn't it be better to warn? (Option 2 in your description.)

> Version conflict with non-ASL jai-imageio-jpeg2000 and edu.ucar jj2000
> --
>
> Key: TIKA-2399
> URL: https://issues.apache.org/jira/browse/TIKA-2399
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.15
>Reporter: Tim Allison
>
> For users who want to extract jp2000 from PDFs for inline-image OCR, they 
> have to add non- ASL 2.0 compatible:
> {noformat}
> 
> com.github.jai-imageio
> jai-imageio-jpeg2000
> 1.3.0
>   
> {noformat}
> However, this creates a conflict with GRIB's jj2000:
> {noformat}
>
>   edu.ucar
>   jj2000
>   5.2
>  
> {noformat}
> [~mcaruanagalizia] (I'm guessing?) identified this conflict 
> [here|https://github.com/ICIJ/extract/blob/master/pom.xml] and fixes it by 
> upgrading jj2000 to 5.3.  However, that doesn't exist in maven central, but 
> only in [Boundless|http://example.com].
> What do we do?
> # We could exclude the jj2000 dependency from GRIB, and that functionality 
> won't work for GRIB folks
> # We could add a warning if we see {{jai-imageio-jpeg2000}} is on the 
> classpath to instruct users to exclude jj2000.
> # Other options?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (TIKA-2399) Version conflict with non-ASL jai-imageio-jpeg2000 and edu.ucar jj2000

2017-06-20 Thread Matthew Caruana Galizia (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16055541#comment-16055541
 ] 

Matthew Caruana Galizia commented on TIKA-2399:
---

I had emailed Unidata in February about publishing to Central and got this 
reply:

bq. Publishing to Maven Central is on our TODO list, not only for JJ2000, but 
for all of our Java products. However, the process is tedious, especially since 
some of our products are built using Gradle. So, we've been reluctant to tackle 
this task. We intend to do it some time this year, but I can't be any more 
specific than that.

> Version conflict with non-ASL jai-imageio-jpeg2000 and edu.ucar jj2000
> --
>
> Key: TIKA-2399
> URL: https://issues.apache.org/jira/browse/TIKA-2399
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.15
>Reporter: Tim Allison
>
> For users who want to extract jp2000 from PDFs for inline-image OCR, they 
> have to add non- ASL 2.0 compatible:
> {noformat}
> 
> com.github.jai-imageio
> jai-imageio-jpeg2000
> 1.3.0
>   
> {noformat}
> However, this creates a conflict with GRIB's jj2000:
> {noformat}
>
>   edu.ucar
>   jj2000
>   5.2
>  
> {noformat}
> [~mcaruanagalizia] (I'm guessing?) identified this conflict 
> [here|https://github.com/ICIJ/extract/blob/master/pom.xml] and fixes it by 
> upgrading jj2000 to 5.3.  However, that doesn't exist in maven central, but 
> only in [Boundless|http://example.com].
> What do we do?
> # We could exclude the jj2000 dependency from GRIB, and that functionality 
> won't work for GRIB folks
> # We could add a warning if we see {{jai-imageio-jpeg2000}} is on the 
> classpath to instruct users to exclude jj2000.
> # Other options?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (TIKA-2394) Unknown message type: IPM.Note.Rules.OofTemplate.Microsoft

2017-06-15 Thread Matthew Caruana Galizia (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2394?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16050226#comment-16050226
 ] 

Matthew Caruana Galizia commented on TIKA-2394:
---

I remember seeing how to override a provided jar version in another issue but I 
can't seem to find it. How do you do that?

> Unknown message type: IPM.Note.Rules.OofTemplate.Microsoft
> --
>
> Key: TIKA-2394
> URL: https://issues.apache.org/jira/browse/TIKA-2394
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.15
>Reporter: Matthew Caruana Galizia
>Priority: Minor
>  Labels: container, email, pst
>
> When parsing a PST, I get this message logged to stderr multiple times:
> Unknown message type: IPM.Note.Rules.OofTemplate.Microsoft
> Unfortunately I cannot supply the PST, as its contents is confidential.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Comment Edited] (TIKA-2394) Unknown message type: IPM.Note.Rules.OofTemplate.Microsoft

2017-06-15 Thread Matthew Caruana Galizia (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2394?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16050226#comment-16050226
 ] 

Matthew Caruana Galizia edited comment on TIKA-2394 at 6/15/17 9:28 AM:


I remember seeing how to override a provided jar version in another issue but I 
can't seem to find it. How do you do that? (With Maven).


was (Author: mcaruanagalizia):
I remember seeing how to override a provided jar version in another issue but I 
can't seem to find it. How do you do that?

> Unknown message type: IPM.Note.Rules.OofTemplate.Microsoft
> --
>
> Key: TIKA-2394
> URL: https://issues.apache.org/jira/browse/TIKA-2394
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.15
>Reporter: Matthew Caruana Galizia
>Priority: Minor
>  Labels: container, email, pst
>
> When parsing a PST, I get this message logged to stderr multiple times:
> Unknown message type: IPM.Note.Rules.OofTemplate.Microsoft
> Unfortunately I cannot supply the PST, as its contents is confidential.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (TIKA-2394) Unknown message type: IPM.Note.Rules.OofTemplate.Microsoft

2017-06-15 Thread Matthew Caruana Galizia (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2394?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthew Caruana Galizia updated TIKA-2394:
--
Affects Version/s: 1.15
   Labels: container email pst  (was: )
 Priority: Minor  (was: Major)
  Description: 
When parsing a PST, I get this message logged to stderr multiple times:

Unknown message type: IPM.Note.Rules.OofTemplate.Microsoft

Unfortunately I cannot supply the PST, as its contents is confidential.
  Component/s: parser
  Summary: Unknown message type: 
IPM.Note.Rules.OofTemplate.Microsoft  (was: "Unknown message type")

> Unknown message type: IPM.Note.Rules.OofTemplate.Microsoft
> --
>
> Key: TIKA-2394
> URL: https://issues.apache.org/jira/browse/TIKA-2394
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.15
>Reporter: Matthew Caruana Galizia
>Priority: Minor
>  Labels: container, email, pst
>
> When parsing a PST, I get this message logged to stderr multiple times:
> Unknown message type: IPM.Note.Rules.OofTemplate.Microsoft
> Unfortunately I cannot supply the PST, as its contents is confidential.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (TIKA-2394) "Unknown message type"

2017-06-15 Thread Matthew Caruana Galizia (JIRA)
Matthew Caruana Galizia created TIKA-2394:
-

 Summary: "Unknown message type"
 Key: TIKA-2394
 URL: https://issues.apache.org/jira/browse/TIKA-2394
 Project: Tika
  Issue Type: Bug
Reporter: Matthew Caruana Galizia






--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (TIKA-2389) Warn log level is pretty strong for missing JBIG2ImageReader

2017-06-09 Thread Matthew Caruana Galizia (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16044297#comment-16044297
 ] 

Matthew Caruana Galizia commented on TIKA-2389:
---

Please don't move this to info.

Before seeing this warning, I didn't even know that the JBIG2 format existed. 
And then yes, maybe who knows, we would have never found things that we found 
after adding support.

For want of a nail... the battle was lost.

> Warn log level is pretty strong for missing JBIG2ImageReader
> 
>
> Key: TIKA-2389
> URL: https://issues.apache.org/jira/browse/TIKA-2389
> Project: Tika
>  Issue Type: Wish
>  Components: parser
>Affects Versions: 1.15
>Reporter: Thomas Mortagne
>
> Given the license of jbig2-imageio many projects (Apache or LGPL projects for 
> example) won't include it and will always end up with a warning because of it 
> while they probably don't really care that much about this image format.
> Ideally ImageParser should probably be made more extensible and jbig2 part 
> moved in an optional module but in the meantime is this warning that 
> necessary ?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (TIKA-1195) XLSB support

2017-03-16 Thread Matthew Caruana Galizia (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15929105#comment-15929105
 ] 

Matthew Caruana Galizia commented on TIKA-1195:
---

[~talli...@mitre.org] d'you reckon that will be out with Tika 1.15?

> XLSB support
> 
>
> Key: TIKA-1195
> URL: https://issues.apache.org/jira/browse/TIKA-1195
> Project: Tika
>  Issue Type: Improvement
>  Components: general
>Affects Versions: 1.4
> Environment: W2008R2
>Reporter: Frederic Ronny
>  Labels: new-parser
>
> We use Manifoldcf 1.3 and Solr 4.4 to index a shared network drive, works 
> fine for most of our Office filetypes ( docx, xlsx, ) but we also have a 
> lot of files with filetype xlsb which are not in the supported filetypes. 
> In order to keep using this solution it is essential to us that there will be 
> a solution provided in the future



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Closed] (TIKA-2280) message_from not extracted from Outlook emails

2017-03-01 Thread Matthew Caruana Galizia (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2280?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthew Caruana Galizia closed TIKA-2280.
-
Resolution: Duplicate

> message_from not extracted from Outlook emails
> --
>
> Key: TIKA-2280
> URL: https://issues.apache.org/jira/browse/TIKA-2280
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.14
>Reporter: Matthew Caruana Galizia
>Priority: Minor
>  Labels: email, outlook, poi
>
> While the MESSAGE_FROM metadata field is extracted for both RFC and Outlook 
> emails, it doesn't include the address for Outlook emails.
> For example, if the raw from field is "John Doe ", the 
> Outlook email parser sets MESSAGE_FROM to "John Doe" while the RFC email 
> parser sets it to "John Doe ".
> Currently I'm getting the from address from the RAW_HEADER_FROM field for 
> Outlook emails, but it would be nice to be able to use a standard across 
> email formats.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (TIKA-1865) Save sender email address in Outlook MSG metadata

2017-03-01 Thread Matthew Caruana Galizia (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15890800#comment-15890800
 ] 

Matthew Caruana Galizia commented on TIKA-1865:
---

Thank you, this is a big improvement.

> Save sender email address in Outlook MSG metadata
> -
>
> Key: TIKA-1865
> URL: https://issues.apache.org/jira/browse/TIKA-1865
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.11
> Environment: Windows 7 x64, jre 1.8.0_60 x64
>Reporter: Luis Filipe Nassif
> Attachments: report.xlsx
>
>
> Sender email address is lost when extracting metadata from Outlook msg files. 
> Currently only sender name is extracted. That is an important information to 
> be extracted for search engines.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (TIKA-2235) Use Tesseract's recommended DPI for PDF images

2017-03-01 Thread Matthew Caruana Galizia (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15890324#comment-15890324
 ] 

Matthew Caruana Galizia commented on TIKA-2235:
---

Ah, good catch. OCR'ing inline.

> Use Tesseract's recommended DPI for PDF images
> --
>
> Key: TIKA-2235
> URL: https://issues.apache.org/jira/browse/TIKA-2235
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.14
>Reporter: Matthew Caruana Galizia
>Priority: Minor
>  Labels: ocr, pdf
> Fix For: 2.0, 1.15
>
>
> From the [Tesseract 
> wiki|https://github.com/tesseract-ocr/tesseract/wiki/ImproveQuality]:
> {quote}
> Tesseract works best on images which have a DPI of at least 300 dpi
> {quote}
> PDFParserConfig is currently initialised with a value of 200 for ocrDPI.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (TIKA-2235) Use Tesseract's recommended DPI for PDF images

2017-03-01 Thread Matthew Caruana Galizia (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15890106#comment-15890106
 ] 

Matthew Caruana Galizia commented on TIKA-2235:
---

In the majority of cases, JPEG, JBIG2 (embedded in PDFs) and TIFF.

> Use Tesseract's recommended DPI for PDF images
> --
>
> Key: TIKA-2235
> URL: https://issues.apache.org/jira/browse/TIKA-2235
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.14
>Reporter: Matthew Caruana Galizia
>Priority: Minor
>  Labels: ocr, pdf
> Fix For: 2.0, 1.15
>
>
> From the [Tesseract 
> wiki|https://github.com/tesseract-ocr/tesseract/wiki/ImproveQuality]:
> {quote}
> Tesseract works best on images which have a DPI of at least 300 dpi
> {quote}
> PDFParserConfig is currently initialised with a value of 200 for ocrDPI.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (TIKA-2280) message_from not extracted from Outlook emails

2017-02-28 Thread Matthew Caruana Galizia (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2280?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15889192#comment-15889192
 ] 

Matthew Caruana Galizia commented on TIKA-2280:
---

OK, so this is a duplicate then. Your proposed strategy in your penultimate 
comment makes sense to me. We have a larger corpus of about 1 million MSG 
files. I could test your fix on that and report the results.

> message_from not extracted from Outlook emails
> --
>
> Key: TIKA-2280
> URL: https://issues.apache.org/jira/browse/TIKA-2280
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.14
>Reporter: Matthew Caruana Galizia
>Priority: Minor
>  Labels: email, outlook, poi
>
> While the MESSAGE_FROM metadata field is extracted for both RFC and Outlook 
> emails, it doesn't include the address for Outlook emails.
> For example, if the raw from field is "John Doe ", the 
> Outlook email parser sets MESSAGE_FROM to "John Doe" while the RFC email 
> parser sets it to "John Doe ".
> Currently I'm getting the from address from the RAW_HEADER_FROM field for 
> Outlook emails, but it would be nice to be able to use a standard across 
> email formats.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (TIKA-2280) message_from not extracted from Outlook emails

2017-02-28 Thread Matthew Caruana Galizia (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2280?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthew Caruana Galizia updated TIKA-2280:
--
Description: 
While the MESSAGE_FROM metadata field is extracted for both RFC and Outlook 
emails, it doesn't include the address for Outlook emails.

For example, if the raw from field is "John Doe ", the 
Outlook email parser sets MESSAGE_FROM to "John Doe" while the RFC email parser 
sets it to "John Doe ".

Currently I'm getting the from address from the RAW_HEADER_FROM field for 
Outlook emails, but it would be nice to be able to use a standard across email 
formats.

  was:
While the MESSAGE_FROM metadata field is extracted for RFC emails, it isn't for 
Outlook emails. The closest thing we have for Outlook emails is the creator 
field, which only includes the name (but not the email address).

Currently I'm getting the from address from the RAW_HEADER_FROM field, but it 
would be nice to be able to use a standard across email formats.


> message_from not extracted from Outlook emails
> --
>
> Key: TIKA-2280
> URL: https://issues.apache.org/jira/browse/TIKA-2280
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.14
>Reporter: Matthew Caruana Galizia
>Priority: Minor
>  Labels: email, outlook, poi
>
> While the MESSAGE_FROM metadata field is extracted for both RFC and Outlook 
> emails, it doesn't include the address for Outlook emails.
> For example, if the raw from field is "John Doe ", the 
> Outlook email parser sets MESSAGE_FROM to "John Doe" while the RFC email 
> parser sets it to "John Doe ".
> Currently I'm getting the from address from the RAW_HEADER_FROM field for 
> Outlook emails, but it would be nice to be able to use a standard across 
> email formats.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (TIKA-2280) message_from not extracted from Outlook emails

2017-02-28 Thread Matthew Caruana Galizia (JIRA)
Matthew Caruana Galizia created TIKA-2280:
-

 Summary: message_from not extracted from Outlook emails
 Key: TIKA-2280
 URL: https://issues.apache.org/jira/browse/TIKA-2280
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.14
Reporter: Matthew Caruana Galizia
Priority: Minor


While the MESSAGE_FROM metadata field is extracted for RFC emails, it isn't for 
Outlook emails. The closest thing we have for Outlook emails is the creator 
field, which only includes the name (but not the email address).

Currently I'm getting the from address from the RAW_HEADER_FROM field, but it 
would be nice to be able to use a standard across email formats.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (TIKA-2274) and metadata collision

2017-02-28 Thread Matthew Caruana Galizia (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15887559#comment-15887559
 ] 

Matthew Caruana Galizia commented on TIKA-2274:
---

Thanks fot checking up on this. Try Metadata.TITLE rather than 
TikaCoreProperties.TITLE.

My suggestion for a namespace is "meta", so in this example the resulting name 
would be "metatitle".

>  and  metadata collision
> --
>
> Key: TIKA-2274
> URL: https://issues.apache.org/jira/browse/TIKA-2274
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.14
>Reporter: Matthew Caruana Galizia
>Priority: Minor
>  Labels: html
>
> In several different corpuses I've found HTML files which look like the 
> following:
> {code}
>  
>   
>Some title
>
>   
>...
> 
> {code}
> This causes the "title" property in the metadata to have two values set, when 
> one would expect that this field is not multivalued.
> Perhaps some fields from  tags, like this one, should be namespaced.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (TIKA-2274) and metadata collision

2017-02-23 Thread Matthew Caruana Galizia (JIRA)
Matthew Caruana Galizia created TIKA-2274:
-

 Summary:  and  metadata collision
 Key: TIKA-2274
 URL: https://issues.apache.org/jira/browse/TIKA-2274
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.14
Reporter: Matthew Caruana Galizia
Priority: Minor


In several different corpuses I've found HTML files which look like the 
following:

{code}
 
  
   Some title
   
  
   ...

{code}

This causes the "title" property in the metadata to have two values set, when 
one would expect that this field is not multivalued.

Perhaps some fields from  tags, like this one, should be namespaced.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (TIKA-2245) Standardise logging

2017-01-19 Thread Matthew Caruana Galizia (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15829856#comment-15829856
 ] 

Matthew Caruana Galizia commented on TIKA-2245:
---

So should we agree that parsers should use ONLY JUL and rid it of slf4j and 
log4j? Or should we standardise on slf4j?

> Standardise logging
> ---
>
> Key: TIKA-2245
> URL: https://issues.apache.org/jira/browse/TIKA-2245
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.14, 1.15
>Reporter: Matthew Caruana Galizia
>  Labels: logging
>
> Tika parsers sometimes use Log4j's Logger, sometimes the JUL 
> (java.util.logging) Logger and sometimes SLF4j.
> It would be better to standardise on a single facade, for the sake of not 
> having to configure multiple loggers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-2245) Standardise logging

2017-01-19 Thread Matthew Caruana Galizia (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthew Caruana Galizia updated TIKA-2245:
--
Description: 
Tika parsers sometimes use Log4j's Logger, sometimes the JUL 
(java.util.logging) Logger and sometimes SLF4j.

It would be better to standardise on a single facade, for the sake of not 
having to configure multiple loggers.

  was:
Tika parsers sometimes use Log4j's Logger and the JUL (java.util.logging) 
Logger. I will happily make a pull request to standardise on the latter, as I 
believe users shouldn't be forced to use a third-party library.

It would be better to standardise on the lowest common denominator and leave 
users free to use their own bridge, for example JUL-to-log4j or whatever they 
want.

Summary: Standardise logging  (was: Standardise on java.util.Logging)

> Standardise logging
> ---
>
> Key: TIKA-2245
> URL: https://issues.apache.org/jira/browse/TIKA-2245
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.14, 1.15
>Reporter: Matthew Caruana Galizia
>  Labels: logging
>
> Tika parsers sometimes use Log4j's Logger, sometimes the JUL 
> (java.util.logging) Logger and sometimes SLF4j.
> It would be better to standardise on a single facade, for the sake of not 
> having to configure multiple loggers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TIKA-2245) Standardise on java.util.Logging

2017-01-19 Thread Matthew Caruana Galizia (JIRA)
Matthew Caruana Galizia created TIKA-2245:
-

 Summary: Standardise on java.util.Logging
 Key: TIKA-2245
 URL: https://issues.apache.org/jira/browse/TIKA-2245
 Project: Tika
  Issue Type: Improvement
  Components: parser
Affects Versions: 1.14, 1.15
Reporter: Matthew Caruana Galizia


Tika parsers sometimes use Log4j's Logger and the JUL (java.util.logging) 
Logger. I will happily make a pull request to standardise on the latter, as I 
believe users shouldn't be forced to use a third-party library.

It would be better to standardise on the lowest common denominator and leave 
users free to use their own bridge, for example JUL-to-log4j or whatever they 
want.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-2232) Add JBIG2 image parsing support

2017-01-16 Thread Matthew Caruana Galizia (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15823691#comment-15823691
 ] 

Matthew Caruana Galizia commented on TIKA-2232:
---

Could we at least log a warning once when the ClassNotFoundException is thrown? 
Otherwise I feel like we're sweeping the problem under the rug.

In the meantime I've asked one of the Levigo developers if they'd consider 
switching to a license which is compatible with the ASL v2.

> Add JBIG2 image parsing support
> ---
>
> Key: TIKA-2232
> URL: https://issues.apache.org/jira/browse/TIKA-2232
> Project: Tika
>  Issue Type: New Feature
>  Components: parser
>Affects Versions: 1.14
> Environment: Any
>Reporter: Pascal Essiembre
>Assignee: Tim Allison
>Priority: Minor
> Fix For: 2.0, 1.15
>
>
> If you are interested, I would like to add support for JBIG2 image files 
> (.jb2, or .jbig2).  I have encountered them PDFs.
> I will make a pull-request shortly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-2235) Use Tesseract's recommended DPI for PDF images

2017-01-11 Thread Matthew Caruana Galizia (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15818176#comment-15818176
 ] 

Matthew Caruana Galizia commented on TIKA-2235:
---

Yes, I am already! Thanks for linking me to that. It's good that that pull 
request adds metadata support for JBIG2, but would it not be better to wait for 
the PDFBox 2.0.5 release (which I'm assuming is soon) instead of adding todos?

> Use Tesseract's recommended DPI for PDF images
> --
>
> Key: TIKA-2235
> URL: https://issues.apache.org/jira/browse/TIKA-2235
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.14
>Reporter: Matthew Caruana Galizia
>Priority: Minor
>  Labels: ocr, pdf
> Fix For: 2.0, 1.15
>
>
> From the [Tesseract 
> wiki|https://github.com/tesseract-ocr/tesseract/wiki/ImproveQuality]:
> {quote}
> Tesseract works best on images which have a DPI of at least 300 dpi
> {quote}
> PDFParserConfig is currently initialised with a value of 200 for ocrDPI.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TIKA-2235) Use Tesseract's recommended DPI for PDF images

2017-01-11 Thread Matthew Caruana Galizia (JIRA)
Matthew Caruana Galizia created TIKA-2235:
-

 Summary: Use Tesseract's recommended DPI for PDF images
 Key: TIKA-2235
 URL: https://issues.apache.org/jira/browse/TIKA-2235
 Project: Tika
  Issue Type: Improvement
  Components: parser
Affects Versions: 1.14
Reporter: Matthew Caruana Galizia
Priority: Minor


>From the [Tesseract 
>wiki|https://github.com/tesseract-ocr/tesseract/wiki/ImproveQuality]:

{quote}
Tesseract works best on images which have a DPI of at least 300 dpi
{quote}

PDFParserConfig is currently initialised with a value of 200 for ocrDPI.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TIKA-2221) poi.EncryptedDocumentException not wrapped in tika.exception.EncryptedDocumentException

2016-12-20 Thread Matthew Caruana Galizia (JIRA)
Matthew Caruana Galizia created TIKA-2221:
-

 Summary: poi.EncryptedDocumentException not wrapped in 
tika.exception.EncryptedDocumentException
 Key: TIKA-2221
 URL: https://issues.apache.org/jira/browse/TIKA-2221
 Project: Tika
  Issue Type: Bug
Affects Versions: 1.14
Reporter: Matthew Caruana Galizia
Priority: Minor
 Fix For: 1.15


When parsing an encrypted Word document, a 
org.apache.poi.EncryptedDocumentException is thrown at WordExtractor.java#151. 
Tika catches this too far up the stack and incorrectly wraps it in a plain 
TikaException instead of a org.apache.tika.exception.EncryptedDocumentException.

The fix would be to catch and wrap the exception correctly, for example:

{noformat}
try {
document = new HWPFDocument(root);
} catch (org.apache.poi.EncryptedDocumentException e) {
throw new EncryptedDocumentException(e);
} catch (OldWordFileFormatException e) {
parseWord6(root, xhtml);
return;
}
{noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-2175) Enable extraction of inlined jp2/jpx from PDF

2016-11-28 Thread Matthew Caruana Galizia (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15701919#comment-15701919
 ] 

Matthew Caruana Galizia commented on TIKA-2175:
---

The problem was OpenCL support in Tesseract. Once I rebuilt Tesseract without 
OpenCL support, I got the same results as you above, but using 
setExtractInlineImages(true) instead of setOcrStrategy(...). Thank you for 
testing.

> Enable extraction of inlined jp2/jpx from PDF
> -
>
> Key: TIKA-2175
> URL: https://issues.apache.org/jira/browse/TIKA-2175
> Project: Tika
>  Issue Type: Bug
>Reporter: Tim Allison
> Attachments: pdf-with-jp2-images.pdf
>
>
> On TIKA-2174, [~mcaruanagalizia] reported that inline jp2 images in PDFs were 
> not being OCR'd.  TIKA-2174 added that file type to our tesseract parser, but 
> we our code in the PDFParser wasn't extracting the inline images as well.  
> Let's fix that. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-2175) Enable extraction of inlined jp2/jpx from PDF

2016-11-25 Thread Matthew Caruana Galizia (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15696377#comment-15696377
 ] 

Matthew Caruana Galizia commented on TIKA-2175:
---

Still no joy, both with my bridge classes and with tika-app from trunk. It 
seems the images in the PDF are skipped over entirely. I don't think that the 
embedded document parsing handler is ever even invoked. I've attached the PDF 
in question. If you open it in a hex editor, you can see that the files are 
declared to be "jp2" format.

> Enable extraction of inlined jp2/jpx from PDF
> -
>
> Key: TIKA-2175
> URL: https://issues.apache.org/jira/browse/TIKA-2175
> Project: Tika
>  Issue Type: Bug
>Reporter: Tim Allison
> Attachments: pdf-with-jp2-images.pdf
>
>
> On TIKA-2174, [~mcaruanagalizia] reported that inline jp2 images in PDFs were 
> not being OCR'd.  TIKA-2174 added that file type to our tesseract parser, but 
> we our code in the PDFParser wasn't extracting the inline images as well.  
> Let's fix that. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1896) Invalid closing script tag not handled gracefully by HtmlParser

2016-11-14 Thread Matthew Caruana Galizia (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1896?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15663830#comment-15663830
 ] 

Matthew Caruana Galizia commented on TIKA-1896:
---

Perhaps we should push ahead with Jsoup integration instead of trying to hack 
Tagsoup?

> Invalid closing script tag not handled gracefully by HtmlParser
> ---
>
> Key: TIKA-1896
> URL: https://issues.apache.org/jira/browse/TIKA-1896
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.12
>Reporter: Matthew Caruana Galizia
>Priority: Minor
> Attachments: reports.tar.bz2, test.html
>
>
> When an HTML file contains an invalid closing script tag, all content after 
> that tag is interpreted as script data and therefore ignored.
> Reduced test case file attached.
> To reproduce:
> 1) create a file with the following HTML
> {code:html}
>  "http://www.w3.org/TR/html4/loose.dtd;>
> 
>   
>   

[jira] [Commented] (TIKA-2175) Enable extraction of inlined jp2/jpx from PDF

2016-11-10 Thread Matthew Caruana Galizia (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15653441#comment-15653441
 ] 

Matthew Caruana Galizia commented on TIKA-2175:
---

I've filed [an 
issue|https://github.com/jai-imageio/jai-imageio-jpeg2000/issues/8] with the 
jpeg2000 imageio project to declare jpx support. The decode/encoders support 
that format - the issue is simply that it's not declared so PDFBox doesn't find 
them.

As a temporary workaround and proof of concept I've added these two bridge Spi 
classes: 
https://github.com/ICIJ/extract/tree/master/src/main/java/org/icij/imageio/jpx

> Enable extraction of inlined jp2/jpx from PDF
> -
>
> Key: TIKA-2175
> URL: https://issues.apache.org/jira/browse/TIKA-2175
> Project: Tika
>  Issue Type: Bug
>Reporter: Tim Allison
>
> On TIKA-2174, [~mcaruanagalizia] reported that inline jp2 images in PDFs were 
> not being OCR'd.  TIKA-2174 added that file type to our tesseract parser, but 
> we our code in the PDFParser wasn't extracting the inline images as well.  
> Let's fix that. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-2174) Too few formats in support declared by TesseractOCRParser

2016-11-10 Thread Matthew Caruana Galizia (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2174?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15653430#comment-15653430
 ] 

Matthew Caruana Galizia commented on TIKA-2174:
---

Thank you! I've also confirmed that Tesseract can handle 
image/x-portable-pixmap (PPM) files, so perhaps we could add that too?

> Too few formats in support declared by TesseractOCRParser
> -
>
> Key: TIKA-2174
> URL: https://issues.apache.org/jira/browse/TIKA-2174
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.14
>Reporter: Matthew Caruana Galizia
>
> A complete install of Leptonica with Tesseract will add support for formats 
> that are not declared by TesseractOCRParser. These include JP2, JPX and PPM.
> Tesseract produces OCR output fine for JPX images as of this version:
> {noformat}
>   $ tesseract -v
>  tesseract 3.04.01
>leptonica-1.73
>  libjpeg 8d : libpng 1.6.26 : libtiff 4.0.6 : zlib 1.2.5}}
> {noformat}
> However, these types are not declared by getSupportTypes so no output is 
> produced for PDFs which contained JPX images of scanned documents, for 
> example.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-2174) Too few formats in support declared by TesseractOCRParser

2016-11-09 Thread Matthew Caruana Galizia (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2174?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15651347#comment-15651347
 ] 

Matthew Caruana Galizia commented on TIKA-2174:
---

That issue went away once I added 'jp2' and 'jpx' to the list of supported 
types in TesseractOCRParser via a new proxy parser that declares support for 
these types. It seems the embedded images are then handed off to Tesseract but 
nothing is OCRed, although that seems to be a separate issue arising from 
PDFBox.

> Too few formats in support declared by TesseractOCRParser
> -
>
> Key: TIKA-2174
> URL: https://issues.apache.org/jira/browse/TIKA-2174
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.14
>Reporter: Matthew Caruana Galizia
>
> A complete install of Leptonica with Tesseract will add support for formats 
> that are not declared by TesseractOCRParser. These include JP2, JPX and PPM.
> Tesseract produces OCR output fine for JPX images as of this version:
> {noformat}
>   $ tesseract -v
>  tesseract 3.04.01
>leptonica-1.73
>  libjpeg 8d : libpng 1.6.26 : libtiff 4.0.6 : zlib 1.2.5}}
> {noformat}
> However, these types are not declared by getSupportTypes so no output is 
> produced for PDFs which contained JPX images of scanned documents, for 
> example.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-2174) Too few formats in support declared by TesseractOCRParser

2016-11-09 Thread Matthew Caruana Galizia (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2174?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15650892#comment-15650892
 ] 

Matthew Caruana Galizia commented on TIKA-2174:
---

Both on inline and independent files. I've renamed the issue and added PPM 
(image/x-portable-pixmap) to the list of formats that could be supported.

> Too few formats in support declared by TesseractOCRParser
> -
>
> Key: TIKA-2174
> URL: https://issues.apache.org/jira/browse/TIKA-2174
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.14
>Reporter: Matthew Caruana Galizia
>
> A complete install of Leptonica with Tesseract will add support for formats 
> that are not declared by TesseractOCRParser. These include JP2, JPX and PPM.
> Tesseract produces OCR output fine for JPX images as of this version:
> {noformat}
>   $ tesseract -v
>  tesseract 3.04.01
>leptonica-1.73
>  libjpeg 8d : libpng 1.6.26 : libtiff 4.0.6 : zlib 1.2.5}}
> {noformat}
> However, these types are not declared by getSupportTypes so no output is 
> produced for PDFs which contained JPX images of scanned documents, for 
> example.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-2174) Too few formats in support declared by TesseractOCRParser

2016-11-09 Thread Matthew Caruana Galizia (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2174?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthew Caruana Galizia updated TIKA-2174:
--
Description: 
A complete install of Leptonica with Tesseract will add support for formats 
that are not declared by TesseractOCRParser. These include JP2, JPX and PPM.

Tesseract produces OCR output fine for JPX images as of this version:

{noformat}
  $ tesseract -v
 tesseract 3.04.01
   leptonica-1.73
 libjpeg 8d : libpng 1.6.26 : libtiff 4.0.6 : zlib 1.2.5}}
{noformat}

However, these types are not declared by getSupportTypes so no output is 
produced for PDFs which contained JPX images of scanned documents, for example.

  was:
Tesseract produces OCR output fine for JPX images as of this version:

{noformat}
  $ tesseract -v
 tesseract 3.04.01
   leptonica-1.73
 libjpeg 8d : libpng 1.6.26 : libtiff 4.0.6 : zlib 1.2.5}}
{noformat}

However, these types are not declared by getSupportTypes so no output is 
produced for PDFs which contained JPX images of scanned documents, for example.

Summary: Too few formats in support declared by TesseractOCRParser  
(was: JP2 and JPX (JPEG 2000) support not declared by TesseractOCRParser)

> Too few formats in support declared by TesseractOCRParser
> -
>
> Key: TIKA-2174
> URL: https://issues.apache.org/jira/browse/TIKA-2174
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.14
>Reporter: Matthew Caruana Galizia
>
> A complete install of Leptonica with Tesseract will add support for formats 
> that are not declared by TesseractOCRParser. These include JP2, JPX and PPM.
> Tesseract produces OCR output fine for JPX images as of this version:
> {noformat}
>   $ tesseract -v
>  tesseract 3.04.01
>leptonica-1.73
>  libjpeg 8d : libpng 1.6.26 : libtiff 4.0.6 : zlib 1.2.5}}
> {noformat}
> However, these types are not declared by getSupportTypes so no output is 
> produced for PDFs which contained JPX images of scanned documents, for 
> example.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TIKA-2174) JP2 and JPX (JPEG 2000) support not declared by TesseractOCRParser

2016-11-09 Thread Matthew Caruana Galizia (JIRA)
Matthew Caruana Galizia created TIKA-2174:
-

 Summary: JP2 and JPX (JPEG 2000) support not declared by 
TesseractOCRParser
 Key: TIKA-2174
 URL: https://issues.apache.org/jira/browse/TIKA-2174
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.14
Reporter: Matthew Caruana Galizia


Tesseract produces OCR output fine for JPX images as of this version:

{noformat}
  $ tesseract -v
 tesseract 3.04.01
   leptonica-1.73
 libjpeg 8d : libpng 1.6.26 : libtiff 4.0.6 : zlib 1.2.5}}
{noformat}

However, these types are not declared by getSupportTypes so no output is 
produced for PDFs which contained JPX images of scanned documents, for example.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-2167) Image processing causes OCR to fail

2016-11-07 Thread Matthew Caruana Galizia (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15644455#comment-15644455
 ] 

Matthew Caruana Galizia commented on TIKA-2167:
---

[~talli...@mitre.org] to replicate the issue:

1) build tika-app from master
2) java -jar target/tika-app-1.15-SNAPSHOT.jar
3) drag simple.tiff onto the window
4) select View > Plain text

Result: the only output is a series of newlines.

> Image processing causes OCR to fail
> ---
>
> Key: TIKA-2167
> URL: https://issues.apache.org/jira/browse/TIKA-2167
> Project: Tika
>  Issue Type: Bug
>  Components: ocr
>Affects Versions: 1.14
> Environment: Mac OS X 10.11.6; Java 1.8.0_45; tesseract 3.04.01; 
> ImageMagick 6.9.6-2
>Reporter: Matthew Caruana Galizia
>Priority: Critical
>  Labels: convert, image, ocr, tiff
> Attachments: simple.tiff
>
>
> Image processing before OCR is enabled by default in the OCR configuration 
> properties file. Unless this is disabled, running Tika on a simple TIFF image 
> (attached) with two clear words fails. When image processing is disabled, it 
> succeeds.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-2167) Image processing causes OCR to fail

2016-11-06 Thread Matthew Caruana Galizia (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthew Caruana Galizia updated TIKA-2167:
--
Attachment: simple.tiff

> Image processing causes OCR to fail
> ---
>
> Key: TIKA-2167
> URL: https://issues.apache.org/jira/browse/TIKA-2167
> Project: Tika
>  Issue Type: Bug
>  Components: ocr
>Affects Versions: 1.14
> Environment: Mac OS X 10.11.6; Java 1.8.0_45; tesseract 3.04.01; 
> ImageMagick 6.9.6-2
>Reporter: Matthew Caruana Galizia
>Priority: Critical
>  Labels: convert, image, ocr, tiff
> Attachments: simple.tiff
>
>
> Image processing before OCR is enabled by default in the OCR configuration 
> properties file. Unless this is disabled, running Tika on a simple TIFF image 
> (attached) with two clear words fails. When image processing is disabled, it 
> succeeds.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TIKA-2167) Image processing causes OCR to fail

2016-11-06 Thread Matthew Caruana Galizia (JIRA)
Matthew Caruana Galizia created TIKA-2167:
-

 Summary: Image processing causes OCR to fail
 Key: TIKA-2167
 URL: https://issues.apache.org/jira/browse/TIKA-2167
 Project: Tika
  Issue Type: Bug
  Components: ocr
Affects Versions: 1.14
 Environment: Mac OS X 10.11.6; Java 1.8.0_45; tesseract 3.04.01; 
ImageMagick 6.9.6-2
Reporter: Matthew Caruana Galizia
Priority: Critical


Image processing before OCR is enabled by default in the OCR configuration 
properties file. Unless this is disabled, running Tika on a simple TIFF image 
(attached) with two clear words fails. When image processing is disabled, it 
succeeds.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1896) Invalid closing script tag not handled gracefully by HtmlParser

2016-11-05 Thread Matthew Caruana Galizia (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1896?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15638978#comment-15638978
 ] 

Matthew Caruana Galizia commented on TIKA-1896:
---

[~talli...@mitre.org] did you ever run your workaround against the corpus?

Although this might not seem a major bug at face value, it's preventing us from 
extracting text from hundreds of thousands of HTML files without a lot of 
manual manipulation of the files first.

> Invalid closing script tag not handled gracefully by HtmlParser
> ---
>
> Key: TIKA-1896
> URL: https://issues.apache.org/jira/browse/TIKA-1896
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.12
>Reporter: Matthew Caruana Galizia
>Priority: Minor
> Attachments: test.html
>
>
> When an HTML file contains an invalid closing script tag, all content after 
> that tag is interpreted as script data and therefore ignored.
> Reduced test case file attached.
> To reproduce:
> 1) create a file with the following HTML
> {code:html}
>  "http://www.w3.org/TR/html4/loose.dtd;>
> 
>   
>   

[jira] [Comment Edited] (TIKA-1896) Invalid closing script tag not handled gracefully by HtmlParser

2016-03-08 Thread Matthew Caruana Galizia (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1896?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15184969#comment-15184969
 ] 

Matthew Caruana Galizia edited comment on TIKA-1896 at 3/8/16 2:33 PM:
---

TagSoup handles this well in HTML mode:

{{java -jar tagsoup-1.2.1.jar --method=html test.html}}

outputs a closing tag correctly. Without HTML mode, all HTML after the invalid 
closing tag is interpreted as part of the script CDATA section.


was (Author: mcaruanagalizia):
TagSoup handles this well in HTML mode:

{{java -jar tagsoup-1.2.1.jar --method=html ~/Downloads/test.html}}

outputs a closing tag correctly. Without HTML mode, all HTML after the invalid 
closing tag is interpreted as part of the script CDATA section.

> Invalid closing script tag not handled gracefully by HtmlParser
> ---
>
> Key: TIKA-1896
> URL: https://issues.apache.org/jira/browse/TIKA-1896
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.12
>Reporter: Matthew Caruana Galizia
> Attachments: test.html
>
>
> When an HTML file contains an invalid closing script tag, all content after 
> that tag is interpreted as script data and therefore ignored.
> Reduced test case file attached.
> To reproduce:
> 1) create a file with the following HTML
> {code:html}
>  "http://www.w3.org/TR/html4/loose.dtd;>
> 
>   
>   

[jira] [Updated] (TIKA-1896) Invalid closing script tag not handled gracefully by HtmlParser

2016-03-08 Thread Matthew Caruana Galizia (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1896?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthew Caruana Galizia updated TIKA-1896:
--
Attachment: test.html

> Invalid closing script tag not handled gracefully by HtmlParser
> ---
>
> Key: TIKA-1896
> URL: https://issues.apache.org/jira/browse/TIKA-1896
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.12
>Reporter: Matthew Caruana Galizia
> Attachments: test.html
>
>
> When an HTML file contains an invalid closing script tag, all content after 
> that tag is interpreted as script data and therefore ignored.
> Reduced test case file attached.
> To reproduce:
> 1) create a file with the following HTML
> {code:html}
>  "http://www.w3.org/TR/html4/loose.dtd;>
> 
>   
>