[jira] [Commented] (TIKA-2451) Detect image frame counts for tiff files

2017-08-31 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16149909#comment-16149909
 ] 

Tim Allison commented on TIKA-2451:
---

Turns out we can get this info from the current version.  I'll look into 
upgrading in another issue.

Fellow devs, any preference for using {{Office.PAGE_COUNT}} or creating a new 
{{TIFF.PAGE_COUNT}} as the metadata key? 

> Detect image frame counts for tiff files
> 
>
> Key: TIKA-2451
> URL: https://issues.apache.org/jira/browse/TIKA-2451
> Project: Tika
>  Issue Type: Improvement
>  Components: metadata
>Reporter: Mike Cantrell
>Priority: Minor
> Attachments: multipage_tiff_example.tif
>
>
> It would be useful to know the number of frames in a multi-page tiff image. 
> My apologies if this already exists but I could not locate it in any of the 
> existing metadata output. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


RE: [ANNOUNCE] Welcome Madhav Sharan as Tika Committer and PMC Member

2017-08-31 Thread Tyler Bui-Palsulich
Welcome, Madhav!

Tyler

On Aug 31, 2017 1:22 PM, "Allison, Timothy B."  wrote:

> W00t!  Welcome, Madhav!
>
> -Original Message-
> From: Chris Mattmann [mailto:mattm...@apache.org]
> Sent: Thursday, August 31, 2017 3:52 PM
> To: dev@tika.apache.org
> Subject: Re: [ANNOUNCE] Welcome Madhav Sharan as Tika Committer and PMC
> Member
>
> Welcome Madhav!
>
> Cheers,
> Chris
>
>
>
>
> On 8/31/17, 12:29 PM, "loo...@gmail.com on behalf of Dave Meikle" <
> loo...@gmail.com on behalf of dmei...@apache.org> wrote:
>
> Hello Everyone,
>
> Please join me in welcoming Madhav Sharan as a PMC Members and
> Committer to
> the project!
>
> Welcome to the team, Madhav. Feel free to say a bit about yourselves
> and
> how you got involved in Tika.
>
> Cheers,
> Dave
>
>
>
>
>


[jira] [Commented] (TIKA-2451) Detect image frame counts for tiff files

2017-08-31 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16149879#comment-16149879
 ] 

Tim Allison commented on TIKA-2451:
---

It looks like this was (fairly) recently added to drewnoakes' 
metadata-extractor: [version 
2.10.0|https://github.com/drewnoakes/metadata-extractor/releases/tag/2.10.0] 
included [support for multipage 
tiffs|https://github.com/drewnoakes/metadata-extractor/pull/228].

When I bumped the version up to the latest, I get the following for your file: 
{{Page Number : 9 10}}.

My guess from 
[this|http://www.awaresystems.be/imaging/tiff/tifftags/pagenumber.html] is that 
9 (0 index) is the last page number and 10 is the total number of pages.  
Should we normalize (split on " " and take the second) ?

As a side note, I confirmed that tesseract is pulling text out of all the 
pages. W00t!

{noformat}

TIFF
Example
Page 1

Multipage
TIFF
Example
Page 2

Multipage
TIFF
Example
Page 3

Multipage
TIFF
Example
Page4

Multipage
TIFF
Example
Page 5

Multipage
TIFF
Example
Page 6

Multipage
TIFF
Example
Page 7

Multipage
TIFF
Example
Page 8

Multipage
TIFF
Example
Page 9

Multipage
TIFF

Example

Page 10
{noformat}



> Detect image frame counts for tiff files
> 
>
> Key: TIKA-2451
> URL: https://issues.apache.org/jira/browse/TIKA-2451
> Project: Tika
>  Issue Type: Improvement
>  Components: metadata
>Reporter: Mike Cantrell
>Priority: Minor
> Attachments: multipage_tiff_example.tif
>
>
> It would be useful to know the number of frames in a multi-page tiff image. 
> My apologies if this already exists but I could not locate it in any of the 
> existing metadata output. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (TIKA-2219) CharsetDetector no longer detects windows-1252 charset

2017-08-31 Thread Matthew Caruana Galizia (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthew Caruana Galizia updated TIKA-2219:
--
Attachment: test.txt

This file contains x92 characters which should force detection to Windows-1252.

> CharsetDetector no longer detects windows-1252 charset
> --
>
> Key: TIKA-2219
> URL: https://issues.apache.org/jira/browse/TIKA-2219
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.14
> Environment: Any.
>Reporter: Pascal Essiembre
>Priority: Minor
> Fix For: 2.0, 1.15
>
> Attachments: test.txt
>
>
> Starting with Tika 1.14, windows-1252 is no longer detected, as ISO-8859-1 is 
> always detected instead.  While not tested, this likely affects other 
> windows-125* encodings as well.
> I tracked it down to a change in the 
> {{CharsetRecog_sbcs.CharsetRecog_8859_1#getName()}} method.  Now it always 
> returns "ISO-8859-1" whereas before it was: {{return haveC1Bytes ? 
> "windows-1252" : "ISO-8859-1";}}
> Now that condition has been moved to the {{match(CharsetDetector det)}} 
> method so that the returned CharsetMatch has the proper name.  The problem 
> with that is {{CharsetDetector#detectAll()}} method overwrites the correct 
> match with a new one that will return the value of {{#getName()}}  from the 
> {{CharsetRecognizer}} instead (which is always "ISO-8859-1" in this case).
> There might be legitimate reasons why the {{CharsetMatch}} instances in 
> {{detectAll()}} method are replaced with new ones, but changing this code in 
> that method appears to work for me:
> // Remove this:
> //CharsetMatch m = new CharsetMatch(this, csr, 
> confidence);
> //matches.add(m);
> // Add this instead:
> matches.add(charsetMatch);



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (TIKA-2219) CharsetDetector no longer detects windows-1252 charset

2017-08-31 Thread Matthew Caruana Galizia (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16149673#comment-16149673
 ] 

Matthew Caruana Galizia commented on TIKA-2219:
---

[~talli...@mitre.org] I think this issue has regressed. Please take a look at 
the attached file. It's parsed as an email but the body text is detected as 
US-ASCII instead of Windows-1252 (note the x92 characters).

> CharsetDetector no longer detects windows-1252 charset
> --
>
> Key: TIKA-2219
> URL: https://issues.apache.org/jira/browse/TIKA-2219
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.14
> Environment: Any.
>Reporter: Pascal Essiembre
>Priority: Minor
> Fix For: 2.0, 1.15
>
>
> Starting with Tika 1.14, windows-1252 is no longer detected, as ISO-8859-1 is 
> always detected instead.  While not tested, this likely affects other 
> windows-125* encodings as well.
> I tracked it down to a change in the 
> {{CharsetRecog_sbcs.CharsetRecog_8859_1#getName()}} method.  Now it always 
> returns "ISO-8859-1" whereas before it was: {{return haveC1Bytes ? 
> "windows-1252" : "ISO-8859-1";}}
> Now that condition has been moved to the {{match(CharsetDetector det)}} 
> method so that the returned CharsetMatch has the proper name.  The problem 
> with that is {{CharsetDetector#detectAll()}} method overwrites the correct 
> match with a new one that will return the value of {{#getName()}}  from the 
> {{CharsetRecognizer}} instead (which is always "ISO-8859-1" in this case).
> There might be legitimate reasons why the {{CharsetMatch}} instances in 
> {{detectAll()}} method are replaced with new ones, but changing this code in 
> that method appears to work for me:
> // Remove this:
> //CharsetMatch m = new CharsetMatch(this, csr, 
> confidence);
> //matches.add(m);
> // Add this instead:
> matches.add(charsetMatch);



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


RE: [ANNOUNCE] Welcome Madhav Sharan as Tika Committer and PMC Member

2017-08-31 Thread Allison, Timothy B.
W00t!  Welcome, Madhav!

-Original Message-
From: Chris Mattmann [mailto:mattm...@apache.org] 
Sent: Thursday, August 31, 2017 3:52 PM
To: dev@tika.apache.org
Subject: Re: [ANNOUNCE] Welcome Madhav Sharan as Tika Committer and PMC Member

Welcome Madhav!

Cheers,
Chris




On 8/31/17, 12:29 PM, "loo...@gmail.com on behalf of Dave Meikle" 
 wrote:

Hello Everyone,

Please join me in welcoming Madhav Sharan as a PMC Members and Committer to
the project!

Welcome to the team, Madhav. Feel free to say a bit about yourselves and
how you got involved in Tika.

Cheers,
Dave






Re: [ANNOUNCE] Welcome Madhav Sharan as Tika Committer and PMC Member

2017-08-31 Thread Chris Mattmann
Welcome Madhav!

Cheers,
Chris




On 8/31/17, 12:29 PM, "loo...@gmail.com on behalf of Dave Meikle" 
 wrote:

Hello Everyone,

Please join me in welcoming Madhav Sharan as a PMC Members and Committer to
the project!

Welcome to the team, Madhav. Feel free to say a bit about yourselves and
how you got involved in Tika.

Cheers,
Dave





[ANNOUNCE] Welcome Madhav Sharan as Tika Committer and PMC Member

2017-08-31 Thread Dave Meikle
Hello Everyone,

Please join me in welcoming Madhav Sharan as a PMC Members and Committer to
the project!

Welcome to the team, Madhav. Feel free to say a bit about yourselves and
how you got involved in Tika.

Cheers,
Dave


[jira] [Created] (TIKA-2457) Update MboxParser to more recent handling of embedded docs

2017-08-31 Thread Tim Allison (JIRA)
Tim Allison created TIKA-2457:
-

 Summary: Update MboxParser to more recent handling of embedded docs
 Key: TIKA-2457
 URL: https://issues.apache.org/jira/browse/TIKA-2457
 Project: Tika
  Issue Type: Improvement
Reporter: Tim Allison
Priority: Minor


Let's make the MBoxParser treat embedded docs similarly to the 
OutlookPSTParser.  The RecursiveParserWrapper allows uniform access to embedded 
docs' metadata across the parsers.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (TIKA-2456) Emails extracted from MBOX not detected as rfc822

2017-08-31 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2456?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16149271#comment-16149271
 ] 

Hudson commented on TIKA-2456:
--

SUCCESS: Integrated in Jenkins build Tika-trunk #1355 (See 
[https://builds.apache.org/job/Tika-trunk/1355/])
TIKA-2456: fix detection of emails inside mbox (lfcnassif: 
[https://github.com/apache/tika/commit/560e91a176ca5ff1adfc3ff1c1f63e32ec4e928a])
* (edit) CHANGES.txt
* (edit) tika-parsers/src/main/java/org/apache/tika/parser/mbox/MboxParser.java
* (edit) 
tika-parsers/src/test/java/org/apache/tika/parser/mbox/MboxParserTest.java
* (add) tika-parsers/src/test/resources/test-documents/single_mail.mbox


> Emails extracted from MBOX not detected as rfc822
> -
>
> Key: TIKA-2456
> URL: https://issues.apache.org/jira/browse/TIKA-2456
> Project: Tika
>  Issue Type: Bug
>  Components: detector
>Affects Versions: 1.16
>Reporter: Luis Filipe Nassif
> Fix For: 1.17
>
> Attachments: single_mail.mbox
>
>
> Similar to TIKA-2454, because of recurrent detection issues with 
> message/rfc822 (TIKA-2042, TIKA-1602, TIKA-879), children of mbox files could 
> not be detected as rfc822, but they will always be. Solution is to set 
> Content-Type-Override inside MBOXPArser. Fix being prepared...



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Comment Edited] (TIKA-2456) Emails extracted from MBOX not detected as rfc822

2017-08-31 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2456?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16149250#comment-16149250
 ] 

Tim Allison edited comment on TIKA-2456 at 8/31/17 4:54 PM:


W00t!  Welcome aboard, [~lfcnassif]! :D


was (Author: talli...@mitre.org):
W00t!  Welcome aboard! :D

> Emails extracted from MBOX not detected as rfc822
> -
>
> Key: TIKA-2456
> URL: https://issues.apache.org/jira/browse/TIKA-2456
> Project: Tika
>  Issue Type: Bug
>  Components: detector
>Affects Versions: 1.16
>Reporter: Luis Filipe Nassif
> Fix For: 1.17
>
> Attachments: single_mail.mbox
>
>
> Similar to TIKA-2454, because of recurrent detection issues with 
> message/rfc822 (TIKA-2042, TIKA-1602, TIKA-879), children of mbox files could 
> not be detected as rfc822, but they will always be. Solution is to set 
> Content-Type-Override inside MBOXPArser. Fix being prepared...



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (TIKA-2456) Emails extracted from MBOX not detected as rfc822

2017-08-31 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2456?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16149250#comment-16149250
 ] 

Tim Allison commented on TIKA-2456:
---

W00t!  Welcome aboard! :D

> Emails extracted from MBOX not detected as rfc822
> -
>
> Key: TIKA-2456
> URL: https://issues.apache.org/jira/browse/TIKA-2456
> Project: Tika
>  Issue Type: Bug
>  Components: detector
>Affects Versions: 1.16
>Reporter: Luis Filipe Nassif
> Fix For: 1.17
>
> Attachments: single_mail.mbox
>
>
> Similar to TIKA-2454, because of recurrent detection issues with 
> message/rfc822 (TIKA-2042, TIKA-1602, TIKA-879), children of mbox files could 
> not be detected as rfc822, but they will always be. Solution is to set 
> Content-Type-Override inside MBOXPArser. Fix being prepared...



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Resolved] (TIKA-2456) Emails extracted from MBOX not detected as rfc822

2017-08-31 Thread Luis Filipe Nassif (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Luis Filipe Nassif resolved TIKA-2456.
--
Resolution: Fixed

Fixed in r560e91a

> Emails extracted from MBOX not detected as rfc822
> -
>
> Key: TIKA-2456
> URL: https://issues.apache.org/jira/browse/TIKA-2456
> Project: Tika
>  Issue Type: Bug
>  Components: detector
>Affects Versions: 1.16
>Reporter: Luis Filipe Nassif
> Fix For: 1.17
>
> Attachments: single_mail.mbox
>
>
> Similar to TIKA-2454, because of recurrent detection issues with 
> message/rfc822 (TIKA-2042, TIKA-1602, TIKA-879), children of mbox files could 
> not be detected as rfc822, but they will always be. Solution is to set 
> Content-Type-Override inside MBOXPArser. Fix being prepared...



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Issue Comment Deleted] (TIKA-2456) Emails extracted from MBOX not detected as rfc822

2017-08-31 Thread Luis Filipe Nassif (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Luis Filipe Nassif updated TIKA-2456:
-
Comment: was deleted

(was: Fixed in r560e91a)

> Emails extracted from MBOX not detected as rfc822
> -
>
> Key: TIKA-2456
> URL: https://issues.apache.org/jira/browse/TIKA-2456
> Project: Tika
>  Issue Type: Bug
>  Components: detector
>Affects Versions: 1.16
>Reporter: Luis Filipe Nassif
> Fix For: 1.17
>
> Attachments: single_mail.mbox
>
>
> Similar to TIKA-2454, because of recurrent detection issues with 
> message/rfc822 (TIKA-2042, TIKA-1602, TIKA-879), children of mbox files could 
> not be detected as rfc822, but they will always be. Solution is to set 
> Content-Type-Override inside MBOXPArser. Fix being prepared...



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (TIKA-2456) Emails extracted from MBOX not detected as rfc822

2017-08-31 Thread Luis Filipe Nassif (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2456?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16149223#comment-16149223
 ] 

Luis Filipe Nassif commented on TIKA-2456:
--

Fixed in r560e91a

> Emails extracted from MBOX not detected as rfc822
> -
>
> Key: TIKA-2456
> URL: https://issues.apache.org/jira/browse/TIKA-2456
> Project: Tika
>  Issue Type: Bug
>  Components: detector
>Affects Versions: 1.16
>Reporter: Luis Filipe Nassif
> Fix For: 1.17
>
> Attachments: single_mail.mbox
>
>
> Similar to TIKA-2454, because of recurrent detection issues with 
> message/rfc822 (TIKA-2042, TIKA-1602, TIKA-879), children of mbox files could 
> not be detected as rfc822, but they will always be. Solution is to set 
> Content-Type-Override inside MBOXPArser. Fix being prepared...



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (TIKA-2451) Detect image frame counts for tiff files

2017-08-31 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16149221#comment-16149221
 ] 

Tim Allison commented on TIKA-2451:
---

Thank you.  Will take a look...on a related note: 
https://github.com/tesseract-ocr/tesseract/issues/743 :P

> Detect image frame counts for tiff files
> 
>
> Key: TIKA-2451
> URL: https://issues.apache.org/jira/browse/TIKA-2451
> Project: Tika
>  Issue Type: Improvement
>  Components: metadata
>Reporter: Mike Cantrell
>Priority: Minor
> Attachments: multipage_tiff_example.tif
>
>
> It would be useful to know the number of frames in a multi-page tiff image. 
> My apologies if this already exists but I could not locate it in any of the 
> existing metadata output. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (TIKA-2451) Detect image frame counts for tiff files

2017-08-31 Thread Mike Cantrell (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2451?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mike Cantrell updated TIKA-2451:

Attachment: multipage_tiff_example.tif

No problem. I'm attaching an example file. We're currently using [twelve 
monkey's imageio tiff plugin|https://github.com/haraldk/TwelveMonkeys]  
ImageReader.getNumImages(true) to count the frames. I'm assuming that the EXIF 
metadata should hold the clue to the number of images though.



> Detect image frame counts for tiff files
> 
>
> Key: TIKA-2451
> URL: https://issues.apache.org/jira/browse/TIKA-2451
> Project: Tika
>  Issue Type: Improvement
>  Components: metadata
>Reporter: Mike Cantrell
>Priority: Minor
> Attachments: multipage_tiff_example.tif
>
>
> It would be useful to know the number of frames in a multi-page tiff image. 
> My apologies if this already exists but I could not locate it in any of the 
> existing metadata output. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (TIKA-2456) Emails extracted from MBOX not detected as rfc822

2017-08-31 Thread Luis Filipe Nassif (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Luis Filipe Nassif updated TIKA-2456:
-
Attachment: single_mail.mbox

File to unit test

> Emails extracted from MBOX not detected as rfc822
> -
>
> Key: TIKA-2456
> URL: https://issues.apache.org/jira/browse/TIKA-2456
> Project: Tika
>  Issue Type: Bug
>  Components: detector
>Affects Versions: 1.16
>Reporter: Luis Filipe Nassif
> Fix For: 1.17
>
> Attachments: single_mail.mbox
>
>
> Similar to TIKA-2454, because of recurrent detection issues with 
> message/rfc822 (TIKA-2042, TIKA-1602, TIKA-879), children of mbox files could 
> not be detected as rfc822, but they will always be. Solution is to set 
> Content-Type-Override inside MBOXPArser. Fix being prepared...



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (TIKA-2456) Emails extracted from MBOX not detected as rfc822

2017-08-31 Thread Luis Filipe Nassif (JIRA)
Luis Filipe Nassif created TIKA-2456:


 Summary: Emails extracted from MBOX not detected as rfc822
 Key: TIKA-2456
 URL: https://issues.apache.org/jira/browse/TIKA-2456
 Project: Tika
  Issue Type: Bug
  Components: detector
Affects Versions: 1.16
Reporter: Luis Filipe Nassif
 Fix For: 1.17


Similar to TIKA-2454, because of recurrent detection issues with message/rfc822 
(TIKA-2042, TIKA-1602, TIKA-879), children of mbox files could not be detected 
as rfc822, but they will always be. Solution is to set Content-Type-Override 
inside MBOXPArser. Fix being prepared...



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (TIKA-2455) Flag in metadata for alternative email bodies

2017-08-31 Thread Matthew Caruana Galizia (JIRA)
Matthew Caruana Galizia created TIKA-2455:
-

 Summary: Flag in metadata for alternative email bodies
 Key: TIKA-2455
 URL: https://issues.apache.org/jira/browse/TIKA-2455
 Project: Tika
  Issue Type: Improvement
  Components: parser
Affects Versions: 1.16
Reporter: Matthew Caruana Galizia
Priority: Minor


When multipart RFC822 emails are being parsed, there's no way to distinguish 
between alternative versions of the body and attachments.

It would be ideal if some kind of flag were set in the metadata passed to the 
{{EmbeddedDocumentExtractor}} that indicates that the stream is an alternative.

In GUIs that present the data extracted from the email, alternative bodies can 
be distinguished from attachments and presented separately.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)