[jira] [Commented] (TIKA-2451) Detect image frame counts for tiff files
[ https://issues.apache.org/jira/browse/TIKA-2451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16149909#comment-16149909 ] Tim Allison commented on TIKA-2451: --- Turns out we can get this info from the current version. I'll look into upgrading in another issue. Fellow devs, any preference for using {{Office.PAGE_COUNT}} or creating a new {{TIFF.PAGE_COUNT}} as the metadata key? > Detect image frame counts for tiff files > > > Key: TIKA-2451 > URL: https://issues.apache.org/jira/browse/TIKA-2451 > Project: Tika > Issue Type: Improvement > Components: metadata >Reporter: Mike Cantrell >Priority: Minor > Attachments: multipage_tiff_example.tif > > > It would be useful to know the number of frames in a multi-page tiff image. > My apologies if this already exists but I could not locate it in any of the > existing metadata output. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
RE: [ANNOUNCE] Welcome Madhav Sharan as Tika Committer and PMC Member
Welcome, Madhav! Tyler On Aug 31, 2017 1:22 PM, "Allison, Timothy B."wrote: > W00t! Welcome, Madhav! > > -Original Message- > From: Chris Mattmann [mailto:mattm...@apache.org] > Sent: Thursday, August 31, 2017 3:52 PM > To: dev@tika.apache.org > Subject: Re: [ANNOUNCE] Welcome Madhav Sharan as Tika Committer and PMC > Member > > Welcome Madhav! > > Cheers, > Chris > > > > > On 8/31/17, 12:29 PM, "loo...@gmail.com on behalf of Dave Meikle" < > loo...@gmail.com on behalf of dmei...@apache.org> wrote: > > Hello Everyone, > > Please join me in welcoming Madhav Sharan as a PMC Members and > Committer to > the project! > > Welcome to the team, Madhav. Feel free to say a bit about yourselves > and > how you got involved in Tika. > > Cheers, > Dave > > > > >
[jira] [Commented] (TIKA-2451) Detect image frame counts for tiff files
[ https://issues.apache.org/jira/browse/TIKA-2451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16149879#comment-16149879 ] Tim Allison commented on TIKA-2451: --- It looks like this was (fairly) recently added to drewnoakes' metadata-extractor: [version 2.10.0|https://github.com/drewnoakes/metadata-extractor/releases/tag/2.10.0] included [support for multipage tiffs|https://github.com/drewnoakes/metadata-extractor/pull/228]. When I bumped the version up to the latest, I get the following for your file: {{Page Number : 9 10}}. My guess from [this|http://www.awaresystems.be/imaging/tiff/tifftags/pagenumber.html] is that 9 (0 index) is the last page number and 10 is the total number of pages. Should we normalize (split on " " and take the second) ? As a side note, I confirmed that tesseract is pulling text out of all the pages. W00t! {noformat} TIFF Example Page 1 Multipage TIFF Example Page 2 Multipage TIFF Example Page 3 Multipage TIFF Example Page4 Multipage TIFF Example Page 5 Multipage TIFF Example Page 6 Multipage TIFF Example Page 7 Multipage TIFF Example Page 8 Multipage TIFF Example Page 9 Multipage TIFF Example Page 10 {noformat} > Detect image frame counts for tiff files > > > Key: TIKA-2451 > URL: https://issues.apache.org/jira/browse/TIKA-2451 > Project: Tika > Issue Type: Improvement > Components: metadata >Reporter: Mike Cantrell >Priority: Minor > Attachments: multipage_tiff_example.tif > > > It would be useful to know the number of frames in a multi-page tiff image. > My apologies if this already exists but I could not locate it in any of the > existing metadata output. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (TIKA-2219) CharsetDetector no longer detects windows-1252 charset
[ https://issues.apache.org/jira/browse/TIKA-2219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matthew Caruana Galizia updated TIKA-2219: -- Attachment: test.txt This file contains x92 characters which should force detection to Windows-1252. > CharsetDetector no longer detects windows-1252 charset > -- > > Key: TIKA-2219 > URL: https://issues.apache.org/jira/browse/TIKA-2219 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.14 > Environment: Any. >Reporter: Pascal Essiembre >Priority: Minor > Fix For: 2.0, 1.15 > > Attachments: test.txt > > > Starting with Tika 1.14, windows-1252 is no longer detected, as ISO-8859-1 is > always detected instead. While not tested, this likely affects other > windows-125* encodings as well. > I tracked it down to a change in the > {{CharsetRecog_sbcs.CharsetRecog_8859_1#getName()}} method. Now it always > returns "ISO-8859-1" whereas before it was: {{return haveC1Bytes ? > "windows-1252" : "ISO-8859-1";}} > Now that condition has been moved to the {{match(CharsetDetector det)}} > method so that the returned CharsetMatch has the proper name. The problem > with that is {{CharsetDetector#detectAll()}} method overwrites the correct > match with a new one that will return the value of {{#getName()}} from the > {{CharsetRecognizer}} instead (which is always "ISO-8859-1" in this case). > There might be legitimate reasons why the {{CharsetMatch}} instances in > {{detectAll()}} method are replaced with new ones, but changing this code in > that method appears to work for me: > // Remove this: > //CharsetMatch m = new CharsetMatch(this, csr, > confidence); > //matches.add(m); > // Add this instead: > matches.add(charsetMatch); -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (TIKA-2219) CharsetDetector no longer detects windows-1252 charset
[ https://issues.apache.org/jira/browse/TIKA-2219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16149673#comment-16149673 ] Matthew Caruana Galizia commented on TIKA-2219: --- [~talli...@mitre.org] I think this issue has regressed. Please take a look at the attached file. It's parsed as an email but the body text is detected as US-ASCII instead of Windows-1252 (note the x92 characters). > CharsetDetector no longer detects windows-1252 charset > -- > > Key: TIKA-2219 > URL: https://issues.apache.org/jira/browse/TIKA-2219 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.14 > Environment: Any. >Reporter: Pascal Essiembre >Priority: Minor > Fix For: 2.0, 1.15 > > > Starting with Tika 1.14, windows-1252 is no longer detected, as ISO-8859-1 is > always detected instead. While not tested, this likely affects other > windows-125* encodings as well. > I tracked it down to a change in the > {{CharsetRecog_sbcs.CharsetRecog_8859_1#getName()}} method. Now it always > returns "ISO-8859-1" whereas before it was: {{return haveC1Bytes ? > "windows-1252" : "ISO-8859-1";}} > Now that condition has been moved to the {{match(CharsetDetector det)}} > method so that the returned CharsetMatch has the proper name. The problem > with that is {{CharsetDetector#detectAll()}} method overwrites the correct > match with a new one that will return the value of {{#getName()}} from the > {{CharsetRecognizer}} instead (which is always "ISO-8859-1" in this case). > There might be legitimate reasons why the {{CharsetMatch}} instances in > {{detectAll()}} method are replaced with new ones, but changing this code in > that method appears to work for me: > // Remove this: > //CharsetMatch m = new CharsetMatch(this, csr, > confidence); > //matches.add(m); > // Add this instead: > matches.add(charsetMatch); -- This message was sent by Atlassian JIRA (v6.4.14#64029)
RE: [ANNOUNCE] Welcome Madhav Sharan as Tika Committer and PMC Member
W00t! Welcome, Madhav! -Original Message- From: Chris Mattmann [mailto:mattm...@apache.org] Sent: Thursday, August 31, 2017 3:52 PM To: dev@tika.apache.org Subject: Re: [ANNOUNCE] Welcome Madhav Sharan as Tika Committer and PMC Member Welcome Madhav! Cheers, Chris On 8/31/17, 12:29 PM, "loo...@gmail.com on behalf of Dave Meikle"wrote: Hello Everyone, Please join me in welcoming Madhav Sharan as a PMC Members and Committer to the project! Welcome to the team, Madhav. Feel free to say a bit about yourselves and how you got involved in Tika. Cheers, Dave
Re: [ANNOUNCE] Welcome Madhav Sharan as Tika Committer and PMC Member
Welcome Madhav! Cheers, Chris On 8/31/17, 12:29 PM, "loo...@gmail.com on behalf of Dave Meikle"wrote: Hello Everyone, Please join me in welcoming Madhav Sharan as a PMC Members and Committer to the project! Welcome to the team, Madhav. Feel free to say a bit about yourselves and how you got involved in Tika. Cheers, Dave
[ANNOUNCE] Welcome Madhav Sharan as Tika Committer and PMC Member
Hello Everyone, Please join me in welcoming Madhav Sharan as a PMC Members and Committer to the project! Welcome to the team, Madhav. Feel free to say a bit about yourselves and how you got involved in Tika. Cheers, Dave
[jira] [Created] (TIKA-2457) Update MboxParser to more recent handling of embedded docs
Tim Allison created TIKA-2457: - Summary: Update MboxParser to more recent handling of embedded docs Key: TIKA-2457 URL: https://issues.apache.org/jira/browse/TIKA-2457 Project: Tika Issue Type: Improvement Reporter: Tim Allison Priority: Minor Let's make the MBoxParser treat embedded docs similarly to the OutlookPSTParser. The RecursiveParserWrapper allows uniform access to embedded docs' metadata across the parsers. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (TIKA-2456) Emails extracted from MBOX not detected as rfc822
[ https://issues.apache.org/jira/browse/TIKA-2456?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16149271#comment-16149271 ] Hudson commented on TIKA-2456: -- SUCCESS: Integrated in Jenkins build Tika-trunk #1355 (See [https://builds.apache.org/job/Tika-trunk/1355/]) TIKA-2456: fix detection of emails inside mbox (lfcnassif: [https://github.com/apache/tika/commit/560e91a176ca5ff1adfc3ff1c1f63e32ec4e928a]) * (edit) CHANGES.txt * (edit) tika-parsers/src/main/java/org/apache/tika/parser/mbox/MboxParser.java * (edit) tika-parsers/src/test/java/org/apache/tika/parser/mbox/MboxParserTest.java * (add) tika-parsers/src/test/resources/test-documents/single_mail.mbox > Emails extracted from MBOX not detected as rfc822 > - > > Key: TIKA-2456 > URL: https://issues.apache.org/jira/browse/TIKA-2456 > Project: Tika > Issue Type: Bug > Components: detector >Affects Versions: 1.16 >Reporter: Luis Filipe Nassif > Fix For: 1.17 > > Attachments: single_mail.mbox > > > Similar to TIKA-2454, because of recurrent detection issues with > message/rfc822 (TIKA-2042, TIKA-1602, TIKA-879), children of mbox files could > not be detected as rfc822, but they will always be. Solution is to set > Content-Type-Override inside MBOXPArser. Fix being prepared... -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Comment Edited] (TIKA-2456) Emails extracted from MBOX not detected as rfc822
[ https://issues.apache.org/jira/browse/TIKA-2456?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16149250#comment-16149250 ] Tim Allison edited comment on TIKA-2456 at 8/31/17 4:54 PM: W00t! Welcome aboard, [~lfcnassif]! :D was (Author: talli...@mitre.org): W00t! Welcome aboard! :D > Emails extracted from MBOX not detected as rfc822 > - > > Key: TIKA-2456 > URL: https://issues.apache.org/jira/browse/TIKA-2456 > Project: Tika > Issue Type: Bug > Components: detector >Affects Versions: 1.16 >Reporter: Luis Filipe Nassif > Fix For: 1.17 > > Attachments: single_mail.mbox > > > Similar to TIKA-2454, because of recurrent detection issues with > message/rfc822 (TIKA-2042, TIKA-1602, TIKA-879), children of mbox files could > not be detected as rfc822, but they will always be. Solution is to set > Content-Type-Override inside MBOXPArser. Fix being prepared... -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (TIKA-2456) Emails extracted from MBOX not detected as rfc822
[ https://issues.apache.org/jira/browse/TIKA-2456?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16149250#comment-16149250 ] Tim Allison commented on TIKA-2456: --- W00t! Welcome aboard! :D > Emails extracted from MBOX not detected as rfc822 > - > > Key: TIKA-2456 > URL: https://issues.apache.org/jira/browse/TIKA-2456 > Project: Tika > Issue Type: Bug > Components: detector >Affects Versions: 1.16 >Reporter: Luis Filipe Nassif > Fix For: 1.17 > > Attachments: single_mail.mbox > > > Similar to TIKA-2454, because of recurrent detection issues with > message/rfc822 (TIKA-2042, TIKA-1602, TIKA-879), children of mbox files could > not be detected as rfc822, but they will always be. Solution is to set > Content-Type-Override inside MBOXPArser. Fix being prepared... -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Resolved] (TIKA-2456) Emails extracted from MBOX not detected as rfc822
[ https://issues.apache.org/jira/browse/TIKA-2456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Luis Filipe Nassif resolved TIKA-2456. -- Resolution: Fixed Fixed in r560e91a > Emails extracted from MBOX not detected as rfc822 > - > > Key: TIKA-2456 > URL: https://issues.apache.org/jira/browse/TIKA-2456 > Project: Tika > Issue Type: Bug > Components: detector >Affects Versions: 1.16 >Reporter: Luis Filipe Nassif > Fix For: 1.17 > > Attachments: single_mail.mbox > > > Similar to TIKA-2454, because of recurrent detection issues with > message/rfc822 (TIKA-2042, TIKA-1602, TIKA-879), children of mbox files could > not be detected as rfc822, but they will always be. Solution is to set > Content-Type-Override inside MBOXPArser. Fix being prepared... -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Issue Comment Deleted] (TIKA-2456) Emails extracted from MBOX not detected as rfc822
[ https://issues.apache.org/jira/browse/TIKA-2456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Luis Filipe Nassif updated TIKA-2456: - Comment: was deleted (was: Fixed in r560e91a) > Emails extracted from MBOX not detected as rfc822 > - > > Key: TIKA-2456 > URL: https://issues.apache.org/jira/browse/TIKA-2456 > Project: Tika > Issue Type: Bug > Components: detector >Affects Versions: 1.16 >Reporter: Luis Filipe Nassif > Fix For: 1.17 > > Attachments: single_mail.mbox > > > Similar to TIKA-2454, because of recurrent detection issues with > message/rfc822 (TIKA-2042, TIKA-1602, TIKA-879), children of mbox files could > not be detected as rfc822, but they will always be. Solution is to set > Content-Type-Override inside MBOXPArser. Fix being prepared... -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (TIKA-2456) Emails extracted from MBOX not detected as rfc822
[ https://issues.apache.org/jira/browse/TIKA-2456?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16149223#comment-16149223 ] Luis Filipe Nassif commented on TIKA-2456: -- Fixed in r560e91a > Emails extracted from MBOX not detected as rfc822 > - > > Key: TIKA-2456 > URL: https://issues.apache.org/jira/browse/TIKA-2456 > Project: Tika > Issue Type: Bug > Components: detector >Affects Versions: 1.16 >Reporter: Luis Filipe Nassif > Fix For: 1.17 > > Attachments: single_mail.mbox > > > Similar to TIKA-2454, because of recurrent detection issues with > message/rfc822 (TIKA-2042, TIKA-1602, TIKA-879), children of mbox files could > not be detected as rfc822, but they will always be. Solution is to set > Content-Type-Override inside MBOXPArser. Fix being prepared... -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (TIKA-2451) Detect image frame counts for tiff files
[ https://issues.apache.org/jira/browse/TIKA-2451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16149221#comment-16149221 ] Tim Allison commented on TIKA-2451: --- Thank you. Will take a look...on a related note: https://github.com/tesseract-ocr/tesseract/issues/743 :P > Detect image frame counts for tiff files > > > Key: TIKA-2451 > URL: https://issues.apache.org/jira/browse/TIKA-2451 > Project: Tika > Issue Type: Improvement > Components: metadata >Reporter: Mike Cantrell >Priority: Minor > Attachments: multipage_tiff_example.tif > > > It would be useful to know the number of frames in a multi-page tiff image. > My apologies if this already exists but I could not locate it in any of the > existing metadata output. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (TIKA-2451) Detect image frame counts for tiff files
[ https://issues.apache.org/jira/browse/TIKA-2451?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mike Cantrell updated TIKA-2451: Attachment: multipage_tiff_example.tif No problem. I'm attaching an example file. We're currently using [twelve monkey's imageio tiff plugin|https://github.com/haraldk/TwelveMonkeys] ImageReader.getNumImages(true) to count the frames. I'm assuming that the EXIF metadata should hold the clue to the number of images though. > Detect image frame counts for tiff files > > > Key: TIKA-2451 > URL: https://issues.apache.org/jira/browse/TIKA-2451 > Project: Tika > Issue Type: Improvement > Components: metadata >Reporter: Mike Cantrell >Priority: Minor > Attachments: multipage_tiff_example.tif > > > It would be useful to know the number of frames in a multi-page tiff image. > My apologies if this already exists but I could not locate it in any of the > existing metadata output. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (TIKA-2456) Emails extracted from MBOX not detected as rfc822
[ https://issues.apache.org/jira/browse/TIKA-2456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Luis Filipe Nassif updated TIKA-2456: - Attachment: single_mail.mbox File to unit test > Emails extracted from MBOX not detected as rfc822 > - > > Key: TIKA-2456 > URL: https://issues.apache.org/jira/browse/TIKA-2456 > Project: Tika > Issue Type: Bug > Components: detector >Affects Versions: 1.16 >Reporter: Luis Filipe Nassif > Fix For: 1.17 > > Attachments: single_mail.mbox > > > Similar to TIKA-2454, because of recurrent detection issues with > message/rfc822 (TIKA-2042, TIKA-1602, TIKA-879), children of mbox files could > not be detected as rfc822, but they will always be. Solution is to set > Content-Type-Override inside MBOXPArser. Fix being prepared... -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (TIKA-2456) Emails extracted from MBOX not detected as rfc822
Luis Filipe Nassif created TIKA-2456: Summary: Emails extracted from MBOX not detected as rfc822 Key: TIKA-2456 URL: https://issues.apache.org/jira/browse/TIKA-2456 Project: Tika Issue Type: Bug Components: detector Affects Versions: 1.16 Reporter: Luis Filipe Nassif Fix For: 1.17 Similar to TIKA-2454, because of recurrent detection issues with message/rfc822 (TIKA-2042, TIKA-1602, TIKA-879), children of mbox files could not be detected as rfc822, but they will always be. Solution is to set Content-Type-Override inside MBOXPArser. Fix being prepared... -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (TIKA-2455) Flag in metadata for alternative email bodies
Matthew Caruana Galizia created TIKA-2455: - Summary: Flag in metadata for alternative email bodies Key: TIKA-2455 URL: https://issues.apache.org/jira/browse/TIKA-2455 Project: Tika Issue Type: Improvement Components: parser Affects Versions: 1.16 Reporter: Matthew Caruana Galizia Priority: Minor When multipart RFC822 emails are being parsed, there's no way to distinguish between alternative versions of the body and attachments. It would be ideal if some kind of flag were set in the metadata passed to the {{EmbeddedDocumentExtractor}} that indicates that the stream is an alternative. In GUIs that present the data extracted from the email, alternative bodies can be distinguished from attachments and presented separately. -- This message was sent by Atlassian JIRA (v6.4.14#64029)