[jira] [Commented] (TIKA-4250) Add a libpst-based parser
[ https://issues.apache.org/jira/browse/TIKA-4250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17845057#comment-17845057 ] Hudson commented on TIKA-4250: -- UNSTABLE: Integrated in Jenkins build Tika » tika-main-jdk11 #1624 (See [https://ci-builds.apache.org/job/Tika/job/tika-main-jdk11/1624/]) TIKA-4250 -- add optional parser for pst files -- wrapper for libpst/readpst (#1751) (github: [https://github.com/apache/tika/commit/32baf2345abe1a04d767ea6641a567d5c924587e]) * (add) tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-microsoft-module/src/main/java/org/apache/tika/parser/microsoft/libpst/LibPstParserConfig.java * (add) tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-microsoft-module/src/main/java/org/apache/tika/parser/microsoft/libpst/LibPstParser.java * (edit) tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-microsoft-module/pom.xml * (add) tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-microsoft-module/src/test/resources/org/apache/tika/parser/microsoft/libpst/tika-libpst-eml-config.xml * (edit) CHANGES.txt * (add) tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-microsoft-module/src/test/java/org/apache/tika/parser/microsoft/libpst/TestLibPstParser.java * (add) tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-microsoft-module/src/test/resources/org/apache/tika/parser/microsoft/libpst/tika-libpst-config.xml * (add) tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-microsoft-module/src/main/java/org/apache/tika/parser/microsoft/libpst/EmailVisitor.java > Add a libpst-based parser > - > > Key: TIKA-4250 > URL: https://issues.apache.org/jira/browse/TIKA-4250 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Major > Attachments: 8.eml, 8.msg > > > We currently use the com.pff Java-based PST parser for PST files. It would be > useful to add a wrapper for libpst as an optional parser. > One of the benefits of libpst is that it creates .eml or .msg files from the > PST records. This is critical for those who want the original bytes from > embedded files. Obv, PST doesn't store eml or msg, but some users want the > "original" emails even if they are constructed from PST records. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-4250) Add a libpst-based parser
[ https://issues.apache.org/jira/browse/TIKA-4250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17845012#comment-17845012 ] ASF GitHub Bot commented on TIKA-4250: -- tballison merged PR #1751: URL: https://github.com/apache/tika/pull/1751 > Add a libpst-based parser > - > > Key: TIKA-4250 > URL: https://issues.apache.org/jira/browse/TIKA-4250 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Major > Attachments: 8.eml, 8.msg > > > We currently use the com.pff Java-based PST parser for PST files. It would be > useful to add a wrapper for libpst as an optional parser. > One of the benefits of libpst is that it creates .eml or .msg files from the > PST records. This is critical for those who want the original bytes from > embedded files. Obv, PST doesn't store eml or msg, but some users want the > "original" emails even if they are constructed from PST records. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-4250) Add a libpst-based parser
[ https://issues.apache.org/jira/browse/TIKA-4250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17844997#comment-17844997 ] ASF GitHub Bot commented on TIKA-4250: -- tballison opened a new pull request, #1751: URL: https://github.com/apache/tika/pull/1751 …readpst Thanks for your contribution to [Apache Tika](https://tika.apache.org/)! Your help is appreciated! Before opening the pull request, please verify that * there is an open issue on the [Tika issue tracker](https://issues.apache.org/jira/projects/TIKA) which describes the problem or the improvement. We cannot accept pull requests without an issue because the change wouldn't be listed in the release notes. * the issue ID (`TIKA-`) - is referenced in the title of the pull request - and placed in front of your commit messages surrounded by square brackets (`[TIKA-] Issue or pull request title`) * commits are squashed into a single one (or few commits for larger changes) * Tika is successfully built and unit tests pass by running `mvn clean test` * there should be no conflicts when merging the pull request branch into the *recent* `main` branch. If there are conflicts, please try to rebase the pull request branch on top of a freshly pulled `main` branch * if you add new module that downstream users will depend upon add it to relevant group in `tika-bom/pom.xml`. We will be able to faster integrate your pull request if these conditions are met. If you have any questions how to fix your problem or about using Tika in general, please sign up for the [Tika mailing list](http://tika.apache.org/mail-lists.html). Thanks! > Add a libpst-based parser > - > > Key: TIKA-4250 > URL: https://issues.apache.org/jira/browse/TIKA-4250 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Major > Attachments: 8.eml, 8.msg > > > We currently use the com.pff Java-based PST parser for PST files. It would be > useful to add a wrapper for libpst as an optional parser. > One of the benefits of libpst is that it creates .eml or .msg files from the > PST records. This is critical for those who want the original bytes from > embedded files. Obv, PST doesn't store eml or msg, but some users want the > "original" emails even if they are constructed from PST records. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-4250) Add a libpst-based parser
[ https://issues.apache.org/jira/browse/TIKA-4250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17844976#comment-17844976 ] Tim Allison commented on TIKA-4250: --- libpff issue opened: https://github.com/libyal/libpff/issues/128 Note that I found non-deterministic behavior even without debug on -- sometimes I got 7 extracted files, sometimes 8. I noted that in the issue. > Add a libpst-based parser > - > > Key: TIKA-4250 > URL: https://issues.apache.org/jira/browse/TIKA-4250 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Major > Attachments: 8.eml, 8.msg > > > We currently use the com.pff Java-based PST parser for PST files. It would be > useful to add a wrapper for libpst as an optional parser. > One of the benefits of libpst is that it creates .eml or .msg files from the > PST records. This is critical for those who want the original bytes from > embedded files. Obv, PST doesn't store eml or msg, but some users want the > "original" emails even if they are constructed from PST records. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-4250) Add a libpst-based parser
[ https://issues.apache.org/jira/browse/TIKA-4250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17844097#comment-17844097 ] Luís Filipe Nassif commented on TIKA-4250: -- Just pushed the quick and dirty java-libpst fork here: https://github.com/sepinf-inc/java-libpst > Add a libpst-based parser > - > > Key: TIKA-4250 > URL: https://issues.apache.org/jira/browse/TIKA-4250 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Major > Attachments: 8.eml, 8.msg > > > We currently use the com.pff Java-based PST parser for PST files. It would be > useful to add a wrapper for libpst as an optional parser. > One of the benefits of libpst is that it creates .eml or .msg files from the > PST records. This is critical for those who want the original bytes from > embedded files. Obv, PST doesn't store eml or msg, but some users want the > "original" emails even if they are constructed from PST records. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-4250) Add a libpst-based parser
[ https://issues.apache.org/jira/browse/TIKA-4250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17843798#comment-17843798 ] Tim Allison commented on TIKA-4250: --- So, I caught an example of libpst not reading an attachment in our unit test file (testPST.pst). The attached msg should contain an embedded msg that includes a docx. Via a hex editor, I can see that there is no embedded msg in 8.msg, whereas the structure is correctly maintained in 8.eml. > Add a libpst-based parser > - > > Key: TIKA-4250 > URL: https://issues.apache.org/jira/browse/TIKA-4250 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Major > Attachments: 8.eml, 8.msg > > > We currently use the com.pff Java-based PST parser for PST files. It would be > useful to add a wrapper for libpst as an optional parser. > One of the benefits of libpst is that it creates .eml or .msg files from the > PST records. This is critical for those who want the original bytes from > embedded files. Obv, PST doesn't store eml or msg, but some users want the > "original" emails even if they are constructed from PST records. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-4250) Add a libpst-based parser
[ https://issues.apache.org/jira/browse/TIKA-4250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17843604#comment-17843604 ] Luís Filipe Nassif commented on TIKA-4250: -- Updating results with Libpff-20231205: |For 258 pst/ost files (93GB)| | | | | | | | | |LibPst-0.6.76|LibPff-20231205|Java-libpst-0.9.5*| |Emails|195698|201792|208373| |Contacts|19738|19949|24342| |Attachments|242394|286669|275481| |Feeds|0|47916|47913| |Appointments|0|12664|15885| |Meetings|0|5285|0| |Activity|0|3457|3457| |Documents|0|2202|0| |Taks|0|578|562| |Notes|0|391|0| |Vcalendar|8642|0|0| |Vjournal|2352|0|0| |Total|468824|580903|576013| > Add a libpst-based parser > - > > Key: TIKA-4250 > URL: https://issues.apache.org/jira/browse/TIKA-4250 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Major > > We currently use the com.pff Java-based PST parser for PST files. It would be > useful to add a wrapper for libpst as an optional parser. > One of the benefits of libpst is that it creates .eml or .msg files from the > PST records. This is critical for those who want the original bytes from > embedded files. Obv, PST doesn't store eml or msg, but some users want the > "original" emails even if they are constructed from PST records. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-4250) Add a libpst-based parser
[ https://issues.apache.org/jira/browse/TIKA-4250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17843593#comment-17843593 ] Luís Filipe Nassif commented on TIKA-4250: -- I included a patched version of java-libpst-0.9.5 in the test, results below: |For 258 pst/ost files (93GB)| | | | | | | | | |LibPst-0.6.76|LibPff-20131028|Java-libpst-0.9.5*| |Emails|195698|201818|208373| |Contacts|19738|19949|24342| |Attachments|242394|286723|275481| |Feeds|0|47916|47913| |Appointments|0|12664|15885| |Meetings|0|5285|0| |Activity|0|3457|3457| |Documents|0|2202|0| |Taks|0|578|562| |Notes|0|391|0| |Vcalendar|8642|0|0| |Vjournal|2352|0|0| |Total|468824|580983|576013| | | | | | |*java-libpst-0.9.5 fork with some fixes| | PS: Tested libpff version is pretty old, I should have run with a newer version... PS2: Libpff recovery of deleted items was not enabled, it recovers some thousands of emails and attachs. > Add a libpst-based parser > - > > Key: TIKA-4250 > URL: https://issues.apache.org/jira/browse/TIKA-4250 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Major > > We currently use the com.pff Java-based PST parser for PST files. It would be > useful to add a wrapper for libpst as an optional parser. > One of the benefits of libpst is that it creates .eml or .msg files from the > PST records. This is critical for those who want the original bytes from > embedded files. Obv, PST doesn't store eml or msg, but some users want the > "original" emails even if they are constructed from PST records. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-4250) Add a libpst-based parser
[ https://issues.apache.org/jira/browse/TIKA-4250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17843509#comment-17843509 ] Luís Filipe Nassif commented on TIKA-4250: -- I'm running a comparison between readpst (libpst) and pffexport (libpff) on ~250 real world PST/OST mailboxes (95GB) and will post results when finished. > Add a libpst-based parser > - > > Key: TIKA-4250 > URL: https://issues.apache.org/jira/browse/TIKA-4250 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Major > > We currently use the com.pff Java-based PST parser for PST files. It would be > useful to add a wrapper for libpst as an optional parser. > One of the benefits of libpst is that it creates .eml or .msg files from the > PST records. This is critical for those who want the original bytes from > embedded files. Obv, PST doesn't store eml or msg, but some users want the > "original" emails even if they are constructed from PST records. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-4250) Add a libpst-based parser
[ https://issues.apache.org/jira/browse/TIKA-4250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17843437#comment-17843437 ] Luís Filipe Nassif commented on TIKA-4250: -- PS: I have never used libpst, so a comparison between libpst and libpff results would be interesting... > Add a libpst-based parser > - > > Key: TIKA-4250 > URL: https://issues.apache.org/jira/browse/TIKA-4250 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Major > > We currently use the com.pff Java-based PST parser for PST files. It would be > useful to add a wrapper for libpst as an optional parser. > One of the benefits of libpst is that it creates .eml or .msg files from the > PST records. This is critical for those who want the original bytes from > embedded files. Obv, PST doesn't store eml or msg, but some users want the > "original" emails even if they are constructed from PST records. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-4250) Add a libpst-based parser
[ https://issues.apache.org/jira/browse/TIKA-4250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17843428#comment-17843428 ] Tim Allison commented on TIKA-4250: --- Given your experience, I think it would be valuable to add libpff as an optional PST parser to Tika. Advanced users can use libpff for content+metadata and then libpst to generate msg files -- with the understanding that some msg files can't be generated (e.g. when libpst fails). > Add a libpst-based parser > - > > Key: TIKA-4250 > URL: https://issues.apache.org/jira/browse/TIKA-4250 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Major > > We currently use the com.pff Java-based PST parser for PST files. It would be > useful to add a wrapper for libpst as an optional parser. > One of the benefits of libpst is that it creates .eml or .msg files from the > PST records. This is critical for those who want the original bytes from > embedded files. Obv, PST doesn't store eml or msg, but some users want the > "original" emails even if they are constructed from PST records. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-4250) Add a libpst-based parser
[ https://issues.apache.org/jira/browse/TIKA-4250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17843381#comment-17843381 ] Luís Filipe Nassif commented on TIKA-4250: -- If our wrapper, or part of it, is of interest, I can relicense it as Apache v2 since I wrote it alone a long ago, but, as I said, it outputs html views of mails, not eml/msg. > Add a libpst-based parser > - > > Key: TIKA-4250 > URL: https://issues.apache.org/jira/browse/TIKA-4250 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Major > > We currently use the com.pff Java-based PST parser for PST files. It would be > useful to add a wrapper for libpst as an optional parser. > One of the benefits of libpst is that it creates .eml or .msg files from the > PST records. This is critical for those who want the original bytes from > embedded files. Obv, PST doesn't store eml or msg, but some users want the > "original" emails even if they are constructed from PST records. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-4250) Add a libpst-based parser
[ https://issues.apache.org/jira/browse/TIKA-4250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17843364#comment-17843364 ] Luís Filipe Nassif commented on TIKA-4250: -- We can improve our wrapper, for sure. But I would suggest taking a look at libpff project, that was an awesome reverse engineering effort by Joachim Metz when PST format was still closed source, and java-libpst was based on it, although libpff doesn't miss thousands of emails and attachments with some inputs... > Add a libpst-based parser > - > > Key: TIKA-4250 > URL: https://issues.apache.org/jira/browse/TIKA-4250 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Major > > We currently use the com.pff Java-based PST parser for PST files. It would be > useful to add a wrapper for libpst as an optional parser. > One of the benefits of libpst is that it creates .eml or .msg files from the > PST records. This is critical for those who want the original bytes from > embedded files. Obv, PST doesn't store eml or msg, but some users want the > "original" emails even if they are constructed from PST records. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-4250) Add a libpst-based parser
[ https://issues.apache.org/jira/browse/TIKA-4250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17843361#comment-17843361 ] Tim Allison commented on TIKA-4250: --- Hahahahaha. I figured you'd have input on this [~lfcnassif]! Y, libpst is aging but it is slightly fresher than java-libpst. :/ I'll take a look at your wrapper. Thank you! > Add a libpst-based parser > - > > Key: TIKA-4250 > URL: https://issues.apache.org/jira/browse/TIKA-4250 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Major > > We currently use the com.pff Java-based PST parser for PST files. It would be > useful to add a wrapper for libpst as an optional parser. > One of the benefits of libpst is that it creates .eml or .msg files from the > PST records. This is critical for those who want the original bytes from > embedded files. Obv, PST doesn't store eml or msg, but some users want the > "original" emails even if they are constructed from PST records. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-4250) Add a libpst-based parser
[ https://issues.apache.org/jira/browse/TIKA-4250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17843359#comment-17843359 ] Luís Filipe Nassif commented on TIKA-4250: -- One drawback of our libpff usage approach is that it exports the internal PST/OST tree as a file system tree, and it sometimes causes issues with forbidden NTFS chars and long paths in the temp folder hard to delete and parsing finishes... > Add a libpst-based parser > - > > Key: TIKA-4250 > URL: https://issues.apache.org/jira/browse/TIKA-4250 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Major > > We currently use the com.pff Java-based PST parser for PST files. It would be > useful to add a wrapper for libpst as an optional parser. > One of the benefits of libpst is that it creates .eml or .msg files from the > PST records. This is critical for those who want the original bytes from > embedded files. Obv, PST doesn't store eml or msg, but some users want the > "original" emails even if they are constructed from PST records. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-4250) Add a libpst-based parser
[ https://issues.apache.org/jira/browse/TIKA-4250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17843355#comment-17843355 ] Luís Filipe Nassif commented on TIKA-4250: -- Hi [~tallison], I would like to add java-libpst seems abandoned, because old and critical reported issues are not being addressed, like this: [https://github.com/rjohnsondev/java-libpst/issues/60] We have a libpff ([https://github.com/libyal/libpff)] based parser using fork and exec (https://github.com/sepinf-inc/IPED/blob/master/iped-parsers/iped-parsers-impl/src/main/java/iped/parsers/mail/LibpffPSTParser.java), it works better than java-libpst, but it is GPL licensed and we build custom html mail views, not eml or msg, so I think it wouldn't help here... > Add a libpst-based parser > - > > Key: TIKA-4250 > URL: https://issues.apache.org/jira/browse/TIKA-4250 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Major > > We currently use the com.pff Java-based PST parser for PST files. It would be > useful to add a wrapper for libpst as an optional parser. > One of the benefits of libpst is that it creates .eml or .msg files from the > PST records. This is critical for those who want the original bytes from > embedded files. Obv, PST doesn't store eml or msg, but some users want the > "original" emails even if they are constructed from PST records. -- This message was sent by Atlassian Jira (v8.20.10#820010)