[jira] [Commented] (TIKA-4250) Add a libpst-based parser

2024-05-09 Thread Hudson (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17845057#comment-17845057
 ] 

Hudson commented on TIKA-4250:
--

UNSTABLE: Integrated in Jenkins build Tika » tika-main-jdk11 #1624 (See 
[https://ci-builds.apache.org/job/Tika/job/tika-main-jdk11/1624/])
TIKA-4250 -- add optional parser for pst files -- wrapper for libpst/readpst 
(#1751) (github: 
[https://github.com/apache/tika/commit/32baf2345abe1a04d767ea6641a567d5c924587e])
* (add) 
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-microsoft-module/src/main/java/org/apache/tika/parser/microsoft/libpst/LibPstParserConfig.java
* (add) 
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-microsoft-module/src/main/java/org/apache/tika/parser/microsoft/libpst/LibPstParser.java
* (edit) 
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-microsoft-module/pom.xml
* (add) 
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-microsoft-module/src/test/resources/org/apache/tika/parser/microsoft/libpst/tika-libpst-eml-config.xml
* (edit) CHANGES.txt
* (add) 
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-microsoft-module/src/test/java/org/apache/tika/parser/microsoft/libpst/TestLibPstParser.java
* (add) 
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-microsoft-module/src/test/resources/org/apache/tika/parser/microsoft/libpst/tika-libpst-config.xml
* (add) 
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-microsoft-module/src/main/java/org/apache/tika/parser/microsoft/libpst/EmailVisitor.java


> Add a libpst-based parser
> -
>
> Key: TIKA-4250
> URL: https://issues.apache.org/jira/browse/TIKA-4250
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
> Attachments: 8.eml, 8.msg
>
>
> We currently use the com.pff Java-based PST parser for PST files. It would be 
> useful to add a wrapper for libpst as an optional parser. 
> One of the benefits of libpst is that it creates .eml or .msg files from the 
> PST records. This is critical for those who want the original bytes from 
> embedded files. Obv, PST doesn't store eml or msg, but some users want the 
> "original" emails even if they are constructed from PST records.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4250) Add a libpst-based parser

2024-05-09 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17845012#comment-17845012
 ] 

ASF GitHub Bot commented on TIKA-4250:
--

tballison merged PR #1751:
URL: https://github.com/apache/tika/pull/1751




> Add a libpst-based parser
> -
>
> Key: TIKA-4250
> URL: https://issues.apache.org/jira/browse/TIKA-4250
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
> Attachments: 8.eml, 8.msg
>
>
> We currently use the com.pff Java-based PST parser for PST files. It would be 
> useful to add a wrapper for libpst as an optional parser. 
> One of the benefits of libpst is that it creates .eml or .msg files from the 
> PST records. This is critical for those who want the original bytes from 
> embedded files. Obv, PST doesn't store eml or msg, but some users want the 
> "original" emails even if they are constructed from PST records.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4250) Add a libpst-based parser

2024-05-09 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17844997#comment-17844997
 ] 

ASF GitHub Bot commented on TIKA-4250:
--

tballison opened a new pull request, #1751:
URL: https://github.com/apache/tika/pull/1751

   …readpst
   
   
   
   Thanks for your contribution to [Apache Tika](https://tika.apache.org/)! 
Your help is appreciated!
   
   Before opening the pull request, please verify that
   * there is an open issue on the [Tika issue 
tracker](https://issues.apache.org/jira/projects/TIKA) which describes the 
problem or the improvement. We cannot accept pull requests without an issue 
because the change wouldn't be listed in the release notes.
   * the issue ID (`TIKA-`)
 - is referenced in the title of the pull request
 - and placed in front of your commit messages surrounded by square 
brackets (`[TIKA-] Issue or pull request title`)
   * commits are squashed into a single one (or few commits for larger changes)
   * Tika is successfully built and unit tests pass by running `mvn clean test`
   * there should be no conflicts when merging the pull request branch into the 
*recent* `main` branch. If there are conflicts, please try to rebase the pull 
request branch on top of a freshly pulled `main` branch
   * if you add new module that downstream users will depend upon add it to 
relevant group in `tika-bom/pom.xml`.
   
   We will be able to faster integrate your pull request if these conditions 
are met. If you have any questions how to fix your problem or about using Tika 
in general, please sign up for the [Tika mailing 
list](http://tika.apache.org/mail-lists.html). Thanks!
   




> Add a libpst-based parser
> -
>
> Key: TIKA-4250
> URL: https://issues.apache.org/jira/browse/TIKA-4250
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
> Attachments: 8.eml, 8.msg
>
>
> We currently use the com.pff Java-based PST parser for PST files. It would be 
> useful to add a wrapper for libpst as an optional parser. 
> One of the benefits of libpst is that it creates .eml or .msg files from the 
> PST records. This is critical for those who want the original bytes from 
> embedded files. Obv, PST doesn't store eml or msg, but some users want the 
> "original" emails even if they are constructed from PST records.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4250) Add a libpst-based parser

2024-05-09 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17844976#comment-17844976
 ] 

Tim Allison commented on TIKA-4250:
---

libpff issue opened: https://github.com/libyal/libpff/issues/128

Note that I found non-deterministic behavior even without debug on -- sometimes 
I got 7 extracted files, sometimes 8. I noted that in the issue. 

> Add a libpst-based parser
> -
>
> Key: TIKA-4250
> URL: https://issues.apache.org/jira/browse/TIKA-4250
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
> Attachments: 8.eml, 8.msg
>
>
> We currently use the com.pff Java-based PST parser for PST files. It would be 
> useful to add a wrapper for libpst as an optional parser. 
> One of the benefits of libpst is that it creates .eml or .msg files from the 
> PST records. This is critical for those who want the original bytes from 
> embedded files. Obv, PST doesn't store eml or msg, but some users want the 
> "original" emails even if they are constructed from PST records.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4250) Add a libpst-based parser

2024-05-06 Thread Jira


[ 
https://issues.apache.org/jira/browse/TIKA-4250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17844097#comment-17844097
 ] 

Luís Filipe Nassif commented on TIKA-4250:
--

Just pushed the quick and dirty java-libpst fork here:

https://github.com/sepinf-inc/java-libpst

> Add a libpst-based parser
> -
>
> Key: TIKA-4250
> URL: https://issues.apache.org/jira/browse/TIKA-4250
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
> Attachments: 8.eml, 8.msg
>
>
> We currently use the com.pff Java-based PST parser for PST files. It would be 
> useful to add a wrapper for libpst as an optional parser. 
> One of the benefits of libpst is that it creates .eml or .msg files from the 
> PST records. This is critical for those who want the original bytes from 
> embedded files. Obv, PST doesn't store eml or msg, but some users want the 
> "original" emails even if they are constructed from PST records.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4250) Add a libpst-based parser

2024-05-06 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17843798#comment-17843798
 ] 

Tim Allison commented on TIKA-4250:
---

So, I caught an example of libpst not reading an attachment in our unit test 
file (testPST.pst). The attached msg should contain an embedded msg that 
includes a docx. Via a hex editor, I can see that there is no embedded msg in 
8.msg, whereas the structure is correctly maintained in 8.eml.

> Add a libpst-based parser
> -
>
> Key: TIKA-4250
> URL: https://issues.apache.org/jira/browse/TIKA-4250
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
> Attachments: 8.eml, 8.msg
>
>
> We currently use the com.pff Java-based PST parser for PST files. It would be 
> useful to add a wrapper for libpst as an optional parser. 
> One of the benefits of libpst is that it creates .eml or .msg files from the 
> PST records. This is critical for those who want the original bytes from 
> embedded files. Obv, PST doesn't store eml or msg, but some users want the 
> "original" emails even if they are constructed from PST records.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4250) Add a libpst-based parser

2024-05-05 Thread Jira


[ 
https://issues.apache.org/jira/browse/TIKA-4250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17843604#comment-17843604
 ] 

Luís Filipe Nassif commented on TIKA-4250:
--

Updating results with Libpff-20231205:
|For 258 pst/ost files (93GB)| | |
| | | | |
| |LibPst-0.6.76|LibPff-20231205|Java-libpst-0.9.5*|
|Emails|195698|201792|208373|
|Contacts|19738|19949|24342|
|Attachments|242394|286669|275481|
|Feeds|0|47916|47913|
|Appointments|0|12664|15885|
|Meetings|0|5285|0|
|Activity|0|3457|3457|
|Documents|0|2202|0|
|Taks|0|578|562|
|Notes|0|391|0|
|Vcalendar|8642|0|0|
|Vjournal|2352|0|0|
|Total|468824|580903|576013|

> Add a libpst-based parser
> -
>
> Key: TIKA-4250
> URL: https://issues.apache.org/jira/browse/TIKA-4250
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
>
> We currently use the com.pff Java-based PST parser for PST files. It would be 
> useful to add a wrapper for libpst as an optional parser. 
> One of the benefits of libpst is that it creates .eml or .msg files from the 
> PST records. This is critical for those who want the original bytes from 
> embedded files. Obv, PST doesn't store eml or msg, but some users want the 
> "original" emails even if they are constructed from PST records.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4250) Add a libpst-based parser

2024-05-05 Thread Jira


[ 
https://issues.apache.org/jira/browse/TIKA-4250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17843593#comment-17843593
 ] 

Luís Filipe Nassif commented on TIKA-4250:
--

I included a patched version of java-libpst-0.9.5 in the test, results below:
|For 258 pst/ost files (93GB)| | |
| | | | |
| |LibPst-0.6.76|LibPff-20131028|Java-libpst-0.9.5*|
|Emails|195698|201818|208373|
|Contacts|19738|19949|24342|
|Attachments|242394|286723|275481|
|Feeds|0|47916|47913|
|Appointments|0|12664|15885|
|Meetings|0|5285|0|
|Activity|0|3457|3457|
|Documents|0|2202|0|
|Taks|0|578|562|
|Notes|0|391|0|
|Vcalendar|8642|0|0|
|Vjournal|2352|0|0|
|Total|468824|580983|576013|
| | | | |
|*java-libpst-0.9.5 fork with some fixes| |

PS: Tested libpff version is pretty old, I should have run with a newer 
version...

PS2: Libpff recovery of deleted items was not enabled, it recovers some 
thousands of emails and attachs.

> Add a libpst-based parser
> -
>
> Key: TIKA-4250
> URL: https://issues.apache.org/jira/browse/TIKA-4250
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
>
> We currently use the com.pff Java-based PST parser for PST files. It would be 
> useful to add a wrapper for libpst as an optional parser. 
> One of the benefits of libpst is that it creates .eml or .msg files from the 
> PST records. This is critical for those who want the original bytes from 
> embedded files. Obv, PST doesn't store eml or msg, but some users want the 
> "original" emails even if they are constructed from PST records.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4250) Add a libpst-based parser

2024-05-04 Thread Jira


[ 
https://issues.apache.org/jira/browse/TIKA-4250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17843509#comment-17843509
 ] 

Luís Filipe Nassif commented on TIKA-4250:
--

I'm running a comparison between readpst (libpst) and pffexport (libpff) on 
~250 real world PST/OST mailboxes (95GB) and will post results when finished.

> Add a libpst-based parser
> -
>
> Key: TIKA-4250
> URL: https://issues.apache.org/jira/browse/TIKA-4250
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
>
> We currently use the com.pff Java-based PST parser for PST files. It would be 
> useful to add a wrapper for libpst as an optional parser. 
> One of the benefits of libpst is that it creates .eml or .msg files from the 
> PST records. This is critical for those who want the original bytes from 
> embedded files. Obv, PST doesn't store eml or msg, but some users want the 
> "original" emails even if they are constructed from PST records.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4250) Add a libpst-based parser

2024-05-04 Thread Jira


[ 
https://issues.apache.org/jira/browse/TIKA-4250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17843437#comment-17843437
 ] 

Luís Filipe Nassif commented on TIKA-4250:
--

PS: I have never used libpst, so a comparison between libpst and libpff results 
would be interesting...

> Add a libpst-based parser
> -
>
> Key: TIKA-4250
> URL: https://issues.apache.org/jira/browse/TIKA-4250
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
>
> We currently use the com.pff Java-based PST parser for PST files. It would be 
> useful to add a wrapper for libpst as an optional parser. 
> One of the benefits of libpst is that it creates .eml or .msg files from the 
> PST records. This is critical for those who want the original bytes from 
> embedded files. Obv, PST doesn't store eml or msg, but some users want the 
> "original" emails even if they are constructed from PST records.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4250) Add a libpst-based parser

2024-05-04 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17843428#comment-17843428
 ] 

Tim Allison commented on TIKA-4250:
---

Given your experience, I think it would be valuable to add libpff as an 
optional PST parser to Tika. 

Advanced users can use libpff for content+metadata and then libpst to generate 
msg files -- with the understanding that some msg files can't be generated 
(e.g. when libpst fails).

> Add a libpst-based parser
> -
>
> Key: TIKA-4250
> URL: https://issues.apache.org/jira/browse/TIKA-4250
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
>
> We currently use the com.pff Java-based PST parser for PST files. It would be 
> useful to add a wrapper for libpst as an optional parser. 
> One of the benefits of libpst is that it creates .eml or .msg files from the 
> PST records. This is critical for those who want the original bytes from 
> embedded files. Obv, PST doesn't store eml or msg, but some users want the 
> "original" emails even if they are constructed from PST records.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4250) Add a libpst-based parser

2024-05-03 Thread Jira


[ 
https://issues.apache.org/jira/browse/TIKA-4250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17843381#comment-17843381
 ] 

Luís Filipe Nassif commented on TIKA-4250:
--

If our wrapper, or part of it, is of interest, I can relicense it as Apache v2 
since I wrote it alone a long ago, but, as I said, it outputs html views of 
mails, not eml/msg.

> Add a libpst-based parser
> -
>
> Key: TIKA-4250
> URL: https://issues.apache.org/jira/browse/TIKA-4250
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
>
> We currently use the com.pff Java-based PST parser for PST files. It would be 
> useful to add a wrapper for libpst as an optional parser. 
> One of the benefits of libpst is that it creates .eml or .msg files from the 
> PST records. This is critical for those who want the original bytes from 
> embedded files. Obv, PST doesn't store eml or msg, but some users want the 
> "original" emails even if they are constructed from PST records.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4250) Add a libpst-based parser

2024-05-03 Thread Jira


[ 
https://issues.apache.org/jira/browse/TIKA-4250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17843364#comment-17843364
 ] 

Luís Filipe Nassif commented on TIKA-4250:
--

We can improve our wrapper, for sure. But I would suggest taking a look at 
libpff project, that was an awesome reverse engineering effort by Joachim Metz 
when PST format was still closed source, and java-libpst was based on it, 
although libpff doesn't miss thousands of emails and attachments with some 
inputs...

> Add a libpst-based parser
> -
>
> Key: TIKA-4250
> URL: https://issues.apache.org/jira/browse/TIKA-4250
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
>
> We currently use the com.pff Java-based PST parser for PST files. It would be 
> useful to add a wrapper for libpst as an optional parser. 
> One of the benefits of libpst is that it creates .eml or .msg files from the 
> PST records. This is critical for those who want the original bytes from 
> embedded files. Obv, PST doesn't store eml or msg, but some users want the 
> "original" emails even if they are constructed from PST records.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4250) Add a libpst-based parser

2024-05-03 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17843361#comment-17843361
 ] 

Tim Allison commented on TIKA-4250:
---

Hahahahaha. I figured you'd have input on this [~lfcnassif]! 

Y, libpst is aging but it is slightly fresher than java-libpst. :/

I'll take a look at your wrapper. Thank you!

> Add a libpst-based parser
> -
>
> Key: TIKA-4250
> URL: https://issues.apache.org/jira/browse/TIKA-4250
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
>
> We currently use the com.pff Java-based PST parser for PST files. It would be 
> useful to add a wrapper for libpst as an optional parser. 
> One of the benefits of libpst is that it creates .eml or .msg files from the 
> PST records. This is critical for those who want the original bytes from 
> embedded files. Obv, PST doesn't store eml or msg, but some users want the 
> "original" emails even if they are constructed from PST records.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4250) Add a libpst-based parser

2024-05-03 Thread Jira


[ 
https://issues.apache.org/jira/browse/TIKA-4250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17843359#comment-17843359
 ] 

Luís Filipe Nassif commented on TIKA-4250:
--

One drawback of our libpff usage approach is that it exports the internal 
PST/OST tree as a file system tree, and it sometimes causes issues with 
forbidden NTFS chars and long paths in the temp folder hard to delete and 
parsing finishes... 

> Add a libpst-based parser
> -
>
> Key: TIKA-4250
> URL: https://issues.apache.org/jira/browse/TIKA-4250
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
>
> We currently use the com.pff Java-based PST parser for PST files. It would be 
> useful to add a wrapper for libpst as an optional parser. 
> One of the benefits of libpst is that it creates .eml or .msg files from the 
> PST records. This is critical for those who want the original bytes from 
> embedded files. Obv, PST doesn't store eml or msg, but some users want the 
> "original" emails even if they are constructed from PST records.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4250) Add a libpst-based parser

2024-05-03 Thread Jira


[ 
https://issues.apache.org/jira/browse/TIKA-4250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17843355#comment-17843355
 ] 

Luís Filipe Nassif commented on TIKA-4250:
--

Hi [~tallison],

I would like to add java-libpst seems abandoned, because old and critical 
reported issues are not being addressed, like this:

[https://github.com/rjohnsondev/java-libpst/issues/60]

We have a libpff ([https://github.com/libyal/libpff)] based parser using fork 
and exec 
(https://github.com/sepinf-inc/IPED/blob/master/iped-parsers/iped-parsers-impl/src/main/java/iped/parsers/mail/LibpffPSTParser.java),
 it works better than java-libpst, but it is GPL licensed and we build custom 
html mail views, not eml or msg, so I think it wouldn't help here...

> Add a libpst-based parser
> -
>
> Key: TIKA-4250
> URL: https://issues.apache.org/jira/browse/TIKA-4250
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
>
> We currently use the com.pff Java-based PST parser for PST files. It would be 
> useful to add a wrapper for libpst as an optional parser. 
> One of the benefits of libpst is that it creates .eml or .msg files from the 
> PST records. This is critical for those who want the original bytes from 
> embedded files. Obv, PST doesn't store eml or msg, but some users want the 
> "original" emails even if they are constructed from PST records.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)