[jira] [Commented] (TIKA-4250) Add a libpst-based parser

2024-05-06 Thread Jira


[ 
https://issues.apache.org/jira/browse/TIKA-4250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17844097#comment-17844097
 ] 

Luís Filipe Nassif commented on TIKA-4250:
--

Just pushed the quick and dirty java-libpst fork here:

https://github.com/sepinf-inc/java-libpst

> Add a libpst-based parser
> -
>
> Key: TIKA-4250
> URL: https://issues.apache.org/jira/browse/TIKA-4250
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
> Attachments: 8.eml, 8.msg
>
>
> We currently use the com.pff Java-based PST parser for PST files. It would be 
> useful to add a wrapper for libpst as an optional parser. 
> One of the benefits of libpst is that it creates .eml or .msg files from the 
> PST records. This is critical for those who want the original bytes from 
> embedded files. Obv, PST doesn't store eml or msg, but some users want the 
> "original" emails even if they are constructed from PST records.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (TIKA-4251) [DISCUSS] move to cosium's git-code-format-maven-plugin with google-java-format

2024-05-06 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4251?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-4251:
--
Description: 
I was recently working a bit on incubator-stormcrawler, and I noticed that they 
are using cosium's git-code-format-maven-plugin: 
https://github.com/Cosium/git-code-format-maven-plugin

I was initially annoyed that I couldn't quickly figure out what I had to fix to 
make the linter happyl, but then I realized there was a magic command: {{mvn 
git-code-format:format-code}} which just fixed the code so that the linter 
passed. 

The one drawback I found is that it does not fix nor does it alert on wildcard 
imports.  We could still use checkstyle for that but only have one rule for 
checkstyle.

The other drawback is that there is not a lot of room for variation from 
google's style. This may actually be a benefit, too, of course.

I just ran this on {{tika-core}} here: 
https://github.com/apache/tika/tree/google-java-format

What would you think about making this change for 3.x?

  was:
I was recently working a bit on incubator-stormcrawler, and I noticed that they 
are using cosium's git-code-format-maven-plugin: 
https://github.com/Cosium/git-code-format-maven-plugin

I was initially annoyed that I couldn't quickly figure out how my code changes 
were causing the build to fail, but then I realized there was a magic command: 
{{mvn git-code-format:format-code}} which just fixed the code so that the 
linter passed. 

The one drawback I found is that it does not fix nor does it alert on wildcard 
imports.  We could still use checkstyle for that but only have one rule for 
checkstyle.

The other drawback is that there is not a lot of room for variation from 
google's style. This may actually be a benefit, too, of course.

I just ran this on {{tika-core}} here: 
https://github.com/apache/tika/tree/google-java-format

What would you think about making this change for 3.x?


> [DISCUSS] move to cosium's git-code-format-maven-plugin with 
> google-java-format
> ---
>
> Key: TIKA-4251
> URL: https://issues.apache.org/jira/browse/TIKA-4251
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
>
> I was recently working a bit on incubator-stormcrawler, and I noticed that 
> they are using cosium's git-code-format-maven-plugin: 
> https://github.com/Cosium/git-code-format-maven-plugin
> I was initially annoyed that I couldn't quickly figure out what I had to fix 
> to make the linter happyl, but then I realized there was a magic command: 
> {{mvn git-code-format:format-code}} which just fixed the code so that the 
> linter passed. 
> The one drawback I found is that it does not fix nor does it alert on 
> wildcard imports.  We could still use checkstyle for that but only have one 
> rule for checkstyle.
> The other drawback is that there is not a lot of room for variation from 
> google's style. This may actually be a benefit, too, of course.
> I just ran this on {{tika-core}} here: 
> https://github.com/apache/tika/tree/google-java-format
> What would you think about making this change for 3.x?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (TIKA-4251) [DISCUSS] move to cosium's git-code-format-maven-plugin with google-java-format

2024-05-06 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4251?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-4251:
--
Summary: [DISCUSS] move to cosium's git-code-format-maven-plugin with 
google-java-format  (was: [DISCUSS] move to cosium's 
git-code-format-maven-plugin)

> [DISCUSS] move to cosium's git-code-format-maven-plugin with 
> google-java-format
> ---
>
> Key: TIKA-4251
> URL: https://issues.apache.org/jira/browse/TIKA-4251
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
>
> I was recently working a bit on incubator-stormcrawler, and I noticed that 
> they are using cosium's git-code-format-maven-plugin: 
> https://github.com/Cosium/git-code-format-maven-plugin
> I was initially annoyed that I couldn't quickly figure out how my code 
> changes were causing the build to fail, but then I realized there was a magic 
> command: {{mvn git-code-format:format-code}} which just fixed the code so 
> that the linter passed. 
> The one drawback I found is that it does not fix nor does it alert on 
> wildcard imports.  We could still use checkstyle for that but only have one 
> rule for checkstyle.
> The other drawback is that there is not a lot of room for variation from 
> google's style. This may actually be a benefit, too, of course.
> I just ran this on {{tika-core}} here: 
> https://github.com/apache/tika/tree/google-java-format
> What would you think about making this change for 3.x?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (TIKA-4251) [DISCUSS] move to cosium's git-code-format-maven-plugin

2024-05-06 Thread Tim Allison (Jira)
Tim Allison created TIKA-4251:
-

 Summary: [DISCUSS] move to cosium's git-code-format-maven-plugin
 Key: TIKA-4251
 URL: https://issues.apache.org/jira/browse/TIKA-4251
 Project: Tika
  Issue Type: Task
Reporter: Tim Allison


I was recently working a bit on incubator-stormcrawler, and I noticed that they 
are using cosium's git-code-format-maven-plugin: 
https://github.com/Cosium/git-code-format-maven-plugin

I was initially annoyed that I couldn't quickly figure out how my code changes 
were causing the build to fail, but then I realized there was a magic command: 
{{mvn git-code-format:format-code}} which just fixed the code so that the 
linter passed. 

The one drawback I found is that it does not fix nor does it alert on wildcard 
imports.  We could still use checkstyle for that but only have one rule for 
checkstyle.

The other drawback is that there is not a lot of room for variation from 
google's style. This may actually be a benefit, too, of course.

I just ran this on {{tika-core}} here: 
https://github.com/apache/tika/tree/google-java-format

What would you think about making this change for 3.x?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (TIKA-4250) Add a libpst-based parser

2024-05-06 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17843746#comment-17843746
 ] 

Tim Allison edited comment on TIKA-4250 at 5/6/24 5:03 PM:
---

Wait, so, on licensing can we include a wrapper for either libpst or libpff 
because libpst is GPL 2 and libpff is GPL 3 
(https://www.apache.org/licenses/GPL-compatibility.html)?

[~nick] is the answer obvious or should I open a ticket on LEGAL?


was (Author: talli...@mitre.org):
Wait, so, on licensing can we include a wrapper for either libpst or libpff 
because libpst is GPL 2 and libpff is GPL 3 
(https://www.apache.org/licenses/GPL-compatibility.html)?

> Add a libpst-based parser
> -
>
> Key: TIKA-4250
> URL: https://issues.apache.org/jira/browse/TIKA-4250
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
> Attachments: 8.eml, 8.msg
>
>
> We currently use the com.pff Java-based PST parser for PST files. It would be 
> useful to add a wrapper for libpst as an optional parser. 
> One of the benefits of libpst is that it creates .eml or .msg files from the 
> PST records. This is critical for those who want the original bytes from 
> embedded files. Obv, PST doesn't store eml or msg, but some users want the 
> "original" emails even if they are constructed from PST records.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (TIKA-4250) Add a libpst-based parser

2024-05-06 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17843798#comment-17843798
 ] 

Tim Allison edited comment on TIKA-4250 at 5/6/24 5:02 PM:
---

So, I caught an example of libpst not exporting an attachment in an msg file 
via our unit test file (testPST.pst). The attached msg should contain an 
embedded msg that includes a docx. Via a hex editor, I can see that there is no 
embedded msg in 8.msg, whereas the structure is correctly maintained in 8.eml.


was (Author: talli...@mitre.org):
So, I caught an example of libpst not reading an attachment in our unit test 
file (testPST.pst). The attached msg should contain an embedded msg that 
includes a docx. Via a hex editor, I can see that there is no embedded msg in 
8.msg, whereas the structure is correctly maintained in 8.eml.

> Add a libpst-based parser
> -
>
> Key: TIKA-4250
> URL: https://issues.apache.org/jira/browse/TIKA-4250
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
> Attachments: 8.eml, 8.msg
>
>
> We currently use the com.pff Java-based PST parser for PST files. It would be 
> useful to add a wrapper for libpst as an optional parser. 
> One of the benefits of libpst is that it creates .eml or .msg files from the 
> PST records. This is critical for those who want the original bytes from 
> embedded files. Obv, PST doesn't store eml or msg, but some users want the 
> "original" emails even if they are constructed from PST records.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4250) Add a libpst-based parser

2024-05-06 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17843798#comment-17843798
 ] 

Tim Allison commented on TIKA-4250:
---

So, I caught an example of libpst not reading an attachment in our unit test 
file (testPST.pst). The attached msg should contain an embedded msg that 
includes a docx. Via a hex editor, I can see that there is no embedded msg in 
8.msg, whereas the structure is correctly maintained in 8.eml.

> Add a libpst-based parser
> -
>
> Key: TIKA-4250
> URL: https://issues.apache.org/jira/browse/TIKA-4250
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
> Attachments: 8.eml, 8.msg
>
>
> We currently use the com.pff Java-based PST parser for PST files. It would be 
> useful to add a wrapper for libpst as an optional parser. 
> One of the benefits of libpst is that it creates .eml or .msg files from the 
> PST records. This is critical for those who want the original bytes from 
> embedded files. Obv, PST doesn't store eml or msg, but some users want the 
> "original" emails even if they are constructed from PST records.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (TIKA-4250) Add a libpst-based parser

2024-05-06 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4250?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-4250:
--
Attachment: 8.eml

> Add a libpst-based parser
> -
>
> Key: TIKA-4250
> URL: https://issues.apache.org/jira/browse/TIKA-4250
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
> Attachments: 8.eml, 8.msg
>
>
> We currently use the com.pff Java-based PST parser for PST files. It would be 
> useful to add a wrapper for libpst as an optional parser. 
> One of the benefits of libpst is that it creates .eml or .msg files from the 
> PST records. This is critical for those who want the original bytes from 
> embedded files. Obv, PST doesn't store eml or msg, but some users want the 
> "original" emails even if they are constructed from PST records.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (TIKA-4250) Add a libpst-based parser

2024-05-06 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4250?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-4250:
--
Attachment: 8.msg

> Add a libpst-based parser
> -
>
> Key: TIKA-4250
> URL: https://issues.apache.org/jira/browse/TIKA-4250
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
> Attachments: 8.eml, 8.msg
>
>
> We currently use the com.pff Java-based PST parser for PST files. It would be 
> useful to add a wrapper for libpst as an optional parser. 
> One of the benefits of libpst is that it creates .eml or .msg files from the 
> PST records. This is critical for those who want the original bytes from 
> embedded files. Obv, PST doesn't store eml or msg, but some users want the 
> "original" emails even if they are constructed from PST records.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (TIKA-4250) Add a libpst-based parser

2024-05-06 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17843740#comment-17843740
 ] 

Tim Allison edited comment on TIKA-4250 at 5/6/24 1:02 PM:
---

Wow. This is super helpful. I guess the answer is to run all three? 

But seriously, should we fork java-libpst and add your extra fixes? Or, better, 
try to push them into the actual java-libpst? Longer term, we could see about 
adding meetings, Documents, Notes, vCalendars  and vJournals into that fork?

This gives some confidence that we were doing will with java-libpst.

In my own, much more modest testing (one large pst), I noticed that libpst had 
fewer emails and fewer attachments. What was weird, though, was that the number 
of emails was equal or closer to equal when I turned debug-mode on on libpst. 
It was much, much slower, but it got the same number of emails as java-libpst.

Again, thank you!


was (Author: talli...@mitre.org):
Wow. This is super helpful. I guess the answer is to run all three? 

But seriously, should we fork java-libpst and add your extra fixes? Or, better, 
try to push them into the actual java-libpst? Longer term, we could see about 
adding meetings, Documents, Notes and Vjournals into that fork?

This gives some confidence that we were doing will with java-libpst.

In my own, much more modest testing (one large pst), I noticed that libpst had 
fewer emails and fewer attachments. What was weird, though, was that the number 
of emails was equal or closer to equal when I turned debug-mode on on libpst. 
It was much, much slower, but it got the same number of emails as java-libpst.

Again, thank you!

> Add a libpst-based parser
> -
>
> Key: TIKA-4250
> URL: https://issues.apache.org/jira/browse/TIKA-4250
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
>
> We currently use the com.pff Java-based PST parser for PST files. It would be 
> useful to add a wrapper for libpst as an optional parser. 
> One of the benefits of libpst is that it creates .eml or .msg files from the 
> PST records. This is critical for those who want the original bytes from 
> embedded files. Obv, PST doesn't store eml or msg, but some users want the 
> "original" emails even if they are constructed from PST records.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)