[jira] [Commented] (TIKA-4250) Add a libpst-based parser
[ https://issues.apache.org/jira/browse/TIKA-4250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17844097#comment-17844097 ] Luís Filipe Nassif commented on TIKA-4250: -- Just pushed the quick and dirty java-libpst fork here: https://github.com/sepinf-inc/java-libpst > Add a libpst-based parser > - > > Key: TIKA-4250 > URL: https://issues.apache.org/jira/browse/TIKA-4250 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Major > Attachments: 8.eml, 8.msg > > > We currently use the com.pff Java-based PST parser for PST files. It would be > useful to add a wrapper for libpst as an optional parser. > One of the benefits of libpst is that it creates .eml or .msg files from the > PST records. This is critical for those who want the original bytes from > embedded files. Obv, PST doesn't store eml or msg, but some users want the > "original" emails even if they are constructed from PST records. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (TIKA-4251) [DISCUSS] move to cosium's git-code-format-maven-plugin with google-java-format
[ https://issues.apache.org/jira/browse/TIKA-4251?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-4251: -- Description: I was recently working a bit on incubator-stormcrawler, and I noticed that they are using cosium's git-code-format-maven-plugin: https://github.com/Cosium/git-code-format-maven-plugin I was initially annoyed that I couldn't quickly figure out what I had to fix to make the linter happyl, but then I realized there was a magic command: {{mvn git-code-format:format-code}} which just fixed the code so that the linter passed. The one drawback I found is that it does not fix nor does it alert on wildcard imports. We could still use checkstyle for that but only have one rule for checkstyle. The other drawback is that there is not a lot of room for variation from google's style. This may actually be a benefit, too, of course. I just ran this on {{tika-core}} here: https://github.com/apache/tika/tree/google-java-format What would you think about making this change for 3.x? was: I was recently working a bit on incubator-stormcrawler, and I noticed that they are using cosium's git-code-format-maven-plugin: https://github.com/Cosium/git-code-format-maven-plugin I was initially annoyed that I couldn't quickly figure out how my code changes were causing the build to fail, but then I realized there was a magic command: {{mvn git-code-format:format-code}} which just fixed the code so that the linter passed. The one drawback I found is that it does not fix nor does it alert on wildcard imports. We could still use checkstyle for that but only have one rule for checkstyle. The other drawback is that there is not a lot of room for variation from google's style. This may actually be a benefit, too, of course. I just ran this on {{tika-core}} here: https://github.com/apache/tika/tree/google-java-format What would you think about making this change for 3.x? > [DISCUSS] move to cosium's git-code-format-maven-plugin with > google-java-format > --- > > Key: TIKA-4251 > URL: https://issues.apache.org/jira/browse/TIKA-4251 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Major > > I was recently working a bit on incubator-stormcrawler, and I noticed that > they are using cosium's git-code-format-maven-plugin: > https://github.com/Cosium/git-code-format-maven-plugin > I was initially annoyed that I couldn't quickly figure out what I had to fix > to make the linter happyl, but then I realized there was a magic command: > {{mvn git-code-format:format-code}} which just fixed the code so that the > linter passed. > The one drawback I found is that it does not fix nor does it alert on > wildcard imports. We could still use checkstyle for that but only have one > rule for checkstyle. > The other drawback is that there is not a lot of room for variation from > google's style. This may actually be a benefit, too, of course. > I just ran this on {{tika-core}} here: > https://github.com/apache/tika/tree/google-java-format > What would you think about making this change for 3.x? -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (TIKA-4251) [DISCUSS] move to cosium's git-code-format-maven-plugin with google-java-format
[ https://issues.apache.org/jira/browse/TIKA-4251?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-4251: -- Summary: [DISCUSS] move to cosium's git-code-format-maven-plugin with google-java-format (was: [DISCUSS] move to cosium's git-code-format-maven-plugin) > [DISCUSS] move to cosium's git-code-format-maven-plugin with > google-java-format > --- > > Key: TIKA-4251 > URL: https://issues.apache.org/jira/browse/TIKA-4251 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Major > > I was recently working a bit on incubator-stormcrawler, and I noticed that > they are using cosium's git-code-format-maven-plugin: > https://github.com/Cosium/git-code-format-maven-plugin > I was initially annoyed that I couldn't quickly figure out how my code > changes were causing the build to fail, but then I realized there was a magic > command: {{mvn git-code-format:format-code}} which just fixed the code so > that the linter passed. > The one drawback I found is that it does not fix nor does it alert on > wildcard imports. We could still use checkstyle for that but only have one > rule for checkstyle. > The other drawback is that there is not a lot of room for variation from > google's style. This may actually be a benefit, too, of course. > I just ran this on {{tika-core}} here: > https://github.com/apache/tika/tree/google-java-format > What would you think about making this change for 3.x? -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (TIKA-4251) [DISCUSS] move to cosium's git-code-format-maven-plugin
Tim Allison created TIKA-4251: - Summary: [DISCUSS] move to cosium's git-code-format-maven-plugin Key: TIKA-4251 URL: https://issues.apache.org/jira/browse/TIKA-4251 Project: Tika Issue Type: Task Reporter: Tim Allison I was recently working a bit on incubator-stormcrawler, and I noticed that they are using cosium's git-code-format-maven-plugin: https://github.com/Cosium/git-code-format-maven-plugin I was initially annoyed that I couldn't quickly figure out how my code changes were causing the build to fail, but then I realized there was a magic command: {{mvn git-code-format:format-code}} which just fixed the code so that the linter passed. The one drawback I found is that it does not fix nor does it alert on wildcard imports. We could still use checkstyle for that but only have one rule for checkstyle. The other drawback is that there is not a lot of room for variation from google's style. This may actually be a benefit, too, of course. I just ran this on {{tika-core}} here: https://github.com/apache/tika/tree/google-java-format What would you think about making this change for 3.x? -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (TIKA-4250) Add a libpst-based parser
[ https://issues.apache.org/jira/browse/TIKA-4250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17843746#comment-17843746 ] Tim Allison edited comment on TIKA-4250 at 5/6/24 5:03 PM: --- Wait, so, on licensing can we include a wrapper for either libpst or libpff because libpst is GPL 2 and libpff is GPL 3 (https://www.apache.org/licenses/GPL-compatibility.html)? [~nick] is the answer obvious or should I open a ticket on LEGAL? was (Author: talli...@mitre.org): Wait, so, on licensing can we include a wrapper for either libpst or libpff because libpst is GPL 2 and libpff is GPL 3 (https://www.apache.org/licenses/GPL-compatibility.html)? > Add a libpst-based parser > - > > Key: TIKA-4250 > URL: https://issues.apache.org/jira/browse/TIKA-4250 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Major > Attachments: 8.eml, 8.msg > > > We currently use the com.pff Java-based PST parser for PST files. It would be > useful to add a wrapper for libpst as an optional parser. > One of the benefits of libpst is that it creates .eml or .msg files from the > PST records. This is critical for those who want the original bytes from > embedded files. Obv, PST doesn't store eml or msg, but some users want the > "original" emails even if they are constructed from PST records. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (TIKA-4250) Add a libpst-based parser
[ https://issues.apache.org/jira/browse/TIKA-4250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17843798#comment-17843798 ] Tim Allison edited comment on TIKA-4250 at 5/6/24 5:02 PM: --- So, I caught an example of libpst not exporting an attachment in an msg file via our unit test file (testPST.pst). The attached msg should contain an embedded msg that includes a docx. Via a hex editor, I can see that there is no embedded msg in 8.msg, whereas the structure is correctly maintained in 8.eml. was (Author: talli...@mitre.org): So, I caught an example of libpst not reading an attachment in our unit test file (testPST.pst). The attached msg should contain an embedded msg that includes a docx. Via a hex editor, I can see that there is no embedded msg in 8.msg, whereas the structure is correctly maintained in 8.eml. > Add a libpst-based parser > - > > Key: TIKA-4250 > URL: https://issues.apache.org/jira/browse/TIKA-4250 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Major > Attachments: 8.eml, 8.msg > > > We currently use the com.pff Java-based PST parser for PST files. It would be > useful to add a wrapper for libpst as an optional parser. > One of the benefits of libpst is that it creates .eml or .msg files from the > PST records. This is critical for those who want the original bytes from > embedded files. Obv, PST doesn't store eml or msg, but some users want the > "original" emails even if they are constructed from PST records. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-4250) Add a libpst-based parser
[ https://issues.apache.org/jira/browse/TIKA-4250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17843798#comment-17843798 ] Tim Allison commented on TIKA-4250: --- So, I caught an example of libpst not reading an attachment in our unit test file (testPST.pst). The attached msg should contain an embedded msg that includes a docx. Via a hex editor, I can see that there is no embedded msg in 8.msg, whereas the structure is correctly maintained in 8.eml. > Add a libpst-based parser > - > > Key: TIKA-4250 > URL: https://issues.apache.org/jira/browse/TIKA-4250 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Major > Attachments: 8.eml, 8.msg > > > We currently use the com.pff Java-based PST parser for PST files. It would be > useful to add a wrapper for libpst as an optional parser. > One of the benefits of libpst is that it creates .eml or .msg files from the > PST records. This is critical for those who want the original bytes from > embedded files. Obv, PST doesn't store eml or msg, but some users want the > "original" emails even if they are constructed from PST records. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (TIKA-4250) Add a libpst-based parser
[ https://issues.apache.org/jira/browse/TIKA-4250?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-4250: -- Attachment: 8.eml > Add a libpst-based parser > - > > Key: TIKA-4250 > URL: https://issues.apache.org/jira/browse/TIKA-4250 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Major > Attachments: 8.eml, 8.msg > > > We currently use the com.pff Java-based PST parser for PST files. It would be > useful to add a wrapper for libpst as an optional parser. > One of the benefits of libpst is that it creates .eml or .msg files from the > PST records. This is critical for those who want the original bytes from > embedded files. Obv, PST doesn't store eml or msg, but some users want the > "original" emails even if they are constructed from PST records. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (TIKA-4250) Add a libpst-based parser
[ https://issues.apache.org/jira/browse/TIKA-4250?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-4250: -- Attachment: 8.msg > Add a libpst-based parser > - > > Key: TIKA-4250 > URL: https://issues.apache.org/jira/browse/TIKA-4250 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Major > Attachments: 8.eml, 8.msg > > > We currently use the com.pff Java-based PST parser for PST files. It would be > useful to add a wrapper for libpst as an optional parser. > One of the benefits of libpst is that it creates .eml or .msg files from the > PST records. This is critical for those who want the original bytes from > embedded files. Obv, PST doesn't store eml or msg, but some users want the > "original" emails even if they are constructed from PST records. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (TIKA-4250) Add a libpst-based parser
[ https://issues.apache.org/jira/browse/TIKA-4250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17843740#comment-17843740 ] Tim Allison edited comment on TIKA-4250 at 5/6/24 1:02 PM: --- Wow. This is super helpful. I guess the answer is to run all three? But seriously, should we fork java-libpst and add your extra fixes? Or, better, try to push them into the actual java-libpst? Longer term, we could see about adding meetings, Documents, Notes, vCalendars and vJournals into that fork? This gives some confidence that we were doing will with java-libpst. In my own, much more modest testing (one large pst), I noticed that libpst had fewer emails and fewer attachments. What was weird, though, was that the number of emails was equal or closer to equal when I turned debug-mode on on libpst. It was much, much slower, but it got the same number of emails as java-libpst. Again, thank you! was (Author: talli...@mitre.org): Wow. This is super helpful. I guess the answer is to run all three? But seriously, should we fork java-libpst and add your extra fixes? Or, better, try to push them into the actual java-libpst? Longer term, we could see about adding meetings, Documents, Notes and Vjournals into that fork? This gives some confidence that we were doing will with java-libpst. In my own, much more modest testing (one large pst), I noticed that libpst had fewer emails and fewer attachments. What was weird, though, was that the number of emails was equal or closer to equal when I turned debug-mode on on libpst. It was much, much slower, but it got the same number of emails as java-libpst. Again, thank you! > Add a libpst-based parser > - > > Key: TIKA-4250 > URL: https://issues.apache.org/jira/browse/TIKA-4250 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Major > > We currently use the com.pff Java-based PST parser for PST files. It would be > useful to add a wrapper for libpst as an optional parser. > One of the benefits of libpst is that it creates .eml or .msg files from the > PST records. This is critical for those who want the original bytes from > embedded files. Obv, PST doesn't store eml or msg, but some users want the > "original" emails even if they are constructed from PST records. -- This message was sent by Atlassian Jira (v8.20.10#820010)