[jira] [Commented] (TIKA-2478) RFC822 includes redundant copies of the text
[ https://issues.apache.org/jira/browse/TIKA-2478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16637592#comment-16637592 ] Hudson commented on TIKA-2478: -- SUCCESS: Integrated in Jenkins build tika-branch-1x #110 (See [https://builds.apache.org/job/tika-branch-1x/110/]) TIKA-2478 -- maxFiles should take an argument...duh (tallison: [https://github.com/apache/tika/commit/4bb4ad60c2d6f968b55829d941d5fdb67917cdda]) * (edit) tika-server/src/test/java/org/apache/tika/server/TikaServerIntegrationTest.java * (edit) tika-server/src/main/java/org/apache/tika/server/TikaServerCli.java TIKA-2478 -- add preliminary pseudo test for -maxFiles (tallison: [https://github.com/apache/tika/commit/ad0f41cfc38f970658444d95db3700c145a2590a]) * (edit) tika-server/src/test/java/org/apache/tika/server/TikaServerIntegrationTest.java > RFC822 includes redundant copies of the text > > > Key: TIKA-2478 > URL: https://issues.apache.org/jira/browse/TIKA-2478 > Project: Tika > Issue Type: Bug >Affects Versions: 1.16 >Reporter: Robert Letzler >Assignee: Tim Allison >Priority: Minor > Fix For: 1.17 > > Attachments: TIKA-2478.patch, UET6KCXR5FYIEJYKUCK2AKF3FLXTRNAT.eml, > mixed-simple, mixed-with-pdf-inline > > > MBOX messages often get parsed into four documents: > a.The mbox file - outer container "/" > b.The actual email-- "/embedded-1" > c.The utf-8 text content of the email "/embedded-1/embedded-2" > d.The utf-8 html content of the email "/embedded-1/embedded-3" > entries C and D are redundant and distracting. The MSG parser parses the > first non-null: email body and then it skips the rest. Please modify MBOX to > not have separate "attached" documents for the html body and the text body. > The attachment to https://issues.apache.org/jira/browse/TIKA-2471 is an > example of input sufficient to generate this behavior. > Thanks! -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TIKA-2478) RFC822 includes redundant copies of the text
[ https://issues.apache.org/jira/browse/TIKA-2478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16637513#comment-16637513 ] Hudson commented on TIKA-2478: -- SUCCESS: Integrated in Jenkins build Tika-trunk #1572 (See [https://builds.apache.org/job/Tika-trunk/1572/]) TIKA-2478 -- maxFiles should take an argument...duh (tallison: [https://github.com/apache/tika/commit/c068479d7ed1734f75e24fad24572b99c9c3a4c6]) * (edit) tika-server/src/test/java/org/apache/tika/server/TikaServerIntegrationTest.java * (edit) tika-server/src/main/java/org/apache/tika/server/TikaServerCli.java TIKA-2478 -- add preliminary pseudo test for -maxFiles (tallison: [https://github.com/apache/tika/commit/7e798ef8603bf40ea4a17a125aaff36677478353]) * (edit) tika-server/src/test/java/org/apache/tika/server/TikaServerIntegrationTest.java > RFC822 includes redundant copies of the text > > > Key: TIKA-2478 > URL: https://issues.apache.org/jira/browse/TIKA-2478 > Project: Tika > Issue Type: Bug >Affects Versions: 1.16 >Reporter: Robert Letzler >Assignee: Tim Allison >Priority: Minor > Fix For: 1.17 > > Attachments: TIKA-2478.patch, UET6KCXR5FYIEJYKUCK2AKF3FLXTRNAT.eml, > mixed-simple, mixed-with-pdf-inline > > > MBOX messages often get parsed into four documents: > a.The mbox file - outer container "/" > b.The actual email-- "/embedded-1" > c.The utf-8 text content of the email "/embedded-1/embedded-2" > d.The utf-8 html content of the email "/embedded-1/embedded-3" > entries C and D are redundant and distracting. The MSG parser parses the > first non-null: email body and then it skips the rest. Please modify MBOX to > not have separate "attached" documents for the html body and the text body. > The attachment to https://issues.apache.org/jira/browse/TIKA-2471 is an > example of input sufficient to generate this behavior. > Thanks! -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TIKA-2478) RFC822 includes redundant copies of the text
[ https://issues.apache.org/jira/browse/TIKA-2478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16637482#comment-16637482 ] Hudson commented on TIKA-2478: -- FAILURE: Integrated in Jenkins build tika-2.x-windows #326 (See [https://builds.apache.org/job/tika-2.x-windows/326/]) TIKA-2478 -- maxFiles should take an argument...duh (tallison: rev c068479d7ed1734f75e24fad24572b99c9c3a4c6) * (edit) tika-server/src/test/java/org/apache/tika/server/TikaServerIntegrationTest.java * (edit) tika-server/src/main/java/org/apache/tika/server/TikaServerCli.java TIKA-2478 -- add preliminary pseudo test for -maxFiles (tallison: rev 7e798ef8603bf40ea4a17a125aaff36677478353) * (edit) tika-server/src/test/java/org/apache/tika/server/TikaServerIntegrationTest.java > RFC822 includes redundant copies of the text > > > Key: TIKA-2478 > URL: https://issues.apache.org/jira/browse/TIKA-2478 > Project: Tika > Issue Type: Bug >Affects Versions: 1.16 >Reporter: Robert Letzler >Assignee: Tim Allison >Priority: Minor > Fix For: 1.17 > > Attachments: TIKA-2478.patch, UET6KCXR5FYIEJYKUCK2AKF3FLXTRNAT.eml, > mixed-simple, mixed-with-pdf-inline > > > MBOX messages often get parsed into four documents: > a.The mbox file - outer container "/" > b.The actual email-- "/embedded-1" > c.The utf-8 text content of the email "/embedded-1/embedded-2" > d.The utf-8 html content of the email "/embedded-1/embedded-3" > entries C and D are redundant and distracting. The MSG parser parses the > first non-null: email body and then it skips the rest. Please modify MBOX to > not have separate "attached" documents for the html body and the text body. > The attachment to https://issues.apache.org/jira/browse/TIKA-2471 is an > example of input sufficient to generate this behavior. > Thanks! -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TIKA-2478) RFC822 includes redundant copies of the text
[ https://issues.apache.org/jira/browse/TIKA-2478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16236049#comment-16236049 ] Hudson commented on TIKA-2478: -- SUCCESS: Integrated in Jenkins build Tika-trunk #1382 (See [https://builds.apache.org/job/Tika-trunk/1382/]) TIKA-2478 -- rfc822 parser should handle alternative parts as the (tallison: [https://github.com/apache/tika/commit/ff481b25dd7f141f55907ce194b9bc2c77fc7069]) * (edit) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/OfficeParserConfig.java * (edit) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/OutlookExtractor.java * (edit) tika-core/src/test/java/org/apache/tika/TikaTest.java * (add) tika-parsers/src/test/resources/test-documents/testRFC822-mixed-with-pdf-inline * (edit) CHANGES.txt * (edit) tika-parsers/src/test/java/org/apache/tika/parser/mail/RFC822ParserTest.java * (edit) tika-parsers/src/main/java/org/apache/tika/parser/mail/MailContentHandler.java * (add) tika-parsers/src/test/resources/test-documents/testRFC822-mixed-simple * (edit) tika-parsers/src/test/java/org/apache/tika/parser/mbox/MboxParserTest.java * (add) tika-parsers/src/test/resources/org/apache/tika/parser/mail/tika-config-extract-all-alternatives.xml * (edit) tika-parsers/src/main/java/org/apache/tika/parser/mail/RFC822Parser.java * (edit) tika-parsers/src/test/java/org/apache/tika/parser/microsoft/OutlookParserTest.java * (edit) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/AbstractOfficeParser.java * (add) tika-parsers/src/test/resources/org/apache/tika/parser/microsoft/tika-config-extract-all-alternatives-msg.xml * (add) tika-parsers/src/test/resources/test-documents/testMBOX_complex.mbox > RFC822 includes redundant copies of the text > > > Key: TIKA-2478 > URL: https://issues.apache.org/jira/browse/TIKA-2478 > Project: Tika > Issue Type: Bug >Affects Versions: 1.16 >Reporter: Robert Letzler >Assignee: Tim Allison >Priority: Minor > Fix For: 1.17 > > Attachments: TIKA-2478.patch, UET6KCXR5FYIEJYKUCK2AKF3FLXTRNAT.eml, > mixed-simple, mixed-with-pdf-inline > > > MBOX messages often get parsed into four documents: > a.The mbox file - outer container "/" > b.The actual email-- "/embedded-1" > c.The utf-8 text content of the email "/embedded-1/embedded-2" > d.The utf-8 html content of the email "/embedded-1/embedded-3" > entries C and D are redundant and distracting. The MSG parser parses the > first non-null: email body and then it skips the rest. Please modify MBOX to > not have separate "attached" documents for the html body and the text body. > The attachment to https://issues.apache.org/jira/browse/TIKA-2471 is an > example of input sufficient to generate this behavior. > Thanks! -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (TIKA-2478) RFC822 includes redundant copies of the text
[ https://issues.apache.org/jira/browse/TIKA-2478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16235659#comment-16235659 ] Robert Letzler commented on TIKA-2478: -- I am at a conference. I will respond to your message when I return. Thanks! -Rob > RFC822 includes redundant copies of the text > > > Key: TIKA-2478 > URL: https://issues.apache.org/jira/browse/TIKA-2478 > Project: Tika > Issue Type: Bug >Affects Versions: 1.16 >Reporter: Robert Letzler >Assignee: Tim Allison >Priority: Minor > Fix For: 1.17 > > Attachments: TIKA-2478.patch, UET6KCXR5FYIEJYKUCK2AKF3FLXTRNAT.eml, > mixed-simple, mixed-with-pdf-inline > > > MBOX messages often get parsed into four documents: > a.The mbox file - outer container "/" > b.The actual email-- "/embedded-1" > c.The utf-8 text content of the email "/embedded-1/embedded-2" > d.The utf-8 html content of the email "/embedded-1/embedded-3" > entries C and D are redundant and distracting. The MSG parser parses the > first non-null: email body and then it skips the rest. Please modify MBOX to > not have separate "attached" documents for the html body and the text body. > The attachment to https://issues.apache.org/jira/browse/TIKA-2471 is an > example of input sufficient to generate this behavior. > Thanks! -- This message was sent by Atlassian JIRA (v6.4.14#64029)