[jira] [Commented] (TIKA-2471) Tab-prefixed message body lines in Mbox interpreted as headers

2017-10-23 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16215011#comment-16215011
 ] 

Tim Allison commented on TIKA-2471:
---

[~kkrugler], got it.  Thank you.

> Tab-prefixed message body lines in Mbox interpreted as headers
> --
>
> Key: TIKA-2471
> URL: https://issues.apache.org/jira/browse/TIKA-2471
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.16
>Reporter: Matthew Caruana Galizia
>  Labels: message, rfc822
> Attachments: mbox
>
>
> The mbox parser code is overly optimistic. It parses the entire message 
> looking for anything that matches a header pattern, wherever it occurs in a 
> line!
> It looks to me like the parsing logic is in desperate need of a refactor. But 
> more to the point, what is the idea behind setting the headers in the 
> MboxParser if they're going to be set by the RFC822Parser in any case?
> Also, out of curiosity, why does the parser force Windows-1252 as the charset?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (TIKA-2471) Tab-prefixed message body lines in Mbox interpreted as headers

2017-10-20 Thread Ken Krugler (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16213150#comment-16213150
 ] 

Ken Krugler commented on TIKA-2471:
---

Hi [~talli...@apache.org] - I don't think using MBoxIterator is the issue. The 
problem is the regex logic used to find headers in the text that's inside of 
one email message.

I think we first need to hear back from [~thaichat04] about why headers are 
being extracted in mbox parser code, versus just relying on the RFC8222 parser.


> Tab-prefixed message body lines in Mbox interpreted as headers
> --
>
> Key: TIKA-2471
> URL: https://issues.apache.org/jira/browse/TIKA-2471
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.16
>Reporter: Matthew Caruana Galizia
>  Labels: message, rfc822
> Attachments: mbox
>
>
> The mbox parser code is overly optimistic. It parses the entire message 
> looking for anything that matches a header pattern, wherever it occurs in a 
> line!
> It looks to me like the parsing logic is in desperate need of a refactor. But 
> more to the point, what is the idea behind setting the headers in the 
> MboxParser if they're going to be set by the RFC822Parser in any case?
> Also, out of curiosity, why does the parser force Windows-1252 as the charset?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (TIKA-2471) Tab-prefixed message body lines in Mbox interpreted as headers

2017-10-20 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16212929#comment-16212929
 ] 

Tim Allison commented on TIKA-2471:
---

It looks like [~kkrugler]'s original mbox contribution on TIKA-295 predates 
mime4j's MboxIterator by three years.

Are there any reasons not to trust mime4j's MboxIterator?

Looks like it hasn't hasn't had much activity in the last few years.

Should we try to integrate it?

> Tab-prefixed message body lines in Mbox interpreted as headers
> --
>
> Key: TIKA-2471
> URL: https://issues.apache.org/jira/browse/TIKA-2471
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.16
>Reporter: Matthew Caruana Galizia
>  Labels: message, rfc822
> Attachments: mbox
>
>
> The mbox parser code is overly optimistic. It parses the entire message 
> looking for anything that matches a header pattern, wherever it occurs in a 
> line!
> It looks to me like the parsing logic is in desperate need of a refactor. But 
> more to the point, what is the idea behind setting the headers in the 
> MboxParser if they're going to be set by the RFC822Parser in any case?
> Also, out of curiosity, why does the parser force Windows-1252 as the charset?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (TIKA-2471) Tab-prefixed message body lines in Mbox interpreted as headers

2017-10-17 Thread Luis Filipe Nassif (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16208775#comment-16208775
 ] 

Luis Filipe Nassif commented on TIKA-2471:
--

Also, the tracking metadata feature was added before the addition of the 
RecursiveParserWrapper. I think it could be deprecated.

> Tab-prefixed message body lines in Mbox interpreted as headers
> --
>
> Key: TIKA-2471
> URL: https://issues.apache.org/jira/browse/TIKA-2471
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.16
>Reporter: Matthew Caruana Galizia
>  Labels: message, rfc822
> Attachments: mbox
>
>
> The mbox parser code is overly optimistic. It parses the entire message 
> looking for anything that matches a header pattern, wherever it occurs in a 
> line!
> It looks to me like the parsing logic is in desperate need of a refactor. But 
> more to the point, what is the idea behind setting the headers in the 
> MboxParser if they're going to be set by the RFC822Parser in any case?
> Also, out of curiosity, why does the parser force Windows-1252 as the charset?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (TIKA-2471) Tab-prefixed message body lines in Mbox interpreted as headers

2017-10-17 Thread Luis Filipe Nassif (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16208769#comment-16208769
 ] 

Luis Filipe Nassif commented on TIKA-2471:
--

Hi Matthew,

If I remember correctly, some headers were not being extracted by the 
RFC822PARSER after the refactoring of Mboxparser, so that logic was added to 
get the missed headers back, right [~thaichat04]? I think it may be better to 
fix the RFC822PARSER instead.

The windows-1252 charset was used initially to facilitate locating newlines and 
the "From" delimiter. I think that shouldn't be added to contentType metadata 
of mbox container.

> Tab-prefixed message body lines in Mbox interpreted as headers
> --
>
> Key: TIKA-2471
> URL: https://issues.apache.org/jira/browse/TIKA-2471
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.16
>Reporter: Matthew Caruana Galizia
>  Labels: message, rfc822
> Attachments: mbox
>
>
> The mbox parser code is overly optimistic. It parses the entire message 
> looking for anything that matches a header pattern, wherever it occurs in a 
> line!
> It looks to me like the parsing logic is in desperate need of a refactor. But 
> more to the point, what is the idea behind setting the headers in the 
> MboxParser if they're going to be set by the RFC822Parser in any case?
> Also, out of curiosity, why does the parser force Windows-1252 as the charset?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (TIKA-2471) Tab-prefixed message body lines in Mbox interpreted as headers

2017-10-16 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16206418#comment-16206418
 ] 

Tim Allison commented on TIKA-2471:
---

That looks totally hosed.  Thank you for opening this and supplying an example 
triggering file. 

bq. But more to the point, what is the idea behind setting the headers in the 
MboxParser if they're going to be set by the RFC822Parser in any case?

TIKA-1244 brought that behavior in.  Before that, emails weren't treated as 
embedded files if I understand correctly.

bq.  why does the parser force Windows-1252 as the charset?
Again, no idea, but I suspect that was because of the rfc822 method of 
encoding.  Are you able to share an example where this corrupts the content?

> Tab-prefixed message body lines in Mbox interpreted as headers
> --
>
> Key: TIKA-2471
> URL: https://issues.apache.org/jira/browse/TIKA-2471
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.16
>Reporter: Matthew Caruana Galizia
>  Labels: message, rfc822
> Attachments: mbox
>
>
> The mbox parser code is overly optimistic. It parses the entire message 
> looking for anything that matches a header pattern, wherever it occurs in a 
> line!
> It looks to me like the parsing logic is in desperate need of a refactor. But 
> more to the point, what is the idea behind setting the headers in the 
> MboxParser if they're going to be set by the RFC822Parser in any case?
> Also, out of curiosity, why does the parser force Windows-1252 as the charset?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)