[ 
https://issues.apache.org/jira/browse/TIKA-1865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15172283#comment-15172283
 ] 

Tim Allison edited comment on TIKA-1865 at 2/29/16 9:55 PM:
------------------------------------------------------------

I took a dump of .msg files from Common Crawl.  Several of the files were 
truncated or not actually MSG files.  Many were "contacts" and not actual email 
files.  The corpus was limited...in many ways, ymmv.

I did three things:
1) dump the most obvious fields for sender email address (attached).  Finding: 
in general, this works well with SMTP emails; for Exchange, things get dicey.
2) For those emails with a header "From:" field, try to find all properties of 
type String (ascii or Unicode) that contained that email.  I was hoping this 
would identify new fields beyond the obvious ones...it didn't.
3) Find all property fields in Exchange emails that contained an email address 
and weren't a recipient chunk...I was hoping this would lead to common patterns 
for Exchange emails not already picked up by the known properties, but it 
didn't.

In short, I think the best bet to extract the sender's email address is the 
strategy that I recommended above.  I think we may also want to pull out the 
senders Exchange id (different metadata property!), because that could be 
useful as an identifier.

-Finally, is there an easy way to tell if an msg file is a message, a post, an 
appointment or a contact?- We should probably add extraction for this: 
MAPIProperty.MESSAGE_CLASS


was (Author: talli...@mitre.org):
I took a dump of .msg files from Common Crawl.  Several of the files were 
truncated or not actually MSG files.  Many were "contacts" and not actual email 
files.  The corpus was limited...in many ways, ymmv.

I did three things:
1) dump the most obvious fields for sender email address (attached).  Finding: 
in general, this works well with SMTP emails; for Exchange, things get dicey.
2) For those emails with a header "From:" field, try to find all properties of 
type String (ascii or Unicode) that contained that email.  I was hoping this 
would identify new fields beyond the obvious ones...it didn't.
3) Find all property fields in Exchange emails that contained an email address 
and weren't a recipient chunk...I was hoping this would lead to common patterns 
for Exchange emails not already picked up by the known properties, but it 
didn't.

In short, I think the best bet to extract the sender's email address is the 
strategy that I recommended above.  I think we may also want to pull out the 
senders Exchange id (different metadata property!), because that could be 
useful as an identifier.

Finally, is there an easy way to tell if an msg file is a message, a post, an 
appointment or a contact?

> Save sender email address in Outlook MSG metadata
> -------------------------------------------------
>
>                 Key: TIKA-1865
>                 URL: https://issues.apache.org/jira/browse/TIKA-1865
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 1.11
>         Environment: Windows 7 x64, jre 1.8.0_60 x64
>            Reporter: Luis Filipe Nassif
>         Attachments: report.xlsx
>
>
> Sender email address is lost when extracting metadata from Outlook msg files. 
> Currently only sender name is extracted. That is an important information to 
> be extracted for search engines.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to