Hi Joe, Are you looking to pay this person to help or are you looking for someone with the same "itch" as you?
(Not that I am volunteering either way - it's not my area.) Regards, Dave On Aug 21, 2012, at 2:33 PM, Joe Wicentowski wrote: > Hi all, > > I hadn't heard from anyone about the question I posed last week -- > regarding POI/HSMF's problems identifying dates in Outlook .msg files. > Is there a better forum for me to post this? Should I file a bug? > Ideally, I'd like to find someone who can help complete the fix that > Nick Burch began in POI's SVN trunk. > > Thanks for any pointers about the best way to proceed, > Joe > > On Thu, Aug 16, 2012 at 6:52 PM, Joe Wicentowski <[email protected]> wrote: >> Hi all, >> >> Hello! This is my message to the list. I'm building an application >> that relies on Tika to extract text from Outlook 2007 .msg files. >> Tika relies on POI's HSMF libraries, which is why I'm writing to this >> list about a problem: HSMF is not pulling out the date of many of my >> Outlook messages. >> >> For example, when I look at one of my message files (.msg) in Outlook, >> it says that the message was sent on "Fri 6/22/2012 8:11 AM", but when >> I process the same message with Tika, no date appears in the output >> [1]. >> >> In comparison, I tried using a different tool, ruby-msg >> (http://code.google.com/p/ruby-msg/), to process the same message, and >> ruby-msg did pull out the date [2]. This experiment shows that the >> email *is* in the .msg file, and that Tika is failing to pick it up. >> >> Nick Burch from the Tika mailing list took a close, hands-on look at >> my .msg file, determined the cause, and outlined a path to the fix: >> >>> I think I've figured out what's wrong. It looks like outlook stores >>> properties with a fixed size of 0-8 bytes in a different chunk in the file, >>> which we weren't processing. >>> >>> If you wanted to tackle it, that'd be great! You'll want to take a look at >>> PropertiesChunk, and fill in the TODOs for readProperties and >>> writeProperties, then add unit tests. See: >>> >>> http://svn.apache.org/viewvc/poi/trunk/src/scratchpad/src/org/apache/poi/hsmf/datatypes/PropertiesChunk.java?view=markup >>> >>> When that's all done and working, then >>> the final step is to update MAPIMessage to read some of the values as needed >>> out of the properties >>> >>> The info I've been working with comes from this blog post: >>> http://blogs.msdn.com/b/openspecification/archive/2009/11/06/msg-file-format-part-1.aspx >>> >>> (That links into suitable bits of the public documentation) >>> >>> I suspect it's under a day's work. I've put in place the basics, just needs >>> someone to flesh it out. >> >> While Nick kindly tracked down the cause, unfortunately I lack the >> java chops to complete the solution. >> >> Would anyone here be kind enough to assist me with this? >> >> I'm happy to test any attempted fixes, and I'm happy to provide more >> info, like sample Outlook files (.msg files). My hope is that this >> fix will allow POI to "just work" for more users who are evaluating >> it. >> >> Thank you in advance, >> Joe >> >> >> [1] Tika output showing no date, retrieved via the following command: >> >> java -jar tika-app-1.1.jar "Inquiry.msg" > inquiry.html >> >> <html xmlns="http://www.w3.org/1999/xhtml"> >> <head> >> <meta name="Message-Bcc" content="" /> >> <meta name="subject" content="Inquiry" /> >> <meta name="Content-Length" content="40960" /> >> <meta name="Message-Recipient-Address" content="[email protected]" /> >> <meta name="Message-From" content="History Mailbox" /> >> <meta name="Author" content="History Mailbox" /> >> <meta name="Message-To" content="'Snip'" /> >> <meta name="Message-Cc" content="" /> >> <meta name="Content-Type" content="application/vnd.ms-outlook" /> >> <meta name="resourceName" content="RE Inquiry.msg" /> >> </head> >> <body> >> <h1>RE: Inquiry</h1> >> <dl> >> <dt>From</dt> >> <dd>History Mailbox</dd> >> <dt>To</dt> >> <dd>'Snip'</dd> >> <dt>Recipients</dt> >> <dd>[email protected]</dd> >> </dl> >> <p>Dear Snip</p> >> ... >> >> [2] The ruby-msg output -- notice the "Date:" line: >> >> From: "History Mailbox" <[email protected]> >> To: "Snip" <[email protected]> >> Subject: RE: Inquiry >> Date: Fri, 22 Jun 2012 12:11:00 -0000 >> Message-ID: <[email protected]> >> In-Reply-To: >> <CAJ4nNe1FPo7Q=10dbk8sdzprarzykjv6skv3nyg5l2li13b...@mail.gmail.com> >> Priority: 0 >> Thread-Topic: Inquiry >> Content-Type: multipart/alternative; >> boundary="----_=_NextPart_001_8149ed38.4fec8c61" > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [email protected] > For additional commands, e-mail: [email protected] > --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
