Tucker Barbour created TIKA-2875:
------------------------------------
Summary: Support Google Takeout MBOX format for GChat Messages
Key: TIKA-2875
URL: https://issues.apache.org/jira/browse/TIKA-2875
Project: Tika
Issue Type: Improvement
Components: parser
Affects Versions: 1.20
Environment: java version "1.8.0_181"
Java(TM) SE Runtime Environment (build 1.8.0_181-b13)
Java HotSpot(TM) 64-Bit Server VM (build 25.181-b13, mixed mode)
Reporter: Tucker Barbour
Attachments: Sample.mbox
The [Google Takeout|https://takeout.google.com] tool allows a user to export
Gmail and GChat messages as an MBOX archive. Tika's content type detection
properly asserts this format as MBOX. However, the provided MBOX parser does
not seem to support the format of the `From` header for GChat messages. I've
included an example chat in the ticket. You can see the format of the From
header also includes a from address and the sent timestamp. As I understand
this is a valid From header format. I would expect the Tika MBOX parser to
properly parse the From header and set the sent time as the value parsed from
the From header format in the provided example.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)