Tucker Barbour created TIKA-2875:
------------------------------------

             Summary: Support Google Takeout MBOX format for GChat Messages
                 Key: TIKA-2875
                 URL: https://issues.apache.org/jira/browse/TIKA-2875
             Project: Tika
          Issue Type: Improvement
          Components: parser
    Affects Versions: 1.20
         Environment: java version "1.8.0_181"

Java(TM) SE Runtime Environment (build 1.8.0_181-b13)

Java HotSpot(TM) 64-Bit Server VM (build 25.181-b13, mixed mode)
            Reporter: Tucker Barbour
         Attachments: Sample.mbox

The [Google Takeout|https://takeout.google.com] tool allows a user to export 
Gmail and GChat messages as an MBOX archive. Tika's content type detection 
properly asserts this format as MBOX. However, the provided MBOX parser does 
not seem to support the format of the `From`  header for GChat messages. I've 
included an example chat in the ticket. You can see the format of the From 
header also includes a from address and the sent timestamp. As I understand 
this is a valid From header format. I would expect the Tika MBOX parser to 
properly parse the From header and set the sent time as the value parsed from 
the From header format in the provided example.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to