Hi everyone!
I'm not sure this is a design decision, a bug, something not implemented or
malformed input on my side.
My use case is the following: users that forward their mails outside the domain
I manage have the option to report spam as attachment so the spam classifier
can learn anyway. I prefer attachment over forward because MUAs have their own
way to do it and it's difficult to reliably reconstruct some headers/email
structure that can be important for the spam classifier.
Each input message is processed through a Sieve extensions of RFC 5703 to
extract the attachment. I ask user to produce a file with .eml extension
through any export-like feature of their MUA.
Below is an example EML file produced by Tuta (Thunderbird does a similar thing
with a different encoding for the body). I'll use a bunch of = characters as a
separator in the rest of this.
===============================================
[...bunch of header...]
X-MS-Exchange-CrossTenant-rms-persistedconsumerorg:
00000000-0000-0000-0000-0000000000
00
X-MS-Exchange-Transport-CrossTenantHeadersStamped: TYSPR04MB7035
Content-Type: multipart/related;
boundary="------------79Bu5A16qPEYcVIZL@tutanota"
--------------79Bu5A16qPEYcVIZL@tutanota
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: base64
DQo8ZGl2IHN0eWxlPSJmb250LWZhbWlseTogQXB0b3MsIEFwdG9zX0VtYmVkZGVkRm9udCwgQXB0b3
NfTVNGb250U2VydmljZSwgQ2FsaWJyaSwgSGVsdmV0aWNhLCBzYW5zLXNlcmlmOyBmb250LXNpemU6
[...base64 encoding of the body...]
===============================================
While attached and sent, the received messaged looks like this:
===============================================
[...headers of the actual email...]
X-Infomaniak-Routing: alpha
This is a multi-part message in MIME format.
--------------5jyKrhQ08xXUif7LHhgg648N
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
--------------5jyKrhQ08xXUif7LHhgg648N
Content-Type: message/rfc822; name="a.eml"
Content-Disposition: attachment; filename="a.eml"
Content-Transfer-Encoding: 7bit
[...headers of the spam....]
Subject: Re: //Re: AI + Ranking ///
Thread-Topic: //Re: AI + Ranking ///
[....]
Content-Type: multipart/related;
boundary="------------79Bu5A16qPEYcVIZL@tutanota"
--------------79Bu5A16qPEYcVIZL@tutanota
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: base64
DQo8ZGl2IHN0eWxlPSJmb250LWZhbWlseTogQXB0b3MsIEFwdG9zX0VtYmVkZGVkRm9udCwgQXB0b3
NfTVNGb250U2VydmljZSwgQ2FsaWJyaSwgSGVsdmV0aWNhLCBzYW5zLXNlcmlmOyBmb250LXNpemU6
[...base64 encoding of the spam....]
--------------79Bu5A16qPEYcVIZL@tutanota--
--------------5jyKrhQ08xXUif7LHhgg648N--
===============================================
I face two issues:
1. Only the base64 decoded string can be retrived through
foreverypart/extracttext;
2. It has a lot of newlines.
- 2. is not a big deal. The original spam has a lot of HTML junk, and each line
with HTML without an inner text node is stripped, but the newline is kept. I
don't know if this is intended.
- 1. is more of a big deal. To try to confirm that, I made an over-simplified
version of the Sieve script (I skip configuration but plugins are loaded etc,
please ask if useful).
===============================================
[...require...]
foreverypart {
extracttext "eml";
debug_log "PART ============================== ${eml}";
}
===============================================
mail_debug is activated. In the logs, I can read:
===============================================
Info: sieve: DEBUG: PART ==============================
Info: sieve: DEBUG: PART ==============================
Info: sieve: DEBUG: PART ==============================
Info: sieve: DEBUG: PART ==============================
Info: sieve: DEBUG: PART ==============================
Info: sieve:
Info: sieve:
Info: sieve:
Info: sieve:
Info: sieve:
Info: sieve:
Info: sieve: Hello,
Info: sieve:
Info: sieve: Hope you're doing well.
Info: sieve:
Info: sieve:
Info: sieve:
Info: sieve:
Info: sieve: Would you like attract more traffic with our AEO + GEO + SEO
services. AI-driven search is here—don’t miss out.
===============================================
So multiple parts are analyzed but only the last gives something. The RFC says
that "If the transfer encoding or character set is unrecognized by the
implementation or recognized but invalid, an empty string will result.". But
it's over my knowledge.
I activated trace for Sieve with "matching" level, and the file looks like
(this is another test but was the same result):
===============================================
Sieve trace log for message delivery:
Username: REDACTED
Session ID: ozNMOEPNl2nlAAAAO14Lzw
Sender: REDACTED
Final recipient: REDACTED
Default mailbox: INBOX
## Started executing script 'move-spam'
33: foreverypart loop begin
33: loop ends at line 36
35: extracttext command
36: assign 'eml' [0] = ""
36: debug_log "PART ============================== "
36: foreverypart loop end
36: switched to next message part
36: looping back to line 35
35: extracttext command
36: assign 'eml' [0] = ""
36: debug_log "PART ============================== "
36: foreverypart loop end
36: switched to next message part
36: looping back to line 35
35: extracttext command
36: assign 'eml' [0] = ""
36: debug_log "PART ============================== "
36: foreverypart loop end
36: switched to next message part
36: looping back to line 35
35: extracttext command
36: assign 'eml' [0] = ""
36: debug_log "PART ============================== "
36: foreverypart loop end
36: switched to next message part
36: looping back to line 35
35: extracttext command
36: assign 'eml' [0] = "Hi,
Just checking in to see if you had a chance to review my earlier message.
[...body spam....]
36: debug_log "PART ============================== ????????????Hi, ?? ??Just
checking in t..."
36: foreverypart loop end
36: no more message parts
36: exiting loops at line 36
## Finished executing script 'move-spam'
===============================================
Hypothesis:
A. Could it be because the mail headers are indistinguishable from the part
headers?
B. Or is it related to Pigeonhole not implementing yet the enclose extension,
which do this the other way?
I hope this message is not too hard to read with all the code/logs. Please tell
me if there is a better way of doing it.
And thanks in advance for any help, because I don't really know what to do and
trying to take a (superficial) look at the Pigeonhole code didn't helped.
_______________________________________________
dovecot mailing list -- [email protected]
To unsubscribe send an email to [email protected]