Hi everyone!

I'm not sure this is a design decision, a bug, something not implemented or 
malformed input on my side.

My use case is the following: users that forward their mails outside the domain 
I manage have the option to report spam as attachment so the spam classifier 
can learn anyway. I prefer attachment over forward because MUAs have their own 
way to do it and it's difficult to reliably reconstruct some headers/email 
structure that can be important for the spam classifier.

Each input message is processed through a Sieve extensions of RFC 5703 to 
extract the attachment. I ask user to produce a file with .eml extension 
through any export-like feature of their MUA. 

Below is an example EML file produced by Tuta (Thunderbird does a similar thing 
with a different encoding for the body). I'll use a bunch of = characters as a 
separator in the rest of this.

===============================================
[...bunch of header...]
X-MS-Exchange-CrossTenant-rms-persistedconsumerorg: 
00000000-0000-0000-0000-0000000000
00
X-MS-Exchange-Transport-CrossTenantHeadersStamped: TYSPR04MB7035
Content-Type: multipart/related; 
boundary="------------79Bu5A16qPEYcVIZL@tutanota"

--------------79Bu5A16qPEYcVIZL@tutanota
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: base64

DQo8ZGl2IHN0eWxlPSJmb250LWZhbWlseTogQXB0b3MsIEFwdG9zX0VtYmVkZGVkRm9udCwgQXB0b3
NfTVNGb250U2VydmljZSwgQ2FsaWJyaSwgSGVsdmV0aWNhLCBzYW5zLXNlcmlmOyBmb250LXNpemU6
[...base64 encoding of the body...]
===============================================

While attached and sent, the received messaged looks like this:
===============================================
[...headers of the actual email...]
X-Infomaniak-Routing: alpha

This is a multi-part message in MIME format.
--------------5jyKrhQ08xXUif7LHhgg648N
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit


--------------5jyKrhQ08xXUif7LHhgg648N
Content-Type: message/rfc822; name="a.eml"
Content-Disposition: attachment; filename="a.eml"
Content-Transfer-Encoding: 7bit

[...headers of the spam....]
Subject: Re: //Re: AI + Ranking ///
Thread-Topic: //Re: AI + Ranking ///
[....]
Content-Type: multipart/related; 
boundary="------------79Bu5A16qPEYcVIZL@tutanota"

--------------79Bu5A16qPEYcVIZL@tutanota
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: base64

DQo8ZGl2IHN0eWxlPSJmb250LWZhbWlseTogQXB0b3MsIEFwdG9zX0VtYmVkZGVkRm9udCwgQXB0b3
NfTVNGb250U2VydmljZSwgQ2FsaWJyaSwgSGVsdmV0aWNhLCBzYW5zLXNlcmlmOyBmb250LXNpemU6
[...base64 encoding of the spam....]
--------------79Bu5A16qPEYcVIZL@tutanota--

--------------5jyKrhQ08xXUif7LHhgg648N--
===============================================

I face two issues:
1. Only the base64 decoded string can be retrived through 
foreverypart/extracttext;
2. It has a lot of newlines.

- 2. is not a big deal. The original spam has a lot of HTML junk, and each line 
with HTML without an inner text node is stripped, but the newline is kept. I 
don't know if this is intended.
- 1. is more of a big deal. To try to confirm that, I made an over-simplified 
version of the Sieve script (I skip configuration but plugins are loaded etc, 
please ask if useful).

===============================================
[...require...]
foreverypart {
    extracttext "eml";
    debug_log "PART ============================== ${eml}";
}
===============================================

mail_debug is activated. In the logs, I can read:
===============================================
Info: sieve: DEBUG: PART ============================== 
Info: sieve: DEBUG: PART ============================== 
Info: sieve: DEBUG: PART ============================== 
Info: sieve: DEBUG: PART ============================== 
Info: sieve: DEBUG: PART ============================== 
Info: sieve: 
Info: sieve:   
Info: sieve: 
Info: sieve:   
Info: sieve: 
Info: sieve: 
Info: sieve: Hello, 
Info: sieve: 
Info: sieve: Hope you're doing well. 
Info: sieve: 
Info: sieve: 
Info: sieve: 
Info: sieve: 
Info: sieve: Would you like attract more traffic with our AEO + GEO + SEO 
services. AI-driven search is here—don’t miss out. 
===============================================

So multiple parts are analyzed but only the last gives something. The RFC says 
that "If the transfer encoding or character set is unrecognized by the 
implementation or recognized but invalid, an empty string will result.". But 
it's over my knowledge.

I activated trace for Sieve with "matching" level, and the file looks like 
(this is another test but was the same result):

===============================================
Sieve trace log for message delivery:

  Username: REDACTED
  Session ID: ozNMOEPNl2nlAAAAO14Lzw
  Sender: REDACTED
  Final recipient: REDACTED
  Default mailbox: INBOX


      ## Started executing script 'move-spam'
  33: foreverypart loop begin
  33:   loop ends at line 36
  35: extracttext command
  36:   assign 'eml' [0] = ""
  36: debug_log "PART ============================== "
  36: foreverypart loop end
  36:   switched to next message part
  36:   looping back to line 35
  35: extracttext command
  36:   assign 'eml' [0] = ""
  36: debug_log "PART ============================== "
  36: foreverypart loop end
  36:   switched to next message part
  36:   looping back to line 35
  35: extracttext command
  36:   assign 'eml' [0] = ""
  36: debug_log "PART ============================== "
  36: foreverypart loop end
  36:   switched to next message part
  36:   looping back to line 35
  35: extracttext command
  36:   assign 'eml' [0] = ""
  36: debug_log "PART ============================== "
  36: foreverypart loop end
  36:   switched to next message part
  36:   looping back to line 35
  35: extracttext command
  36:   assign 'eml' [0] = "Hi,



Just checking in to see if you had a chance to review my earlier message.
[...body spam....]
  36: debug_log "PART ============================== ????????????Hi, ??  ??Just 
checking in t..."
  36: foreverypart loop end
  36:   no more message parts
  36:   exiting loops at line 36
      ## Finished executing script 'move-spam'
===============================================

Hypothesis:
A. Could it be because the mail headers are indistinguishable from the part 
headers?  
B. Or is it related to Pigeonhole not implementing yet the enclose extension, 
which do this the other way?

I hope this message is not too hard to read with all the code/logs. Please tell 
me if there is a better way of doing it.

And thanks in advance for any help, because I don't really know what to do and 
trying to take a (superficial) look at the Pigeonhole code didn't helped.
_______________________________________________
dovecot mailing list -- [email protected]
To unsubscribe send an email to [email protected]

Reply via email to