Note that when you're talking about aggregate reports, records are per 
        sending ip + disposition + reason + authentication results

(Theoretically, recipient domain also enters into the equation, but at the
moment nobody generates a report in which it varies.)

That could be a number of records per sending IP, but in practice it isn't;
most sending IPs get a single record. So 1,000,000 records will usually
be at least 500,000 sending IPs. Even big 
senders with lots of mail going through forwarders and mailing lists come 
nowhere 
near that in valid mail. The result is that truncated reports consist in the 
vast majority 
of abusive mail, and in particular of abusive mail sent from botnets.

 Because these abusive IPs send few messages per IP, and most valid IPs 
send large numbers of messages per IP, you can easily truncate half or more the 
records 
and still end up with a report that represents the majority of the actual mail. 
So in order to 
determine statistical significance, you probably don't want to know
how many report records are omitted, but how many pieces of mail are omitted.

There are two reasons that reports are truncated. First, there's the effort to 
keep the 
reports to a size that people can effectively receive and process. That's 
easily achieved 
by truncating reports after generation.  Second, report generators need to keep 
the resource 
consumption of the reports to a manageable level. That generally requires just 
ignoring 
things at some point. Therefore, as a report generator, I'm reluctant to 
volunteer to count up 
anything about the data I'm not reporting on. However, if I were going to count 
something, 
it would be messages, not report lines -- messages is both more useful to 
calculating
significance and lower cost to count.

        Elizabeth Zwicky 

On Aug 18, 2012, at 6:50 AM, <[email protected]> wrote:

> It is possible for some sites to choose and arbitrary Aggregate Report size 
> and
> to truncate reports at that size. Google currently truncates at 1,000,000 
> records.
> The problem is that without knowing how many records are missing, we do not 
> know
> if we can trust the sent data. For example, if 1,000,000 record are reported
> for example.com, and 20 were omitted, that is not statistically significant
> enough to worry about, but is 100,000 were omitted the actually reported data
> may be misleading and should probably not be used.
> I suggest that the Aggregate Reports should contain an indication
> of the data included and omitted. Perhaps:
> 
>       <records_available>1405671</records_available>
>       <records_reported>1000000</records_reported>
> 
> Best Regards,
> --Bryan Costales
> _______________________________________________
> dmarc-discuss mailing list
> [email protected]
> http://www.dmarc.org/mailman/listinfo/dmarc-discuss
> 
> NOTE: Participating in this list means you agree to the DMARC Note Well terms 
> (http://www.dmarc.org/note_well.html)


_______________________________________________
dmarc-discuss mailing list
[email protected]
http://www.dmarc.org/mailman/listinfo/dmarc-discuss

NOTE: Participating in this list means you agree to the DMARC Note Well terms 
(http://www.dmarc.org/note_well.html)

Reply via email to