Elizabeth, An excellent point that counting messages not reported would be both easier and more valuable than counting records not reported. So I ask, should this information be provided, and if so, should it appear at the top of the report along with begin/end, or should it appear at the end of the report as part of a new section called something like "summary"?
Then, when armed with such information, one could decide to either process the report or ignore it or interpolate it. Best Regards, --Bryan Costales >------------ > Quoting Elizabeth Zwicky <[email protected]> > Subject: Re: [dmarc-discuss] Aggregate Report Missing Data >------------ > > Note that when you're talking about aggregate reports, records are per > sending ip + disposition + reason + authentication results > > (Theoretically, recipient domain also enters into the equation, but at the > moment nobody generates a report in which it varies.) > > That could be a number of records per sending IP, but in practice it isn't; > most sending IPs get a single record. So 1,000,000 records will usually > be at least 500,000 sending IPs. Even big > senders with lots of mail going through forwarders and mailing lists come nowhere > near that in valid mail. The result is that truncated reports consist in the vast majority > of abusive mail, and in particular of abusive mail sent from botnets. > > Because these abusive IPs send few messages per IP, and most valid IPs > send large numbers of messages per IP, you can easily truncate half or more the records > and still end up with a report that represents the majority of the actual mail. So in order to > determine statistical significance, you probably don't want to know > how many report records are omitted, but how many pieces of mail are omitted. > > There are two reasons that reports are truncated. First, there's the effort to keep the > reports to a size that people can effectively receive and process. That's easily achieved > by truncating reports after generation. Second, report generators need to keep the resource > consumption of the reports to a manageable level. That generally requires just ignoring > things at some point. Therefore, as a report generator, I'm reluctant to volunteer to count up > anything about the data I'm not reporting on. However, if I were going to count something, > it would be messages, not report lines -- messages is both more useful to calculating > significance and lower cost to count. > > Elizabeth Zwicky > > On Aug 18, 2012, at 6:50 AM, <[email protected]> wrote: > > > It is possible for some sites to choose and arbitrary Aggregate Report size and > > to truncate reports at that size. Google currently truncates at 1,000,000 records. > > The problem is that without knowing how many records are missing, we do not know > > if we can trust the sent data. For example, if 1,000,000 record are reported > > for example.com, and 20 were omitted, that is not statistically significant > > enough to worry about, but is 100,000 were omitted the actually reported data > > may be misleading and should probably not be used. > > I suggest that the Aggregate Reports should contain an indication > > of the data included and omitted. Perhaps: > > > > <records_available>1405671</records_available> > > <records_reported>1000000</records_reported> > > > > Best Regards, > > --Bryan Costales > > _______________________________________________ > > dmarc-discuss mailing list > > [email protected] > > http://www.dmarc.org/mailman/listinfo/dmarc-discuss > > > > NOTE: Participating in this list means you agree to the DMARC Note Well terms (http://www.dmarc.org/note_well.html) > > _______________________________________________ dmarc-discuss mailing list [email protected] http://www.dmarc.org/mailman/listinfo/dmarc-discuss NOTE: Participating in this list means you agree to the DMARC Note Well terms (http://www.dmarc.org/note_well.html)
