#7123: cryptanalysis of the shift cipher
----------------------------+-----------------------------------------------
Reporter: mvngu | Owner: somebody
Type: enhancement | Status: needs_work
Priority: major | Milestone: sage-4.1.3
Component: cryptography | Keywords:
Work_issues: | Author: Minh Van Nguyen
Reviewer: | Merged:
----------------------------+-----------------------------------------------
Comment(by mvngu):
Replying to [comment:7 rbeezer]:
> 1. For a chi-square used for goodness-of-fit between distributions see
Hogg and Tannis, Probbilit and Stastical Inference, 6th Ed., Section 8.4.
Use the actual counts of each letter in a candidate decipher and call
these "observed."
Let's call these actual counts the "character count".
[[BR]][[BR]]
> Then multiply frequences from {{{frequency_distribution()}}} times the
length of the message.
There are two types of frequencies. The characteristic frequency
distribution (or characteristic frequency, for short) of an alphabet is
the expected frequency probability distribution for that alphabet. The
book by Beker and Piper 1982 contains a table describing this probability
distribution. The book by Lewand 2000 also contains such a table. The
frequency distribution of a message `M` (or frequency distribution, for
short) gives the ratio `R` of character occurrences over message length:
{{{
R = (number of occurrences of a character) / (message length)
}}}
The characteristic frequency can be thought of as the expected
probability, while the frequency distribution is the observed probability.
[[BR]][[BR]]
> Call these theoretical counts the "expected" and carry them as non-
integers, even if we know they should be integral.
With the above distinction between characteristic frequency (of an
alphabet) and frequency distribution (of a message), I interpret the
theoretical counts you mention here as the result of multiplying each
probability value in the characteristic frequency by the message length.
For example:
{{{
sage: A = AlphabeticStrings()
sage: M = A.encoding("abcabf")
sage: L = len(M); L
6
sage: CF = A.characteristic_frequency()
sage: sorted(CF.items())
[('A', 0.0820000000000000),
('B', 0.0150000000000000),
('C', 0.0280000000000000),
('D', 0.0430000000000000),
('E', 0.127000000000000),
('F', 0.0220000000000000),
('G', 0.0200000000000000),
('H', 0.0610000000000000),
('I', 0.0700000000000000),
('J', 0.00200000000000000),
('K', 0.00800000000000000),
('L', 0.0400000000000000),
('M', 0.0240000000000000),
('N', 0.0670000000000000),
('O', 0.0750000000000000),
('P', 0.0190000000000000),
('Q', 0.00100000000000000),
('R', 0.0600000000000000),
('S', 0.0630000000000000),
('T', 0.0910000000000000),
('U', 0.0280000000000000),
('V', 0.0100000000000000),
('W', 0.0230000000000000),
('X', 0.00100000000000000),
('Y', 0.0200000000000000),
('Z', 0.00100000000000000)]
sage: keys = CF.keys()
sage: for k in keys:
....: CF[k] *= L
....:
sage: sorted(CF.items())
[('A', 0.492000000000000),
('B', 0.0900000000000000),
('C', 0.168000000000000),
('D', 0.258000000000000),
('E', 0.762000000000000),
('F', 0.132000000000000),
('G', 0.120000000000000),
('H', 0.366000000000000),
('I', 0.420000000000000),
('J', 0.0120000000000000),
('K', 0.0480000000000000),
('L', 0.240000000000000),
('M', 0.144000000000000),
('N', 0.402000000000000),
('O', 0.450000000000000),
('P', 0.114000000000000),
('Q', 0.00600000000000000),
('R', 0.360000000000000),
('S', 0.378000000000000),
('T', 0.546000000000000),
('U', 0.168000000000000),
('V', 0.0600000000000000),
('W', 0.138000000000000),
('X', 0.00600000000000000),
('Y', 0.120000000000000),
('Z', 0.00600000000000000)]
}}}
Multiplying each probability value in the frequency distribution by the
message length would result in the counts of character occurrences:
{{{
sage: char_counts = M.character_count()
sage: sorted(char_counts.items())
[(A, 2), (B, 2), (C, 1), (F, 1)]
sage: FD = M.frequency_distribution()
sage: sorted(FD.items())
[(A, 0.333333333333333),
(B, 0.333333333333333),
(C, 0.166666666666667),
(F, 0.166666666666667)]
sage: keys = FD.keys()
sage: for k in keys:
....: FD[k] *= L
....:
sage: sorted(FD.items())
[(A, 2.00000000000000),
(B, 2.00000000000000),
(C, 1.00000000000000),
(F, 1.00000000000000)]
sage: sorted(char_counts.items())
[(A, 2), (B, 2), (C, 1), (F, 1)]
}}}
[[BR]][[BR]]
> Then the statistic is the sum over all the letters of
>
> {{{(observed - expected)^2/expected}}}
This agrees with Pearson's chi-square test at
http://en.wikipedia.org/wiki/Goodness_of_fit
[[BR]][[BR]]
> I would use the keyword "chisquare" for this.
No problem.
[[BR]][[BR]]
> 2. The sum of squared differences should probably be computed as the
sum over all letters using counts, not frequencies, ie
>
> {{{(observed - expected)^2}}}
No problem.
[[BR]][[BR]]
> This is equivalent in this case to the coefficient of determination, or
more informally "r-squared." Another name is "residual sum of squares."
>
> 3. Not at all clear to me how [SavHar99] justifies the use of their
expression and naming it a chi-square statistic. If anything, it looks to
me like a noncentral chisquare statistic. But it may do a good job of
"scoring" candidate decipherments. Can you be sure if they mean to square
just the numerator of each term ({{{FD^2/CD}}}), or square the whole
fraction ({{{FD/CF}}})?
The description at
http://starbase.trincoll.edu/~crypto/historical/caesar.html
is mostly textual and ambiguous. I would implement the above statistics
you suggested as they are much more concrete and well specified, instead
of spending time teasing out the "right" formula from the textual
description.
--
Ticket URL: <http://trac.sagemath.org/sage_trac/ticket/7123#comment:9>
Sage <http://www.sagemath.org>
Sage: Creating a Viable Open Source Alternative to Magma, Maple, Mathematica,
and MATLAB
--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups
"sage-trac" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/sage-trac?hl=en
-~----------~----~----~----~------~----~------~--~---