#7123: cryptanalysis of the shift cipher
----------------------------+-----------------------------------------------
   Reporter:  mvngu         |       Owner:  somebody       
       Type:  enhancement   |      Status:  needs_work     
   Priority:  major         |   Milestone:  sage-4.1.3     
  Component:  cryptography  |    Keywords:                 
Work_issues:                |      Author:  Minh Van Nguyen
   Reviewer:                |      Merged:                 
----------------------------+-----------------------------------------------

Comment(by mvngu):

 Replying to [comment:7 rbeezer]:
 > 1.  For a chi-square used for goodness-of-fit between distributions see
 Hogg and Tannis, Probbilit and Stastical Inference, 6th Ed., Section 8.4.
 Use the actual counts of each letter in a candidate decipher and call
 these "observed."

 Let's call these actual counts the "character count".
 [[BR]][[BR]]


 > Then multiply frequences from {{{frequency_distribution()}}} times the
 length of the message.

 There are two types of frequencies. The characteristic frequency
 distribution (or characteristic frequency, for short) of an alphabet is
 the expected frequency probability distribution for that alphabet. The
 book by Beker and Piper 1982 contains a table describing this probability
 distribution. The book by Lewand 2000 also contains such a table. The
 frequency distribution of a message `M` (or frequency distribution, for
 short) gives the ratio `R` of character occurrences over message length:
 {{{
 R = (number of occurrences of a character) / (message length)
 }}}
 The characteristic frequency can be thought of as the expected
 probability, while the frequency distribution is the observed probability.
 [[BR]][[BR]]


 > Call these theoretical counts the "expected" and carry them as non-
 integers, even if we know they should be integral.

 With the above distinction between characteristic frequency (of an
 alphabet) and frequency distribution (of a message), I interpret the
 theoretical counts you mention here as the result of multiplying each
 probability value in the characteristic frequency by the message length.
 For example:
 {{{
 sage: A = AlphabeticStrings()
 sage: M = A.encoding("abcabf")
 sage: L = len(M); L
 6
 sage: CF = A.characteristic_frequency()
 sage: sorted(CF.items())

 [('A', 0.0820000000000000),
  ('B', 0.0150000000000000),
  ('C', 0.0280000000000000),
  ('D', 0.0430000000000000),
  ('E', 0.127000000000000),
  ('F', 0.0220000000000000),
  ('G', 0.0200000000000000),
  ('H', 0.0610000000000000),
  ('I', 0.0700000000000000),
  ('J', 0.00200000000000000),
  ('K', 0.00800000000000000),
  ('L', 0.0400000000000000),
  ('M', 0.0240000000000000),
  ('N', 0.0670000000000000),
  ('O', 0.0750000000000000),
  ('P', 0.0190000000000000),
  ('Q', 0.00100000000000000),
  ('R', 0.0600000000000000),
  ('S', 0.0630000000000000),
  ('T', 0.0910000000000000),
  ('U', 0.0280000000000000),
  ('V', 0.0100000000000000),
  ('W', 0.0230000000000000),
  ('X', 0.00100000000000000),
  ('Y', 0.0200000000000000),
  ('Z', 0.00100000000000000)]
 sage: keys = CF.keys()
 sage: for k in keys:
 ....:     CF[k] *= L
 ....:
 sage: sorted(CF.items())

 [('A', 0.492000000000000),
  ('B', 0.0900000000000000),
  ('C', 0.168000000000000),
  ('D', 0.258000000000000),
  ('E', 0.762000000000000),
  ('F', 0.132000000000000),
  ('G', 0.120000000000000),
  ('H', 0.366000000000000),
  ('I', 0.420000000000000),
  ('J', 0.0120000000000000),
  ('K', 0.0480000000000000),
  ('L', 0.240000000000000),
  ('M', 0.144000000000000),
  ('N', 0.402000000000000),
  ('O', 0.450000000000000),
  ('P', 0.114000000000000),
  ('Q', 0.00600000000000000),
  ('R', 0.360000000000000),
  ('S', 0.378000000000000),
  ('T', 0.546000000000000),
  ('U', 0.168000000000000),
  ('V', 0.0600000000000000),
  ('W', 0.138000000000000),
  ('X', 0.00600000000000000),
  ('Y', 0.120000000000000),
  ('Z', 0.00600000000000000)]
 }}}

 Multiplying each probability value in the frequency distribution by the
 message length would result in the counts of character occurrences:
 {{{
 sage: char_counts = M.character_count()
 sage: sorted(char_counts.items())
 [(A, 2), (B, 2), (C, 1), (F, 1)]
 sage: FD = M.frequency_distribution()
 sage: sorted(FD.items())

 [(A, 0.333333333333333),
  (B, 0.333333333333333),
  (C, 0.166666666666667),
  (F, 0.166666666666667)]
 sage: keys = FD.keys()
 sage: for k in keys:
 ....:     FD[k] *= L
 ....:
 sage: sorted(FD.items())

 [(A, 2.00000000000000),
  (B, 2.00000000000000),
  (C, 1.00000000000000),
  (F, 1.00000000000000)]
 sage: sorted(char_counts.items())
 [(A, 2), (B, 2), (C, 1), (F, 1)]
 }}}
 [[BR]][[BR]]


 >  Then the statistic is the sum over all the letters of
 >
 > {{{(observed - expected)^2/expected}}}

 This agrees with Pearson's chi-square test at

 http://en.wikipedia.org/wiki/Goodness_of_fit
 [[BR]][[BR]]


 > I would use the keyword "chisquare" for this.

 No problem.
 [[BR]][[BR]]


 > 2.  The sum of squared differences should probably be computed as the
 sum over all letters using counts, not frequencies, ie
 >
 > {{{(observed - expected)^2}}}

 No problem.
 [[BR]][[BR]]


 > This is equivalent in this case to the coefficient of determination, or
 more informally "r-squared."  Another name is "residual sum of squares."
 >
 > 3.  Not at all clear to me how [SavHar99] justifies the use of their
 expression and naming it a chi-square statistic.  If anything, it looks to
 me like a noncentral chisquare statistic.  But it may do a good job of
 "scoring" candidate decipherments.  Can you be sure if they mean to square
 just the numerator of each term ({{{FD^2/CD}}}), or square the whole
 fraction ({{{FD/CF}}})?
 The description at

 http://starbase.trincoll.edu/~crypto/historical/caesar.html

 is mostly textual and ambiguous. I would implement the above statistics
 you suggested as they are much more concrete and well specified, instead
 of spending time teasing out the "right" formula from the textual
 description.

-- 
Ticket URL: <http://trac.sagemath.org/sage_trac/ticket/7123#comment:9>
Sage <http://www.sagemath.org>
Sage: Creating a Viable Open Source Alternative to Magma, Maple, Mathematica, 
and MATLAB

--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"sage-trac" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to 
[email protected]
For more options, visit this group at 
http://groups.google.com/group/sage-trac?hl=en
-~----------~----~----~----~------~----~------~--~---

Reply via email to