On Jan 7, 5:10 pm, dgoldsmith_89 <[EMAIL PROTECTED]> wrote: > On Jan 7, 2:54 pm, "[EMAIL PROTECTED]" <[EMAIL PROTECTED]> wrote: > > > On Jan 7, 4:37 pm, dgoldsmith_89 <[EMAIL PROTECTED]> wrote: > > > > Can anyone point me to a downloadable open source English dictionary > > > suitable for programmatic use with python: I'm programming a puzzle > > > generator, and I need to be able to generate more or less complete > > > lists of English words, alphabetized. Thanks! DG > > >www.puzzlers.orghasnumerous word lists & dictionarys in text > > format that can be downloaded. I recommend you insert them into > > some form of database. I have most of them in an Access db and > > it's 95 MB. That's a worse case as I also have some value-added > > stuff, the OSPD alone would be a lot smaller. > > > <http://www.puzzlers.org/dokuwiki/doku.php?id=solving:wordlists:start> > > Sorry for my ignorance: I can query an Access DB w/ standard SQL > queries (and this is how I would access it w/ Python)?
Yes, if you have the appropriate way to link to the DB. I use Windows and ODBC from Win32. I don't know what you would use on a Mac. As Paul McGuire said, you could easily do this with SqlLite3. Personnaly, I always use Access since my job requires it and I find it much more convenient. I often use Crosstab tables which I think SqlLite3 doesn't support. Typically, I'll write complex queries in Access and simple select SQL statements in Python to grab them. Here's my anagram locator. (the [signature] is an example of the value-added I mentioned). ## I took a somewhat different approach. Instead of in a file, ## I've got my word list (562456 words) in an MS-Access database. ## And instead of calculating the signature on the fly, I did it ## once and added the signature as a second field: ## ## TABLE CONS_alpha_only_signature_unique ## -------------------------------------- ## CONS text 75 ## signature text 26 ## ## The signature is a 26 character string where each character is ## the count of occurences of the matching letter. Luckily, in ## only a single case was there more than 9 occurences of any ## given letter, which turned not to be a word but a series of ## words concatenated so I just deleted it from the database ## (lots of crap in the original word list I used). ## ## Example: ## ## CONS signature ## aah 20000001000000000000000000 # 'a' occurs twice & 'h' once ## aahed 20011001000000000000000000 ## aahing 20000011100001000000000000 ## aahs 20000001000000000010000000 ## aaii 20000000200000000000000000 ## aaker 20001000001000000100000000 ## aal 20000000000100000000000000 ## aalborg 21000010000100100100000000 ## aalesund 20011000000101000010100000 ## ## Any words with identical signatures must be anagrams. ## ## Once this was been set up, I wrote a whole bunch of queries ## to use this table. I use the normal Access drag and drop ## design, but the SQL can be extracted from each, so I can ## simply open the query from Python or I can grab the SQL ## and build it inside the program. The example ## ## signatures_anagrams_select_signature ## ## is hard coded for criteria 9 & 10 and should be cast inside ## Python so the criteria can be changed dynamically. ## ## ## QUERY signatures_anagrams_longest ## --------------------------------- ## SELECT Len([CONS]) AS Expr1, ## Count(Cons_alpha_only_signature_unique.CONS) AS CountOfCONS, ## Cons_alpha_only_signature_unique.signature ## FROM Cons_alpha_only_signature_unique ## GROUP BY Len([CONS]), ## Cons_alpha_only_signature_unique.signature ## HAVING (((Count(Cons_alpha_only_signature_unique.CONS))>1)) ## ORDER BY Len([CONS]) DESC , ## Count(Cons_alpha_only_signature_unique.CONS) DESC; ## ## This is why I don't use SQLite3, must have crosstab queries. ## ## QUERY signatures_anagram_summary ## -------------------------------- ## TRANSFORM Count(signatures_anagrams_longest.signature) AS CountOfsignature ## SELECT signatures_anagrams_longest.Expr1 AS [length of word] ## FROM signatures_anagrams_longest ## GROUP BY signatures_anagrams_longest.Expr1 ## PIVOT signatures_anagrams_longest.CountOfCONS; ## ## ## QUERY signatures_anagrams_select_signature ## ------------------------------------------ ## SELECT Len([CONS]) AS Expr1, ## Count(Cons_alpha_only_signature_unique.CONS) AS CountOfCONS, ## Cons_alpha_only_signature_unique.signature ## FROM Cons_alpha_only_signature_unique ## GROUP BY Len([CONS]), ## Cons_alpha_only_signature_unique.signature ## HAVING (((Len([CONS]))=9) AND ## ((Count(Cons_alpha_only_signature_unique.CONS))=10)) ## ORDER BY Len([CONS]) DESC , ## Count(Cons_alpha_only_signature_unique.CONS) DESC; ## ## QUERY signatures_lookup_by_anagram_select_signature ## --------------------------------------------------- ## SELECT signatures_anagrams_select_signature.Expr1, ## signatures_anagrams_select_signature.CountOfCONS, ## Cons_alpha_only_signature_unique.CONS, ## Cons_alpha_only_signature_unique.signature ## FROM signatures_anagrams_select_signature ## INNER JOIN Cons_alpha_only_signature_unique ## ON signatures_anagrams_select_signature.signature ## = Cons_alpha_only_signature_unique.signature; ## ## ## Now it's a simple matter to use the ODBC from Win32 to extract ## the query output into Python. import dbi import odbc con = odbc.odbc("words") cursor = con.cursor() ## This first section grabs the anagram summary. Note that ## queries act just like tables (as long as they don't have ## internal dependencies. I read somewhere you can get the ## field names, but here I put them in by hand. ##cursor.execute("SELECT * FROM signature_anagram_summary") ## ##results = cursor.fetchall() ## ##for i in results: ## for j in i: ## print '%4s' % (str(j)), ## print ## (if this wraps, each line is 116 characters) ## 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 23 ## 2 259 None None None None None None None None None None None None None None None None None ## 3 487 348 218 150 102 None None None None None None None None None None None None None ## 4 1343 718 398 236 142 101 51 26 25 9 8 3 2 None None None None None ## 5 3182 1424 777 419 274 163 106 83 53 23 20 10 6 4 5 1 3 1 ## 6 5887 2314 1051 545 302 170 114 54 43 21 15 6 5 4 4 2 None None ## 7 7321 2251 886 390 151 76 49 37 14 7 5 1 1 1 None None None None ## 8 6993 1505 452 166 47 23 8 6 4 2 2 None None None None None None None ## 9 5127 830 197 47 17 6 None None 1 None None None None None None None None None ## 10 2975 328 66 8 2 None None None None None None None None None None None None None ## 11 1579 100 5 4 2 None None None None None None None None None None None None None ## 12 781 39 2 1 None None None None None None None None None None None None None None ## 13 326 11 2 None None None None None None None None None None None None None None None ## 14 166 2 None None None None None None None None None None None None None None None None ## 15 91 None 1 None None None None None None None None None None None None None None None ## 16 60 None None None None None None None None None None None None None None None None None ## 17 35 None None None None None None None None None None None None None None None None None ## 18 24 None None None None None None None None None None None None None None None None None ## 19 11 None None None None None None None None None None None None None None None None None ## 20 6 None None None None None None None None None None None None None None None None None ## 21 6 None None None None None None None None None None None None None None None None None ## 22 4 None None None None None None None None None None None None None None None None None ## From the query we have the word size as row header and size of ## anagram set as column header. The data value is the count of ## how many different anagram sets match the row/column header. ## ## For example, there are 7321 different 7-letter signatures that ## have 2 anagram sets. There is 1 5-letter signature having a ## 23 member anagram set. ## ## We can then pick any of these, say the single 10 member anagram ## set of 9-letter words, and query out out the anagrams: cursor.execute("SELECT * FROM signatures_lookup_by_anagram_select_signature") results = cursor.fetchall() for i in results: for j in i: print j, print ## 9 10 anoretics 10101000100001100111000000 ## 9 10 atroscine 10101000100001100111000000 ## 9 10 certosina 10101000100001100111000000 ## 9 10 creations 10101000100001100111000000 ## 9 10 narcotise 10101000100001100111000000 ## 9 10 ostracine 10101000100001100111000000 ## 9 10 reactions 10101000100001100111000000 ## 9 10 secration 10101000100001100111000000 ## 9 10 tinoceras 10101000100001100111000000 ## 9 10 tricosane 10101000100001100111000000 ## Nifty, eh? > > DG -- http://mail.python.org/mailman/listinfo/python-list