On Mon, Jul 27, 2009 at 5:55 AM, Gwern Branwen<[email protected]> wrote: ... > But I couldn't figure out the best way to marry it with SRS. I figured > that one viable approach might be to take a corpus, take a set of > foreign vocab which it is mandatory for the user to have, and then > generate the 'minimal' learning path. That is, it'd create thousands > of cards, each covering the next most rare word, and the user could > just work his way through them linearly. > > (Actually, maybe this approach isn't as bad as I thought. I've been > generating large numbers of cards for memorizing poems, and it hasn't > worked out too bad as long as I didn't use the randomization plugin. > Hm. I should look into whether the guy's software could be repurposed > for this. A static set of cards could work well: imagine such a > generated card deck for someone learning French: she can choose from > one targeted at _In Search of Lost Time_, or she could pick a deck > targeted at Rene Descartes if her interests inclined that way.) > > * There's some extra stuff about translating parts of sentences into > English to focus on a particular word, but I think this is extra - a > hack to get around the fact that a 'small' corpus like the New > Testament isn't going to give you often sentences which have *only* > one unknown word. By translating, you can take a sentence with > multiple unknown words and translate it into a sentence with only one > unknown word.
So, to update. I discovered that the problem is indeed NP or even EXP hard when I finished up the program and found that runtime on a corpus of a few hundred words was going to be on the order of weeks; I switched to a heuristic which is kind of frequency-based, and seems to give reasonable results. The results look kind of like this: given a random hardwired list of English words, if one feeds in the text of Frank Herbert's _Dune_, one gets this: [02:15 AM] 0Mb$ cat /home/gwern/doc/herbert/fh-dune-messiah.txt | ./hcorpus 20 he said paul his she her not him had for at alia no from what asked they there have stilgar (Paul, Alia, and Stilgar are major characters, frequently mentioned.) Eyeballing, these look like reasonable words to know. If we 'learn' these 20 words (=putting them in the hardwired known list), then our next batch of 20 words looks like [02:18 AM] 0Mb$ cat /home/gwern/doc/herbert/fh-dune-messiah.txt | ./hcorpus 20 scytale thought know we do could will your must chani one now by irulan eyes ghola fremen then out them (Irulan, Chani, Scytale are minor characters; Fremen & ghola are _Dune_ neologisms.) These too look plausible. On the TODO list is - read known words from a file - after printing out the top nth word, print out sentences which are now translatable by it - efficiency hacks; the top 20 words on a single book takes ~6s, but if we run on Frank Herbert's entire corpus (only 10x the size), memory use blows up and I dunno how long it takes, which is obviously unacceptable -- gwern --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "mnemosyne-proj-users" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/mnemosyne-proj-users?hl=en -~----------~----~----~----~------~----~------~--~---
-- based on http://jtauber.com/blog/2008/02/10/a_new_kind_of_graded_reader/ import Data.Char (isPunctuation, toLower) import Data.List -- (nub, sort) import qualified Data.Set as Set import qualified Data.Map as Map import Control.Parallel.Strategies import Data.Function (on) import Data.Maybe import Data.List.Split (splitWhen) import System.IO.UTF8 (getContents, putStrLn) import System.Environment (getArgs) main :: IO () main = do depth <- fmap (read . head) $ getArgs corpus <- System.IO.UTF8.getContents let pcorpus = processCorpus corpus let knownwords = map (map toLower) ["You", "dont", "see", "more", "than", "that", "The", "first", "episode", "of", "Kare", "Kano", "is", "rotten", "with", "Evangelion", "visual", "motifs", "the", "trains", "the", "spotlights", "and", "telephone", "poles", "and", "wires", "the", "masks", "and", "this", "is", "how", "everyone", "sees", "me", "etc", "a", "it", "did", "are", "to", "in", "I", "Dune", "was", "Stalin", "Mussolini", "Hitler", "Churchill", "beginning", "That", "all", "be", "like", "on", "an", "Its", "But", "only", "you", "themes", "into", "as", "my", "human", "paradox","he","said","paul","his","she","her","not","him","had","for","at","alia","no","from","what","asked","they","there","have","stilgar"] let optimalwords = answer depth pcorpus knownwords System.IO.UTF8.putStrLn optimalwords -- | Clean up. Don't want 'Je suis." to look different from "Je suis"... -- -- > stringPunctuation "Greetings, fellow human flesh-sacks!" ~> "Greetings fellow human fleshsacks" stripPunctuation :: String -> String stripPunctuation = filter (not . isPunctuation) -- Turn a single big document into a stream of sentences of individual words; lower-case so we don't get -- multiple hits for 'He', 'he' etc processCorpus :: String -> [[String]] processCorpus = pmap (sort . words . stripPunctuation) . splitWhen (=='.') . map toLower -- parallel map pmap :: (NFData b) =>(a -> b) -> [a] -> [b] pmap = parMap rnf sentences :: (NFData a, Ord a) => [[a]] -> Map.Map Int (Set.Set a) sentences = Map.fromList . zip [(0::Int)..] . pmap Set.fromList fidiv :: (Integral a, Fractional b) => a -> a -> b fidiv = (/) `on` fromIntegral swap :: (a, b) -> (b, a) swap = uncurry (flip (,)) ranks :: (NFData v, Ord k, Ord v) => Map.Map k (Set.Set v) -> Maybe (Rational, v) ranks s = listToMaybe . sortBy (flip compare) . pmap swap . Map.toList . Map.fromListWith (+) $ [(word, 1 `fidiv` Set.size wrds) | (_sentenceId, wrds) <- Map.toList s , word <- Set.toList wrds] approximation :: (NFData v, Ord k, Ord v) => Map.Map k (Set.Set v) -> Int -> [v] approximation _ 0 = [] approximation s n = case ranks s of Nothing -> [] Just (_value, word) -> let withoutWord = Map.map (Set.delete word) s in word : approximation withoutWord (n-1) -- do not use parmap in this function on pain of death; GHC is broken? process :: (Ord v, NFData v) => [[v]] -> [Int] -> [[v]] process ss ns = map (approximation $ sentences ss) ns getBest :: [Int] ->[[String]] -> String getBest x y = unlines . last $ process y x filterKnown :: [String] -> [[String]] -> [[String]] filterKnown known = filter (not . null) . pmap (filter (flip notElem $ known)) answer :: Int -> [[String]] -> [String] -> String answer depth corp known = let corp' = filterKnown known corp in getBest [1..depth] corp'
