[mnemosyne-proj-users] Re: scheduler feedback needed

Gwern Branwen Wed, 05 Aug 2009 01:07:06 -0700

On Mon, Jul 27, 2009 at 5:55 AM, Gwern Branwen<[email protected]> wrote:
...
> But I couldn't figure out the best way to marry it with SRS. I figured
> that one viable approach might be to take a corpus, take a set of
> foreign vocab which it is mandatory for the user to have, and then
> generate the 'minimal' learning path. That is, it'd create thousands
> of cards, each covering the next most rare word, and the user could
> just work his way through them linearly.
>
> (Actually, maybe this approach isn't as bad as I thought. I've been
> generating large numbers of cards for memorizing poems, and it hasn't
> worked out too bad as long as I didn't use the randomization plugin.
> Hm. I should look into whether the guy's software could be repurposed
> for this. A static set of cards could work well: imagine such a
> generated card deck for someone learning French: she can choose from
> one targeted at _In Search of Lost Time_, or she could pick a deck
> targeted at Rene Descartes if her interests inclined that way.)
>
> * There's some extra stuff about translating parts of sentences into
> English to focus on a particular word, but I think this is extra - a
> hack to get around the fact that a 'small' corpus like the New
> Testament isn't going to give you often sentences which have *only*
> one unknown word. By translating, you can take a sentence with
> multiple unknown words and translate it into a sentence with only one
> unknown word.


So, to update. I discovered that the problem is indeed NP or even EXP
hard when I finished up the program and found that runtime on a corpus
of a few hundred words was going to be on the order of weeks; I
switched to a heuristic which is kind of frequency-based, and seems to
give reasonable results.

The results look kind of like this: given a random hardwired list of
English words, if one feeds in the text of Frank Herbert's _Dune_, one
gets this:

[02:15 AM] 0Mb$ cat /home/gwern/doc/herbert/fh-dune-messiah.txt | ./hcorpus 20
he
said
paul
his
she
her
not
him
had
for
at
alia
no
from
what
asked
they
there
have
stilgar

(Paul, Alia, and Stilgar are major characters, frequently mentioned.)

Eyeballing, these look like reasonable words to know. If we 'learn'
these 20 words (=putting them in the hardwired known list), then our
next batch of 20 words looks like

[02:18 AM] 0Mb$ cat /home/gwern/doc/herbert/fh-dune-messiah.txt | ./hcorpus 20
scytale
thought
know
we
do
could
will
your
must
chani
one
now
by
irulan
eyes
ghola
fremen
then
out
them

(Irulan, Chani, Scytale are minor characters; Fremen & ghola are
_Dune_ neologisms.)

These too look plausible.

On the TODO list is
- read known words from a file
- after printing out the top nth word, print out sentences which are
now translatable by it
- efficiency hacks; the top 20 words on a single book takes ~6s, but
if we run on Frank Herbert's entire corpus (only 10x the size), memory
use blows up and I dunno how long it takes, which is obviously
unacceptable

-- 
gwern

--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"mnemosyne-proj-users" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to 
[email protected]
For more options, visit this group at 
http://groups.google.com/group/mnemosyne-proj-users?hl=en
-~----------~----~----~----~------~----~------~--~---

-- based on http://jtauber.com/blog/2008/02/10/a_new_kind_of_graded_reader/

import Data.Char (isPunctuation, toLower)
import Data.List -- (nub, sort)
import qualified Data.Set as Set
import qualified Data.Map as Map
import Control.Parallel.Strategies
import Data.Function (on)
import Data.Maybe

import Data.List.Split (splitWhen)

import System.IO.UTF8 (getContents, putStrLn)
import System.Environment (getArgs)

main :: IO ()
main = do depth <- fmap (read . head) $ getArgs
          corpus <- System.IO.UTF8.getContents
          let pcorpus = processCorpus corpus
          let knownwords = map (map toLower) ["You", "dont", "see", "more", "than", "that", "The", "first", "episode", "of", "Kare", "Kano", "is", "rotten", "with", "Evangelion", "visual", "motifs", "the", "trains", "the", "spotlights", "and", "telephone", "poles", "and", "wires", "the", "masks", "and", "this", "is", "how", "everyone", "sees", "me", "etc", "a", "it", "did", "are", "to", "in", "I", "Dune", "was", "Stalin", "Mussolini", "Hitler", "Churchill", "beginning", "That", "all", "be", "like", "on", "an", "Its", "But", "only", "you", "themes", "into", "as", "my", "human", "paradox","he","said","paul","his","she","her","not","him","had","for","at","alia","no","from","what","asked","they","there","have","stilgar"]
          let optimalwords = answer depth pcorpus knownwords 
          System.IO.UTF8.putStrLn optimalwords

-- | Clean up. Don't want 'Je suis." to look different from "Je suis"...
--
-- > stringPunctuation "Greetings, fellow human flesh-sacks!" ~> "Greetings fellow human fleshsacks"
stripPunctuation :: String -> String
stripPunctuation = filter (not . isPunctuation)

-- Turn a single big document into a stream of sentences of individual words; lower-case so we don't get
-- multiple hits for 'He', 'he' etc
processCorpus :: String -> [[String]]
processCorpus = pmap (sort . words . stripPunctuation) . splitWhen (=='.') . map toLower

-- parallel map
pmap :: (NFData b) =>(a -> b) -> [a] -> [b]
pmap = parMap rnf

sentences :: (NFData a, Ord a) => [[a]] -> Map.Map Int (Set.Set a)
sentences = Map.fromList . zip [(0::Int)..] . pmap Set.fromList

fidiv :: (Integral a, Fractional b) => a -> a -> b
fidiv = (/) `on` fromIntegral

swap :: (a, b) -> (b, a)
swap = uncurry (flip (,))

ranks :: (NFData v, Ord k, Ord v) => Map.Map k (Set.Set v) -> Maybe (Rational, v)
ranks s =  listToMaybe . sortBy (flip compare) .
          pmap swap .
          Map.toList .
          Map.fromListWith (+) $
          [(word, 1 `fidiv` Set.size wrds)
          | (_sentenceId, wrds) <- Map.toList s
          , word <- Set.toList wrds]

approximation :: (NFData v, Ord k, Ord v) => Map.Map k (Set.Set v) -> Int -> [v]
approximation _ 0 = []
approximation s n =
    case ranks s of
      Nothing -> []
      Just (_value, word) ->
            let withoutWord = Map.map (Set.delete word) s
            in word : approximation withoutWord (n-1)

-- do not use parmap in this function on pain of death; GHC is broken?
process :: (Ord v, NFData v) => [[v]] -> [Int] -> [[v]]
process ss ns = map (approximation $ sentences ss) ns

getBest :: [Int] ->[[String]] -> String
getBest x y = unlines . last  $ process y x

filterKnown :: [String] -> [[String]] -> [[String]]
filterKnown known = filter (not . null) . pmap (filter (flip notElem $ known))

answer :: Int -> [[String]] -> [String] -> String
answer depth corp known = let corp' = filterKnown known corp in getBest  [1..depth] corp'

[mnemosyne-proj-users] Re: scheduler feedback needed

Reply via email to