Thanks Linas... Ruiting was getting stuck on this stuff so I asked Curtis for help, but he's been stuck too...
Curtis, I guess you should try your best to follow Linas's recent instructions before trying anything more radical... thanks all ! On Sun, Jun 4, 2017 at 12:00 AM, Linas Vepstas <[email protected]> wrote: > > > On Wed, May 31, 2017 at 9:13 AM, Curtis Faith <[email protected]> > wrote: >> >> I am sending this a bit earlier than I would normally as I am still trying >> to figure out the contribution of the Guile garbage collector on performance >> and am not quite done. I have also not yet measured a test over significant >> data for a before and after comparison, for example. >> >> But since Ben sent Linas an email about this problem, I'll send my work in >> progress, so Linas has a bit more information to answer Ben's question. >> >> So, the following represents a partial analysis of the SQL issues I found. >> I removed the code for fetch-atom and store-atom and while this is >> considerably faster, there still seems to be many-second pauses in >> processing that are not related to the number of words in the sentence in >> any way I can easily correlate. > > > The atospace SQL backend has 8 writeback queues, which. by default are > allowed to be quite large. They normally drain in under a second, but if > there are a lot of atomspace threads putting atoms into them, the drains can > take tens of seconds, very rarely minutes or longer, e.g. if threads are > dumping atoms in at the same rate that the writers are draining the queues. > > Also: postgres on SSD's is about 10x faster than postgres on spining disks, > for me. >> >> >> ----------------------- >> >> My 6,000 sentence test file finished over the weekend. It took about 48 >> hours. > > > That's insanely slow. My best guess is that you failed to tune postgres > according to the instructions in the README. Without this tuning, you will > get at one or two disk writes for every atom, which, for spinning media, > takes about disk rotation for each atom, i.e. 15 milliseconds a 7200 RPM > disk drive. That works out to maybe 100 atoms/second. > > for 6000 sentences x 16 parses/sentence x 20 words/parse x 3.5 atom > writes/word = 6.7M atoms > > at 100 atoms/sec, this would be 18 hours. After postgres tuning, this > should be, I dunno 3x or 5x faster?? I'm getting less than an hour for > pride n prejudice, I believe. Not sure. > > >> >> >> For some context, the input corpus, Pride and Prejudice, has >> approximately: >> >> 120 K words >> 700 K bytes >> 6 K sentences >> 2 K paragraphs >> 25 average words per sentence >> >> So it is a fair sized novel but only a tiny tiny fraction of what we'd >> like to feed the language learning system. >> >> In looking at the Postgres statistics for updates on a freshly initialized >> database, I found: >> >> SELECT relname, idx_scan, idx_tup_fetch, n_tup_ins, n_tup_upd, n_tup_del >> FROM pg_stat_user_tables; >> >> >> relname | idx_scan | idx_tup_fetch | n_tup_ins | n_tup_upd | >> n_tup_del >> >> ------------+-----------+---------------+-----------+-----------+----------- >> atoms | 95789119 | 90072278 | 3288742 | 0 | >> 0 >> values | 4 | 0 | 0 | 0 | >> 0 >> typecodes | 0 | 0 | 228 | 0 | >> 0 >> valuations | 131581719 | 84520666 | 44519326 | 0 | >> 41926158 >> spaces | 3288742 | 3288742 | 2 | 0 | >> 0 >> (5 rows) >> >> So roughly 90M fetches for Atoms and 85M for Valuations while parsing a >> single file of 6K sentences. > > > There are 16 parses per sentence. Besides observing each word once per > parse, there are also observations of word-pairs, with the number of > word-pairs being about equal to the number of words. (per parse) > > My back-of-the-envelope above suggests that there should have been about 7M > updates. You are seeing 10x as many. I don't understand this at all. > > Oh wait. I see the problem. You are using a version that also does "clique > counting" I had a goofy argument with Ben about this in HK, and later added > clique counting. This increases the number of atom writes by 10x or 20x, and > creates absolutely immense databases, so I later on disabled this code. If > you pull the latest and greatest, this is disabled. > > >> >> We have 3.3M atoms inserted and 90.0M index tuple fetches for the atom >> table. >> >> In digging into the code, i found we have an explicit delete and insert >> for every single increment of every atom count maintained by the system. The >> count increment code is also recursive for the trees defined by each pair >> which will in practice be many different atoms because even though we only >> increment the count for one atom in the tree, the store-atom and fetch-atom >> calls update the truth values for every atom in the tree that has a truth >> value. > > > That's not correct; or it shouldn't be; the fetch only fetches the values > for the fetched atom, and not for the others. It cannot, must not update the > others, as that would leave the atomspace contents garbled and incorrect. > That code is conservative, with the goal of not corrupting any truth values > when fetching atoms. > > >> >> The pair counting code has this comment: >> >> ; --------------------------------------------------------------------- >> ; update-pair-counts -- count occurances of word-pairs in a parse. >> ; >> ; Note that this might throw an exception... >> ; >> ; The structures that get created and incremented are of the form >> ; >> ; EvaluationLink >> ; PredicateNode "*-Sentence Word Pair-*" >> ; ListLink >> ; WordNode "lefty" -- or whatever words these are. >> ; WordNode "righty" >> ; >> ; ExecutionLink >> ; SchemaNode "*-Pair Distance-*" >> ; ListLink >> ; WordNode "lefty" >> ; WordNode "righty" >> ; NumberNode 3 >> >> >> So every pair has 11 atoms. > > > Yes, I disabled this later on, maybe a few weeks ago. > >> >> >> The truth value gets set only at the top level but the fetch and store end >> up retrieving and storing the valuations for the entire tree where there are >> truth values. > > > That should not be happening, at all. If it is, then its a bug.Can you point > out where this is happening? > >> >> >> Further, the counting code is storing and and fetching the truth values >> for each unique count value each time it gets incremented. So we'll have (1: >> an insert for 1), then (2: a delete of the 1, and an insert for 2), then (3: >> a delete of the 2, and an insert for 3), (4: a delete for 3, and an insert >> for 4), etc. > > > If you want to change this to an sql UPDATE, that's fine. UPDATES add a > fair amount of code complexity. Perhaps the UPDATE will run a little faster > than the DELETE/INSERT, but I'm not convinced that's a bottleneck. > > To summarize: > 1) configure postgress as described in the README > 2) get the newer relex, which stubs out the stupid clique-counting idea. > > --linas > > p.s. you can configure postgres to not commit to disk, by saying fsync=off > This gives a 10x performance improvement, or more, but if postgres crashes, > you will have a corrupted database that's not recoverable. This happened to > me last month, when I accidentally ran out of RAM (while clique-counting) so > I am now older, wiser, and $500 poorer after buying some SSD disks. > > >> >> And this is done for the EvaluationLink, ExecutionLink and four Word Nodes >> in the above example. >> >> I wanted to verify this assumption, so I created a fresh database and >> turned on statement logging in Postgres. It resulted in the attached log >> file for a test of the a four-word sentence: >> >> (observe-text "This is a test.") >> >> We got 5419 lines in the log file (not counting the 333 lines of startup >> creations of types etc.), each of which represents a single SQL statement. >> There are: >> >> 813 BEGIN, >> 813 COMMIT, >> 2127 SELECT, >> 720 DELETE, and >> 946 INSERT >> >> statements for a four word sentence. (See the attached text log for >> details) >> >> There are 133 new atoms created and 93 valuations after the call to >> observe-text is finished. >> >> > -- Ben Goertzel, PhD http://goertzel.org "I am God! I am nothing, I'm play, I am freedom, I am life. I am the boundary, I am the peak." -- Alexander Scriabin -- You received this message because you are subscribed to the Google Groups "opencog" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at https://groups.google.com/group/opencog. To view this discussion on the web visit https://groups.google.com/d/msgid/opencog/CACYTDBc%2BRmuR2QMot3P_zOT8aWg3gpMFNiTEJ%3D4HYm2%2BchYHwA%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.
