On Wed, May 31, 2017 at 9:13 AM, Curtis Faith <[email protected]> wrote:
> I am sending this a bit earlier than I would normally as I am still trying > to figure out the contribution of the Guile garbage collector on > performance and am not quite done. I have also not yet measured a test over > significant data for a before and after comparison, for example. > > But since Ben sent Linas an email about this problem, I'll send my work in > progress, so Linas has a bit more information to answer Ben's question. > > So, the following represents a partial analysis of the SQL issues I found. > I removed the code for fetch-atom and store-atom and while this is > considerably faster, there still seems to be many-second pauses in > processing that are not related to the number of words in the sentence in > any way I can easily correlate. > The atospace SQL backend has 8 writeback queues, which. by default are allowed to be quite large. They normally drain in under a second, but if there are a lot of atomspace threads putting atoms into them, the drains can take tens of seconds, very rarely minutes or longer, e.g. if threads are dumping atoms in at the same rate that the writers are draining the queues. Also: postgres on SSD's is about 10x faster than postgres on spining disks, for me. > > ----------------------- > > My 6,000 sentence test file finished over the weekend. It took about 48 > hours. > That's insanely slow. My best guess is that you failed to tune postgres according to the instructions in the README. Without this tuning, you will get at one or two disk writes for every atom, which, for spinning media, takes about disk rotation for each atom, i.e. 15 milliseconds a 7200 RPM disk drive. That works out to maybe 100 atoms/second. for 6000 sentences x 16 parses/sentence x 20 words/parse x 3.5 atom writes/word = 6.7M atoms at 100 atoms/sec, this would be 18 hours. After postgres tuning, this should be, I dunno 3x or 5x faster?? I'm getting less than an hour for pride n prejudice, I believe. Not sure. > > For some context, the input corpus, Pride and Prejudice, has approximately: > > 120 K words > 700 K bytes > 6 K sentences > 2 K paragraphs > 25 average words per sentence > > So it is a fair sized novel but only a tiny tiny fraction of what we'd > like to feed the language learning system. > > In looking at the Postgres statistics for updates on a freshly initialized > database, I found: > > SELECT relname, idx_scan, idx_tup_fetch, n_tup_ins, n_tup_upd, n_tup_del > FROM pg_stat_user_tables; > > > relname | idx_scan | idx_tup_fetch | n_tup_ins | n_tup_upd | > n_tup_del > ------------+-----------+---------------+-----------+------- > ----+----------- > atoms | 95789119 | 90072278 | 3288742 | 0 | 0 > values | 4 | 0 | 0 | 0 | 0 > typecodes | 0 | 0 | 228 | 0 | 0 > valuations | 131581719 | 84520666 | 44519326 | 0 | 41926158 > spaces | 3288742 | 3288742 | 2 | 0 | 0 > (5 rows) > > So roughly 90M fetches for Atoms and 85M for Valuations while parsing a > single file of 6K sentences. > There are 16 parses per sentence. Besides observing each word once per parse, there are also observations of word-pairs, with the number of word-pairs being about equal to the number of words. (per parse) My back-of-the-envelope above suggests that there should have been about 7M updates. You are seeing 10x as many. I don't understand this at all. Oh wait. I see the problem. You are using a version that also does "clique counting" I had a goofy argument with Ben about this in HK, and later added clique counting. This increases the number of atom writes by 10x or 20x, and creates absolutely immense databases, so I later on disabled this code. If you pull the latest and greatest, this is disabled. > We have 3.3M atoms inserted and 90.0M index tuple fetches for the atom > table. > > In digging into the code, i found we have an explicit delete and insert > for every single increment of every atom count maintained by the system. > The count increment code is also recursive for the trees defined by each > pair which will in practice be many different atoms because even though we > only increment the count for one atom in the tree, the store-atom and > fetch-atom calls update the truth values for every atom in the tree that > has a truth value. > That's not correct; or it shouldn't be; the fetch only fetches the values for the fetched atom, and not for the others. It cannot, must not update the others, as that would leave the atomspace contents garbled and incorrect. That code is conservative, with the goal of not corrupting any truth values when fetching atoms. > The pair counting code has this comment: > > ; --------------------------------------------------------------------- > ; update-pair-counts -- count occurances of word-pairs in a parse. > ; > ; Note that this might throw an exception... > ; > ; The structures that get created and incremented are of the form > ; > ; EvaluationLink > ; PredicateNode "*-Sentence Word Pair-*" > ; ListLink > ; WordNode "lefty" -- or whatever words these are. > ; WordNode "righty" > ; > ; ExecutionLink > ; SchemaNode "*-Pair Distance-*" > ; ListLink > ; WordNode "lefty" > ; WordNode "righty" > ; NumberNode 3 > > > So every pair has 11 atoms. > Yes, I disabled this later on, maybe a few weeks ago. > > The truth value gets set only at the top level but the fetch and store end > up retrieving and storing the valuations for the entire tree where there > are truth values. > That should not be happening, at all. If it is, then its a bug.Can you point out where this is happening? > > Further, the counting code is storing and and fetching the truth values > for each unique count value each time it gets incremented. So we'll have > (1: an insert for 1), then (2: a delete of the 1, and an insert for 2), > then (3: a delete of the 2, and an insert for 3), (4: a delete for 3, and > an insert for 4), etc. > If you want to change this to an sql UPDATE, that's fine. UPDATES add a fair amount of code complexity. Perhaps the UPDATE will run a little faster than the DELETE/INSERT, but I'm not convinced that's a bottleneck. To summarize: 1) configure postgress as described in the README 2) get the newer relex, which stubs out the stupid clique-counting idea. --linas p.s. you can configure postgres to not commit to disk, by saying fsync=off This gives a 10x performance improvement, or more, but if postgres crashes, you will have a corrupted database that's not recoverable. This happened to me last month, when I accidentally ran out of RAM (while clique-counting) so I am now older, wiser, and $500 poorer after buying some SSD disks. > And this is done for the EvaluationLink, ExecutionLink and four Word Nodes > in the above example. > > I wanted to verify this assumption, so I created a fresh database and > turned on statement logging in Postgres. It resulted in the attached log > file for a test of the a four-word sentence: > > (observe-text "This is a test.") > > We got 5419 lines in the log file (not counting the 333 lines of startup > creations of types etc.), each of which represents a single SQL statement. > There are: > > 813 BEGIN, > 813 COMMIT, > 2127 SELECT, > 720 DELETE, and > 946 INSERT > > statements for a four word sentence. (See the attached text log for > details) > > There are 133 new atoms created and 93 valuations after the call to > observe-text is finished. > > > -- You received this message because you are subscribed to the Google Groups "opencog" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at https://groups.google.com/group/opencog. To view this discussion on the web visit https://groups.google.com/d/msgid/opencog/CAHrUA34Ybe8dUbu%3DbsW%2BOG5krXzyGYTtJZkLZszEaJWxywAAOg%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.
