[opencog-dev] Re: Language Learning - Progress and Performance

Ben Goertzel Sat, 03 Jun 2017 09:07:14 -0700

Thanks Linas... Ruiting was getting stuck on this stuff so I asked
Curtis for help, but he's been stuck too...


Curtis, I guess you should try your best to follow Linas's recent
instructions before trying anything more radical...

thanks all !


On Sun, Jun 4, 2017 at 12:00 AM, Linas Vepstas <[email protected]> wrote:
>
>
> On Wed, May 31, 2017 at 9:13 AM, Curtis Faith <[email protected]>
> wrote:
>>
>> I am sending this a bit earlier than I would normally as I am still trying
>> to figure out the contribution of the Guile garbage collector on performance
>> and am not quite done. I have also not yet measured a test over significant
>> data for a before and after comparison, for example.
>>
>> But since Ben sent Linas an email about this problem, I'll send my work in
>> progress, so Linas has a bit more information to answer Ben's question.
>>
>> So, the following represents a partial analysis of the SQL issues I found.
>> I removed the code for fetch-atom and store-atom and while this is
>> considerably faster, there still seems to be many-second pauses in
>> processing that are not related to the number of words in the sentence in
>> any way I can easily correlate.
>
>
> The atospace SQL backend has 8 writeback queues, which. by default are
> allowed  to be quite large.  They normally drain in under a second, but if
> there are a lot of atomspace threads putting atoms into them, the drains can
> take tens of seconds, very rarely minutes or longer, e.g. if threads are
> dumping atoms in at the same rate that the writers are draining the queues.
>
> Also: postgres on SSD's is about 10x faster than postgres on spining disks,
> for me.
>>
>>
>> -----------------------
>>
>> My 6,000 sentence test file finished over the weekend. It took about 48
>> hours.
>
>
> That's insanely slow.  My best guess is that you failed to tune postgres
> according to the instructions in the README.   Without this tuning, you will
> get at one or two disk writes for every atom, which, for spinning media,
> takes about disk rotation for each atom, i.e. 15 milliseconds a 7200 RPM
> disk drive. That works out to maybe 100 atoms/second.
>
> for 6000 sentences x 16 parses/sentence x 20 words/parse x 3.5 atom
> writes/word = 6.7M atoms
>
> at 100 atoms/sec, this would be 18 hours.  After postgres tuning, this
> should be, I dunno 3x or 5x faster??  I'm getting less than an hour for
> pride n prejudice, I believe. Not sure.
>
>
>>
>>
>> For some context, the input corpus, Pride and Prejudice, has
>> approximately:
>>
>> 120 K words
>> 700 K bytes
>> 6 K sentences
>> 2 K paragraphs
>> 25 average words per sentence
>>
>> So it is a fair sized novel but only a tiny tiny fraction of what we'd
>> like to feed the language learning system.
>>
>> In looking at the Postgres statistics for updates on a freshly initialized
>> database, I found:
>>
>> SELECT relname, idx_scan, idx_tup_fetch, n_tup_ins, n_tup_upd, n_tup_del
>> FROM pg_stat_user_tables;
>>
>>
>>   relname   | idx_scan  | idx_tup_fetch | n_tup_ins | n_tup_upd |
>> n_tup_del
>>
>> ------------+-----------+---------------+-----------+-----------+-----------
>>  atoms      |  95789119 |      90072278 |   3288742 |         0 |
>> 0
>>  values     |         4 |             0 |         0 |         0 |
>> 0
>>  typecodes  |         0 |             0 |       228 |         0 |
>> 0
>>  valuations | 131581719 |      84520666 |  44519326 |         0 |
>> 41926158
>>  spaces     |   3288742 |       3288742 |         2 |         0 |
>> 0
>> (5 rows)
>>
>> So roughly 90M fetches for Atoms and 85M for Valuations while parsing a
>> single file of 6K sentences.
>
>
> There are 16 parses per sentence. Besides observing each word once per
> parse, there are also observations of word-pairs, with the number of
> word-pairs being about equal to the number of words. (per parse)
>
> My back-of-the-envelope above suggests that there should have been about 7M
> updates. You are seeing 10x as many.  I don't understand this at all.
>
> Oh wait. I see the problem. You are using a version that also does "clique
> counting"  I had a goofy argument with Ben about this in HK, and later added
> clique counting. This increases the number of atom writes by 10x or 20x, and
> creates absolutely immense databases, so I later on disabled this code.  If
> you pull the latest and greatest, this is disabled.
>
>
>>
>> We have 3.3M atoms inserted and 90.0M index tuple fetches for the atom
>> table.
>>
>> In digging into the code, i found we have an explicit delete and insert
>> for every single increment of every atom count maintained by the system. The
>> count increment code is also recursive for the trees defined by each pair
>> which will in practice be many different atoms because even though we only
>> increment the count for one atom in the tree, the store-atom and fetch-atom
>> calls update the truth values for every atom in the tree that has a truth
>> value.
>
>
> That's not correct; or it shouldn't be; the fetch only fetches the values
> for the fetched atom, and not for the others. It cannot, must not update the
> others, as that would leave the atomspace contents garbled and incorrect.
> That code is conservative, with the goal of not corrupting any truth values
> when fetching atoms.
>
>
>>
>> The pair counting code has this comment:
>>
>> ; ---------------------------------------------------------------------
>> ; update-pair-counts -- count occurances of word-pairs in a parse.
>> ;
>> ; Note that this might throw an exception...
>> ;
>> ; The structures that get created and incremented are of the form
>> ;
>> ;     EvaluationLink
>> ;         PredicateNode "*-Sentence Word Pair-*"
>> ;         ListLink
>> ;             WordNode "lefty"  -- or whatever words these are.
>> ;             WordNode "righty"
>> ;
>> ;     ExecutionLink
>> ;         SchemaNode "*-Pair Distance-*"
>> ;         ListLink
>> ;             WordNode "lefty"
>> ;             WordNode "righty"
>> ;         NumberNode 3
>>
>>
>> So every pair has 11 atoms.
>
>
> Yes, I disabled this later on, maybe a few weeks ago.
>
>>
>>
>> The truth value gets set only at the top level but the fetch and store end
>> up retrieving and storing the valuations for the entire tree where there are
>> truth values.
>
>
> That should not be happening, at all. If it is, then its a bug.Can you point
> out where this is happening?
>
>>
>>
>> Further, the counting code is storing and and fetching the truth values
>> for each unique count value each time it gets incremented. So we'll have (1:
>> an insert for 1), then (2: a delete of the 1, and an insert for 2), then (3:
>> a delete of the 2, and an insert for 3), (4: a delete for 3, and an insert
>> for 4), etc.
>
>
> If you want to change this to an sql UPDATE, that's fine.  UPDATES add a
> fair amount of code complexity. Perhaps the UPDATE will run a little faster
> than the DELETE/INSERT, but I'm not convinced that's a bottleneck.
>
> To summarize:
> 1) configure postgress as described in the README
> 2) get the newer relex, which stubs out the stupid clique-counting idea.
>
> --linas
>
> p.s. you can configure postgres to not commit to disk, by saying fsync=off
> This gives a 10x performance improvement, or more, but if postgres crashes,
> you will have a corrupted database that's not recoverable. This happened to
> me last month, when I accidentally ran out of RAM (while clique-counting) so
> I am now older, wiser, and $500 poorer after buying some SSD disks.
>
>
>>
>> And this is done for the EvaluationLink, ExecutionLink and four Word Nodes
>> in the above example.
>>
>> I wanted to verify this assumption, so I created a fresh database and
>> turned on statement logging in Postgres. It resulted in the attached log
>> file for a test of the a four-word sentence:
>>
>> (observe-text "This is a test.")
>>
>> We got 5419 lines in the log file (not counting the 333 lines of startup
>> creations of types etc.), each of which represents a single SQL statement.
>> There are:
>>
>> 813 BEGIN,
>> 813 COMMIT,
>> 2127 SELECT,
>> 720 DELETE, and
>> 946 INSERT
>>
>> statements for a four word sentence. (See the attached text log for
>> details)
>>
>> There are 133 new atoms created and 93 valuations after the call to
>> observe-text is finished.
>>
>>
>



-- 
Ben Goertzel, PhD
http://goertzel.org

"I am God! I am nothing, I'm play, I am freedom, I am life. I am the
boundary, I am the peak." -- Alexander Scriabin

-- 
You received this message because you are subscribed to the Google Groups 
"opencog" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/opencog.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/opencog/CACYTDBc%2BRmuR2QMot3P_zOT8aWg3gpMFNiTEJ%3D4HYm2%2BchYHwA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

[opencog-dev] Re: Language Learning - Progress and Performance

Reply via email to