> On 27 Jul 2015, at 13:46, Jan Lehnardt <[email protected]> wrote: > > >> On 26 Jul 2015, at 19:03, Jan Lehnardt <[email protected]> wrote: >> >> >>> On 26 Jul 2015, at 14:47, Jan Lehnardt <[email protected]> wrote: >>> >>> Hey all, >>> >>> I’m trying to upgrade a database from 1.6.1 to 2.0.0/master/0c579b98 and >>> I’m seeing a number of issues. >>> >>> Any help is greatly appreciated. Since this is our official upgrade path >>> for 2.0, this has to be rock-solid. >>> >>> Feel free to break out individual issue into new threads, if it helps >>> keeping things organised. >>> >>> Scroll down for detailed information about the database, and machine >>> configurations. >>> >>> >>> ## The Scenario >>> >>> Replication is running on 2.0, pulling from 1.6.1 over the EC2 internal ip >>> address. >>> >>> ## The Issues >>> >>> 1. repeated log entries for “write quorum for <targetdb> failed”. I’ve seen >>> this in other contexts as well, why is this happening and should it? >>> >>> >>> 2. getting a lot of “cassim_metadata_cache changes listener died” from all >>> nodes about every 5 seconds. What’s up with these? >>> >>> - 2015-07-26 08:30:34.400 [error] Undefined emulator Error in process >>> <0.14633.26> on node '[email protected]' with exit value: >>> {function_clause,[{cassim_metadata_cache,changes_callback,[waiting_for_updates,"0"]},{fabric_view_changes,keep_sending_changes,8},{fabric_view_changes,go,5}]} >>> >>> - 2015-07-26 08:30:39.401 [notice] [email protected] <0.314.0> >>> cassim_metadata_cache changes listener died >>> {function_clause,[{cassim_metadata_cache,changes_callback,[waiting_for_updates,"0"]},{fabric_view_changes,keep_sending_changes,8},{fabric_view_changes,go,5}]} >> >> Alexander pointed to >> https://github.com/apache/couchdb-fabric/commit/b6659c8344c9a028b5ab451be41a991801c2ab3d#diff-2af86e058b4e7a4a99a7c5a12da6debdR96 >> which is part of Adam’s recent work on COUCHDB-2724. >> >> Adam, any insights? :) > > Bob says this should fix it: > https://gist.github.com/rnewson/b9efd4f45e1c62315816 > > In the meantime, I reverted the changes optimisation commit on fabric and now > I’m getting this once it is time to start replicating more documents after > the existing update sequence is all caught up with during replication: > > https://gist.github.com/janl/75804904dad73d17ed0e > > During which I found out that there *are* a few small attachments in the > source database, sorry for the confusion about this earlier. > > I still see function_clause errors after the revert, Bob suggests to wait for > Adam to comment.
Bob’s latest commits fixed the replication issue, but I’d love to hear about the other things I mentioned. Best Jan -- > > Best > Jan > -- > > >> >> Best >> Jan >> -- >> >> >> >>> >>> >>> 3. A number of Replicator, request PUT to >>> "http://0.0.0.0:15984/<target>/edbef049aae9c8828f336534984e5e4f" failed due >>> to error {error,req_timedout} this happens for regular docs, local docs, >>> and _bulk_docs. The machine is basically idle (see below for details), the >>> three beam.smp processes over at 200-250% CPU each, io is 98% idle (it’s >>> mostly logs being written), the machine is basically idle. >>> >>> >>> 4, two issues from couch_replicator_api_wrap.erl: >>> >>> - 2015-07-26 08:22:49.849 [error] Undefined <0.3546.0> gen_server >>> <0.3546.0> terminated with reason: no function clause matching >>> couch_replicator_api_wrap:'-update_docs/4-fun-2-'(400, >>> [{"Server","MochiWeb/1.0 (Any of you quaids got a smint?)"},{"Date","Sun, >>> 26 Jul 2015 08:22:49 G..."},...], null, >>> [<<"{\"_id\":\”12345678\",\"_rev\":\"1050-ee6c7d54276b43bc937470e44e0283f2\”,... >>> >>> - 2015-07-26 08:30:08.514 [notice] [email protected] <0.6360.26> Retrying GET >>> to >>> http://172.31.10.115:5984/generic_db_name/12348765?revs=true&open_revs=%5B%228-b2826209867a286c76e6a2762f10b1e0%22%5D&latest=true >>> in 1.0 seconds due to error >>> {function_clause,[{couch_replicator_api_wrap,run_user_fun,4},{couch_replicator_api_wrap,receive_docs,4},{couch_replicator_api_wrap,receive_docs_loop,6},{couch_replicator_api_wrap,'-open_doc_revs/6-fun-4-',7}]} >>> >>> >>> >>> 5. Eventually, replication reliably stops with an “invalid_ejson” error, >>> but I don’t yet know if that’s because of the api_wrap issue or something >>> else. >>> >>> >>> >>> 6. Replication has stopped numerous times until I got here, I didn’t have >>> time to look into why that happened, but I have all the logs, but they are >>> 130MB total, so it’ll be a while. >>> >>> >>> 7. When replication ran, it replicated at a rate of about 1000 docs/s, >>> which felt a little slow, but I have no experience there, yet. >>> >>> >>> ## Source Database Info >>> >>> { >>> "db_name": "generic_db_name", >>> "doc_count": 6808004, >>> "doc_del_count": 18856, >>> "update_seq": 8044450, >>> "purge_seq": 0, >>> "compact_running": false, >>> "disk_size": 16293904519, >>> "data_size": 11711402577, >>> "instance_start_time": "1437834202967309", >>> "disk_format_version": 6, >>> "committed_update_seq": 8044450 >>> } >>> >>> Mostly small-ish docs, no big outliers, no attachments. >>> >>> Source machine info: >>> >>> Amazon EC2 m3.xlarge 4 cores, 64bit, 16GB RAM, 100GB SSD, 3000 provisioned >>> iops. FFM Availability Zone. >>> >>> Standard EC2 Ubuntu, Erlang R16B03 (I know, but that’s not the problem >>> here, this couch behaves fine). >>> >>> Target machine info: >>> >>> Amazon EC2 m4.10xlarge, 40 cores, 64bit, 160GB RAM, 100GB SSD, 3000 iops >>> (not provisioned), 10GigE networking, FFM AZ. >>> >>> The latency between both instances is very small and the network throughput >>> is (copying a file is between 100 and 200MB/s). >>> >>> Standard EC2 Amazon Linux (Redhat/Fedora derivative), Erlang R14B04. >>> CouchDB 2.0 running as dev/run >>> >>> >>> Thanks! >>> Jan >>> -- >>> >> >> -- >> Professional Support for Apache CouchDB: >> http://www.neighbourhood.ie/couchdb-support/ >> > > -- > Professional Support for Apache CouchDB: > http://www.neighbourhood.ie/couchdb-support/ > -- Professional Support for Apache CouchDB: http://www.neighbourhood.ie/couchdb-support/
