On December 15, 2013 S. Dale Morrey wrote: > Now let's hope my choice of RDS for this project (mysql cuz I'm an idiot and
> a cheap one at that), doesn't choke on the fact that I'm cramming 20+GB of > data into a single table. I kind of doubt it, but if it does choke, you could always try out Postgres. I hear it's able to handle things better at times, especially large data sets. I've really never tried more than about 10% of that in a single table in any DB, but I know larger tables exist, and I imagine larger tables exist in the Open Source databases like MySQL or PGSQL. Good luck either way! :) --- Dan On Sun, Dec 15, 2013 at 1:48 AM, S. Dale Morrey <[email protected]>wrote: > So this seems to be working out really well now. > I've got the entire thing operational and the finalized data looks about as > I would expect it to. > Plus... > > Total Execution time to this step 892.853 seconds > Total blocks complete: 17099 of 274910 > > 15 minutes for ~17,000 blocks. > That's 68,000 blocks per hour! > It's going to be a LOT less than several days to get this DB fed. More > like 4 hours. That's for the entire blockchain including transactions > folks! > > Now let's hope my choice of RDS for this project(mysql cuz I'm an idiot and > a cheap one at that), doesn't choke on that fact that I'm cramming 20+GB of > data into a single table. > > Thanks for all the help! > > > On Fri, Dec 13, 2013 at 11:20 AM, S. Dale Morrey <[email protected]> > wrote: > > > > Thanks Levi. That's some very sage advice. > > > > To be clear where I'm coming from. I already wrote an app in node.js > that did exactly what I needed it to do, i.e. stuff the entire tx chain of > bitcoin into an RDS so I can query it later using SQL style queries (part > of a service I'm working on similar to blockchain.info, but meant for > merchants to quickly look up balances). > > > > This is sort of my own "hello world" for node. :) > > > > The problem I am trying to solve here is that the application is horribly > slow. Therefore I decided to refactor it (actually rewrite from scratch > might be a better term) into individual execution units, string them > together with message queues and have each unit run on it's own amazon spot > instance. This gives me the ability to bring more dedicated execution > units online to handle the various work flow stages depending on queue size > and work remaining. > > > > With the original version of the application, I was looking at over a > month and possibly much longer to get everything into the database. My > goal is to take that down to a few hours at most. This is possible because > there are many, many possible points of parallelization. > > > > Unfortunately doing it this way would also swamp my datasources, so I > needed to stagger the calls a little bit so I don't get cut off/banned. > > At the moment I have 2 datasource providers (what the clientnum are > actually connecting to), but I wanted to be able to bring more online to > handle the workload if needed. > > > > So there are a few steps involved. > > > > Execution Unit #1 > > The first step is to get the total number of blocks. Next the blockhash > at n++ is queried for and all the hashes are placed into an array. When > 7kb of blockhashes are in the array, a new sqs message is sent. The array > is cleared and gathering begins anew. > > > > Execution Unit #2 > > Reads the message queue, fetching arrays of block hashes. The array is > treated like a stack and a hash is popped off the top. We then query the > datasource for the actual block referenced by the hash and obtain the tx > hashes contained in it. The rest proceeds as in in XU#1 but messages are > placed into a different queue. > > > > Execution Unit#3 > > Reads the txqueue fetching the txhashes. Query the datasource for full > tx's for each hash. Do some data transformations on the tx's and stuff > them into an RDS such as mysql. > > > > It sounds simple enough, but I have a limited number of datasource > providers. One is mine and I can control it, the other is a public > resource and I have to be very careful not to overwhelm them. More than 1 > query per second from a single IP address or 2 queries per second multiple > ip's on the same account will trigger a disconnect. > > > > There are currently ~275000 blocks. If the average block contains 10tx. > That's over 2.75M queries on top of the blockhash queries. Furthermore, a > tx consists of 2 parts, a txin and a txout. Txouts are simple end points, > but txins contain a reference to the hash of the source tx and an offset of > the txout it originates from. So with the exception of coinbase > transactions I'm looking at a minimum of 2 and an average of probably 5 or > 10 reverse lookups per tx. The blockchain itself is only about 10GB right > now. I can see the final datastore being >100GB without even really > trying. > > > > This is making me think I might be better off getting my information > directly from the p2p network instead of using a single datasource or even > a handful. But I don't want to try and implement the low level details of > bitcoind in javascript and all the libraries I've seen seem to have major > issues in current versions of node. So this is what I've got :( > > > > In the back of my mind I'm seeing an easier and possibly faster way to > accomplish this using a scatter gather technique, but the idea itself is > not clearly formed yet. > > > > > > > > > > On Thu, Dec 12, 2013 at 3:20 PM, Levi Pearson <[email protected]> > wrote: > >> > >> On Thu, Dec 12, 2013 at 2:16 PM, S. Dale Morrey <[email protected]> > wrote: > >> > >> > Now I've got to figure out how to slowdown the requests and gradually > feed > >> > them to the server. Batching will help somewhat, but the max I can > send in > >> > a batch is 100 and even then it's going to quickly overwhelm the > server to > >> > send them out without a delay between sends. > >> > >> My previous email suggested setting the timeout to i * 500ms, which > >> will space them 500ms apart. You can easily space them more the > >> obvious way. To introduce a batching-style periodic delay of, say, 5 > >> seconds for every 100 requests, you would simply add another term to > >> your delay calculation by doing an integer divide of i by 100 and > >> multiplying by 5 seconds. > >> > >> > > >> > Another option is to get rid of the loop entirely and just have the > >> > callback calling in a cycle, but I'm afraid that's going to smash the > stack > >> > :( > >> > >> One of the most frustrating aspects of callback-style programming, but > >> perhaps fortuitous for you in this instance, is that each callback > >> executes with a fresh stack. It's in no way connected to the stack > >> frame that generated the closure; indeed, the callback might be a > >> top-level function and not a nested closure at all. So if each > >> invocation updates its loop variables and sets a timeout callback for > >> 500ms in the future to invoke itself, you'll use no extra stack space > >> at all. This will also eliminate the problem of the closures always > >> referring to the loop index variable of a loop that already completed! > >> > >> --Levi > >> > >> /* > >> PLUG: http://plug.org, #utah on irc.freenode.net > >> Unsubscribe: http://plug.org/mailman/options/plug > >> Don't fear the penguin. > >> */ > > > > > > /* > PLUG: http://plug.org, #utah on irc.freenode.net > Unsubscribe: http://plug.org/mailman/options/plug > Don't fear the penguin. > */ > /* PLUG: http://plug.org, #utah on irc.freenode.net Unsubscribe: http://plug.org/mailman/options/plug Don't fear the penguin. */
