Hi Rich, I am very curious about your work. With node 0.10 release I have been searching for an ETL tools using Node Streams. I am seeking the equivalent to http://pandas.pydata.org/ (amazing python tool for ETL) in nodejs.
Could you post an example of what youve acomplished already?? Thank youu El viernes, 22 de abril de 2011 22:58:41 UTC+2, Rich Schiavi escribió: > > Luke, > > Not sure if my other reply got sent, but the approach I've explored > and is working well is to stage each part of the ETL process, instead > of trying to interleave the db writes. > > Basically, I'm ingestion about 300MB worth of XML that has all sorts > of elements/transforms before it gets into mysql. I first parse that > with SAX, generating each set of objects. From those objects instead > of doing row-by-row or mysql inserts, I dump them out to TSV files, > that can then be loaded more efficiently with mysql load infile. > > For me though, the key to not slamming the database or node even, is > to make sure each stage completes before trying to do async/parallel > database writes. For ETL, this is probably better anyway to avoid > failures and partial writes if parsing fails or something like that. > > Lemme know if you have questions, I can post my example code. Its > surprising small for the amount of work it processes: 300MB out to > mysql load files in about 160 seconds > > > > > On Mar 15, 1:26 am, Luke Monahan <[email protected]> wrote: > > A quick update. > > > > I figured it out that I could call Stream.pause() on the internal > > stream in the CSV library very easily. Once I had this in mind, it > > wasn't hard to then keep an "currentlyRunning" counter of CSV rows > > that had been parsed but not inserted into the database. Once > > currentlyRunning gets above 100k I pause the stream and resume it when > > that counter drops below 10k. > > > > This way everything runs smoothly without huge amounts of memory. I > > can drop the number of riak client connections down to a more sane > > level (4 or so works for me) and the whole lot is much faster. > > > > Thanks, > > > > Luke. > > > > On Mar 10, 9:40 am, Luke Monahan <[email protected]> wrote: > > > > > > > > > > > > > > > > > node-inspector is working well thanks. > > > > > I've been able to dump the heap before processing and during > > > processing. I'm seeing large numbers of String, object and closure > > > constructors, which between them contribute 99% of the heap. The tool > > > doesn't let me drill down any further than that as far as I can tell. > > > > > But the total heap size is well and truly less than a problem -- 20MB > > > or so with a largeish file, wheras the operating system is telling me > > > the entire process is using over 200MB at the same time. I am I > > > misunderstanding the output? Or is this pointing in another direction, > > > such as a recursive call that is creating a huge stack? > > > > > I've managed to make it work fine by cutting my CSV files into smaller > > > chunks (100MB each) and processing them serially by ensuring the > > > number of unsaved CSV rows is 0 before starting on the next file. I've > > > also worked out why I can't use to protobuf api (need to install > thishttp://code.google.com/p/protobuf-for-node/). I'll spend a little > > > while getting that working, but then I'll probably move on to the more > > > interesting parts of my idea than just loading up data. > > > > > Thanks again, > > > > > Luke. > > > > > On Mar 10, 2:51 am, Nicholas Campbell <[email protected]> > > > wrote: > > > > > > Valgrind is good but there is alsohttps:// > github.com/dannycoates/node-inspector. I haven't yet had a need for > > > > it but I know others who have used it and love it. > > > > > > - Nick Campbell > > > > > >http://digitaltumbleweed.com > > > > > > On Wed, Mar 9, 2011 at 2:08 AM, Luke Monahan <[email protected]> > wrote: > > > > > Thanks for the suggestions. It looks like a custom stream would be > a > > > > > good idea, as the "pump" between streams already has throttling > baked > > > > > in. I haven't looked at this in-depth though, so I'll do that > shortly > > > > > to figure out if it's viable. A custom CSV parser to specifically > > > > > support throttling will be my next option. > > > > > > > I'm using the REST API, as the protobuf support in riak-js doesn't > > > > > seem to work for me -- I just get an error trying to instantiate > the > > > > > connection. I might have to switch to git head or something > instead of > > > > > the npm to try that out? > > > > > > > I'll see if I can profile to find where the memory is specifically > > > > > going to before all the above anyway. I just found I could only > about > > > > > 200MB of the file fully processed and saved before "Allocation > failed > > > > > - process out of memory", which is 1GB I think due to V8. There's > > > > > obviously more to the story than just rows of CSV being kept in > memory > > > > > to cause this. Is there a tool/method recommended to trace memory > > > > > leaks in nodejs -- the valgrind method is all I could find in this > > > > > group and it seemed beyond me... > > > > > > > Luke. > > > > > > > On Mar 9, 5:11 pm, Nicholas Campbell < > [email protected]> > > > > > wrote: > > > > > > Luke, > > > > > > > > A couple gb of data shouldn't crush the db when writing to it > (in theory > > > > > > ;D). > > > > > > > > You've identified that Riak is the slowdown? What backend to > Riak are you > > > > > > using? Could it be the module? Are you using proto_bufs or the > REST API? > > > > > > Have you used the inspector? Where is the memory increase > happening- in > > > > > Node > > > > > > due to the slowdown/backup? > > > > > > > > - Nick Campbell > > > > > > > >http://digitaltumbleweed.com > > > > > > > > On Tue, Mar 8, 2011 at 11:28 PM, Marco Rogers < > [email protected] > > > > > >wrote: > > > > > > > > > Luke.ETLis an interesting use case for node, and one I haven't > heard > > > > > a > > > > > > > lot of people talking about. Some quick thoughts. > > > > > > > > > - Your pooling tactic is a good idea in general. Use as many > > > > > connections > > > > > > > as your db can handle for high writing load. > > > > > > > - It sounds like you're aware that your main problem is > throttling the > > > > > rows > > > > > > > coming from the CSV. Node fs streams are fast :) You're in > the right > > > > > > > neighborhood with the pause() method. You want to couple this > with > > > > > whatever > > > > > > > method you're using to determine load on the DB. > > > > > > > - You could read a number of items from the csv stream and > then pause > > > > > it > > > > > > > and process that batch. Resume when the batch is done. > > > > > > > - Or you could let the stream flow until some alert tells you > that the > > > > > > > database is saturated with writes, then call pause on the csv > stream > > > > > until > > > > > > > it cools down. > > > > > > > > > If this is an infrequent job, I would go with the second > option. If you > > > > > > > want to run this longer term, I would probably explore the > first. And > > > > > tweak > > > > > > > the batch size and pool size to keep throughput high. > > > > > > > > > Lastly, I don't think you're missing anything magical with > node. It's > > > > > > > supposed to be pretty low level, and people are building handy > > > > > utilities on > > > > > > > top of it. It might help to read more about streams and maybe > think > > > > > about > > > > > > > writing your own. > > > > > > > > >http://nodejs.org/docs/v0.4.2/api/streams.html > > > > > > > > > Your custom stream could sit between the csv file stream and > database > > > > > > > writing code. It would handle the batching or throttling any > way you > > > > > wanted. > > > > > > > > > I'm not sure this is actually helpful. You can always gist > some code > > > > > and > > > > > > > ask more specific questions too. Feel free to share your > experiences > > > > > with > > > > > > > this as well. > > > > > > > > > :Marco > > > > > > > > > -- > > > > > > > You received this message because you are subscribed to the > Google > > > > > Groups > > > > > > > "nodejs" group. > > > > > > > To post to this group, send email to > > > > > > > [email protected]<javascript:>. > > > > > > > > To unsubscribe from this group, send email to > > > > > > > [email protected] <javascript:>. > > > > > > > For more options, visit this group at > > > > > > >http://groups.google.com/group/nodejs?hl=en. > > > > > > > -- > > > > > You received this message because you are subscribed to the Google > Groups > > > > > "nodejs" group. > > > > > To post to this group, send email to > > > > > [email protected]<javascript:>. > > > > > > To unsubscribe from this group, send email to > > > > > [email protected] <javascript:>. > > > > > For more options, visit this group at > > > > >http://groups.google.com/group/nodejs?hl=en. -- -- Job Board: http://jobs.nodejs.org/ Posting guidelines: https://github.com/joyent/node/wiki/Mailing-List-Posting-Guidelines You received this message because you are subscribed to the Google Groups "nodejs" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/nodejs?hl=en?hl=en --- You received this message because you are subscribed to the Google Groups "nodejs" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. For more options, visit https://groups.google.com/d/optout.
