[nodejs] Re: New user, trying to write a basic ETL process

Ignacio Elguezabal Wed, 16 Apr 2014 06:20:00 -0700

Hi Rich,

 I am very curious about your work. With node 0.10 release I have been 
searching for an ETL tools using Node Streams.
 
 I am seeking the equivalent to http://pandas.pydata.org/ (amazing python 
tool for ETL) in nodejs.


 Could you post an example of what youve acomplished already??

Thank youu




El viernes, 22 de abril de 2011 22:58:41 UTC+2, Rich Schiavi escribió:
>
> Luke, 
>
> Not sure if my other reply got sent, but the approach I've explored 
> and is working well is to stage each part of the ETL process, instead 
> of trying to interleave the db writes. 
>
> Basically, I'm ingestion about 300MB worth of XML that has all sorts 
> of elements/transforms before it gets into mysql. I first parse that 
> with SAX, generating each set of objects. From those objects instead 
> of doing row-by-row or mysql inserts, I dump them out to TSV files, 
> that can then be loaded more efficiently with mysql load infile. 
>
> For me though, the key to not slamming the database or node even, is 
> to make sure each stage completes before trying to do async/parallel 
> database writes.  For ETL, this is probably better anyway to avoid 
> failures and partial writes if parsing fails or something like that. 
>
> Lemme know if you have questions, I can post my example code. Its 
> surprising small for the amount of work it processes:   300MB out to 
> mysql load files in about 160 seconds 
>
>
>
>
> On Mar 15, 1:26 am, Luke Monahan <[email protected]> wrote: 
> > A quick update. 
> > 
> > I figured it out that I could call Stream.pause() on the internal 
> > stream in the CSV library very easily. Once I had this in mind, it 
> > wasn't hard to then keep an "currentlyRunning" counter of CSV rows 
> > that had been parsed but not inserted into the database. Once 
> > currentlyRunning gets above 100k I pause the stream and resume it when 
> > that counter drops below 10k. 
> > 
> > This way everything runs smoothly without huge amounts of memory. I 
> > can drop the number of riak client connections down to a more sane 
> > level (4 or so works for me) and the whole lot is much faster. 
> > 
> > Thanks, 
> > 
> > Luke. 
> > 
> > On Mar 10, 9:40 am, Luke Monahan <[email protected]> wrote: 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > > node-inspector is working well thanks. 
> > 
> > > I've been able to dump the heap before processing and during 
> > > processing. I'm seeing large numbers of String, object and closure 
> > > constructors, which between them contribute 99% of the heap. The tool 
> > > doesn't let me drill down any further than that as far as I can tell. 
> > 
> > > But the total heap size is well and truly less than a problem -- 20MB 
> > > or so with a largeish file, wheras the operating system is telling me 
> > > the entire process is using over 200MB at the same time. I am I 
> > > misunderstanding the output? Or is this pointing in another direction, 
> > > such as a recursive call that is creating a huge stack? 
> > 
> > > I've managed to make it work fine by cutting my CSV files into smaller 
> > > chunks (100MB each) and processing them serially by ensuring the 
> > > number of unsaved CSV rows is 0 before starting on the next file. I've 
> > > also worked out why I can't use to protobuf api (need to install 
> thishttp://code.google.com/p/protobuf-for-node/). I'll spend a little 
> > > while getting that working, but then I'll probably move on to the more 
> > > interesting parts of my idea than just loading up data. 
> > 
> > > Thanks again, 
> > 
> > > Luke. 
> > 
> > > On Mar 10, 2:51 am, Nicholas Campbell <[email protected]> 
> > > wrote: 
> > 
> > > > Valgrind is good but there is alsohttps://
> github.com/dannycoates/node-inspector. I haven't yet had a need for 
> > > > it but I know others who have used it and love it. 
> > 
> > > > - Nick Campbell 
> > 
> > > >http://digitaltumbleweed.com 
> > 
> > > > On Wed, Mar 9, 2011 at 2:08 AM, Luke Monahan <[email protected]> 
> wrote: 
> > > > > Thanks for the suggestions. It looks like a custom stream would be 
> a 
> > > > > good idea, as the "pump" between streams already has throttling 
> baked 
> > > > > in. I haven't looked at this in-depth though, so I'll do that 
> shortly 
> > > > > to figure out if it's viable. A custom CSV parser to specifically 
> > > > > support throttling will be my next option. 
> > 
> > > > > I'm using the REST API, as the protobuf support in riak-js doesn't 
> > > > > seem to work for me -- I just get an error trying to instantiate 
> the 
> > > > > connection. I might have to switch to git head or something 
> instead of 
> > > > > the npm to try that out? 
> > 
> > > > > I'll see if I can profile to find where the memory is specifically 
> > > > > going to before all the above anyway. I just found I could only 
> about 
> > > > > 200MB of the file fully processed and saved before "Allocation 
> failed 
> > > > > - process out of memory", which is 1GB I think due to V8.  There's 
> > > > > obviously more to the story than just rows of CSV being kept in 
> memory 
> > > > > to cause this. Is there a tool/method recommended to trace memory 
> > > > > leaks in nodejs -- the valgrind method is all I could find in this 
> > > > > group and it seemed beyond me... 
> > 
> > > > > Luke. 
> > 
> > > > > On Mar 9, 5:11 pm, Nicholas Campbell <
> [email protected]> 
> > > > > wrote: 
> > > > > > Luke, 
> > 
> > > > > > A couple gb of data shouldn't crush the db when writing to it 
> (in theory 
> > > > > > ;D). 
> > 
> > > > > > You've identified that Riak is the slowdown? What backend to 
> Riak are you 
> > > > > > using? Could it be the module? Are you using proto_bufs or the 
> REST API? 
> > > > > > Have you used the inspector? Where is the memory increase 
> happening- in 
> > > > > Node 
> > > > > > due to the slowdown/backup? 
> > 
> > > > > > - Nick Campbell 
> > 
> > > > > >http://digitaltumbleweed.com 
> > 
> > > > > > On Tue, Mar 8, 2011 at 11:28 PM, Marco Rogers <
> [email protected] 
> > > > > >wrote: 
> > 
> > > > > > > Luke.ETLis an interesting use case for node, and one I haven't 
> heard 
> > > > > a 
> > > > > > > lot of people talking about.  Some quick thoughts. 
> > 
> > > > > > > - Your pooling tactic is a good idea in general.  Use as many 
> > > > > connections 
> > > > > > > as your db can handle for high writing load. 
> > > > > > > - It sounds like you're aware that your main problem is 
> throttling the 
> > > > > rows 
> > > > > > > coming from the CSV. Node fs streams are fast :)  You're in 
> the right 
> > > > > > > neighborhood with the pause() method.  You want to couple this 
> with 
> > > > > whatever 
> > > > > > > method you're using to determine load on the DB. 
> > > > > > > - You could read a number of items from the csv stream and 
> then pause 
> > > > > it 
> > > > > > > and process that batch. Resume when the batch is done. 
> > > > > > > - Or you could let the stream flow until some alert tells you 
> that the 
> > > > > > > database is saturated with writes, then call pause on the csv 
> stream 
> > > > > until 
> > > > > > > it cools down. 
> > 
> > > > > > > If this is an infrequent job, I would go with the second 
> option. If you 
> > > > > > > want to run this longer term, I would probably explore the 
> first. And 
> > > > > tweak 
> > > > > > > the batch size and pool size to keep throughput high. 
> > 
> > > > > > > Lastly, I don't think you're missing anything magical with 
> node. It's 
> > > > > > > supposed to be pretty low level, and people are building handy 
> > > > > utilities on 
> > > > > > > top of it.  It might help to read more about streams and maybe 
> think 
> > > > > about 
> > > > > > > writing your own. 
> > 
> > > > > > >http://nodejs.org/docs/v0.4.2/api/streams.html 
> > 
> > > > > > > Your custom stream could sit between the csv file stream and 
> database 
> > > > > > > writing code. It would handle the batching or throttling any 
> way you 
> > > > > wanted. 
> > 
> > > > > > > I'm not sure this is actually helpful.  You can always gist 
> some code 
> > > > > and 
> > > > > > > ask more specific questions too.  Feel free to share your 
> experiences 
> > > > > with 
> > > > > > > this as well. 
> > 
> > > > > > > :Marco 
> > 
> > > > > > >  -- 
> > > > > > > You received this message because you are subscribed to the 
> Google 
> > > > > Groups 
> > > > > > > "nodejs" group. 
> > > > > > > To post to this group, send email to 
> > > > > > > [email protected]<javascript:>. 
>
> > > > > > > To unsubscribe from this group, send email to 
> > > > > > > [email protected] <javascript:>. 
> > > > > > > For more options, visit this group at 
> > > > > > >http://groups.google.com/group/nodejs?hl=en. 
> > 
> > > > > -- 
> > > > > You received this message because you are subscribed to the Google 
> Groups 
> > > > > "nodejs" group. 
> > > > > To post to this group, send email to 
> > > > > [email protected]<javascript:>. 
>
> > > > > To unsubscribe from this group, send email to 
> > > > > [email protected] <javascript:>. 
> > > > > For more options, visit this group at 
> > > > >http://groups.google.com/group/nodejs?hl=en.

-- 
-- 
Job Board: http://jobs.nodejs.org/
Posting guidelines: 
https://github.com/joyent/node/wiki/Mailing-List-Posting-Guidelines
You received this message because you are subscribed to the Google
Groups "nodejs" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/nodejs?hl=en?hl=en

--- 
You received this message because you are subscribed to the Google Groups 
"nodejs" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

[nodejs] Re: New user, trying to write a basic ETL process

Reply via email to