Re: [rhino-tools-dev] Re: Rhino ETL: BranchingOperation does not stream. What else does not?

Nathan Palmer Fri, 21 Jan 2011 04:51:47 -0800

I did see the pull request. I'm out of town at the moment and probably won't 
get a chance to review it until the first of next week.


Nathan Palmer 

Sent from my Phone

On Jan 21, 2011, at 2:01 AM, Simone Busoli <simone.bus...@gmail.com> wrote:

> It was taken off the list but there was a requirement for a branching 
> operation which streams data correctly. Mail thread is below. There is a 
> pending pull request on the main repo.
> 
> On Fri, Jan 21, 2011 at 07:56, Simone Busoli <simone.bus...@gmail.com> wrote:
> Cool, great to hear that!
> 
> On Fri, Jan 21, 2011 at 06:42, Shannon Marsh <sjmars...@gmail.com> wrote:
> Simone,
>  
> We finally got to implement the MultiThreadedBranchingOperation in our ETL 
> process and it works very well. 
>  
> BEFORE:
> 2011-01-21 04:04:19,410 [1] INFO  WA.LS.Migration.ETL.Runner [(null)] - 
> Process Name PartyRegisterETLProcess. Current Memory Usage: 1495 MB.
> 2011-01-21 04:04:19,410 [1] INFO  WA.LS.Migration.ETL.Runner [(null)] - 
> Process Name PartyRegisterETLProcess. Peak Memory Usage: 1691 MB.
> 
> AFTER:
> 2011-01-20 18:54:30,500 [10] INFO  WA.LS.Migration.ETL.Runner [(null)] - 
> Process Name PartyRegisterETLProcess. Current Memory Usage: 48 MB.
> 2011-01-20 18:54:30,500 [10] INFO  WA.LS.Migration.ETL.Runner [(null)] - 
> Process Name PartyRegisterETLProcess. Peak Memory Usage: 59 MB.
> Thanks again,
>  
> Shannon
> On Mon, Jan 3, 2011 at 7:25 PM, Simone Busoli <simone.bus...@gmail.com> wrote:
> Hi Shannon, in the meanwhile I improved it a bit by removing the need for 
> using the threaded pipeline and also noticed that recently a non-caching 
> single threaded pipeline has been added so you can use any of the two. I also 
> forwarded a pull request to the main repository.
> 
> Regards, Simone
> 
> On Sun, Jan 2, 2011 at 23:35, Shannon Marsh <sjmars...@gmail.com> wrote:
> Hi Simone,
> 
>  
> 
> Thanks, that seems like a more robust solution.  I agree that there could be 
> a problem if there were too many child operations but in our case our ETL 
> process is fairly simplistic.  The problem was just the number of records we 
> were dealing with.   I guess it comes down to using the right tool for the 
> job.  This solution gives us another tool to choose from.  Can’t wait to try 
> out this solution when I’m back in the office next week.
> 
>  
> 
> Regards,
> 
> Shannon
> 
>  
> 
> From: Simone Busoli [mailto:simone.bus...@gmail.com] 
> Sent: Sunday, 2 January 2011 10:51 AM
> 
> 
> To: Shannon Marsh
> Subject: Re: Rhino ETL: BranchingOperation does not stream. What else does 
> not?
>  
> 
> Hi Shannon, I pushed a change which includes a new operation which optimizes 
> memory consumption in branching scenarios, while instead performs worse than 
> the other one when it comes to duration due to thread synchronization (which 
> BTW shouldn't represent a bit problem as long as you don't branch to many 
> children operations). It's called MultiThreadedBranchingOperation. Take into 
> account that it needs the multi-threaded pipeline runner because the single 
> threaded one relies on the caching enumerable which caches all the input rows 
> and exhibits the same issue you have described.
> My branch on github is here.
> 
> On Wed, Dec 29, 2010 at 12:43, Simone Busoli <simone.bus...@gmail.com> wrote:
> 
> Hi Shannon, thanks for the update, unfortunately your solution won't work 
> correctly in the general case, as you stated the chunking you have 
> implemented implies executing the operations more than once, which is not the 
> desired behavior. I will look into it today to find out if there is a more 
> general solution to the problem.
> 
> Simone
> 
>  
> 
> On Tue, Dec 28, 2010 at 00:49, Shannon Marsh <sjmars...@gmail.com> wrote:
> 
> Hi Simone,
> 
>  
> 
> I did manage to make some progress with this however I am holiday at the 
> moment so I haven’t been into the office to test if my solution works on a 
> large scale.
> 
>  
> 
> I modified the BranchingOperation code to “chunk” the data coming through the 
> pipeline and was able to work around the issue with the operations being 
> called multiple times by adding a line to the SqlBulkInsertOperation to check 
> the dictionary before adding the pair key/value in the “CreateInputSchema” 
> method.  See attached files for changes. 
> 
>  
> 
> These changes seem to make the Fibonacci Branching performance test pass when 
> setting the number of rows to over a million.  If I monitor the memory usage 
> while the test is running it seems to peak at a much lower amount. 
> 
>  
> 
> I will be back in the office on 10th January so I will be able to test it 
> with our ETL application then.  Rather than actually modify the Rhino ETL 
> source as I have done in testing,  I was planning on just extending these 
> operation classes and overriding the required methods.  Assuming that it all 
> works I would also look at making the “chunk” size configurable.
> 
>  
> 
> Regards,
> 
>  
> 
> Shannon
> 
>  
> 
>  
> 
> From: Simone Busoli [mailto:simone.bus...@gmail.com] 
> Sent: Monday, 27 December 2010 5:53 PM
> To: Shannon Marsh
> 
> 
> Subject: Re: Rhino ETL: BranchingOperation does not stream. What else does 
> not?
> 
>  
> 
> Hi Shannon, any news about this issue?
> 
> On Mon, Dec 13, 2010 at 21:55, Simone Busoli <simone.bus...@gmail.com> wrote:
> 
> Sure, keep us informed of the progress, in any case in the next few days I 
> might find some time to look into it, too.
> 
>  
> 
> On Sun, Dec 12, 2010 at 22:45, Shannon Marsh <sjmars...@gmail.com> wrote:
> 
> Thanks, 
> 
>  
> 
> I was looking at the code to see if I could batch the rows (chunking).  Mixed 
> results so far.  I seem to have problems re-iterating through the operations. 
>  The first batch of rows go through perfectly but the 2nd batch fails when 
> trying to call the same instance of the operation again, failing in the 
> PrepareMapping method on SqlBulkInserOperation ("...key has already been 
> added, etc.").  I'll continue to investigate and get back to you when i find 
> a solution.  I am thinking I may need to clone the operations for each batch. 
>  In the mean time the memory upgrade to our server should get us out of 
> trouble.
> 
> Regards,
> 
> Shannon
> 
> On Mon, Dec 13, 2010 at 3:00 AM, Simone Busoli <simone.bus...@gmail.com> 
> wrote:
> 
> Hi Shannon, I see your point. The tricky part here is that we need to provide 
> the same set of rows to each operation without iterating through them more 
> than once. I didn't try that, maybe branching them in batches is doable, have 
> you looked at the code?
> 
>  
> 
> On Sun, Dec 12, 2010 at 01:05, Shannon Marsh <sjmars...@gmail.com> wrote:
> 
> Hi Simone,
> 
>  
> 
> Reading the entire post, zvolkov talks about the problem of “file is so huge 
> as to not fit into memory” and wanting “pulling and pushing one record at a 
> time but never trying to accumulate all records in memory”.    That is what 
> sparked my interest as it sounded exactly the problem we were experiencing. 
> 
>  
> 
> Later zvolkov says “maybe cache only a few rows but not all”  and   webpaul 
> says “If you make an IEnumerable that copies the rows one at a time and feed 
> that to the operation.Execute”  
> 
>  
> 
> So my understanding of how the fix would work was that it would take a copy 
> of the row and serve it out to each branch consuming the row, then repeat 
> this for each row in the pipeline.  Something like this.
> 
>  
> 
> Branch 1 – Operation 1 – Execute on Row 1.
> 
> Branch 1 – Operation 2 – Execute on Row 1.
> 
> Branch 1 – Operation n – Execute on Row 1.
> 
>  
> 
> Branch 2 – Operation 1 – Execute on Row 1.
> 
> Branch 2 – Operation 2 – Execute on Row 1.
> 
> Branch 2 – Operation n – Execute on Row 1.
> 
>  
> 
> Branch n – Operation 1 – Execute on Row 1.
> 
> Branch n – Operation 2 – Execute on Row 1.
> 
> Branch n – Operation n – Execute on Row 1.
> 
>  
> 
> Branch 1 – Operation 1 – Execute on Row 2.
> 
> Branch 1 – Operation 2 – Execute on Row 2.
> 
> Branch 1 – Operation n – Execute on Row 2.
> 
>  
> 
> Branch 2 – Operation 1 – Execute on Row 2.
> 
> Branch 2 – Operation 2 – Execute on Row 2.
> 
> Branch 2 – Operation n – Execute on Row 2.
> 
>  
> 
> Branch n – Operation 1 – Execute on Row 2.
> 
> Branch n – Operation 2 – Execute on Row 2.
> 
> Branch n – Operation n – Execute on Row 2.
> 
>  
> 
> …etc, etc… for each row.
> 
>  
> 
> With that approach I don’t think you would ever need to accumulate rows in 
> memory.   To be honest though I haven’t considered the technicalities of 
> implementing this and whether it is achievable with the IEnumerable model.  
> This was just how I imagined it would work after reading this post.
> 
>  
> 
> What seems to happen is that streaming does occur, but only for the first 
> branch.  The second and subsequent branches have to wait until the first 
> branch has consumed all the rows before they start, meaning that all the rows 
> need to be cached in RAM during the first branch to be available for the 
> later branches.  Like this…
> 
>  
> 
> Branch 1 – Operation 1 – Execute on Row 1.
> 
> Branch 1 – Operation 2 – Execute on Row 1.
> 
> Branch 1 – Operation n – Execute on Row 1.
> 
>  
> 
> Branch 1 – Operation 1 – Execute on Row 2.
> 
> Branch 1 – Operation 2 – Execute on Row 2.
> 
> Branch 1 – Operation n – Execute on Row 2.
> 
>  
> 
> (When branch 1 is complete - all rows cached in RAM)
> 
>  
> 
> Branch 2 – Operation 1 – Execute on Row 1.
> 
> Branch 2 – Operation 2 – Execute on Row 1.
> 
> Branch 2 – Operation n – Execute on Row 1.
> 
>  
> 
> Branch 2 – Operation 1 – Execute on Row 2.
> 
> Branch 2 – Operation 2 – Execute on Row 2.
> 
> Branch 2 – Operation n – Execute on Row 2.
> 
>  
> 
> Branch n – Operation 1 – Execute on Row 1.
> 
> Branch n – Operation 2 – Execute on Row 1.
> 
> Branch n – Operation n – Execute on Row 1.
> 
>  
> 
> Branch n – Operation 1 – Execute on Row 2.
> 
> Branch n – Operation 2 – Execute on Row 2.
> 
> Branch n – Operation n – Execute on Row 2.
> 
>  
> 
> So again, I may have completely misunderstood  how it should work?  Or could 
> we be doing something wrong that causes all rows to cache in memory?
> 
>  
> 
> Thanks Again,
> 
>  
> 
> Shannon
> 
>  
> 
> From: Simone B. [mailto:simone.bus...@gmail.com] 
> Sent: Friday, 10 December 2010 11:40 AM
> To: ShannonM
> Subject: Re: Rhino ETL: BranchingOperation does not stream. What else does 
> not?
> 
>  
> 
> Hi Shannon, can you explain what is the behavior you would expect?
> 
> On Fri, Dec 10, 2010 at 01:19, ShannonM <sjmars...@gmail.com> wrote:
> 
> Hello,
> 
> I realise this is an old post but we seem to be experiencing a similar
> issue with memory usage.  Our ETL process works with approximately 1
> million records per source table.  For a standard straight through
> process there are no problems.  The rows just stream through and load
> to the database and memory peaks at approx 110MB.  However, wherever
> we use a BranchingOperation in our process the rows accumulate in
> memory while executing the first branch (assuming so they are
> available for the remaining branches).  The subsequent branches do not
> execute until the first one has completed.
> 
> The problem is that this usually consumes approx 1.6GB of memory and
> depending on other processes running on our server can sometimes cause
> a memory exception "System.AccessViolationException: Attempted to read
> or write protected memory. This is often an indication that other
> memory is corrupt.". We can work-around this issue by re-starting
> services (eg. SQL Server) or re-booting the server prior to running
> the ETL process to ensure there are no rougue processes hogging
> memory.  We are also considering moving to at 64Bit OS and adding more
> RAM.
> 
> I investigated the Fibonacci Braching tests in the Rhino ETL source
> and it seems to behave exactly as I described.  If I debug the test
> named CanBranchThePipelineEfficiently() I can actually duplicate the
> scenario and you can see that all the rows are cached in memory after
> the first branch executes.
> 
> Your post seems to indicate that you fixed this issue by using the
> cahcing enumerable.  I am misunderstanding your post or could we be
> doing something wrong?
> 
> While we will probably go ahead with the server ugrage anyway it would
> be nice to make our ETL process more effienct and not consume so much
> memory if it can be avoided.
> 
> 
> 
> On Jul 5 2009, 9:22 pm, Simone Busoli <simone.bus...@gmail.com> wrote:
> > Fixed. What it's doing now is wrap the input enumerable into a caching
> > enumerable and then feed a clone of each row into the operations making up
> > the branch.
> >
> >
> >
> 
> > On Thu, Jul 2, 2009 at 16:53, webpaul <goo...@webpaul.net> wrote:
> >
> > > Looking at the code, I think the only reason it isn't yield returning
> > > right now is so it can copy the rows. If you make an IEnumerable that
> > > copies the rows one at a time and feed that to the operation.Execute I
> > > think that is all that is needed.
> >
> > > On Jul 2, 7:17 am, zvolkov <zvol...@gmail.com> wrote:
> > > > uhh... maybe cache only a few rows but not all? Assuming I branch in
> > > > 2, at worst I will need to cache as many rows as there is the
> > > > disparity between the two consumers of my two output streams... Makes
> 
> > > > sense?- Hide quoted text -
> >
> > - Show quoted text -
> 
>  
> 
>  
> 
>  
> 
>  
> 
>  
> 
>  
> 
>  
> 
> 
> 
> 
> 
> -- 
> You received this message because you are subscribed to the Google Groups 
> "Rhino Tools Dev" group.
> To post to this group, send email to rhino-tools-dev@googlegroups.com.
> To unsubscribe from this group, send email to 
> rhino-tools-dev+unsubscr...@googlegroups.com.
> For more options, visit this group at 
> http://groups.google.com/group/rhino-tools-dev?hl=en.

-- 
You received this message because you are subscribed to the Google Groups 
"Rhino Tools Dev" group.
To post to this group, send email to rhino-tools-dev@googlegroups.com.
To unsubscribe from this group, send email to 
rhino-tools-dev+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/rhino-tools-dev?hl=en.

Re: [rhino-tools-dev] Re: Rhino ETL: BranchingOperation does not stream. What else does not?

Reply via email to