Re: [rhino-tools-dev] Re: Rhino ETL: BranchingOperation does not stream. What else does not?

Simone Busoli Sun, 23 Jan 2011 02:19:57 -0800

Hi Miles, glad it helped. If you push your changes to your repo and then
fill a pull request Nathan will review your code.


On Sun, Jan 23, 2011 at 02:26, Miles Waller <miles.wal...@gmail.com> wrote:

> Thanks Simon - that is really useful.  By chance, I have just got a
> requirement for a branching operation that needs to be memory efficient so
> this will save me loads of time.
>
> Also, I'm using the non-caching single-threaded pipeline to avoid memory
> issues, for an ETL that merges several files together.  I reworked the join
> implementation so that the various legs of the merge run independently
> rather than one after the other.  I got about 20-30% speed increase.
>
> Happy to share if its useful for anyone.
>
> Cheers
>
> Miles
>
>
>
> On Fri, Jan 21, 2011 at 7:01 AM, Simone Busoli <simone.bus...@gmail.com>
> wrote:
>
> It was taken off the list but there was a requirement for a branching
> operation which streams data correctly. Mail thread is below. There is a
> pending pull request on the main repo.
>
> On Fri, Jan 21, 2011 at 07:56, Simone Busoli <simone.bus...@gmail.com>
> wrote:
>
> Cool, great to hear that!
>
> On Fri, Jan 21, 2011 at 06:42, Shannon Marsh <sjmars...@gmail.com> wrote:
>
>  Simone,
>
>
>
> We finally got to implement the MultiThreadedBranchingOperation in our ETL
> process and it works very well.
>
>
>
> BEFORE:
> 2011-01-21 04:04:19,410 [1] INFO  WA.LS.Migration.ETL.Runner [(null)] -
> Process Name PartyRegisterETLProcess. Current Memory Usage: 1495 MB.
> 2011-01-21 04:04:19,410 [1] INFO  WA.LS.Migration.ETL.Runner [(null)] -
> Process Name PartyRegisterETLProcess. Peak Memory Usage: 1691 MB.
>
> AFTER:
> 2011-01-20 18:54:30,500 [10] INFO  WA.LS.Migration.ETL.Runner [(null)] -
> Process Name PartyRegisterETLProcess. Current Memory Usage: 48 MB.
> 2011-01-20 18:54:30,500 [10] INFO  WA.LS.Migration.ETL.Runner [(null)] -
> Process Name PartyRegisterETLProcess. Peak Memory Usage: 59 MB.
>
> Thanks again,
>
>
>
> Shannon
>
> On Mon, Jan 3, 2011 at 7:25 PM, Simone Busoli <simone.bus...@gmail.com>
> wrote:
>
>  Hi Shannon, in the meanwhile I improved it a bit by removing the need for
> using the threaded pipeline and also noticed that recently a non-caching
> single threaded pipeline has been added so you can use any of the two. I
> also forwarded a pull request to the main repository.
>
> Regards, Simone
>
> On Sun, Jan 2, 2011 at 23:35, Shannon Marsh <sjmars...@gmail.com> wrote:
>
>   Hi Simone,
>
>
>
> Thanks, that seems like a more robust solution.  I agree that there could
> be a problem if there were too many child operations but in our case our ETL
> process is fairly simplistic.  The problem was just the number of records we
> were dealing with.   I guess it comes down to using the right tool for the
> job.  This solution gives us another tool to choose from.  Can’t wait to try
> out this solution when I’m back in the office next week.
>
>
>
> Regards,
>
> Shannon
>
>
>
> *From:* Simone Busoli [mailto:simone.bus...@gmail.com]
> *Sent:* Sunday, 2 January 2011 10:51 AM
>
> *To:* Shannon Marsh
> *Subject:* Re: Rhino ETL: BranchingOperation does not stream. What else
> does not?
>
>
>
> Hi Shannon, I pushed a change which includes a new operation which
> optimizes memory consumption in branching scenarios, while instead performs
> worse than the other one when it comes to duration due to thread
> synchronization (which BTW shouldn't represent a bit problem as long as you
> don't branch to many children operations). It's called
> MultiThreadedBranchingOperation. Take into account that it needs the
> multi-threaded pipeline runner because the single threaded one relies on the
> caching enumerable which caches all the input rows and exhibits the same
> issue you have described.
> My branch on github is here <https://github.com/simoneb/rhino-etl>.
>
> On Wed, Dec 29, 2010 at 12:43, Simone Busoli <simone.bus...@gmail.com>
> wrote:
>
> Hi Shannon, thanks for the update, unfortunately your solution won't work
> correctly in the general case, as you stated the chunking you have
> implemented implies executing the operations more than once, which is not
> the desired behavior. I will look into it today to find out if there is a
> more general solution to the problem.
>
> Simone
>
>
>
> On Tue, Dec 28, 2010 at 00:49, Shannon Marsh <sjmars...@gmail.com> wrote:
>
> Hi Simone,
>
>
>
> I did manage to make some progress with this however I am holiday at the
> moment so I haven’t been into the office to test if my solution works on a
> large scale.
>
>
>
> I modified the BranchingOperation code to “chunk” the data coming through
> the pipeline and was able to work around the issue with the operations being
> called multiple times by adding a line to the SqlBulkInsertOperation to
> check the dictionary before adding the pair key/value in the
> “CreateInputSchema” method.  See attached files for changes.
>
>
>
> These changes seem to make the Fibonacci Branching performance test pass
> when setting the number of rows to over a million.  If I monitor the memory
> usage while the test is running it seems to peak at a much lower amount.
>
>
>
> I will be back in the office on 10th January so I will be able to test it
> with our ETL application then.  Rather than actually modify the Rhino ETL
> source as I have done in testing,  I was planning on just extending these
> operation classes and overriding the required methods.  Assuming that it all
> works I would also look at making the “chunk” size configurable.
>
>
>
> Regards,
>
>
>
> Shannon
>
>
>
>
>
> *From:* Simone Busoli [mailto:simone.bus...@gmail.com]
> *Sent:* Monday, 27 December 2010 5:53 PM
> *To:* Shannon Marsh
>
> *Subject:* Re: Rhino ETL: BranchingOperation does not stream. What else
> does not?
>
>
>
> Hi Shannon, any news about this issue?
>
> On Mon, Dec 13, 2010 at 21:55, Simone Busoli <simone.bus...@gmail.com>
> wrote:
>
> Sure, keep us informed of the progress, in any case in the next few days I
> might find some time to look into it, too.
>
>
>
> On Sun, Dec 12, 2010 at 22:45, Shannon Marsh <sjmars...@gmail.com> wrote:
>
> Thanks,
>
>
>
> I was looking at the code to see if I could batch the rows (chunking).
> Mixed results so far.  I seem to have problems re-iterating through the
> operations.  The first batch of rows go through perfectly but the 2nd batch
> fails when trying to call the same instance of the operation again, failing
> in the PrepareMapping method on SqlBulkInserOperation ("...key has already
> been added, etc.").  I'll continue to investigate and get back to you when i
> find a solution.  I am thinking I may need to clone the operations for each
> batch.  In the mean time the memory upgrade to our server should get us out
> of trouble.
>
> Regards,
>
> Shannon
>
> On Mon, Dec 13, 2010 at 3:00 AM, Simone Busoli <simone.bus...@gmail.com>
> wrote:
>
> Hi Shannon, I see your point. The tricky part here is that we need to
> provide the same set of rows to each operation without iterating through
> them more than once. I didn't try that, maybe branching them in batches is
> doable, have you looked at the code?
>
>
>
> On Sun, Dec 12, 2010 at 01:05, Shannon Marsh <sjmars...@gmail.com> wrote:
>
> Hi Simone,
>
>
>
> Reading the entire post, zvolkov talks about the problem of “file is so
> huge as to not fit into memory” and wanting “pulling and pushing one record
> at a time but never trying to accumulate all records in memory”.    That is
> what sparked my interest as it sounded exactly the problem we were
> experiencing.
>
>
>
> Later zvolkov says “maybe cache only a few rows but not all”  and   webpaul
> says “If you make an IEnumerable that copies the rows one at a time and feed
> that to the operation.Execute”
>
>
>
> So my understanding of how the fix would work was that it would take a copy
> of the row and serve it out to each branch consuming the row, then repeat
> this for each row in the pipeline.  Something like this.
>
>
>
> Branch 1 – Operation 1 – Execute on Row 1.
>
> Branch 1 – Operation 2 – Execute on Row 1.
>
> Branch 1 – Operation n – Execute on Row 1.
>
>
>
> Branch 2 – Operation 1 – Execute on Row 1.
>
> Branch 2 – Operation 2 – Execute on Row 1.
>
> Branch 2 – Operation n – Execute on Row 1.
>
>
>
> Branch n – Operation 1 – Execute on Row 1.
>
> Branch n – Operation 2 – Execute on Row 1.
>
> Branch n – Operation n – Execute on Row 1.
>
>
>
> Branch 1 – Operation 1 – Execute on Row 2.
>
> Branch 1 – Operation 2 – Execute on Row 2.
>
> Branch 1 – Operation n – Execute on Row 2.
>
>
>
> Branch 2 – Operation 1 – Execute on Row 2.
>
> Branch 2 – Operation 2 – Execute on Row 2.
>
> Branch 2 – Operation n – Execute on Row 2.
>
>
>
> Branch n – Operation 1 – Execute on Row 2.
>
> Branch n – Operation 2 – Execute on Row 2.
>
> Branch n – Operation n – Execute on Row 2.
>
>
>
> …etc, etc… for each row.
>
>
>
> With that approach I don’t think you would ever need to accumulate rows in
> memory.   To be honest though I haven’t considered the technicalities of
> implementing this and whether it is achievable with the IEnumerable model.
> This was just how I imagined it would work after reading this post.
>
>
>
> What seems to happen is that streaming does occur, but only for the first
> branch.  The second and subsequent branches have to wait until the first
> branch has consumed all the rows before they start, meaning that all the
> rows need to be cached in RAM during the first branch to be available for
> the later branches.  Like this…
>
>
>
> Branch 1 – Operation 1 – Execute on Row 1.
>
> Branch 1 – Operation 2 – Execute on Row 1.
>
> Branch 1 – Operation n – Execute on Row 1.
>
>
>
> Branch 1 – Operation 1 – Execute on Row 2.
>
> Branch 1 – Operation 2 – Execute on Row 2.
>
> Branch 1 – Operation n – Execute on Row 2.
>
>
>
> (When branch 1 is complete - all rows cached in RAM)
>
>
>
> Branch 2 – Operation 1 – Execute on Row 1.
>
> Branch 2 – Operation 2 – Execute on Row 1.
>
> Branch 2 – Operation n – Execute on Row 1.
>
>
>
> Branch 2 – Operation 1 – Execute on Row 2.
>
> Branch 2 – Operation 2 – Execute on Row 2.
>
> Branch 2 – Operation n – Execute on Row 2.
>
>
>
> Branch n – Operation 1 – Execute on Row 1.
>
> Branch n – Operation 2 – Execute on Row 1.
>
> Branch n – Operation n – Execute on Row 1.
>
>
>
> Branch n – Operation 1 – Execute on Row 2.
>
> Branch n – Operation 2 – Execute on Row 2.
>
> Branch n – Operation n – Execute on Row 2.
>
>
>
> So again, I may have completely misunderstood  how it should work?  Or
> could we be doing something wrong that causes all rows to cache in memory?
>
>
>
> Thanks Again,
>
>
>
> Shannon
>
>
>
> *From:* Simone B. [mailto:simone.bus...@gmail.com]
> *Sent:* Friday, 10 December 2010 11:40 AM
> *To:* ShannonM
> *Subject:* Re: Rhino ETL: BranchingOperation does not stream. What else
> does not?
>
>
>
> Hi Shannon, can you explain what is the behavior you would expect?
>
> On Fri, Dec 10, 2010 at 01:19, ShannonM <sjmars...@gmail.com> wrote:
>
> Hello,
>
> I realise this is an old post but we seem to be experiencing a similar
> issue with memory usage.  Our ETL process works with approximately 1
> million records per source table.  For a standard straight through
> process there are no problems.  The rows just stream through and load
> to the database and memory peaks at approx 110MB.  However, wherever
> we use a BranchingOperation in our process the rows accumulate in
> memory while executing the first branch (assuming so they are
> available for the remaining branches).  The subsequent branches do not
> execute until the first one has completed.
>
> The problem is that this usually consumes approx 1.6GB of memory and
> depending on other processes running on our server can sometimes cause
> a memory exception "System.AccessViolationException: Attempted to read
> or write protected memory. This is often an indication that other
> memory is corrupt.". We can work-around this issue by re-starting
> services (eg. SQL Server) or re-booting the server prior to running
> the ETL process to ensure there are no rougue processes hogging
> memory.  We are also considering moving to at 64Bit OS and adding more
> RAM.
>
> I investigated the Fibonacci Braching tests in the Rhino ETL source
> and it seems to behave exactly as I described.  If I debug the test
> named CanBranchThePipelineEfficiently() I can actually duplicate the
> scenario and you can see that all the rows are cached in memory after
> the first branch executes.
>
> Your post seems to indicate that you fixed this issue by using the
> cahcing enumerable.  I am misunderstanding your post or could we be
> doing something wrong?
>
> While we will probably go ahead with the server ugrage anyway it would
> be nice to make our ETL process more effienct and not consume so much
> memory if it can be avoided.
>
> On Jul 5 2009, 9:22 pm, Simone Busoli <simone.bus...@gmail.com> wrote:
> > Fixed. What it's doing now is wrap the input enumerable into a caching
> > enumerable and then feed a clone of each row into the operations making
> up
> > the branch.
> >
> >
> >
>
> > On Thu, Jul 2, 2009 at 16:53, webpaul <goo...@webpaul.net> wrote:
> >
> > > Looking at the code, I think the only reason it isn't yield returning
> > > right now is so it can copy the rows. If you make an IEnumerable that
> > > copies the rows one at a time and feed that to the operation.Execute I
> > > think that is all that is needed.
> >
> > > On Jul 2, 7:17 am, zvolkov <zvol...@gmail.com> wrote:
> > > > uhh... maybe cache only a few rows but not all? Assuming I branch in
> > > > 2, at worst I will need to cache as many rows as there is the
> > > > disparity between the two consumers of my two output streams... Makes
>
> > > > sense?- Hide quoted text -
> >
> > - Show quoted text -
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>   --
> You received this message because you are subscribed to the Google Groups
> "Rhino Tools Dev" group.
> To post to this group, send email to rhino-tools-dev@googlegroups.com.
> To unsubscribe from this group, send email to
> rhino-tools-dev+unsubscr...@googlegroups.com<rhino-tools-dev%2bunsubscr...@googlegroups.com>
> .
> For more options, visit this group at
> http://groups.google.com/group/rhino-tools-dev?hl=en.
>
>  --
> You received this message because you are subscribed to the Google Groups
> "Rhino Tools Dev" group.
> To post to this group, send email to rhino-tools-dev@googlegroups.com.
> To unsubscribe from this group, send email to
> rhino-tools-dev+unsubscr...@googlegroups.com<rhino-tools-dev%2bunsubscr...@googlegroups.com>
> .
> For more options, visit this group at
> http://groups.google.com/group/rhino-tools-dev?hl=en.
>

-- 
You received this message because you are subscribed to the Google Groups 
"Rhino Tools Dev" group.
To post to this group, send email to rhino-tools-dev@googlegroups.com.
To unsubscribe from this group, send email to 
rhino-tools-dev+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/rhino-tools-dev?hl=en.

Re: [rhino-tools-dev] Re: Rhino ETL: BranchingOperation does not stream. What else does not?

Reply via email to