Hi Miles, glad it helped. If you push your changes to your repo and then fill a pull request Nathan will review your code.
On Sun, Jan 23, 2011 at 02:26, Miles Waller <miles.wal...@gmail.com> wrote: > Thanks Simon - that is really useful. By chance, I have just got a > requirement for a branching operation that needs to be memory efficient so > this will save me loads of time. > > Also, I'm using the non-caching single-threaded pipeline to avoid memory > issues, for an ETL that merges several files together. I reworked the join > implementation so that the various legs of the merge run independently > rather than one after the other. I got about 20-30% speed increase. > > Happy to share if its useful for anyone. > > Cheers > > Miles > > > > On Fri, Jan 21, 2011 at 7:01 AM, Simone Busoli <simone.bus...@gmail.com> > wrote: > > It was taken off the list but there was a requirement for a branching > operation which streams data correctly. Mail thread is below. There is a > pending pull request on the main repo. > > On Fri, Jan 21, 2011 at 07:56, Simone Busoli <simone.bus...@gmail.com> > wrote: > > Cool, great to hear that! > > On Fri, Jan 21, 2011 at 06:42, Shannon Marsh <sjmars...@gmail.com> wrote: > > Simone, > > > > We finally got to implement the MultiThreadedBranchingOperation in our ETL > process and it works very well. > > > > BEFORE: > 2011-01-21 04:04:19,410 [1] INFO WA.LS.Migration.ETL.Runner [(null)] - > Process Name PartyRegisterETLProcess. Current Memory Usage: 1495 MB. > 2011-01-21 04:04:19,410 [1] INFO WA.LS.Migration.ETL.Runner [(null)] - > Process Name PartyRegisterETLProcess. Peak Memory Usage: 1691 MB. > > AFTER: > 2011-01-20 18:54:30,500 [10] INFO WA.LS.Migration.ETL.Runner [(null)] - > Process Name PartyRegisterETLProcess. Current Memory Usage: 48 MB. > 2011-01-20 18:54:30,500 [10] INFO WA.LS.Migration.ETL.Runner [(null)] - > Process Name PartyRegisterETLProcess. Peak Memory Usage: 59 MB. > > Thanks again, > > > > Shannon > > On Mon, Jan 3, 2011 at 7:25 PM, Simone Busoli <simone.bus...@gmail.com> > wrote: > > Hi Shannon, in the meanwhile I improved it a bit by removing the need for > using the threaded pipeline and also noticed that recently a non-caching > single threaded pipeline has been added so you can use any of the two. I > also forwarded a pull request to the main repository. > > Regards, Simone > > On Sun, Jan 2, 2011 at 23:35, Shannon Marsh <sjmars...@gmail.com> wrote: > > Hi Simone, > > > > Thanks, that seems like a more robust solution. I agree that there could > be a problem if there were too many child operations but in our case our ETL > process is fairly simplistic. The problem was just the number of records we > were dealing with. I guess it comes down to using the right tool for the > job. This solution gives us another tool to choose from. Can’t wait to try > out this solution when I’m back in the office next week. > > > > Regards, > > Shannon > > > > *From:* Simone Busoli [mailto:simone.bus...@gmail.com] > *Sent:* Sunday, 2 January 2011 10:51 AM > > *To:* Shannon Marsh > *Subject:* Re: Rhino ETL: BranchingOperation does not stream. What else > does not? > > > > Hi Shannon, I pushed a change which includes a new operation which > optimizes memory consumption in branching scenarios, while instead performs > worse than the other one when it comes to duration due to thread > synchronization (which BTW shouldn't represent a bit problem as long as you > don't branch to many children operations). It's called > MultiThreadedBranchingOperation. Take into account that it needs the > multi-threaded pipeline runner because the single threaded one relies on the > caching enumerable which caches all the input rows and exhibits the same > issue you have described. > My branch on github is here <https://github.com/simoneb/rhino-etl>. > > On Wed, Dec 29, 2010 at 12:43, Simone Busoli <simone.bus...@gmail.com> > wrote: > > Hi Shannon, thanks for the update, unfortunately your solution won't work > correctly in the general case, as you stated the chunking you have > implemented implies executing the operations more than once, which is not > the desired behavior. I will look into it today to find out if there is a > more general solution to the problem. > > Simone > > > > On Tue, Dec 28, 2010 at 00:49, Shannon Marsh <sjmars...@gmail.com> wrote: > > Hi Simone, > > > > I did manage to make some progress with this however I am holiday at the > moment so I haven’t been into the office to test if my solution works on a > large scale. > > > > I modified the BranchingOperation code to “chunk” the data coming through > the pipeline and was able to work around the issue with the operations being > called multiple times by adding a line to the SqlBulkInsertOperation to > check the dictionary before adding the pair key/value in the > “CreateInputSchema” method. See attached files for changes. > > > > These changes seem to make the Fibonacci Branching performance test pass > when setting the number of rows to over a million. If I monitor the memory > usage while the test is running it seems to peak at a much lower amount. > > > > I will be back in the office on 10th January so I will be able to test it > with our ETL application then. Rather than actually modify the Rhino ETL > source as I have done in testing, I was planning on just extending these > operation classes and overriding the required methods. Assuming that it all > works I would also look at making the “chunk” size configurable. > > > > Regards, > > > > Shannon > > > > > > *From:* Simone Busoli [mailto:simone.bus...@gmail.com] > *Sent:* Monday, 27 December 2010 5:53 PM > *To:* Shannon Marsh > > *Subject:* Re: Rhino ETL: BranchingOperation does not stream. What else > does not? > > > > Hi Shannon, any news about this issue? > > On Mon, Dec 13, 2010 at 21:55, Simone Busoli <simone.bus...@gmail.com> > wrote: > > Sure, keep us informed of the progress, in any case in the next few days I > might find some time to look into it, too. > > > > On Sun, Dec 12, 2010 at 22:45, Shannon Marsh <sjmars...@gmail.com> wrote: > > Thanks, > > > > I was looking at the code to see if I could batch the rows (chunking). > Mixed results so far. I seem to have problems re-iterating through the > operations. The first batch of rows go through perfectly but the 2nd batch > fails when trying to call the same instance of the operation again, failing > in the PrepareMapping method on SqlBulkInserOperation ("...key has already > been added, etc."). I'll continue to investigate and get back to you when i > find a solution. I am thinking I may need to clone the operations for each > batch. In the mean time the memory upgrade to our server should get us out > of trouble. > > Regards, > > Shannon > > On Mon, Dec 13, 2010 at 3:00 AM, Simone Busoli <simone.bus...@gmail.com> > wrote: > > Hi Shannon, I see your point. The tricky part here is that we need to > provide the same set of rows to each operation without iterating through > them more than once. I didn't try that, maybe branching them in batches is > doable, have you looked at the code? > > > > On Sun, Dec 12, 2010 at 01:05, Shannon Marsh <sjmars...@gmail.com> wrote: > > Hi Simone, > > > > Reading the entire post, zvolkov talks about the problem of “file is so > huge as to not fit into memory” and wanting “pulling and pushing one record > at a time but never trying to accumulate all records in memory”. That is > what sparked my interest as it sounded exactly the problem we were > experiencing. > > > > Later zvolkov says “maybe cache only a few rows but not all” and webpaul > says “If you make an IEnumerable that copies the rows one at a time and feed > that to the operation.Execute” > > > > So my understanding of how the fix would work was that it would take a copy > of the row and serve it out to each branch consuming the row, then repeat > this for each row in the pipeline. Something like this. > > > > Branch 1 – Operation 1 – Execute on Row 1. > > Branch 1 – Operation 2 – Execute on Row 1. > > Branch 1 – Operation n – Execute on Row 1. > > > > Branch 2 – Operation 1 – Execute on Row 1. > > Branch 2 – Operation 2 – Execute on Row 1. > > Branch 2 – Operation n – Execute on Row 1. > > > > Branch n – Operation 1 – Execute on Row 1. > > Branch n – Operation 2 – Execute on Row 1. > > Branch n – Operation n – Execute on Row 1. > > > > Branch 1 – Operation 1 – Execute on Row 2. > > Branch 1 – Operation 2 – Execute on Row 2. > > Branch 1 – Operation n – Execute on Row 2. > > > > Branch 2 – Operation 1 – Execute on Row 2. > > Branch 2 – Operation 2 – Execute on Row 2. > > Branch 2 – Operation n – Execute on Row 2. > > > > Branch n – Operation 1 – Execute on Row 2. > > Branch n – Operation 2 – Execute on Row 2. > > Branch n – Operation n – Execute on Row 2. > > > > …etc, etc… for each row. > > > > With that approach I don’t think you would ever need to accumulate rows in > memory. To be honest though I haven’t considered the technicalities of > implementing this and whether it is achievable with the IEnumerable model. > This was just how I imagined it would work after reading this post. > > > > What seems to happen is that streaming does occur, but only for the first > branch. The second and subsequent branches have to wait until the first > branch has consumed all the rows before they start, meaning that all the > rows need to be cached in RAM during the first branch to be available for > the later branches. Like this… > > > > Branch 1 – Operation 1 – Execute on Row 1. > > Branch 1 – Operation 2 – Execute on Row 1. > > Branch 1 – Operation n – Execute on Row 1. > > > > Branch 1 – Operation 1 – Execute on Row 2. > > Branch 1 – Operation 2 – Execute on Row 2. > > Branch 1 – Operation n – Execute on Row 2. > > > > (When branch 1 is complete - all rows cached in RAM) > > > > Branch 2 – Operation 1 – Execute on Row 1. > > Branch 2 – Operation 2 – Execute on Row 1. > > Branch 2 – Operation n – Execute on Row 1. > > > > Branch 2 – Operation 1 – Execute on Row 2. > > Branch 2 – Operation 2 – Execute on Row 2. > > Branch 2 – Operation n – Execute on Row 2. > > > > Branch n – Operation 1 – Execute on Row 1. > > Branch n – Operation 2 – Execute on Row 1. > > Branch n – Operation n – Execute on Row 1. > > > > Branch n – Operation 1 – Execute on Row 2. > > Branch n – Operation 2 – Execute on Row 2. > > Branch n – Operation n – Execute on Row 2. > > > > So again, I may have completely misunderstood how it should work? Or > could we be doing something wrong that causes all rows to cache in memory? > > > > Thanks Again, > > > > Shannon > > > > *From:* Simone B. [mailto:simone.bus...@gmail.com] > *Sent:* Friday, 10 December 2010 11:40 AM > *To:* ShannonM > *Subject:* Re: Rhino ETL: BranchingOperation does not stream. What else > does not? > > > > Hi Shannon, can you explain what is the behavior you would expect? > > On Fri, Dec 10, 2010 at 01:19, ShannonM <sjmars...@gmail.com> wrote: > > Hello, > > I realise this is an old post but we seem to be experiencing a similar > issue with memory usage. Our ETL process works with approximately 1 > million records per source table. For a standard straight through > process there are no problems. The rows just stream through and load > to the database and memory peaks at approx 110MB. However, wherever > we use a BranchingOperation in our process the rows accumulate in > memory while executing the first branch (assuming so they are > available for the remaining branches). The subsequent branches do not > execute until the first one has completed. > > The problem is that this usually consumes approx 1.6GB of memory and > depending on other processes running on our server can sometimes cause > a memory exception "System.AccessViolationException: Attempted to read > or write protected memory. This is often an indication that other > memory is corrupt.". We can work-around this issue by re-starting > services (eg. SQL Server) or re-booting the server prior to running > the ETL process to ensure there are no rougue processes hogging > memory. We are also considering moving to at 64Bit OS and adding more > RAM. > > I investigated the Fibonacci Braching tests in the Rhino ETL source > and it seems to behave exactly as I described. If I debug the test > named CanBranchThePipelineEfficiently() I can actually duplicate the > scenario and you can see that all the rows are cached in memory after > the first branch executes. > > Your post seems to indicate that you fixed this issue by using the > cahcing enumerable. I am misunderstanding your post or could we be > doing something wrong? > > While we will probably go ahead with the server ugrage anyway it would > be nice to make our ETL process more effienct and not consume so much > memory if it can be avoided. > > On Jul 5 2009, 9:22 pm, Simone Busoli <simone.bus...@gmail.com> wrote: > > Fixed. What it's doing now is wrap the input enumerable into a caching > > enumerable and then feed a clone of each row into the operations making > up > > the branch. > > > > > > > > > On Thu, Jul 2, 2009 at 16:53, webpaul <goo...@webpaul.net> wrote: > > > > > Looking at the code, I think the only reason it isn't yield returning > > > right now is so it can copy the rows. If you make an IEnumerable that > > > copies the rows one at a time and feed that to the operation.Execute I > > > think that is all that is needed. > > > > > On Jul 2, 7:17 am, zvolkov <zvol...@gmail.com> wrote: > > > > uhh... maybe cache only a few rows but not all? Assuming I branch in > > > > 2, at worst I will need to cache as many rows as there is the > > > > disparity between the two consumers of my two output streams... Makes > > > > > sense?- Hide quoted text - > > > > - Show quoted text - > > > > > > > > > > > > > > > > > > > -- > You received this message because you are subscribed to the Google Groups > "Rhino Tools Dev" group. > To post to this group, send email to rhino-tools-dev@googlegroups.com. > To unsubscribe from this group, send email to > rhino-tools-dev+unsubscr...@googlegroups.com<rhino-tools-dev%2bunsubscr...@googlegroups.com> > . > For more options, visit this group at > http://groups.google.com/group/rhino-tools-dev?hl=en. > > -- > You received this message because you are subscribed to the Google Groups > "Rhino Tools Dev" group. > To post to this group, send email to rhino-tools-dev@googlegroups.com. > To unsubscribe from this group, send email to > rhino-tools-dev+unsubscr...@googlegroups.com<rhino-tools-dev%2bunsubscr...@googlegroups.com> > . > For more options, visit this group at > http://groups.google.com/group/rhino-tools-dev?hl=en. > -- You received this message because you are subscribed to the Google Groups "Rhino Tools Dev" group. To post to this group, send email to rhino-tools-dev@googlegroups.com. To unsubscribe from this group, send email to rhino-tools-dev+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/rhino-tools-dev?hl=en.