Matt, Using coordinators to kick the hive jobs when the sqoop outputs become available would be an option to keep thing simple. The only constrain is that you'll need to model that assuming all your inputs/outputs have a fixed frequency.
Thx On Thu, Jul 19, 2012 at 12:03 PM, Mona Chitnis <[email protected]>wrote: > Matt, > > Virag's illustration explains the approach very well. > > However, you mentioned a requirement 'forking but not require all of the > forked nodes to rejoin the primary workflow'. The fork-join pair construct > in Oozie will mandate the forked Hive extractions to ultimately join the > main workflow. So is there any different requirement of asynchronous > behavior that is not getting fulfilled yet? > > -- > Mona Chitnis > > > > > On 7/19/12 11:35 AM, "Virag Kothari" <[email protected]> wrote: > > >Matt, > > > >To me, the nested forks option you are considering looks good. Its also > >better to have the join in pair. > >For eg., if you have S1, S2, S3 as your serial sqoop extractions > >And H1, H2, H3 are the corresponding asynchronous Hive extraction > >Then, you can have > > > >S1 -> fork1 > >Fork1 -> {S2, H1} > >S2 -> Fork2 > >Fork2 -> {S3, H2} > >S3-> h3 > >{H3, H2} -> Join2 > >{Join2, H1} ->Join1 > > > >However, I have not encountered many workflows using sqoop and hive. So in > >terms of workflow design, you can get opinion from other people in > >community. > > > >Thanks, > >Virag > > > > > >On 7/18/12 10:52 AM, "Matt Goeke" <[email protected]> wrote: > > > >> Let me see if I can give a better summary of what we are trying to do. > >>Our > >> use case is such that we have a set of mySQL instances and we would > >>like to > >> control the number of connections that we establish to them for sqoop > >> extractions. Within each instance we can have several tables we > >> are targeting for that daily extraction. Our ETL process involves the > >> mentioned sqoop table extractions into a Hive warehouse and then a > >> transformation from the Hive staging area into a date partitioned set of > >> Hive tables (with a few column name transformations as well). We would > >>like > >> to establish an Oozie workflow per mySQL instance and use the DAG to > >> properly queue sqoop table extractions such that no more than one sqoop > >> action is happening at any time. The issue I am running into is that I > >>need > >> to find a way to have the Hive extraction run asynchronously from the > >> serial Sqoop queue. In other words I would like to avoid 1) having the > >>next > >> sqoop table extraction have to wait on the previous Hive transformation > >>and > >> 2) not having to move all of the Hive transformations to the bottom of > >>the > >> DAG (I would like to be able to run them as soon as the sqoop table has > >> been extracted). > >> > >> I have tinkered with the thought of having a coordinator job staged for > >> every Hive transformation and then doing a data availability clause that > >> allowed it to run but this gets more difficult when you are trying to > >>watch > >> data folders that have been directly imported into Hive. The other > >>route I > >> have looked into is a series of nested forks in which I call the Hive > >> transformation and the next Sqoop action in parallel from a completed > >>Sqoop > >> Action. > >> > >> Let me know if there are any documented best practices around these > >>kind of > >> flows or if I need to try to balance this across more than just Oozie. > >> > >> -- > >> Matt Goeke > >> > >> On Tue, Jul 17, 2012 at 3:07 PM, Virag Kothari <[email protected]> > >>wrote: > >> > >>> Matt, > >>> Its always better to have a join for the corresponding fork. I think it > >>> would be better if you clarify in the question more about your workflow > >>> design and the requirement for asynchronous spikes. > >>> > >>> Thanks, > >>> Virag > >>> > >>> > >>> On 7/17/12 2:30 PM, "Matt Goeke" <[email protected]> wrote: > >>> > >>>> Virag, > >>>> > >>>> Thanks for the response. I have read the workflow spec and while I > >>> realize > >>>> there is the ability to fork within a workflow my issue is that all > >>>>forks > >>>> must be paired with joins. What I was looking for was some way to fork > >>> but > >>>> not require all of the forked nodes to rejoin the primary workflow > >>>>(hence > >>>> some of the nodes becoming asynchronous spikes). I feel like this > >>>> capability might already exist and this might just be an issue of > >>>> workflow/subworkflow composition. > >>>> > >>>> -- > >>>> Matt Goeke > >>>> > >>>> On Tue, Jul 17, 2012 at 2:00 PM, Virag Kothari <[email protected]> > >>> wrote: > >>>> > >>>>> Hi Matt, > >>>>> I think you can fork the hive actions using the fork/join control > >>>>>nodes > >>> in > >>>>> Oozie. > >>>>> > >>>>> > >>> > >>> > http://incubator.apache.org/oozie/docs/3.2.0-incubating/docs/WorkflowFun > >>>ctio > >>>>> nalSpec.html#a3.1.5_Fork_and_Join_Control_Nodes. > >>>>> > >>>>> I have no idea why the attachment doesn't work. > >>>>> > >>>>> Thanks, > >>>>> Virag > >>>>> > >>>>> > >>>>> On 7/17/12 12:13 PM, "Matt Goeke" <[email protected]> wrote: > >>>>> > >>>>>> Apparently when I put an imagur link in the reply the spam score > >>>>>>gets > >>>>> high > >>>>>> enough that the delivery is denied... is there anyway to link an > >>>>>>image? > >>>>>> Also, if not then is there anything I can clarify in the question > >>>>>>that > >>>>>> would make it more straightforward? > >>>>>> > >>>>>> -- > >>>>>> Matt Goeke > >>>>>> > >>>>>> On Tue, Jul 17, 2012 at 11:22 AM, Mona Chitnis > >>>>>><[email protected] > >>>>>> wrote: > >>>>>> > >>>>>>> The attachment hasn't come through. This had happened with an > >>>>>>>earlier > >>>>>>> email with the Oozie Meetup slides attachments too. Any solutions? > >>>>>>> > >>>>>>> -- > >>>>>>> Mona Chitnis > >>>>>>> > >>>>>>> From: Matt Goeke <[email protected]<mailto: > >>>>> [email protected]>> > >>>>>>> Reply-To: "[email protected]<mailto: > >>>>>>> [email protected]>" > >>>>>>><[email protected] > >>>>>>> <mailto:[email protected]>> > >>>>>>> To: "[email protected]<mailto: > >>>>>>> [email protected]>" > >>>>>>><[email protected] > >>>>>>> <mailto:[email protected]>> > >>>>>>> Subject: Oozie: asynchronous forking > >>>>>>> > >>>>>>> All, > >>>>>>> > >>>>>>> Does anyone know if it is possible to do asynchronous forking in > >>> Oozie? > >>>>>>> Currently we are running a set of ETL extractions that are pairs of > >>>>> actions > >>>>>>> (sqoop action then a hive transformation) but we would like to have > >>> the > >>>>>>> Sqoop actions be serial and the Hive actions be called > >>>>>>>asynchronously > >>>>> when > >>>>>>> the paired Sqoop job finishes. The reason the Sqoop actions are > >>>>>>>serial > >>>>> is > >>>>>>> we would like to limit the number of concurrent mappers hitting the > >>> data > >>>>>>> source and we could do this through the fair scheduler but that > >>>>>>>would > >>>>>>> require a pool per data source. Attached is a picture of suggested > >>>>>>>ETL > >>>>> flow. > >>>>>>> > >>>>>>> If anyone has any suggestions on best practices around this I would > >>> love > >>>>>>> to hear them. > >>>>>>> > >>>>>>> Thanks, > >>>>>>> Matt > >>>>>>> > >>>>> > >>>>> > >>> > >>> > > > > -- Alejandro
