For option 1, I think the dataset id is not a unique identifier. Couldn't multiple transactions in one job work on the same dataset?
Steven On Fri, Nov 17, 2017 at 11:38 AM, abdullah alamoudi <[email protected]> wrote: > So, there are three options to do this: > 1. Each of these operators work on a a specific dataset. So we can pass > the datasetId to the JobEventListenerFactory when requesting the > transaction id. > 2. We make 1 transaction works for multiple datasets by using a map from > datasetId to primary opTracker and use it when reporting commits by the log > flusher thread. > 3. Prevent a job from having multiple transactions. (For the record, I > dislike this option since the price we pay is very high IMO) > > Cheers, > Abdullah. > > > On Nov 17, 2017, at 11:32 AM, Steven Jacobs <[email protected]> wrote: > > > > Well, we've solved the problem when there is only one transaction id per > > job. The operators can fetch the transaction ids from the > > JobEventListenerFactory (you can find this in master now). The issue is, > > when we are trying to combine multiple job specs into one feed job, the > > operators at runtime don't have a memory of which "job spec" they > > originally belonged to which could tell them which one of the transaction > > ids that they should use. > > > > Steven > > > > On Fri, Nov 17, 2017 at 11:25 AM, abdullah alamoudi <[email protected]> > > wrote: > > > >> > >> I think that this works and seems like the question is how different > >> operators in the job can get their transaction ids. > >> > >> ~Abdullah. > >> > >>> On Nov 17, 2017, at 11:21 AM, Steven Jacobs <[email protected]> wrote: > >>> > >>> From the conversation, it seems like nobody has the full picture to > >> propose > >>> the design? > >>> For deployed jobs, the idea is to use the same job specification but > >> create > >>> a new Hyracks job and Asterix Transaction for each execution. > >>> > >>> Steven > >>> > >>> On Fri, Nov 17, 2017 at 11:10 AM, abdullah alamoudi < > [email protected]> > >>> wrote: > >>> > >>>> I can e-meet anytime (moved to Sunnyvale). We can also look at a > >> proposed > >>>> design and see if it can work > >>>> Back to my question, how were you planning to change the transaction > id > >> if > >>>> we forget about the case with multiple datasets (feed job)? > >>>> > >>>> > >>>>> On Nov 17, 2017, at 10:38 AM, Steven Jacobs <[email protected]> > wrote: > >>>>> > >>>>> Maybe it would be good to have a meeting about this with all > interested > >>>>> parties? > >>>>> > >>>>> I can be on-campus at UCI on Tuesday if that would be a good day to > >> meet. > >>>>> > >>>>> Steven > >>>>> > >>>>> On Fri, Nov 17, 2017 at 9:36 AM, abdullah alamoudi < > [email protected] > >>> > >>>>> wrote: > >>>>> > >>>>>> Also, was wondering how would you do the same for a single dataset > >>>>>> (non-feed). How would you get the transaction id and change it when > >> you > >>>>>> re-run? > >>>>>> > >>>>>> On Nov 17, 2017 7:12 AM, "Murtadha Hubail" <[email protected]> > >> wrote: > >>>>>> > >>>>>>> For atomic transactions, the change was merged yesterday. For > entity > >>>>>> level > >>>>>>> transactions, it should be a very small change. > >>>>>>> > >>>>>>> Cheers, > >>>>>>> Murtadha > >>>>>>> > >>>>>>>> On Nov 17, 2017, at 6:07 PM, abdullah alamoudi < > [email protected]> > >>>>>>> wrote: > >>>>>>>> > >>>>>>>> I understand that is not the case right now but what you're > working > >>>> on? > >>>>>>>> > >>>>>>>> Cheers, > >>>>>>>> Abdullah. > >>>>>>>> > >>>>>>>> > >>>>>>>>> On Nov 17, 2017, at 7:04 AM, Murtadha Hubail < > [email protected]> > >>>>>>> wrote: > >>>>>>>>> > >>>>>>>>> A transaction context can register multiple primary indexes. > >>>>>>>>> Since each entity commit log contains the dataset id, you can > >>>>>> decrement > >>>>>>> the active operations on > >>>>>>>>> the operation tracker associated with that dataset id. > >>>>>>>>> > >>>>>>>>> On 17/11/2017, 5:52 PM, "abdullah alamoudi" <[email protected]> > >>>>>> wrote: > >>>>>>>>> > >>>>>>>>> Can you illustrate how a deadlock can happen? I am anxious to > know. > >>>>>>>>> Moreover, the reason for the multiple transaction ids in feeds is > >>>>>> not > >>>>>>> simply because we compile them differently. > >>>>>>>>> > >>>>>>>>> How would a commit operator know which dataset active operation > >>>>>>> counter to decrement if they share the same id for example? > >>>>>>>>> > >>>>>>>>>> On Nov 16, 2017, at 9:46 PM, Xikui Wang <[email protected]> wrote: > >>>>>>>>>> > >>>>>>>>>> Yes. That deadlock could happen. Currently, we have one-to-one > >>>>>>> mappings for > >>>>>>>>>> the jobs and transactions, except for the feeds. > >>>>>>>>>> > >>>>>>>>>> @Abdullah, after some digging into the code, I think probably we > >> can > >>>>>>> use a > >>>>>>>>>> single transaction id for the job which feeds multiple datasets? > >> See > >>>>>>> if I > >>>>>>>>>> can convince you. :) > >>>>>>>>>> > >>>>>>>>>> The reason we have multiple transaction ids in feeds is that we > >>>>>> compile > >>>>>>>>>> each connection job separately and combine them into a single > feed > >>>>>>> job. A > >>>>>>>>>> new transaction id is created and assigned to each connection > job, > >>>>>>> thus for > >>>>>>>>>> the combined job, we have to handle the different transactions > as > >>>>>> they > >>>>>>>>>> are embedded in the connection job specifications. But, what if > we > >>>>>>> create a > >>>>>>>>>> single transaction id for the combined job? That transaction id > >> will > >>>>>> be > >>>>>>>>>> embedded into each connection so they can write logs freely, but > >> the > >>>>>>>>>> transaction will be started and committed only once as there is > >> only > >>>>>>> one > >>>>>>>>>> feed job. In this way, we won't need > >> multiTransactionJobletEventLis > >>>>>>> tener > >>>>>>>>>> and the transaction id can be removed from the job specification > >>>>>>> easily as > >>>>>>>>>> well (for Steven's change). > >>>>>>>>>> > >>>>>>>>>> Best, > >>>>>>>>>> Xikui > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>>> On Thu, Nov 16, 2017 at 4:26 PM, Mike Carey <[email protected] > > > >>>>>>> wrote: > >>>>>>>>>>> > >>>>>>>>>>> I worry about deadlocks. The waits for graph may not > understand > >>>>>> that > >>>>>>>>>>> making t1 wait will also make t2 wait since they may share a > >> thread > >>>>>> - > >>>>>>>>>>> right? Or do we have jobs and transactions separately > >> represented > >>>>>>> there > >>>>>>>>>>> now? > >>>>>>>>>>> > >>>>>>>>>>>> On Nov 16, 2017 3:10 PM, "abdullah alamoudi" < > >> [email protected]> > >>>>>>> wrote: > >>>>>>>>>>>> > >>>>>>>>>>>> We are using multiple transactions in a single job in case of > >> feed > >>>>>>> and I > >>>>>>>>>>>> think that this is the correct way. > >>>>>>>>>>>> Having a single job for a feed that feeds into multiple > datasets > >>>>>> is a > >>>>>>>>>>> good > >>>>>>>>>>>> thing since job resources/feed resources are consolidated. > >>>>>>>>>>>> > >>>>>>>>>>>> Here are some points: > >>>>>>>>>>>> - We can't use the same transaction id to feed multiple > >> datasets. > >>>>>> The > >>>>>>>>>>> only > >>>>>>>>>>>> other option is to have multiple jobs each feeding a different > >>>>>>> dataset. > >>>>>>>>>>>> - Having multiple jobs (in addition to the extra resources > used, > >>>>>>> memory > >>>>>>>>>>>> and CPU) would then forces us to either read data from > external > >>>>>>> sources > >>>>>>>>>>>> multiple times, parse records multiple times, etc > >>>>>>>>>>>> or having to have a synchronization between the different jobs > >> and > >>>>>>> the > >>>>>>>>>>>> feed source within asterixdb. IMO, this is far more > complicated > >>>>>> than > >>>>>>>>>>> having > >>>>>>>>>>>> multiple transactions within a single job and the cost far > >>>> outweigh > >>>>>>> the > >>>>>>>>>>>> benefits. > >>>>>>>>>>>> > >>>>>>>>>>>> P.S, > >>>>>>>>>>>> We are also using this for bucket connections in Couchbase > >>>>>> Analytics. > >>>>>>>>>>>> > >>>>>>>>>>>>> On Nov 16, 2017, at 2:57 PM, Till Westmann <[email protected] > > > >>>>>>> wrote: > >>>>>>>>>>>>> > >>>>>>>>>>>>> If there are a number of issue with supporting multiple > >>>>>> transaction > >>>>>>> ids > >>>>>>>>>>>>> and no clear benefits/use-cases, I’d vote for simplification > :) > >>>>>>>>>>>>> Also, code that’s not being used has a tendency to "rot" and > >> so I > >>>>>>> think > >>>>>>>>>>>>> that it’s usefulness might be limited by the time we’d find a > >> use > >>>>>>> for > >>>>>>>>>>>>> this functionality. > >>>>>>>>>>>>> > >>>>>>>>>>>>> My 2c, > >>>>>>>>>>>>> Till > >>>>>>>>>>>>> > >>>>>>>>>>>>>> On 16 Nov 2017, at 13:57, Xikui Wang wrote: > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> I'm separating the connections into different jobs in some > of > >> my > >>>>>>>>>>>>>> experiments... but that was intended to be used for the > >>>>>>> experimental > >>>>>>>>>>>>>> settings (i.e., not for master now)... > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> I think the interesting question here is whether we want to > >>>> allow > >>>>>>> one > >>>>>>>>>>>>>> Hyracks job to carry multiple transactions. I personally > think > >>>>>> that > >>>>>>>>>>>> should > >>>>>>>>>>>>>> be allowed as the transaction and job are two separate > >> concepts, > >>>>>>> but I > >>>>>>>>>>>>>> couldn't find such use cases other than the feeds. Does > anyone > >>>>>>> have a > >>>>>>>>>>>> good > >>>>>>>>>>>>>> example on this? > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> Another question is, if we do allow multiple transactions > in a > >>>>>>> single > >>>>>>>>>>>>>> Hyracks job, how do we enable commit runtime to obtain the > >>>>>> correct > >>>>>>> TXN > >>>>>>>>>>>> id > >>>>>>>>>>>>>> without having that embedded as part of the job > specification. > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> Best, > >>>>>>>>>>>>>> Xikui > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> On Thu, Nov 16, 2017 at 1:01 PM, abdullah alamoudi < > >>>>>>>>>>> [email protected]> > >>>>>>>>>>>>>> wrote: > >>>>>>>>>>>>>> > >>>>>>>>>>>>>>> I am curious as to how feed will work without this? > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> ~Abdullah. > >>>>>>>>>>>>>>>> On Nov 16, 2017, at 12:43 PM, Steven Jacobs < > >> [email protected] > >>>>> > >>>>>>>>>>> wrote: > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> Hi all, > >>>>>>>>>>>>>>>> We currently have MultiTransactionJobletEventLis > >> tenerFactory, > >>>>>>> which > >>>>>>>>>>>>>>> allows > >>>>>>>>>>>>>>>> for one Hyracks job to run multiple Asterix transactions > >>>>>>> together. > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> This class is only used by feeds, and feeds are in process > >> of > >>>>>>>>>>>> changing to > >>>>>>>>>>>>>>>> no longer need this feature. As part of the work in > >>>>>> pre-deploying > >>>>>>>>>>> job > >>>>>>>>>>>>>>>> specifications to be used by multiple hyracks jobs, I've > >> been > >>>>>>>>>>> working > >>>>>>>>>>>> on > >>>>>>>>>>>>>>>> removing the transaction id from the job specifications, > as > >> we > >>>>>>> use a > >>>>>>>>>>>> new > >>>>>>>>>>>>>>>> transaction for each invocation of a deployed job. > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> There is currently no clear way to remove the transaction > id > >>>>>> from > >>>>>>>>>>> the > >>>>>>>>>>>> job > >>>>>>>>>>>>>>>> spec and keep the option for > MultiTransactionJobletEventLis > >>>>>>>>>>>> tenerFactory. > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> The question for the group is, do we see a need to > maintain > >>>>>> this > >>>>>>>>>>> class > >>>>>>>>>>>>>>> that > >>>>>>>>>>>>>>>> will no longer be used by any current code? Or, an other > >>>> words, > >>>>>>> is > >>>>>>>>>>>> there > >>>>>>>>>>>>>>> a > >>>>>>>>>>>>>>>> strong possibility that in the future we will want > multiple > >>>>>>>>>>>> transactions > >>>>>>>>>>>>>>> to > >>>>>>>>>>>>>>>> share a single Hyracks job, meaning that it is worth > >> figuring > >>>>>> out > >>>>>>>>>>> how > >>>>>>>>>>>> to > >>>>>>>>>>>>>>>> maintain this class? > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> Steven > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> > >>>>>>>> > >>>>>>> > >>>>>> > >>>> > >>>> > >> > >> > >
