Keep in mind that one option looks up the map once per job while the other looks it up once per record.
Cheers, Abdullah. > On Nov 17, 2017, at 2:23 PM, Xikui Wang <[email protected]> wrote: > > If I understand Abdullah's proposal correctly, for option 1, you can create > a dataset id to transaction id map in the > MultiTransactionJobletEventListener. When committing, the commit runtime > can take the dataset-id to ask for the transaction-id and commit the > sub-transaction. Here we are putting up an assumption that there will not > be a feed connected to a dataset twice in a feed job. This is a fair > assumption in most cases. > > But, this doesn't really solve the problem, right? IMHO, if we all agree > that there is a one-to-one mapping from transaction to Hyracks job, we > probably should use a single transaction id for the combined job, i.e., > option 2... As Murtadha suggested, we can now register multiple resources > with the transaction context with the patch he merged yesterday (it took me > some time to catch up on the transaction codebase so sorry for joining > late.). I think this can offer us a nice and clean solution. > > Best, > Xikui > > On Fri, Nov 17, 2017 at 11:58 AM, Steven Jacobs <[email protected]> wrote: > >> If that's true than that solution seems best to me, but we had discussed >> this earlier and Xikui mentioned that that might not be true. >> @Xikui? >> Steven >> >> On Fri, Nov 17, 2017 at 11:55 AM, abdullah alamoudi <[email protected]> >> wrote: >> >>> Right now, they can't, so datasetId can be safely used. >>>> On Nov 17, 2017, at 11:51 AM, Steven Jacobs <[email protected]> wrote: >>>> >>>> For option 1, I think the dataset id is not a unique identifier. >> Couldn't >>>> multiple transactions in one job work on the same dataset? >>>> >>>> Steven >>>> >>>> On Fri, Nov 17, 2017 at 11:38 AM, abdullah alamoudi < >> [email protected]> >>>> wrote: >>>> >>>>> So, there are three options to do this: >>>>> 1. Each of these operators work on a a specific dataset. So we can >> pass >>>>> the datasetId to the JobEventListenerFactory when requesting the >>>>> transaction id. >>>>> 2. We make 1 transaction works for multiple datasets by using a map >> from >>>>> datasetId to primary opTracker and use it when reporting commits by >> the >>> log >>>>> flusher thread. >>>>> 3. Prevent a job from having multiple transactions. (For the record, I >>>>> dislike this option since the price we pay is very high IMO) >>>>> >>>>> Cheers, >>>>> Abdullah. >>>>> >>>>>> On Nov 17, 2017, at 11:32 AM, Steven Jacobs <[email protected]> >> wrote: >>>>>> >>>>>> Well, we've solved the problem when there is only one transaction id >>> per >>>>>> job. The operators can fetch the transaction ids from the >>>>>> JobEventListenerFactory (you can find this in master now). The issue >>> is, >>>>>> when we are trying to combine multiple job specs into one feed job, >> the >>>>>> operators at runtime don't have a memory of which "job spec" they >>>>>> originally belonged to which could tell them which one of the >>> transaction >>>>>> ids that they should use. >>>>>> >>>>>> Steven >>>>>> >>>>>> On Fri, Nov 17, 2017 at 11:25 AM, abdullah alamoudi < >>> [email protected]> >>>>>> wrote: >>>>>> >>>>>>> >>>>>>> I think that this works and seems like the question is how different >>>>>>> operators in the job can get their transaction ids. >>>>>>> >>>>>>> ~Abdullah. >>>>>>> >>>>>>>> On Nov 17, 2017, at 11:21 AM, Steven Jacobs <[email protected]> >>> wrote: >>>>>>>> >>>>>>>> From the conversation, it seems like nobody has the full picture to >>>>>>> propose >>>>>>>> the design? >>>>>>>> For deployed jobs, the idea is to use the same job specification >> but >>>>>>> create >>>>>>>> a new Hyracks job and Asterix Transaction for each execution. >>>>>>>> >>>>>>>> Steven >>>>>>>> >>>>>>>> On Fri, Nov 17, 2017 at 11:10 AM, abdullah alamoudi < >>>>> [email protected]> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> I can e-meet anytime (moved to Sunnyvale). We can also look at a >>>>>>> proposed >>>>>>>>> design and see if it can work >>>>>>>>> Back to my question, how were you planning to change the >> transaction >>>>> id >>>>>>> if >>>>>>>>> we forget about the case with multiple datasets (feed job)? >>>>>>>>> >>>>>>>>> >>>>>>>>>> On Nov 17, 2017, at 10:38 AM, Steven Jacobs <[email protected]> >>>>> wrote: >>>>>>>>>> >>>>>>>>>> Maybe it would be good to have a meeting about this with all >>>>> interested >>>>>>>>>> parties? >>>>>>>>>> >>>>>>>>>> I can be on-campus at UCI on Tuesday if that would be a good day >> to >>>>>>> meet. >>>>>>>>>> >>>>>>>>>> Steven >>>>>>>>>> >>>>>>>>>> On Fri, Nov 17, 2017 at 9:36 AM, abdullah alamoudi < >>>>> [email protected] >>>>>>>> >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>>> Also, was wondering how would you do the same for a single >> dataset >>>>>>>>>>> (non-feed). How would you get the transaction id and change it >>> when >>>>>>> you >>>>>>>>>>> re-run? >>>>>>>>>>> >>>>>>>>>>> On Nov 17, 2017 7:12 AM, "Murtadha Hubail" <[email protected] >>> >>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>>> For atomic transactions, the change was merged yesterday. For >>>>> entity >>>>>>>>>>> level >>>>>>>>>>>> transactions, it should be a very small change. >>>>>>>>>>>> >>>>>>>>>>>> Cheers, >>>>>>>>>>>> Murtadha >>>>>>>>>>>> >>>>>>>>>>>>> On Nov 17, 2017, at 6:07 PM, abdullah alamoudi < >>>>> [email protected]> >>>>>>>>>>>> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>> I understand that is not the case right now but what you're >>>>> working >>>>>>>>> on? >>>>>>>>>>>>> >>>>>>>>>>>>> Cheers, >>>>>>>>>>>>> Abdullah. >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>>> On Nov 17, 2017, at 7:04 AM, Murtadha Hubail < >>>>> [email protected]> >>>>>>>>>>>> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>> A transaction context can register multiple primary indexes. >>>>>>>>>>>>>> Since each entity commit log contains the dataset id, you can >>>>>>>>>>> decrement >>>>>>>>>>>> the active operations on >>>>>>>>>>>>>> the operation tracker associated with that dataset id. >>>>>>>>>>>>>> >>>>>>>>>>>>>> On 17/11/2017, 5:52 PM, "abdullah alamoudi" < >>> [email protected]> >>>>>>>>>>> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>> Can you illustrate how a deadlock can happen? I am anxious to >>>>> know. >>>>>>>>>>>>>> Moreover, the reason for the multiple transaction ids in >> feeds >>> is >>>>>>>>>>> not >>>>>>>>>>>> simply because we compile them differently. >>>>>>>>>>>>>> >>>>>>>>>>>>>> How would a commit operator know which dataset active >> operation >>>>>>>>>>>> counter to decrement if they share the same id for example? >>>>>>>>>>>>>> >>>>>>>>>>>>>>> On Nov 16, 2017, at 9:46 PM, Xikui Wang <[email protected]> >>> wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Yes. That deadlock could happen. Currently, we have >> one-to-one >>>>>>>>>>>> mappings for >>>>>>>>>>>>>>> the jobs and transactions, except for the feeds. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> @Abdullah, after some digging into the code, I think >> probably >>> we >>>>>>> can >>>>>>>>>>>> use a >>>>>>>>>>>>>>> single transaction id for the job which feeds multiple >>> datasets? >>>>>>> See >>>>>>>>>>>> if I >>>>>>>>>>>>>>> can convince you. :) >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> The reason we have multiple transaction ids in feeds is that >>> we >>>>>>>>>>> compile >>>>>>>>>>>>>>> each connection job separately and combine them into a >> single >>>>> feed >>>>>>>>>>>> job. A >>>>>>>>>>>>>>> new transaction id is created and assigned to each >> connection >>>>> job, >>>>>>>>>>>> thus for >>>>>>>>>>>>>>> the combined job, we have to handle the different >> transactions >>>>> as >>>>>>>>>>> they >>>>>>>>>>>>>>> are embedded in the connection job specifications. But, what >>> if >>>>> we >>>>>>>>>>>> create a >>>>>>>>>>>>>>> single transaction id for the combined job? That transaction >>> id >>>>>>> will >>>>>>>>>>> be >>>>>>>>>>>>>>> embedded into each connection so they can write logs freely, >>> but >>>>>>> the >>>>>>>>>>>>>>> transaction will be started and committed only once as there >>> is >>>>>>> only >>>>>>>>>>>> one >>>>>>>>>>>>>>> feed job. In this way, we won't need >>>>>>> multiTransactionJobletEventLis >>>>>>>>>>>> tener >>>>>>>>>>>>>>> and the transaction id can be removed from the job >>> specification >>>>>>>>>>>> easily as >>>>>>>>>>>>>>> well (for Steven's change). >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Best, >>>>>>>>>>>>>>> Xikui >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> On Thu, Nov 16, 2017 at 4:26 PM, Mike Carey < >>> [email protected] >>>>>> >>>>>>>>>>>> wrote: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> I worry about deadlocks. The waits for graph may not >>>>> understand >>>>>>>>>>> that >>>>>>>>>>>>>>>> making t1 wait will also make t2 wait since they may share >> a >>>>>>> thread >>>>>>>>>>> - >>>>>>>>>>>>>>>> right? Or do we have jobs and transactions separately >>>>>>> represented >>>>>>>>>>>> there >>>>>>>>>>>>>>>> now? >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> On Nov 16, 2017 3:10 PM, "abdullah alamoudi" < >>>>>>> [email protected]> >>>>>>>>>>>> wrote: >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> We are using multiple transactions in a single job in case >>> of >>>>>>> feed >>>>>>>>>>>> and I >>>>>>>>>>>>>>>>> think that this is the correct way. >>>>>>>>>>>>>>>>> Having a single job for a feed that feeds into multiple >>>>> datasets >>>>>>>>>>> is a >>>>>>>>>>>>>>>> good >>>>>>>>>>>>>>>>> thing since job resources/feed resources are consolidated. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Here are some points: >>>>>>>>>>>>>>>>> - We can't use the same transaction id to feed multiple >>>>>>> datasets. >>>>>>>>>>> The >>>>>>>>>>>>>>>> only >>>>>>>>>>>>>>>>> other option is to have multiple jobs each feeding a >>> different >>>>>>>>>>>> dataset. >>>>>>>>>>>>>>>>> - Having multiple jobs (in addition to the extra resources >>>>> used, >>>>>>>>>>>> memory >>>>>>>>>>>>>>>>> and CPU) would then forces us to either read data from >>>>> external >>>>>>>>>>>> sources >>>>>>>>>>>>>>>>> multiple times, parse records multiple times, etc >>>>>>>>>>>>>>>>> or having to have a synchronization between the different >>> jobs >>>>>>> and >>>>>>>>>>>> the >>>>>>>>>>>>>>>>> feed source within asterixdb. IMO, this is far more >>>>> complicated >>>>>>>>>>> than >>>>>>>>>>>>>>>> having >>>>>>>>>>>>>>>>> multiple transactions within a single job and the cost far >>>>>>>>> outweigh >>>>>>>>>>>> the >>>>>>>>>>>>>>>>> benefits. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> P.S, >>>>>>>>>>>>>>>>> We are also using this for bucket connections in Couchbase >>>>>>>>>>> Analytics. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> On Nov 16, 2017, at 2:57 PM, Till Westmann < >>> [email protected] >>>>>> >>>>>>>>>>>> wrote: >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> If there are a number of issue with supporting multiple >>>>>>>>>>> transaction >>>>>>>>>>>> ids >>>>>>>>>>>>>>>>>> and no clear benefits/use-cases, I’d vote for >>> simplification >>>>> :) >>>>>>>>>>>>>>>>>> Also, code that’s not being used has a tendency to "rot" >>> and >>>>>>> so I >>>>>>>>>>>> think >>>>>>>>>>>>>>>>>> that it’s usefulness might be limited by the time we’d >>> find a >>>>>>> use >>>>>>>>>>>> for >>>>>>>>>>>>>>>>>> this functionality. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> My 2c, >>>>>>>>>>>>>>>>>> Till >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> On 16 Nov 2017, at 13:57, Xikui Wang wrote: >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> I'm separating the connections into different jobs in >> some >>>>> of >>>>>>> my >>>>>>>>>>>>>>>>>>> experiments... but that was intended to be used for the >>>>>>>>>>>> experimental >>>>>>>>>>>>>>>>>>> settings (i.e., not for master now)... >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> I think the interesting question here is whether we want >>> to >>>>>>>>> allow >>>>>>>>>>>> one >>>>>>>>>>>>>>>>>>> Hyracks job to carry multiple transactions. I personally >>>>> think >>>>>>>>>>> that >>>>>>>>>>>>>>>>> should >>>>>>>>>>>>>>>>>>> be allowed as the transaction and job are two separate >>>>>>> concepts, >>>>>>>>>>>> but I >>>>>>>>>>>>>>>>>>> couldn't find such use cases other than the feeds. Does >>>>> anyone >>>>>>>>>>>> have a >>>>>>>>>>>>>>>>> good >>>>>>>>>>>>>>>>>>> example on this? >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Another question is, if we do allow multiple >> transactions >>>>> in a >>>>>>>>>>>> single >>>>>>>>>>>>>>>>>>> Hyracks job, how do we enable commit runtime to obtain >> the >>>>>>>>>>> correct >>>>>>>>>>>> TXN >>>>>>>>>>>>>>>>> id >>>>>>>>>>>>>>>>>>> without having that embedded as part of the job >>>>> specification. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Best, >>>>>>>>>>>>>>>>>>> Xikui >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> On Thu, Nov 16, 2017 at 1:01 PM, abdullah alamoudi < >>>>>>>>>>>>>>>> [email protected]> >>>>>>>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> I am curious as to how feed will work without this? >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> ~Abdullah. >>>>>>>>>>>>>>>>>>>>> On Nov 16, 2017, at 12:43 PM, Steven Jacobs < >>>>>>> [email protected] >>>>>>>>>> >>>>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> Hi all, >>>>>>>>>>>>>>>>>>>>> We currently have MultiTransactionJobletEventLis >>>>>>> tenerFactory, >>>>>>>>>>>> which >>>>>>>>>>>>>>>>>>>> allows >>>>>>>>>>>>>>>>>>>>> for one Hyracks job to run multiple Asterix >> transactions >>>>>>>>>>>> together. >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> This class is only used by feeds, and feeds are in >>> process >>>>>>> of >>>>>>>>>>>>>>>>> changing to >>>>>>>>>>>>>>>>>>>>> no longer need this feature. As part of the work in >>>>>>>>>>> pre-deploying >>>>>>>>>>>>>>>> job >>>>>>>>>>>>>>>>>>>>> specifications to be used by multiple hyracks jobs, >> I've >>>>>>> been >>>>>>>>>>>>>>>> working >>>>>>>>>>>>>>>>> on >>>>>>>>>>>>>>>>>>>>> removing the transaction id from the job >> specifications, >>>>> as >>>>>>> we >>>>>>>>>>>> use a >>>>>>>>>>>>>>>>> new >>>>>>>>>>>>>>>>>>>>> transaction for each invocation of a deployed job. >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> There is currently no clear way to remove the >>> transaction >>>>> id >>>>>>>>>>> from >>>>>>>>>>>>>>>> the >>>>>>>>>>>>>>>>> job >>>>>>>>>>>>>>>>>>>>> spec and keep the option for >>>>> MultiTransactionJobletEventLis >>>>>>>>>>>>>>>>> tenerFactory. >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> The question for the group is, do we see a need to >>>>> maintain >>>>>>>>>>> this >>>>>>>>>>>>>>>> class >>>>>>>>>>>>>>>>>>>> that >>>>>>>>>>>>>>>>>>>>> will no longer be used by any current code? Or, an >> other >>>>>>>>> words, >>>>>>>>>>>> is >>>>>>>>>>>>>>>>> there >>>>>>>>>>>>>>>>>>>> a >>>>>>>>>>>>>>>>>>>>> strong possibility that in the future we will want >>>>> multiple >>>>>>>>>>>>>>>>> transactions >>>>>>>>>>>>>>>>>>>> to >>>>>>>>>>>>>>>>>>>>> share a single Hyracks job, meaning that it is worth >>>>>>> figuring >>>>>>>>>>> out >>>>>>>>>>>>>>>> how >>>>>>>>>>>>>>>>> to >>>>>>>>>>>>>>>>>>>>> maintain this class? >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> Steven >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>> >>>>>>> >>>>> >>>>> >>> >>> >>
