So, there are three options to do this: 1. Each of these operators work on a a specific dataset. So we can pass the datasetId to the JobEventListenerFactory when requesting the transaction id. 2. We make 1 transaction works for multiple datasets by using a map from datasetId to primary opTracker and use it when reporting commits by the log flusher thread. 3. Prevent a job from having multiple transactions. (For the record, I dislike this option since the price we pay is very high IMO)
Cheers, Abdullah. > On Nov 17, 2017, at 11:32 AM, Steven Jacobs <[email protected]> wrote: > > Well, we've solved the problem when there is only one transaction id per > job. The operators can fetch the transaction ids from the > JobEventListenerFactory (you can find this in master now). The issue is, > when we are trying to combine multiple job specs into one feed job, the > operators at runtime don't have a memory of which "job spec" they > originally belonged to which could tell them which one of the transaction > ids that they should use. > > Steven > > On Fri, Nov 17, 2017 at 11:25 AM, abdullah alamoudi <[email protected]> > wrote: > >> >> I think that this works and seems like the question is how different >> operators in the job can get their transaction ids. >> >> ~Abdullah. >> >>> On Nov 17, 2017, at 11:21 AM, Steven Jacobs <[email protected]> wrote: >>> >>> From the conversation, it seems like nobody has the full picture to >> propose >>> the design? >>> For deployed jobs, the idea is to use the same job specification but >> create >>> a new Hyracks job and Asterix Transaction for each execution. >>> >>> Steven >>> >>> On Fri, Nov 17, 2017 at 11:10 AM, abdullah alamoudi <[email protected]> >>> wrote: >>> >>>> I can e-meet anytime (moved to Sunnyvale). We can also look at a >> proposed >>>> design and see if it can work >>>> Back to my question, how were you planning to change the transaction id >> if >>>> we forget about the case with multiple datasets (feed job)? >>>> >>>> >>>>> On Nov 17, 2017, at 10:38 AM, Steven Jacobs <[email protected]> wrote: >>>>> >>>>> Maybe it would be good to have a meeting about this with all interested >>>>> parties? >>>>> >>>>> I can be on-campus at UCI on Tuesday if that would be a good day to >> meet. >>>>> >>>>> Steven >>>>> >>>>> On Fri, Nov 17, 2017 at 9:36 AM, abdullah alamoudi <[email protected] >>> >>>>> wrote: >>>>> >>>>>> Also, was wondering how would you do the same for a single dataset >>>>>> (non-feed). How would you get the transaction id and change it when >> you >>>>>> re-run? >>>>>> >>>>>> On Nov 17, 2017 7:12 AM, "Murtadha Hubail" <[email protected]> >> wrote: >>>>>> >>>>>>> For atomic transactions, the change was merged yesterday. For entity >>>>>> level >>>>>>> transactions, it should be a very small change. >>>>>>> >>>>>>> Cheers, >>>>>>> Murtadha >>>>>>> >>>>>>>> On Nov 17, 2017, at 6:07 PM, abdullah alamoudi <[email protected]> >>>>>>> wrote: >>>>>>>> >>>>>>>> I understand that is not the case right now but what you're working >>>> on? >>>>>>>> >>>>>>>> Cheers, >>>>>>>> Abdullah. >>>>>>>> >>>>>>>> >>>>>>>>> On Nov 17, 2017, at 7:04 AM, Murtadha Hubail <[email protected]> >>>>>>> wrote: >>>>>>>>> >>>>>>>>> A transaction context can register multiple primary indexes. >>>>>>>>> Since each entity commit log contains the dataset id, you can >>>>>> decrement >>>>>>> the active operations on >>>>>>>>> the operation tracker associated with that dataset id. >>>>>>>>> >>>>>>>>> On 17/11/2017, 5:52 PM, "abdullah alamoudi" <[email protected]> >>>>>> wrote: >>>>>>>>> >>>>>>>>> Can you illustrate how a deadlock can happen? I am anxious to know. >>>>>>>>> Moreover, the reason for the multiple transaction ids in feeds is >>>>>> not >>>>>>> simply because we compile them differently. >>>>>>>>> >>>>>>>>> How would a commit operator know which dataset active operation >>>>>>> counter to decrement if they share the same id for example? >>>>>>>>> >>>>>>>>>> On Nov 16, 2017, at 9:46 PM, Xikui Wang <[email protected]> wrote: >>>>>>>>>> >>>>>>>>>> Yes. That deadlock could happen. Currently, we have one-to-one >>>>>>> mappings for >>>>>>>>>> the jobs and transactions, except for the feeds. >>>>>>>>>> >>>>>>>>>> @Abdullah, after some digging into the code, I think probably we >> can >>>>>>> use a >>>>>>>>>> single transaction id for the job which feeds multiple datasets? >> See >>>>>>> if I >>>>>>>>>> can convince you. :) >>>>>>>>>> >>>>>>>>>> The reason we have multiple transaction ids in feeds is that we >>>>>> compile >>>>>>>>>> each connection job separately and combine them into a single feed >>>>>>> job. A >>>>>>>>>> new transaction id is created and assigned to each connection job, >>>>>>> thus for >>>>>>>>>> the combined job, we have to handle the different transactions as >>>>>> they >>>>>>>>>> are embedded in the connection job specifications. But, what if we >>>>>>> create a >>>>>>>>>> single transaction id for the combined job? That transaction id >> will >>>>>> be >>>>>>>>>> embedded into each connection so they can write logs freely, but >> the >>>>>>>>>> transaction will be started and committed only once as there is >> only >>>>>>> one >>>>>>>>>> feed job. In this way, we won't need >> multiTransactionJobletEventLis >>>>>>> tener >>>>>>>>>> and the transaction id can be removed from the job specification >>>>>>> easily as >>>>>>>>>> well (for Steven's change). >>>>>>>>>> >>>>>>>>>> Best, >>>>>>>>>> Xikui >>>>>>>>>> >>>>>>>>>> >>>>>>>>>>> On Thu, Nov 16, 2017 at 4:26 PM, Mike Carey <[email protected]> >>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>> I worry about deadlocks. The waits for graph may not understand >>>>>> that >>>>>>>>>>> making t1 wait will also make t2 wait since they may share a >> thread >>>>>> - >>>>>>>>>>> right? Or do we have jobs and transactions separately >> represented >>>>>>> there >>>>>>>>>>> now? >>>>>>>>>>> >>>>>>>>>>>> On Nov 16, 2017 3:10 PM, "abdullah alamoudi" < >> [email protected]> >>>>>>> wrote: >>>>>>>>>>>> >>>>>>>>>>>> We are using multiple transactions in a single job in case of >> feed >>>>>>> and I >>>>>>>>>>>> think that this is the correct way. >>>>>>>>>>>> Having a single job for a feed that feeds into multiple datasets >>>>>> is a >>>>>>>>>>> good >>>>>>>>>>>> thing since job resources/feed resources are consolidated. >>>>>>>>>>>> >>>>>>>>>>>> Here are some points: >>>>>>>>>>>> - We can't use the same transaction id to feed multiple >> datasets. >>>>>> The >>>>>>>>>>> only >>>>>>>>>>>> other option is to have multiple jobs each feeding a different >>>>>>> dataset. >>>>>>>>>>>> - Having multiple jobs (in addition to the extra resources used, >>>>>>> memory >>>>>>>>>>>> and CPU) would then forces us to either read data from external >>>>>>> sources >>>>>>>>>>>> multiple times, parse records multiple times, etc >>>>>>>>>>>> or having to have a synchronization between the different jobs >> and >>>>>>> the >>>>>>>>>>>> feed source within asterixdb. IMO, this is far more complicated >>>>>> than >>>>>>>>>>> having >>>>>>>>>>>> multiple transactions within a single job and the cost far >>>> outweigh >>>>>>> the >>>>>>>>>>>> benefits. >>>>>>>>>>>> >>>>>>>>>>>> P.S, >>>>>>>>>>>> We are also using this for bucket connections in Couchbase >>>>>> Analytics. >>>>>>>>>>>> >>>>>>>>>>>>> On Nov 16, 2017, at 2:57 PM, Till Westmann <[email protected]> >>>>>>> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>> If there are a number of issue with supporting multiple >>>>>> transaction >>>>>>> ids >>>>>>>>>>>>> and no clear benefits/use-cases, I’d vote for simplification :) >>>>>>>>>>>>> Also, code that’s not being used has a tendency to "rot" and >> so I >>>>>>> think >>>>>>>>>>>>> that it’s usefulness might be limited by the time we’d find a >> use >>>>>>> for >>>>>>>>>>>>> this functionality. >>>>>>>>>>>>> >>>>>>>>>>>>> My 2c, >>>>>>>>>>>>> Till >>>>>>>>>>>>> >>>>>>>>>>>>>> On 16 Nov 2017, at 13:57, Xikui Wang wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>> I'm separating the connections into different jobs in some of >> my >>>>>>>>>>>>>> experiments... but that was intended to be used for the >>>>>>> experimental >>>>>>>>>>>>>> settings (i.e., not for master now)... >>>>>>>>>>>>>> >>>>>>>>>>>>>> I think the interesting question here is whether we want to >>>> allow >>>>>>> one >>>>>>>>>>>>>> Hyracks job to carry multiple transactions. I personally think >>>>>> that >>>>>>>>>>>> should >>>>>>>>>>>>>> be allowed as the transaction and job are two separate >> concepts, >>>>>>> but I >>>>>>>>>>>>>> couldn't find such use cases other than the feeds. Does anyone >>>>>>> have a >>>>>>>>>>>> good >>>>>>>>>>>>>> example on this? >>>>>>>>>>>>>> >>>>>>>>>>>>>> Another question is, if we do allow multiple transactions in a >>>>>>> single >>>>>>>>>>>>>> Hyracks job, how do we enable commit runtime to obtain the >>>>>> correct >>>>>>> TXN >>>>>>>>>>>> id >>>>>>>>>>>>>> without having that embedded as part of the job specification. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Best, >>>>>>>>>>>>>> Xikui >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Thu, Nov 16, 2017 at 1:01 PM, abdullah alamoudi < >>>>>>>>>>> [email protected]> >>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>>> I am curious as to how feed will work without this? >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> ~Abdullah. >>>>>>>>>>>>>>>> On Nov 16, 2017, at 12:43 PM, Steven Jacobs < >> [email protected] >>>>> >>>>>>>>>>> wrote: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Hi all, >>>>>>>>>>>>>>>> We currently have MultiTransactionJobletEventLis >> tenerFactory, >>>>>>> which >>>>>>>>>>>>>>> allows >>>>>>>>>>>>>>>> for one Hyracks job to run multiple Asterix transactions >>>>>>> together. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> This class is only used by feeds, and feeds are in process >> of >>>>>>>>>>>> changing to >>>>>>>>>>>>>>>> no longer need this feature. As part of the work in >>>>>> pre-deploying >>>>>>>>>>> job >>>>>>>>>>>>>>>> specifications to be used by multiple hyracks jobs, I've >> been >>>>>>>>>>> working >>>>>>>>>>>> on >>>>>>>>>>>>>>>> removing the transaction id from the job specifications, as >> we >>>>>>> use a >>>>>>>>>>>> new >>>>>>>>>>>>>>>> transaction for each invocation of a deployed job. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> There is currently no clear way to remove the transaction id >>>>>> from >>>>>>>>>>> the >>>>>>>>>>>> job >>>>>>>>>>>>>>>> spec and keep the option for MultiTransactionJobletEventLis >>>>>>>>>>>> tenerFactory. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> The question for the group is, do we see a need to maintain >>>>>> this >>>>>>>>>>> class >>>>>>>>>>>>>>> that >>>>>>>>>>>>>>>> will no longer be used by any current code? Or, an other >>>> words, >>>>>>> is >>>>>>>>>>>> there >>>>>>>>>>>>>>> a >>>>>>>>>>>>>>>> strong possibility that in the future we will want multiple >>>>>>>>>>>> transactions >>>>>>>>>>>>>>> to >>>>>>>>>>>>>>>> share a single Hyracks job, meaning that it is worth >> figuring >>>>>> out >>>>>>>>>>> how >>>>>>>>>>>> to >>>>>>>>>>>>>>>> maintain this class? >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Steven >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>> >>>> >> >>
