Re: MultiTransactionJobletEventListenerFactory

abdullah alamoudi Fri, 17 Nov 2017 11:39:07 -0800

So, there are three options to do this:
1. Each of these operators work on a a specific dataset. So we can pass the 
datasetId to the JobEventListenerFactory when requesting the transaction id.
2. We make 1 transaction works for multiple datasets by using a map from 
datasetId to primary opTracker and use it when reporting commits by the log 
flusher thread.
3. Prevent a job from having multiple transactions. (For the record, I dislike 
this option since the price we pay is very high IMO)


Cheers,
Abdullah.

> On Nov 17, 2017, at 11:32 AM, Steven Jacobs <[email protected]> wrote:
> 
> Well, we've solved the problem when there is only one transaction id per
> job. The operators can fetch the transaction ids from the
> JobEventListenerFactory (you can find this in master now). The issue is,
> when we are trying to combine multiple job specs into one feed job, the
> operators at runtime don't have a memory of which "job spec" they
> originally belonged to which could tell them which one of the transaction
> ids that they should use.
> 
> Steven
> 
> On Fri, Nov 17, 2017 at 11:25 AM, abdullah alamoudi <[email protected]>
> wrote:
> 
>> 
>> I think that this works and seems like the question is how different
>> operators in the job can get their transaction ids.
>> 
>> ~Abdullah.
>> 
>>> On Nov 17, 2017, at 11:21 AM, Steven Jacobs <[email protected]> wrote:
>>> 
>>> From the conversation, it seems like nobody has the full picture to
>> propose
>>> the design?
>>> For deployed jobs, the idea is to use the same job specification but
>> create
>>> a new Hyracks job and Asterix Transaction for each execution.
>>> 
>>> Steven
>>> 
>>> On Fri, Nov 17, 2017 at 11:10 AM, abdullah alamoudi <[email protected]>
>>> wrote:
>>> 
>>>> I can e-meet anytime (moved to Sunnyvale). We can also look at a
>> proposed
>>>> design and see if it can work
>>>> Back to my question, how were you planning to change the transaction id
>> if
>>>> we forget about the case with multiple datasets (feed job)?
>>>> 
>>>> 
>>>>> On Nov 17, 2017, at 10:38 AM, Steven Jacobs <[email protected]> wrote:
>>>>> 
>>>>> Maybe it would be good to have a meeting about this with all interested
>>>>> parties?
>>>>> 
>>>>> I can be on-campus at UCI on Tuesday if that would be a good day to
>> meet.
>>>>> 
>>>>> Steven
>>>>> 
>>>>> On Fri, Nov 17, 2017 at 9:36 AM, abdullah alamoudi <[email protected]
>>> 
>>>>> wrote:
>>>>> 
>>>>>> Also, was wondering how would you do the same for a single dataset
>>>>>> (non-feed). How would you get the transaction id and change it when
>> you
>>>>>> re-run?
>>>>>> 
>>>>>> On Nov 17, 2017 7:12 AM, "Murtadha Hubail" <[email protected]>
>> wrote:
>>>>>> 
>>>>>>> For atomic transactions, the change was merged yesterday. For entity
>>>>>> level
>>>>>>> transactions, it should be a very small change.
>>>>>>> 
>>>>>>> Cheers,
>>>>>>> Murtadha
>>>>>>> 
>>>>>>>> On Nov 17, 2017, at 6:07 PM, abdullah alamoudi <[email protected]>
>>>>>>> wrote:
>>>>>>>> 
>>>>>>>> I understand that is not the case right now but what you're working
>>>> on?
>>>>>>>> 
>>>>>>>> Cheers,
>>>>>>>> Abdullah.
>>>>>>>> 
>>>>>>>> 
>>>>>>>>> On Nov 17, 2017, at 7:04 AM, Murtadha Hubail <[email protected]>
>>>>>>> wrote:
>>>>>>>>> 
>>>>>>>>> A transaction context can register multiple primary indexes.
>>>>>>>>> Since each entity commit log contains the dataset id, you can
>>>>>> decrement
>>>>>>> the active operations on
>>>>>>>>> the operation tracker associated with that dataset id.
>>>>>>>>> 
>>>>>>>>> On 17/11/2017, 5:52 PM, "abdullah alamoudi" <[email protected]>
>>>>>> wrote:
>>>>>>>>> 
>>>>>>>>> Can you illustrate how a deadlock can happen? I am anxious to know.
>>>>>>>>> Moreover, the reason for the multiple transaction ids in feeds is
>>>>>> not
>>>>>>> simply because we compile them differently.
>>>>>>>>> 
>>>>>>>>> How would a commit operator know which dataset active operation
>>>>>>> counter to decrement if they share the same id for example?
>>>>>>>>> 
>>>>>>>>>> On Nov 16, 2017, at 9:46 PM, Xikui Wang <[email protected]> wrote:
>>>>>>>>>> 
>>>>>>>>>> Yes. That deadlock could happen. Currently, we have one-to-one
>>>>>>> mappings for
>>>>>>>>>> the jobs and transactions, except for the feeds.
>>>>>>>>>> 
>>>>>>>>>> @Abdullah, after some digging into the code, I think probably we
>> can
>>>>>>> use a
>>>>>>>>>> single transaction id for the job which feeds multiple datasets?
>> See
>>>>>>> if I
>>>>>>>>>> can convince you. :)
>>>>>>>>>> 
>>>>>>>>>> The reason we have multiple transaction ids in feeds is that we
>>>>>> compile
>>>>>>>>>> each connection job separately and combine them into a single feed
>>>>>>> job. A
>>>>>>>>>> new transaction id is created and assigned to each connection job,
>>>>>>> thus for
>>>>>>>>>> the combined job, we have to handle the different transactions as
>>>>>> they
>>>>>>>>>> are embedded in the connection job specifications. But, what if we
>>>>>>> create a
>>>>>>>>>> single transaction id for the combined job? That transaction id
>> will
>>>>>> be
>>>>>>>>>> embedded into each connection so they can write logs freely, but
>> the
>>>>>>>>>> transaction will be started and committed only once as there is
>> only
>>>>>>> one
>>>>>>>>>> feed job. In this way, we won't need
>> multiTransactionJobletEventLis
>>>>>>> tener
>>>>>>>>>> and the transaction id can be removed from the job specification
>>>>>>> easily as
>>>>>>>>>> well (for Steven's change).
>>>>>>>>>> 
>>>>>>>>>> Best,
>>>>>>>>>> Xikui
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>>> On Thu, Nov 16, 2017 at 4:26 PM, Mike Carey <[email protected]>
>>>>>>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>> I worry about deadlocks.  The waits for graph may not understand
>>>>>> that
>>>>>>>>>>> making t1 wait will also make t2 wait since they may share a
>> thread
>>>>>> -
>>>>>>>>>>> right?  Or do we have jobs and transactions separately
>> represented
>>>>>>> there
>>>>>>>>>>> now?
>>>>>>>>>>> 
>>>>>>>>>>>> On Nov 16, 2017 3:10 PM, "abdullah alamoudi" <
>> [email protected]>
>>>>>>> wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>> We are using multiple transactions in a single job in case of
>> feed
>>>>>>> and I
>>>>>>>>>>>> think that this is the correct way.
>>>>>>>>>>>> Having a single job for a feed that feeds into multiple datasets
>>>>>> is a
>>>>>>>>>>> good
>>>>>>>>>>>> thing since job resources/feed resources are consolidated.
>>>>>>>>>>>> 
>>>>>>>>>>>> Here are some points:
>>>>>>>>>>>> - We can't use the same transaction id to feed multiple
>> datasets.
>>>>>> The
>>>>>>>>>>> only
>>>>>>>>>>>> other option is to have multiple jobs each feeding a different
>>>>>>> dataset.
>>>>>>>>>>>> - Having multiple jobs (in addition to the extra resources used,
>>>>>>> memory
>>>>>>>>>>>> and CPU) would then forces us to either read data from external
>>>>>>> sources
>>>>>>>>>>>> multiple times, parse records multiple times, etc
>>>>>>>>>>>> or having to have a synchronization between the different jobs
>> and
>>>>>>> the
>>>>>>>>>>>> feed source within asterixdb. IMO, this is far more complicated
>>>>>> than
>>>>>>>>>>> having
>>>>>>>>>>>> multiple transactions within a single job and the cost far
>>>> outweigh
>>>>>>> the
>>>>>>>>>>>> benefits.
>>>>>>>>>>>> 
>>>>>>>>>>>> P.S,
>>>>>>>>>>>> We are also using this for bucket connections in Couchbase
>>>>>> Analytics.
>>>>>>>>>>>> 
>>>>>>>>>>>>> On Nov 16, 2017, at 2:57 PM, Till Westmann <[email protected]>
>>>>>>> wrote:
>>>>>>>>>>>>> 
>>>>>>>>>>>>> If there are a number of issue with supporting multiple
>>>>>> transaction
>>>>>>> ids
>>>>>>>>>>>>> and no clear benefits/use-cases, I’d vote for simplification :)
>>>>>>>>>>>>> Also, code that’s not being used has a tendency to "rot" and
>> so I
>>>>>>> think
>>>>>>>>>>>>> that it’s usefulness might be limited by the time we’d find a
>> use
>>>>>>> for
>>>>>>>>>>>>> this functionality.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> My 2c,
>>>>>>>>>>>>> Till
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> On 16 Nov 2017, at 13:57, Xikui Wang wrote:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> I'm separating the connections into different jobs in some of
>> my
>>>>>>>>>>>>>> experiments... but that was intended to be used for the
>>>>>>> experimental
>>>>>>>>>>>>>> settings (i.e., not for master now)...
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> I think the interesting question here is whether we want to
>>>> allow
>>>>>>> one
>>>>>>>>>>>>>> Hyracks job to carry multiple transactions. I personally think
>>>>>> that
>>>>>>>>>>>> should
>>>>>>>>>>>>>> be allowed as the transaction and job are two separate
>> concepts,
>>>>>>> but I
>>>>>>>>>>>>>> couldn't find such use cases other than the feeds. Does anyone
>>>>>>> have a
>>>>>>>>>>>> good
>>>>>>>>>>>>>> example on this?
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Another question is, if we do allow multiple transactions in a
>>>>>>> single
>>>>>>>>>>>>>> Hyracks job, how do we enable commit runtime to obtain the
>>>>>> correct
>>>>>>> TXN
>>>>>>>>>>>> id
>>>>>>>>>>>>>> without having that embedded as part of the job specification.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>> Xikui
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> On Thu, Nov 16, 2017 at 1:01 PM, abdullah alamoudi <
>>>>>>>>>>> [email protected]>
>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> I am curious as to how feed will work without this?
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> ~Abdullah.
>>>>>>>>>>>>>>>> On Nov 16, 2017, at 12:43 PM, Steven Jacobs <
>> [email protected]
>>>>> 
>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Hi all,
>>>>>>>>>>>>>>>> We currently have MultiTransactionJobletEventLis
>> tenerFactory,
>>>>>>> which
>>>>>>>>>>>>>>> allows
>>>>>>>>>>>>>>>> for one Hyracks job to run multiple Asterix transactions
>>>>>>> together.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> This class is only used by feeds, and feeds are in process
>> of
>>>>>>>>>>>> changing to
>>>>>>>>>>>>>>>> no longer need this feature. As part of the work in
>>>>>> pre-deploying
>>>>>>>>>>> job
>>>>>>>>>>>>>>>> specifications to be used by multiple hyracks jobs, I've
>> been
>>>>>>>>>>> working
>>>>>>>>>>>> on
>>>>>>>>>>>>>>>> removing the transaction id from the job specifications, as
>> we
>>>>>>> use a
>>>>>>>>>>>> new
>>>>>>>>>>>>>>>> transaction for each invocation of a deployed job.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> There is currently no clear way to remove the transaction id
>>>>>> from
>>>>>>>>>>> the
>>>>>>>>>>>> job
>>>>>>>>>>>>>>>> spec and keep the option for MultiTransactionJobletEventLis
>>>>>>>>>>>> tenerFactory.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> The question for the group is, do we see a need to maintain
>>>>>> this
>>>>>>>>>>> class
>>>>>>>>>>>>>>> that
>>>>>>>>>>>>>>>> will no longer be used by any current code? Or, an other
>>>> words,
>>>>>>> is
>>>>>>>>>>>> there
>>>>>>>>>>>>>>> a
>>>>>>>>>>>>>>>> strong possibility that in the future we will want multiple
>>>>>>>>>>>> transactions
>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>> share a single Hyracks job, meaning that it is worth
>> figuring
>>>>>> out
>>>>>>>>>>> how
>>>>>>>>>>>> to
>>>>>>>>>>>>>>>> maintain this class?
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Steven
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>> 
>>>> 
>> 
>>

Re: MultiTransactionJobletEventListenerFactory

Reply via email to