Re: MultiTransactionJobletEventListenerFactory

abdullah alamoudi Fri, 17 Nov 2017 14:53:52 -0800

Keep in mind that one option looks up the map once per job while the other 
looks it up once per record.


Cheers,
Abdullah.

> On Nov 17, 2017, at 2:23 PM, Xikui Wang <[email protected]> wrote:
> 
> If I understand Abdullah's proposal correctly, for option 1, you can create
> a dataset id to transaction id map in the
> MultiTransactionJobletEventListener. When committing, the commit runtime
> can take the dataset-id to ask for the transaction-id and commit the
> sub-transaction. Here we are putting up an assumption that there will not
> be a feed connected to a dataset twice in a feed job. This is a fair
> assumption in most cases.
> 
> But, this doesn't really solve the problem, right? IMHO, if we all agree
> that there is a one-to-one mapping from transaction to Hyracks job, we
> probably should use a single transaction id for the combined job, i.e.,
> option 2... As Murtadha suggested, we can now register multiple resources
> with the transaction context with the patch he merged yesterday (it took me
> some time to catch up on the transaction codebase so sorry for joining
> late.). I think this can offer us a nice and clean solution.
> 
> Best,
> Xikui
> 
> On Fri, Nov 17, 2017 at 11:58 AM, Steven Jacobs <[email protected]> wrote:
> 
>> If that's true than that solution seems best to me, but we had discussed
>> this earlier and Xikui mentioned that that might not be true.
>> @Xikui?
>> Steven
>> 
>> On Fri, Nov 17, 2017 at 11:55 AM, abdullah alamoudi <[email protected]>
>> wrote:
>> 
>>> Right now, they can't, so datasetId can be safely used.
>>>> On Nov 17, 2017, at 11:51 AM, Steven Jacobs <[email protected]> wrote:
>>>> 
>>>> For option 1, I think the dataset id is not a unique identifier.
>> Couldn't
>>>> multiple transactions in one job work on the same dataset?
>>>> 
>>>> Steven
>>>> 
>>>> On Fri, Nov 17, 2017 at 11:38 AM, abdullah alamoudi <
>> [email protected]>
>>>> wrote:
>>>> 
>>>>> So, there are three options to do this:
>>>>> 1. Each of these operators work on a a specific dataset. So we can
>> pass
>>>>> the datasetId to the JobEventListenerFactory when requesting the
>>>>> transaction id.
>>>>> 2. We make 1 transaction works for multiple datasets by using a map
>> from
>>>>> datasetId to primary opTracker and use it when reporting commits by
>> the
>>> log
>>>>> flusher thread.
>>>>> 3. Prevent a job from having multiple transactions. (For the record, I
>>>>> dislike this option since the price we pay is very high IMO)
>>>>> 
>>>>> Cheers,
>>>>> Abdullah.
>>>>> 
>>>>>> On Nov 17, 2017, at 11:32 AM, Steven Jacobs <[email protected]>
>> wrote:
>>>>>> 
>>>>>> Well, we've solved the problem when there is only one transaction id
>>> per
>>>>>> job. The operators can fetch the transaction ids from the
>>>>>> JobEventListenerFactory (you can find this in master now). The issue
>>> is,
>>>>>> when we are trying to combine multiple job specs into one feed job,
>> the
>>>>>> operators at runtime don't have a memory of which "job spec" they
>>>>>> originally belonged to which could tell them which one of the
>>> transaction
>>>>>> ids that they should use.
>>>>>> 
>>>>>> Steven
>>>>>> 
>>>>>> On Fri, Nov 17, 2017 at 11:25 AM, abdullah alamoudi <
>>> [email protected]>
>>>>>> wrote:
>>>>>> 
>>>>>>> 
>>>>>>> I think that this works and seems like the question is how different
>>>>>>> operators in the job can get their transaction ids.
>>>>>>> 
>>>>>>> ~Abdullah.
>>>>>>> 
>>>>>>>> On Nov 17, 2017, at 11:21 AM, Steven Jacobs <[email protected]>
>>> wrote:
>>>>>>>> 
>>>>>>>> From the conversation, it seems like nobody has the full picture to
>>>>>>> propose
>>>>>>>> the design?
>>>>>>>> For deployed jobs, the idea is to use the same job specification
>> but
>>>>>>> create
>>>>>>>> a new Hyracks job and Asterix Transaction for each execution.
>>>>>>>> 
>>>>>>>> Steven
>>>>>>>> 
>>>>>>>> On Fri, Nov 17, 2017 at 11:10 AM, abdullah alamoudi <
>>>>> [email protected]>
>>>>>>>> wrote:
>>>>>>>> 
>>>>>>>>> I can e-meet anytime (moved to Sunnyvale). We can also look at a
>>>>>>> proposed
>>>>>>>>> design and see if it can work
>>>>>>>>> Back to my question, how were you planning to change the
>> transaction
>>>>> id
>>>>>>> if
>>>>>>>>> we forget about the case with multiple datasets (feed job)?
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>>> On Nov 17, 2017, at 10:38 AM, Steven Jacobs <[email protected]>
>>>>> wrote:
>>>>>>>>>> 
>>>>>>>>>> Maybe it would be good to have a meeting about this with all
>>>>> interested
>>>>>>>>>> parties?
>>>>>>>>>> 
>>>>>>>>>> I can be on-campus at UCI on Tuesday if that would be a good day
>> to
>>>>>>> meet.
>>>>>>>>>> 
>>>>>>>>>> Steven
>>>>>>>>>> 
>>>>>>>>>> On Fri, Nov 17, 2017 at 9:36 AM, abdullah alamoudi <
>>>>> [email protected]
>>>>>>>> 
>>>>>>>>>> wrote:
>>>>>>>>>> 
>>>>>>>>>>> Also, was wondering how would you do the same for a single
>> dataset
>>>>>>>>>>> (non-feed). How would you get the transaction id and change it
>>> when
>>>>>>> you
>>>>>>>>>>> re-run?
>>>>>>>>>>> 
>>>>>>>>>>> On Nov 17, 2017 7:12 AM, "Murtadha Hubail" <[email protected]
>>> 
>>>>>>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>>> For atomic transactions, the change was merged yesterday. For
>>>>> entity
>>>>>>>>>>> level
>>>>>>>>>>>> transactions, it should be a very small change.
>>>>>>>>>>>> 
>>>>>>>>>>>> Cheers,
>>>>>>>>>>>> Murtadha
>>>>>>>>>>>> 
>>>>>>>>>>>>> On Nov 17, 2017, at 6:07 PM, abdullah alamoudi <
>>>>> [email protected]>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>> 
>>>>>>>>>>>>> I understand that is not the case right now but what you're
>>>>> working
>>>>>>>>> on?
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Cheers,
>>>>>>>>>>>>> Abdullah.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> On Nov 17, 2017, at 7:04 AM, Murtadha Hubail <
>>>>> [email protected]>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> A transaction context can register multiple primary indexes.
>>>>>>>>>>>>>> Since each entity commit log contains the dataset id, you can
>>>>>>>>>>> decrement
>>>>>>>>>>>> the active operations on
>>>>>>>>>>>>>> the operation tracker associated with that dataset id.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> On 17/11/2017, 5:52 PM, "abdullah alamoudi" <
>>> [email protected]>
>>>>>>>>>>> wrote:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Can you illustrate how a deadlock can happen? I am anxious to
>>>>> know.
>>>>>>>>>>>>>> Moreover, the reason for the multiple transaction ids in
>> feeds
>>> is
>>>>>>>>>>> not
>>>>>>>>>>>> simply because we compile them differently.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> How would a commit operator know which dataset active
>> operation
>>>>>>>>>>>> counter to decrement if they share the same id for example?
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> On Nov 16, 2017, at 9:46 PM, Xikui Wang <[email protected]>
>>> wrote:
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Yes. That deadlock could happen. Currently, we have
>> one-to-one
>>>>>>>>>>>> mappings for
>>>>>>>>>>>>>>> the jobs and transactions, except for the feeds.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> @Abdullah, after some digging into the code, I think
>> probably
>>> we
>>>>>>> can
>>>>>>>>>>>> use a
>>>>>>>>>>>>>>> single transaction id for the job which feeds multiple
>>> datasets?
>>>>>>> See
>>>>>>>>>>>> if I
>>>>>>>>>>>>>>> can convince you. :)
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> The reason we have multiple transaction ids in feeds is that
>>> we
>>>>>>>>>>> compile
>>>>>>>>>>>>>>> each connection job separately and combine them into a
>> single
>>>>> feed
>>>>>>>>>>>> job. A
>>>>>>>>>>>>>>> new transaction id is created and assigned to each
>> connection
>>>>> job,
>>>>>>>>>>>> thus for
>>>>>>>>>>>>>>> the combined job, we have to handle the different
>> transactions
>>>>> as
>>>>>>>>>>> they
>>>>>>>>>>>>>>> are embedded in the connection job specifications. But, what
>>> if
>>>>> we
>>>>>>>>>>>> create a
>>>>>>>>>>>>>>> single transaction id for the combined job? That transaction
>>> id
>>>>>>> will
>>>>>>>>>>> be
>>>>>>>>>>>>>>> embedded into each connection so they can write logs freely,
>>> but
>>>>>>> the
>>>>>>>>>>>>>>> transaction will be started and committed only once as there
>>> is
>>>>>>> only
>>>>>>>>>>>> one
>>>>>>>>>>>>>>> feed job. In this way, we won't need
>>>>>>> multiTransactionJobletEventLis
>>>>>>>>>>>> tener
>>>>>>>>>>>>>>> and the transaction id can be removed from the job
>>> specification
>>>>>>>>>>>> easily as
>>>>>>>>>>>>>>> well (for Steven's change).
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>>> Xikui
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> On Thu, Nov 16, 2017 at 4:26 PM, Mike Carey <
>>> [email protected]
>>>>>> 
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> I worry about deadlocks.  The waits for graph may not
>>>>> understand
>>>>>>>>>>> that
>>>>>>>>>>>>>>>> making t1 wait will also make t2 wait since they may share
>> a
>>>>>>> thread
>>>>>>>>>>> -
>>>>>>>>>>>>>>>> right?  Or do we have jobs and transactions separately
>>>>>>> represented
>>>>>>>>>>>> there
>>>>>>>>>>>>>>>> now?
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> On Nov 16, 2017 3:10 PM, "abdullah alamoudi" <
>>>>>>> [email protected]>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> We are using multiple transactions in a single job in case
>>> of
>>>>>>> feed
>>>>>>>>>>>> and I
>>>>>>>>>>>>>>>>> think that this is the correct way.
>>>>>>>>>>>>>>>>> Having a single job for a feed that feeds into multiple
>>>>> datasets
>>>>>>>>>>> is a
>>>>>>>>>>>>>>>> good
>>>>>>>>>>>>>>>>> thing since job resources/feed resources are consolidated.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Here are some points:
>>>>>>>>>>>>>>>>> - We can't use the same transaction id to feed multiple
>>>>>>> datasets.
>>>>>>>>>>> The
>>>>>>>>>>>>>>>> only
>>>>>>>>>>>>>>>>> other option is to have multiple jobs each feeding a
>>> different
>>>>>>>>>>>> dataset.
>>>>>>>>>>>>>>>>> - Having multiple jobs (in addition to the extra resources
>>>>> used,
>>>>>>>>>>>> memory
>>>>>>>>>>>>>>>>> and CPU) would then forces us to either read data from
>>>>> external
>>>>>>>>>>>> sources
>>>>>>>>>>>>>>>>> multiple times, parse records multiple times, etc
>>>>>>>>>>>>>>>>> or having to have a synchronization between the different
>>> jobs
>>>>>>> and
>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>> feed source within asterixdb. IMO, this is far more
>>>>> complicated
>>>>>>>>>>> than
>>>>>>>>>>>>>>>> having
>>>>>>>>>>>>>>>>> multiple transactions within a single job and the cost far
>>>>>>>>> outweigh
>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>> benefits.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> P.S,
>>>>>>>>>>>>>>>>> We are also using this for bucket connections in Couchbase
>>>>>>>>>>> Analytics.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> On Nov 16, 2017, at 2:57 PM, Till Westmann <
>>> [email protected]
>>>>>> 
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> If there are a number of issue with supporting multiple
>>>>>>>>>>> transaction
>>>>>>>>>>>> ids
>>>>>>>>>>>>>>>>>> and no clear benefits/use-cases, I’d vote for
>>> simplification
>>>>> :)
>>>>>>>>>>>>>>>>>> Also, code that’s not being used has a tendency to "rot"
>>> and
>>>>>>> so I
>>>>>>>>>>>> think
>>>>>>>>>>>>>>>>>> that it’s usefulness might be limited by the time we’d
>>> find a
>>>>>>> use
>>>>>>>>>>>> for
>>>>>>>>>>>>>>>>>> this functionality.
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> My 2c,
>>>>>>>>>>>>>>>>>> Till
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> On 16 Nov 2017, at 13:57, Xikui Wang wrote:
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> I'm separating the connections into different jobs in
>> some
>>>>> of
>>>>>>> my
>>>>>>>>>>>>>>>>>>> experiments... but that was intended to be used for the
>>>>>>>>>>>> experimental
>>>>>>>>>>>>>>>>>>> settings (i.e., not for master now)...
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> I think the interesting question here is whether we want
>>> to
>>>>>>>>> allow
>>>>>>>>>>>> one
>>>>>>>>>>>>>>>>>>> Hyracks job to carry multiple transactions. I personally
>>>>> think
>>>>>>>>>>> that
>>>>>>>>>>>>>>>>> should
>>>>>>>>>>>>>>>>>>> be allowed as the transaction and job are two separate
>>>>>>> concepts,
>>>>>>>>>>>> but I
>>>>>>>>>>>>>>>>>>> couldn't find such use cases other than the feeds. Does
>>>>> anyone
>>>>>>>>>>>> have a
>>>>>>>>>>>>>>>>> good
>>>>>>>>>>>>>>>>>>> example on this?
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> Another question is, if we do allow multiple
>> transactions
>>>>> in a
>>>>>>>>>>>> single
>>>>>>>>>>>>>>>>>>> Hyracks job, how do we enable commit runtime to obtain
>> the
>>>>>>>>>>> correct
>>>>>>>>>>>> TXN
>>>>>>>>>>>>>>>>> id
>>>>>>>>>>>>>>>>>>> without having that embedded as part of the job
>>>>> specification.
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>>>>>>> Xikui
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> On Thu, Nov 16, 2017 at 1:01 PM, abdullah alamoudi <
>>>>>>>>>>>>>>>> [email protected]>
>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> I am curious as to how feed will work without this?
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> ~Abdullah.
>>>>>>>>>>>>>>>>>>>>> On Nov 16, 2017, at 12:43 PM, Steven Jacobs <
>>>>>>> [email protected]
>>>>>>>>>> 
>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> Hi all,
>>>>>>>>>>>>>>>>>>>>> We currently have MultiTransactionJobletEventLis
>>>>>>> tenerFactory,
>>>>>>>>>>>> which
>>>>>>>>>>>>>>>>>>>> allows
>>>>>>>>>>>>>>>>>>>>> for one Hyracks job to run multiple Asterix
>> transactions
>>>>>>>>>>>> together.
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> This class is only used by feeds, and feeds are in
>>> process
>>>>>>> of
>>>>>>>>>>>>>>>>> changing to
>>>>>>>>>>>>>>>>>>>>> no longer need this feature. As part of the work in
>>>>>>>>>>> pre-deploying
>>>>>>>>>>>>>>>> job
>>>>>>>>>>>>>>>>>>>>> specifications to be used by multiple hyracks jobs,
>> I've
>>>>>>> been
>>>>>>>>>>>>>>>> working
>>>>>>>>>>>>>>>>> on
>>>>>>>>>>>>>>>>>>>>> removing the transaction id from the job
>> specifications,
>>>>> as
>>>>>>> we
>>>>>>>>>>>> use a
>>>>>>>>>>>>>>>>> new
>>>>>>>>>>>>>>>>>>>>> transaction for each invocation of a deployed job.
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> There is currently no clear way to remove the
>>> transaction
>>>>> id
>>>>>>>>>>> from
>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>> job
>>>>>>>>>>>>>>>>>>>>> spec and keep the option for
>>>>> MultiTransactionJobletEventLis
>>>>>>>>>>>>>>>>> tenerFactory.
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> The question for the group is, do we see a need to
>>>>> maintain
>>>>>>>>>>> this
>>>>>>>>>>>>>>>> class
>>>>>>>>>>>>>>>>>>>> that
>>>>>>>>>>>>>>>>>>>>> will no longer be used by any current code? Or, an
>> other
>>>>>>>>> words,
>>>>>>>>>>>> is
>>>>>>>>>>>>>>>>> there
>>>>>>>>>>>>>>>>>>>> a
>>>>>>>>>>>>>>>>>>>>> strong possibility that in the future we will want
>>>>> multiple
>>>>>>>>>>>>>>>>> transactions
>>>>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>>> share a single Hyracks job, meaning that it is worth
>>>>>>> figuring
>>>>>>>>>>> out
>>>>>>>>>>>>>>>> how
>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>>> maintain this class?
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> Steven
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>> 
>>>>> 
>>> 
>>> 
>>

Re: MultiTransactionJobletEventListenerFactory

Reply via email to