Re: MultiTransactionJobletEventListenerFactory

Xikui Wang Fri, 17 Nov 2017 14:24:26 -0800

If I understand Abdullah's proposal correctly, for option 1, you can create
a dataset id to transaction id map in the
MultiTransactionJobletEventListener. When committing, the commit runtime
can take the dataset-id to ask for the transaction-id and commit the
sub-transaction. Here we are putting up an assumption that there will not
be a feed connected to a dataset twice in a feed job. This is a fair
assumption in most cases.


But, this doesn't really solve the problem, right? IMHO, if we all agree
that there is a one-to-one mapping from transaction to Hyracks job, we
probably should use a single transaction id for the combined job, i.e.,
option 2... As Murtadha suggested, we can now register multiple resources
with the transaction context with the patch he merged yesterday (it took me
some time to catch up on the transaction codebase so sorry for joining
late.). I think this can offer us a nice and clean solution.

Best,
Xikui

On Fri, Nov 17, 2017 at 11:58 AM, Steven Jacobs <[email protected]> wrote:

> If that's true than that solution seems best to me, but we had discussed
> this earlier and Xikui mentioned that that might not be true.
> @Xikui?
> Steven
>
> On Fri, Nov 17, 2017 at 11:55 AM, abdullah alamoudi <[email protected]>
> wrote:
>
> > Right now, they can't, so datasetId can be safely used.
> > > On Nov 17, 2017, at 11:51 AM, Steven Jacobs <[email protected]> wrote:
> > >
> > > For option 1, I think the dataset id is not a unique identifier.
> Couldn't
> > > multiple transactions in one job work on the same dataset?
> > >
> > > Steven
> > >
> > > On Fri, Nov 17, 2017 at 11:38 AM, abdullah alamoudi <
> [email protected]>
> > > wrote:
> > >
> > >> So, there are three options to do this:
> > >> 1. Each of these operators work on a a specific dataset. So we can
> pass
> > >> the datasetId to the JobEventListenerFactory when requesting the
> > >> transaction id.
> > >> 2. We make 1 transaction works for multiple datasets by using a map
> from
> > >> datasetId to primary opTracker and use it when reporting commits by
> the
> > log
> > >> flusher thread.
> > >> 3. Prevent a job from having multiple transactions. (For the record, I
> > >> dislike this option since the price we pay is very high IMO)
> > >>
> > >> Cheers,
> > >> Abdullah.
> > >>
> > >>> On Nov 17, 2017, at 11:32 AM, Steven Jacobs <[email protected]>
> wrote:
> > >>>
> > >>> Well, we've solved the problem when there is only one transaction id
> > per
> > >>> job. The operators can fetch the transaction ids from the
> > >>> JobEventListenerFactory (you can find this in master now). The issue
> > is,
> > >>> when we are trying to combine multiple job specs into one feed job,
> the
> > >>> operators at runtime don't have a memory of which "job spec" they
> > >>> originally belonged to which could tell them which one of the
> > transaction
> > >>> ids that they should use.
> > >>>
> > >>> Steven
> > >>>
> > >>> On Fri, Nov 17, 2017 at 11:25 AM, abdullah alamoudi <
> > [email protected]>
> > >>> wrote:
> > >>>
> > >>>>
> > >>>> I think that this works and seems like the question is how different
> > >>>> operators in the job can get their transaction ids.
> > >>>>
> > >>>> ~Abdullah.
> > >>>>
> > >>>>> On Nov 17, 2017, at 11:21 AM, Steven Jacobs <[email protected]>
> > wrote:
> > >>>>>
> > >>>>> From the conversation, it seems like nobody has the full picture to
> > >>>> propose
> > >>>>> the design?
> > >>>>> For deployed jobs, the idea is to use the same job specification
> but
> > >>>> create
> > >>>>> a new Hyracks job and Asterix Transaction for each execution.
> > >>>>>
> > >>>>> Steven
> > >>>>>
> > >>>>> On Fri, Nov 17, 2017 at 11:10 AM, abdullah alamoudi <
> > >> [email protected]>
> > >>>>> wrote:
> > >>>>>
> > >>>>>> I can e-meet anytime (moved to Sunnyvale). We can also look at a
> > >>>> proposed
> > >>>>>> design and see if it can work
> > >>>>>> Back to my question, how were you planning to change the
> transaction
> > >> id
> > >>>> if
> > >>>>>> we forget about the case with multiple datasets (feed job)?
> > >>>>>>
> > >>>>>>
> > >>>>>>> On Nov 17, 2017, at 10:38 AM, Steven Jacobs <[email protected]>
> > >> wrote:
> > >>>>>>>
> > >>>>>>> Maybe it would be good to have a meeting about this with all
> > >> interested
> > >>>>>>> parties?
> > >>>>>>>
> > >>>>>>> I can be on-campus at UCI on Tuesday if that would be a good day
> to
> > >>>> meet.
> > >>>>>>>
> > >>>>>>> Steven
> > >>>>>>>
> > >>>>>>> On Fri, Nov 17, 2017 at 9:36 AM, abdullah alamoudi <
> > >> [email protected]
> > >>>>>
> > >>>>>>> wrote:
> > >>>>>>>
> > >>>>>>>> Also, was wondering how would you do the same for a single
> dataset
> > >>>>>>>> (non-feed). How would you get the transaction id and change it
> > when
> > >>>> you
> > >>>>>>>> re-run?
> > >>>>>>>>
> > >>>>>>>> On Nov 17, 2017 7:12 AM, "Murtadha Hubail" <[email protected]
> >
> > >>>> wrote:
> > >>>>>>>>
> > >>>>>>>>> For atomic transactions, the change was merged yesterday. For
> > >> entity
> > >>>>>>>> level
> > >>>>>>>>> transactions, it should be a very small change.
> > >>>>>>>>>
> > >>>>>>>>> Cheers,
> > >>>>>>>>> Murtadha
> > >>>>>>>>>
> > >>>>>>>>>> On Nov 17, 2017, at 6:07 PM, abdullah alamoudi <
> > >> [email protected]>
> > >>>>>>>>> wrote:
> > >>>>>>>>>>
> > >>>>>>>>>> I understand that is not the case right now but what you're
> > >> working
> > >>>>>> on?
> > >>>>>>>>>>
> > >>>>>>>>>> Cheers,
> > >>>>>>>>>> Abdullah.
> > >>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>>> On Nov 17, 2017, at 7:04 AM, Murtadha Hubail <
> > >> [email protected]>
> > >>>>>>>>> wrote:
> > >>>>>>>>>>>
> > >>>>>>>>>>> A transaction context can register multiple primary indexes.
> > >>>>>>>>>>> Since each entity commit log contains the dataset id, you can
> > >>>>>>>> decrement
> > >>>>>>>>> the active operations on
> > >>>>>>>>>>> the operation tracker associated with that dataset id.
> > >>>>>>>>>>>
> > >>>>>>>>>>> On 17/11/2017, 5:52 PM, "abdullah alamoudi" <
> > [email protected]>
> > >>>>>>>> wrote:
> > >>>>>>>>>>>
> > >>>>>>>>>>> Can you illustrate how a deadlock can happen? I am anxious to
> > >> know.
> > >>>>>>>>>>> Moreover, the reason for the multiple transaction ids in
> feeds
> > is
> > >>>>>>>> not
> > >>>>>>>>> simply because we compile them differently.
> > >>>>>>>>>>>
> > >>>>>>>>>>> How would a commit operator know which dataset active
> operation
> > >>>>>>>>> counter to decrement if they share the same id for example?
> > >>>>>>>>>>>
> > >>>>>>>>>>>> On Nov 16, 2017, at 9:46 PM, Xikui Wang <[email protected]>
> > wrote:
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> Yes. That deadlock could happen. Currently, we have
> one-to-one
> > >>>>>>>>> mappings for
> > >>>>>>>>>>>> the jobs and transactions, except for the feeds.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> @Abdullah, after some digging into the code, I think
> probably
> > we
> > >>>> can
> > >>>>>>>>> use a
> > >>>>>>>>>>>> single transaction id for the job which feeds multiple
> > datasets?
> > >>>> See
> > >>>>>>>>> if I
> > >>>>>>>>>>>> can convince you. :)
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> The reason we have multiple transaction ids in feeds is that
> > we
> > >>>>>>>> compile
> > >>>>>>>>>>>> each connection job separately and combine them into a
> single
> > >> feed
> > >>>>>>>>> job. A
> > >>>>>>>>>>>> new transaction id is created and assigned to each
> connection
> > >> job,
> > >>>>>>>>> thus for
> > >>>>>>>>>>>> the combined job, we have to handle the different
> transactions
> > >> as
> > >>>>>>>> they
> > >>>>>>>>>>>> are embedded in the connection job specifications. But, what
> > if
> > >> we
> > >>>>>>>>> create a
> > >>>>>>>>>>>> single transaction id for the combined job? That transaction
> > id
> > >>>> will
> > >>>>>>>> be
> > >>>>>>>>>>>> embedded into each connection so they can write logs freely,
> > but
> > >>>> the
> > >>>>>>>>>>>> transaction will be started and committed only once as there
> > is
> > >>>> only
> > >>>>>>>>> one
> > >>>>>>>>>>>> feed job. In this way, we won't need
> > >>>> multiTransactionJobletEventLis
> > >>>>>>>>> tener
> > >>>>>>>>>>>> and the transaction id can be removed from the job
> > specification
> > >>>>>>>>> easily as
> > >>>>>>>>>>>> well (for Steven's change).
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> Best,
> > >>>>>>>>>>>> Xikui
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>> On Thu, Nov 16, 2017 at 4:26 PM, Mike Carey <
> > [email protected]
> > >>>
> > >>>>>>>>> wrote:
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> I worry about deadlocks.  The waits for graph may not
> > >> understand
> > >>>>>>>> that
> > >>>>>>>>>>>>> making t1 wait will also make t2 wait since they may share
> a
> > >>>> thread
> > >>>>>>>> -
> > >>>>>>>>>>>>> right?  Or do we have jobs and transactions separately
> > >>>> represented
> > >>>>>>>>> there
> > >>>>>>>>>>>>> now?
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>> On Nov 16, 2017 3:10 PM, "abdullah alamoudi" <
> > >>>> [email protected]>
> > >>>>>>>>> wrote:
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> We are using multiple transactions in a single job in case
> > of
> > >>>> feed
> > >>>>>>>>> and I
> > >>>>>>>>>>>>>> think that this is the correct way.
> > >>>>>>>>>>>>>> Having a single job for a feed that feeds into multiple
> > >> datasets
> > >>>>>>>> is a
> > >>>>>>>>>>>>> good
> > >>>>>>>>>>>>>> thing since job resources/feed resources are consolidated.
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> Here are some points:
> > >>>>>>>>>>>>>> - We can't use the same transaction id to feed multiple
> > >>>> datasets.
> > >>>>>>>> The
> > >>>>>>>>>>>>> only
> > >>>>>>>>>>>>>> other option is to have multiple jobs each feeding a
> > different
> > >>>>>>>>> dataset.
> > >>>>>>>>>>>>>> - Having multiple jobs (in addition to the extra resources
> > >> used,
> > >>>>>>>>> memory
> > >>>>>>>>>>>>>> and CPU) would then forces us to either read data from
> > >> external
> > >>>>>>>>> sources
> > >>>>>>>>>>>>>> multiple times, parse records multiple times, etc
> > >>>>>>>>>>>>>> or having to have a synchronization between the different
> > jobs
> > >>>> and
> > >>>>>>>>> the
> > >>>>>>>>>>>>>> feed source within asterixdb. IMO, this is far more
> > >> complicated
> > >>>>>>>> than
> > >>>>>>>>>>>>> having
> > >>>>>>>>>>>>>> multiple transactions within a single job and the cost far
> > >>>>>> outweigh
> > >>>>>>>>> the
> > >>>>>>>>>>>>>> benefits.
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> P.S,
> > >>>>>>>>>>>>>> We are also using this for bucket connections in Couchbase
> > >>>>>>>> Analytics.
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> On Nov 16, 2017, at 2:57 PM, Till Westmann <
> > [email protected]
> > >>>
> > >>>>>>>>> wrote:
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> If there are a number of issue with supporting multiple
> > >>>>>>>> transaction
> > >>>>>>>>> ids
> > >>>>>>>>>>>>>>> and no clear benefits/use-cases, I’d vote for
> > simplification
> > >> :)
> > >>>>>>>>>>>>>>> Also, code that’s not being used has a tendency to "rot"
> > and
> > >>>> so I
> > >>>>>>>>> think
> > >>>>>>>>>>>>>>> that it’s usefulness might be limited by the time we’d
> > find a
> > >>>> use
> > >>>>>>>>> for
> > >>>>>>>>>>>>>>> this functionality.
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> My 2c,
> > >>>>>>>>>>>>>>> Till
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> On 16 Nov 2017, at 13:57, Xikui Wang wrote:
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> I'm separating the connections into different jobs in
> some
> > >> of
> > >>>> my
> > >>>>>>>>>>>>>>>> experiments... but that was intended to be used for the
> > >>>>>>>>> experimental
> > >>>>>>>>>>>>>>>> settings (i.e., not for master now)...
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> I think the interesting question here is whether we want
> > to
> > >>>>>> allow
> > >>>>>>>>> one
> > >>>>>>>>>>>>>>>> Hyracks job to carry multiple transactions. I personally
> > >> think
> > >>>>>>>> that
> > >>>>>>>>>>>>>> should
> > >>>>>>>>>>>>>>>> be allowed as the transaction and job are two separate
> > >>>> concepts,
> > >>>>>>>>> but I
> > >>>>>>>>>>>>>>>> couldn't find such use cases other than the feeds. Does
> > >> anyone
> > >>>>>>>>> have a
> > >>>>>>>>>>>>>> good
> > >>>>>>>>>>>>>>>> example on this?
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> Another question is, if we do allow multiple
> transactions
> > >> in a
> > >>>>>>>>> single
> > >>>>>>>>>>>>>>>> Hyracks job, how do we enable commit runtime to obtain
> the
> > >>>>>>>> correct
> > >>>>>>>>> TXN
> > >>>>>>>>>>>>>> id
> > >>>>>>>>>>>>>>>> without having that embedded as part of the job
> > >> specification.
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> Best,
> > >>>>>>>>>>>>>>>> Xikui
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> On Thu, Nov 16, 2017 at 1:01 PM, abdullah alamoudi <
> > >>>>>>>>>>>>> [email protected]>
> > >>>>>>>>>>>>>>>> wrote:
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> I am curious as to how feed will work without this?
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> ~Abdullah.
> > >>>>>>>>>>>>>>>>>> On Nov 16, 2017, at 12:43 PM, Steven Jacobs <
> > >>>> [email protected]
> > >>>>>>>
> > >>>>>>>>>>>>> wrote:
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> Hi all,
> > >>>>>>>>>>>>>>>>>> We currently have MultiTransactionJobletEventLis
> > >>>> tenerFactory,
> > >>>>>>>>> which
> > >>>>>>>>>>>>>>>>> allows
> > >>>>>>>>>>>>>>>>>> for one Hyracks job to run multiple Asterix
> transactions
> > >>>>>>>>> together.
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> This class is only used by feeds, and feeds are in
> > process
> > >>>> of
> > >>>>>>>>>>>>>> changing to
> > >>>>>>>>>>>>>>>>>> no longer need this feature. As part of the work in
> > >>>>>>>> pre-deploying
> > >>>>>>>>>>>>> job
> > >>>>>>>>>>>>>>>>>> specifications to be used by multiple hyracks jobs,
> I've
> > >>>> been
> > >>>>>>>>>>>>> working
> > >>>>>>>>>>>>>> on
> > >>>>>>>>>>>>>>>>>> removing the transaction id from the job
> specifications,
> > >> as
> > >>>> we
> > >>>>>>>>> use a
> > >>>>>>>>>>>>>> new
> > >>>>>>>>>>>>>>>>>> transaction for each invocation of a deployed job.
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> There is currently no clear way to remove the
> > transaction
> > >> id
> > >>>>>>>> from
> > >>>>>>>>>>>>> the
> > >>>>>>>>>>>>>> job
> > >>>>>>>>>>>>>>>>>> spec and keep the option for
> > >> MultiTransactionJobletEventLis
> > >>>>>>>>>>>>>> tenerFactory.
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> The question for the group is, do we see a need to
> > >> maintain
> > >>>>>>>> this
> > >>>>>>>>>>>>> class
> > >>>>>>>>>>>>>>>>> that
> > >>>>>>>>>>>>>>>>>> will no longer be used by any current code? Or, an
> other
> > >>>>>> words,
> > >>>>>>>>> is
> > >>>>>>>>>>>>>> there
> > >>>>>>>>>>>>>>>>> a
> > >>>>>>>>>>>>>>>>>> strong possibility that in the future we will want
> > >> multiple
> > >>>>>>>>>>>>>> transactions
> > >>>>>>>>>>>>>>>>> to
> > >>>>>>>>>>>>>>>>>> share a single Hyracks job, meaning that it is worth
> > >>>> figuring
> > >>>>>>>> out
> > >>>>>>>>>>>>> how
> > >>>>>>>>>>>>>> to
> > >>>>>>>>>>>>>>>>>> maintain this class?
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> Steven
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>
> > >>>>>>
> > >>>>>>
> > >>>>
> > >>>>
> > >>
> > >>
> >
> >
>

Re: MultiTransactionJobletEventListenerFactory

Reply via email to