Re: MultiTransactionJobletEventListenerFactory

Steven Jacobs Fri, 17 Nov 2017 11:51:51 -0800

For option 1, I think the dataset id is not a unique identifier. Couldn't
multiple transactions in one job work on the same dataset?


Steven

On Fri, Nov 17, 2017 at 11:38 AM, abdullah alamoudi <[email protected]>
wrote:

> So, there are three options to do this:
> 1. Each of these operators work on a a specific dataset. So we can pass
> the datasetId to the JobEventListenerFactory when requesting the
> transaction id.
> 2. We make 1 transaction works for multiple datasets by using a map from
> datasetId to primary opTracker and use it when reporting commits by the log
> flusher thread.
> 3. Prevent a job from having multiple transactions. (For the record, I
> dislike this option since the price we pay is very high IMO)
>
> Cheers,
> Abdullah.
>
> > On Nov 17, 2017, at 11:32 AM, Steven Jacobs <[email protected]> wrote:
> >
> > Well, we've solved the problem when there is only one transaction id per
> > job. The operators can fetch the transaction ids from the
> > JobEventListenerFactory (you can find this in master now). The issue is,
> > when we are trying to combine multiple job specs into one feed job, the
> > operators at runtime don't have a memory of which "job spec" they
> > originally belonged to which could tell them which one of the transaction
> > ids that they should use.
> >
> > Steven
> >
> > On Fri, Nov 17, 2017 at 11:25 AM, abdullah alamoudi <[email protected]>
> > wrote:
> >
> >>
> >> I think that this works and seems like the question is how different
> >> operators in the job can get their transaction ids.
> >>
> >> ~Abdullah.
> >>
> >>> On Nov 17, 2017, at 11:21 AM, Steven Jacobs <[email protected]> wrote:
> >>>
> >>> From the conversation, it seems like nobody has the full picture to
> >> propose
> >>> the design?
> >>> For deployed jobs, the idea is to use the same job specification but
> >> create
> >>> a new Hyracks job and Asterix Transaction for each execution.
> >>>
> >>> Steven
> >>>
> >>> On Fri, Nov 17, 2017 at 11:10 AM, abdullah alamoudi <
> [email protected]>
> >>> wrote:
> >>>
> >>>> I can e-meet anytime (moved to Sunnyvale). We can also look at a
> >> proposed
> >>>> design and see if it can work
> >>>> Back to my question, how were you planning to change the transaction
> id
> >> if
> >>>> we forget about the case with multiple datasets (feed job)?
> >>>>
> >>>>
> >>>>> On Nov 17, 2017, at 10:38 AM, Steven Jacobs <[email protected]>
> wrote:
> >>>>>
> >>>>> Maybe it would be good to have a meeting about this with all
> interested
> >>>>> parties?
> >>>>>
> >>>>> I can be on-campus at UCI on Tuesday if that would be a good day to
> >> meet.
> >>>>>
> >>>>> Steven
> >>>>>
> >>>>> On Fri, Nov 17, 2017 at 9:36 AM, abdullah alamoudi <
> [email protected]
> >>>
> >>>>> wrote:
> >>>>>
> >>>>>> Also, was wondering how would you do the same for a single dataset
> >>>>>> (non-feed). How would you get the transaction id and change it when
> >> you
> >>>>>> re-run?
> >>>>>>
> >>>>>> On Nov 17, 2017 7:12 AM, "Murtadha Hubail" <[email protected]>
> >> wrote:
> >>>>>>
> >>>>>>> For atomic transactions, the change was merged yesterday. For
> entity
> >>>>>> level
> >>>>>>> transactions, it should be a very small change.
> >>>>>>>
> >>>>>>> Cheers,
> >>>>>>> Murtadha
> >>>>>>>
> >>>>>>>> On Nov 17, 2017, at 6:07 PM, abdullah alamoudi <
> [email protected]>
> >>>>>>> wrote:
> >>>>>>>>
> >>>>>>>> I understand that is not the case right now but what you're
> working
> >>>> on?
> >>>>>>>>
> >>>>>>>> Cheers,
> >>>>>>>> Abdullah.
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>> On Nov 17, 2017, at 7:04 AM, Murtadha Hubail <
> [email protected]>
> >>>>>>> wrote:
> >>>>>>>>>
> >>>>>>>>> A transaction context can register multiple primary indexes.
> >>>>>>>>> Since each entity commit log contains the dataset id, you can
> >>>>>> decrement
> >>>>>>> the active operations on
> >>>>>>>>> the operation tracker associated with that dataset id.
> >>>>>>>>>
> >>>>>>>>> On 17/11/2017, 5:52 PM, "abdullah alamoudi" <[email protected]>
> >>>>>> wrote:
> >>>>>>>>>
> >>>>>>>>> Can you illustrate how a deadlock can happen? I am anxious to
> know.
> >>>>>>>>> Moreover, the reason for the multiple transaction ids in feeds is
> >>>>>> not
> >>>>>>> simply because we compile them differently.
> >>>>>>>>>
> >>>>>>>>> How would a commit operator know which dataset active operation
> >>>>>>> counter to decrement if they share the same id for example?
> >>>>>>>>>
> >>>>>>>>>> On Nov 16, 2017, at 9:46 PM, Xikui Wang <[email protected]> wrote:
> >>>>>>>>>>
> >>>>>>>>>> Yes. That deadlock could happen. Currently, we have one-to-one
> >>>>>>> mappings for
> >>>>>>>>>> the jobs and transactions, except for the feeds.
> >>>>>>>>>>
> >>>>>>>>>> @Abdullah, after some digging into the code, I think probably we
> >> can
> >>>>>>> use a
> >>>>>>>>>> single transaction id for the job which feeds multiple datasets?
> >> See
> >>>>>>> if I
> >>>>>>>>>> can convince you. :)
> >>>>>>>>>>
> >>>>>>>>>> The reason we have multiple transaction ids in feeds is that we
> >>>>>> compile
> >>>>>>>>>> each connection job separately and combine them into a single
> feed
> >>>>>>> job. A
> >>>>>>>>>> new transaction id is created and assigned to each connection
> job,
> >>>>>>> thus for
> >>>>>>>>>> the combined job, we have to handle the different transactions
> as
> >>>>>> they
> >>>>>>>>>> are embedded in the connection job specifications. But, what if
> we
> >>>>>>> create a
> >>>>>>>>>> single transaction id for the combined job? That transaction id
> >> will
> >>>>>> be
> >>>>>>>>>> embedded into each connection so they can write logs freely, but
> >> the
> >>>>>>>>>> transaction will be started and committed only once as there is
> >> only
> >>>>>>> one
> >>>>>>>>>> feed job. In this way, we won't need
> >> multiTransactionJobletEventLis
> >>>>>>> tener
> >>>>>>>>>> and the transaction id can be removed from the job specification
> >>>>>>> easily as
> >>>>>>>>>> well (for Steven's change).
> >>>>>>>>>>
> >>>>>>>>>> Best,
> >>>>>>>>>> Xikui
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>> On Thu, Nov 16, 2017 at 4:26 PM, Mike Carey <[email protected]
> >
> >>>>>>> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>> I worry about deadlocks.  The waits for graph may not
> understand
> >>>>>> that
> >>>>>>>>>>> making t1 wait will also make t2 wait since they may share a
> >> thread
> >>>>>> -
> >>>>>>>>>>> right?  Or do we have jobs and transactions separately
> >> represented
> >>>>>>> there
> >>>>>>>>>>> now?
> >>>>>>>>>>>
> >>>>>>>>>>>> On Nov 16, 2017 3:10 PM, "abdullah alamoudi" <
> >> [email protected]>
> >>>>>>> wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>> We are using multiple transactions in a single job in case of
> >> feed
> >>>>>>> and I
> >>>>>>>>>>>> think that this is the correct way.
> >>>>>>>>>>>> Having a single job for a feed that feeds into multiple
> datasets
> >>>>>> is a
> >>>>>>>>>>> good
> >>>>>>>>>>>> thing since job resources/feed resources are consolidated.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Here are some points:
> >>>>>>>>>>>> - We can't use the same transaction id to feed multiple
> >> datasets.
> >>>>>> The
> >>>>>>>>>>> only
> >>>>>>>>>>>> other option is to have multiple jobs each feeding a different
> >>>>>>> dataset.
> >>>>>>>>>>>> - Having multiple jobs (in addition to the extra resources
> used,
> >>>>>>> memory
> >>>>>>>>>>>> and CPU) would then forces us to either read data from
> external
> >>>>>>> sources
> >>>>>>>>>>>> multiple times, parse records multiple times, etc
> >>>>>>>>>>>> or having to have a synchronization between the different jobs
> >> and
> >>>>>>> the
> >>>>>>>>>>>> feed source within asterixdb. IMO, this is far more
> complicated
> >>>>>> than
> >>>>>>>>>>> having
> >>>>>>>>>>>> multiple transactions within a single job and the cost far
> >>>> outweigh
> >>>>>>> the
> >>>>>>>>>>>> benefits.
> >>>>>>>>>>>>
> >>>>>>>>>>>> P.S,
> >>>>>>>>>>>> We are also using this for bucket connections in Couchbase
> >>>>>> Analytics.
> >>>>>>>>>>>>
> >>>>>>>>>>>>> On Nov 16, 2017, at 2:57 PM, Till Westmann <[email protected]
> >
> >>>>>>> wrote:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> If there are a number of issue with supporting multiple
> >>>>>> transaction
> >>>>>>> ids
> >>>>>>>>>>>>> and no clear benefits/use-cases, I’d vote for simplification
> :)
> >>>>>>>>>>>>> Also, code that’s not being used has a tendency to "rot" and
> >> so I
> >>>>>>> think
> >>>>>>>>>>>>> that it’s usefulness might be limited by the time we’d find a
> >> use
> >>>>>>> for
> >>>>>>>>>>>>> this functionality.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> My 2c,
> >>>>>>>>>>>>> Till
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> On 16 Nov 2017, at 13:57, Xikui Wang wrote:
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> I'm separating the connections into different jobs in some
> of
> >> my
> >>>>>>>>>>>>>> experiments... but that was intended to be used for the
> >>>>>>> experimental
> >>>>>>>>>>>>>> settings (i.e., not for master now)...
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> I think the interesting question here is whether we want to
> >>>> allow
> >>>>>>> one
> >>>>>>>>>>>>>> Hyracks job to carry multiple transactions. I personally
> think
> >>>>>> that
> >>>>>>>>>>>> should
> >>>>>>>>>>>>>> be allowed as the transaction and job are two separate
> >> concepts,
> >>>>>>> but I
> >>>>>>>>>>>>>> couldn't find such use cases other than the feeds. Does
> anyone
> >>>>>>> have a
> >>>>>>>>>>>> good
> >>>>>>>>>>>>>> example on this?
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Another question is, if we do allow multiple transactions
> in a
> >>>>>>> single
> >>>>>>>>>>>>>> Hyracks job, how do we enable commit runtime to obtain the
> >>>>>> correct
> >>>>>>> TXN
> >>>>>>>>>>>> id
> >>>>>>>>>>>>>> without having that embedded as part of the job
> specification.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Best,
> >>>>>>>>>>>>>> Xikui
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> On Thu, Nov 16, 2017 at 1:01 PM, abdullah alamoudi <
> >>>>>>>>>>> [email protected]>
> >>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> I am curious as to how feed will work without this?
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> ~Abdullah.
> >>>>>>>>>>>>>>>> On Nov 16, 2017, at 12:43 PM, Steven Jacobs <
> >> [email protected]
> >>>>>
> >>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Hi all,
> >>>>>>>>>>>>>>>> We currently have MultiTransactionJobletEventLis
> >> tenerFactory,
> >>>>>>> which
> >>>>>>>>>>>>>>> allows
> >>>>>>>>>>>>>>>> for one Hyracks job to run multiple Asterix transactions
> >>>>>>> together.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> This class is only used by feeds, and feeds are in process
> >> of
> >>>>>>>>>>>> changing to
> >>>>>>>>>>>>>>>> no longer need this feature. As part of the work in
> >>>>>> pre-deploying
> >>>>>>>>>>> job
> >>>>>>>>>>>>>>>> specifications to be used by multiple hyracks jobs, I've
> >> been
> >>>>>>>>>>> working
> >>>>>>>>>>>> on
> >>>>>>>>>>>>>>>> removing the transaction id from the job specifications,
> as
> >> we
> >>>>>>> use a
> >>>>>>>>>>>> new
> >>>>>>>>>>>>>>>> transaction for each invocation of a deployed job.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> There is currently no clear way to remove the transaction
> id
> >>>>>> from
> >>>>>>>>>>> the
> >>>>>>>>>>>> job
> >>>>>>>>>>>>>>>> spec and keep the option for
> MultiTransactionJobletEventLis
> >>>>>>>>>>>> tenerFactory.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> The question for the group is, do we see a need to
> maintain
> >>>>>> this
> >>>>>>>>>>> class
> >>>>>>>>>>>>>>> that
> >>>>>>>>>>>>>>>> will no longer be used by any current code? Or, an other
> >>>> words,
> >>>>>>> is
> >>>>>>>>>>>> there
> >>>>>>>>>>>>>>> a
> >>>>>>>>>>>>>>>> strong possibility that in the future we will want
> multiple
> >>>>>>>>>>>> transactions
> >>>>>>>>>>>>>>> to
> >>>>>>>>>>>>>>>> share a single Hyracks job, meaning that it is worth
> >> figuring
> >>>>>> out
> >>>>>>>>>>> how
> >>>>>>>>>>>> to
> >>>>>>>>>>>>>>>> maintain this class?
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Steven
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>
> >>>>
> >>
> >>
>
>

Re: MultiTransactionJobletEventListenerFactory

Reply via email to