Re: Feeds UDF

Mike Carey Wed, 09 Dec 2015 13:30:13 -0800

The function itself takes one record in and produces N records (wherethe normal case is N=1, but you are right, there's nothing to stop itfrom being all of the records in some dataset). The normal case for ajoin would be something like adding some additional fields to anincoming record by matching the record against other datasets - e.g.,geolocating a Tweet (as Jianfeng mentioned). Thus, normally one recordcomes in to the function, some processing happens, and a fatter versionof the record goes out. Since the record isn't out when the processingoccurs, it can't see itself during the processing - it doesn't exist instored form yet. A wierd case would be outputting multiple records forone incoming record - and given our loose transaction model, if therecords are going to be stored in the same data set that's being read bythe function for a join, the join could perhaps see some of the recordsbeing generated. HOWEVER: The function doesn't get run on them - theyare not coming in the front door of the feed - so if an incoming recordindeed generates N incoming records in its place, it could see theoriginal dataset plus N-1 of the other records. But that's still notinfinite. (Nor is it normal for the join to be a self-join or theresult to have cardinality > 1. :-))


On 12/9/15 9:48 AM, abdullah alamoudi wrote:

But if the function actually takes a single record and performs a join
effectively producing a collection of records that feeds into the same
dataset, wouldn't that create a chance for this infinite loop that would
eventually fills up the storage and explodes the dataset?


One thing to note is that in their current implementation, feed connections
are translated into insert statements that go through the query compiler,
meaning that a materialize operator will be introduced.

Cheers,
Abdullah.

Amoudi, Abdullah.

On Wed, Dec 9, 2015 at 9:40 AM, Mike Carey <[email protected]> wrote:

Hmmm....  I'm not sure where the Halloween problem is in this case - for a
given record being ingested, it's not in the dataset yet, and won't get to
move furrher thru the pipeline to the point where it IS in the data set
until after the query evaluation is over, the result has been computed, and
the new object (the one to be inserted) has been determined.  At least
that's how it should work.  There should thus be no way for the ingestion
pipeline query to see a record twice in a self-join scenario, because it
won't be in play in the dataset yet (it's not part of "self") - right?  (Or
is there a subtlety that I'm missing?)

Cheers,
Mike


On 12/9/15 6:59 AM, abdullah alamoudi wrote:

The only problem I see is the Halloween problem in case of a self join,
hence the need for materialization(not sure if it is possible in this case
but definitely possible in general). Other than that, I don't think there
is any problem.

Cheers,
Abdullah
On Dec 8, 2015 11:51 PM, "Mike Carey" <[email protected]> wrote:

(I am still completely not seeing a problem here.)

On 12/8/15 10:20 PM, abdullah alamoudi wrote:

The plan is to mostly use Upsert in the future since we can do some

optimizations with it that we can't do with an insert.
We should also support deletes as well and probably allow a mix of the
three operations within the same feed. This is a work in progress right
now
but before I go far, I am stabilizing some other parts of the feeds.

Cheers,
Abdullah.


Amoudi, Abdullah.

On Tue, Dec 8, 2015 at 10:11 PM, Ildar Absalyamov <
[email protected]> wrote:

Abdullah,

OK, now I see what problems it will cause.
Kinda related question: could the feed implement “upsert” semantics,
that
you’ve been working on, instead of “insert” semantics?

On Dec 8, 2015, at 21:52, abdullah alamoudi <[email protected]>
wrote:

I think that we probably should restrict feed applied functions
somehow
(needs further thoughts and discussions) and I know for sure that we

don't.

As for the case you present, I would imagine that it could be allowed

theoretically but I think everyone sees why it should be disallowed.

One thing to keep in mind is that we introduce a materialize if the

dataset

was part of an insert pipeline. Now think about how this would work

with

a

continuous feed. One choice would be that the feed will materialize all

records to be inserted and once the feed stops, it would start
inserting
them but I still think we should not allow it.

My 2c,
Any opposing argument?


Amoudi, Abdullah.

On Tue, Dec 8, 2015 at 6:28 PM, Ildar Absalyamov <

[email protected]

wrote:

Hi All,

As a part of feed ingestion we do allow preprocessing incoming data
with
AQL UDFs.
I was wondering if we somehow restrict the kind of UDFs that could be
used? Do we allow joins in these UDFs? Especially joins with the same
dataset, which is used for intake. Ex:

create type TweetType as open {
    id: string,
    username : string,
    location : string,
    text : string,
    timestamp : string
}
create dataset Tweets(TweetType)
primary key id;
create function feed_processor($x) {
for $y in dataset Tweets
// self-join with Tweets dataset on some predicate($x, $y)
return $y
}
create feed TweetFeed
apply function feed_processor;

The query above fails in runtime, but I was wondering if that
theoretically could work at all.

Best regards,
Ildar


Best regards,

Ildar

Re: Feeds UDF

Reply via email to