Re: Why does `now()` produce different times within the same query?

Edward Capriolo Sat, 03 Dec 2016 08:02:29 -0800

On Saturday, December 3, 2016, Edward Capriolo <edlinuxg...@gmail.com>
wrote:


>
>
> On Saturday, December 3, 2016, Jonathan Haddad <j...@jonhaddad.com
> <javascript:_e(%7B%7D,'cvml','j...@jonhaddad.com');>> wrote:
>
>> That isn't what the original thread is about. The thread is about the
>> timestamp portion of the UUID being different.
>>
>> Having UUID() return the same thing for all rows in a batch would be the
>> unexpected thing virtually every time.
>> On Sat, Dec 3, 2016 at 7:09 AM Edward Capriolo <edlinuxg...@gmail.com>
>> wrote:
>>
>>>
>>>
>>> On Friday, December 2, 2016, Jonathan Haddad <j...@jonhaddad.com> wrote:
>>>
>>>> This isn't about using the same UUID though. It's about the timestamp
>>>> bits in the UUID.
>>>>
>>>> What the use case is for generating multiple UUIDs in a single row? Why
>>>> do you need to extract the timestamp out of both?
>>>> On Fri, Dec 2, 2016 at 10:24 AM Edward Capriolo <edlinuxg...@gmail.com>
>>>> wrote:
>>>>
>>>>>
>>>>> On Thu, Dec 1, 2016 at 11:09 AM, Sylvain Lebresne <
>>>>> sylv...@datastax.com> wrote:
>>>>>
>>>>>> On Thu, Dec 1, 2016 at 4:44 PM, Edward Capriolo <
>>>>>> edlinuxg...@gmail.com> wrote:
>>>>>>
>>>>>>>
>>>>>>> I am not sure you saw my reply on thread but I believe everyone's
>>>>>>> needs can be met I will copy that here:
>>>>>>>
>>>>>>
>>>>>> I saw it, but the real problem that was raised initially was not that
>>>>>> of UDF and of allowing both behavior. It's a matter of people being
>>>>>> confused by the behavior of a non-UDF function, now(), and suggesting it
>>>>>> should be changed.
>>>>>>
>>>>>> The Hive idea is interesting I guess, and we can switch to discussing
>>>>>> that, but it's a different problem really and I'm not a fond of derailing
>>>>>> threads. I will just note though that if we're not talking about a
>>>>>> confusion issue but rather how to get a timeuuid to be fixed within a
>>>>>> statement, then there is much much more trivial solution: generate it
>>>>>> client side. The `now()` function is a small convenience but there is
>>>>>> nothing you cannot do without it client side, and that actually basically
>>>>>> stands for almost any use of (non aggregate) function in Cassandra
>>>>>> currently.
>>>>>>
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> "Food for thought: Hive's UDFs introduced an annotation
>>>>>>> @UDFType(deterministic = false)
>>>>>>>
>>>>>>> http://dmtolpeko.com/2014/10/15/invoking-stateful-udf-at-map
>>>>>>> -and-reduce-side-in-hive/
>>>>>>>
>>>>>>> The effect is the query planner can see when such a UDF is in use
>>>>>>> and determine the value once at the start of a very long query."
>>>>>>>
>>>>>>> Essentially hive had a similar if not identical problem, during a
>>>>>>> long running distributed process like map/reduce some users wanted the
>>>>>>> semantics of:
>>>>>>>
>>>>>>> 1) Each call should have a new timestamps
>>>>>>>
>>>>>>> While other users wanted the semantics of:
>>>>>>>
>>>>>>> 2) Each call should generate the same timestamp
>>>>>>>
>>>>>>> The solution implemented was to add an annotation to udf such that
>>>>>>> the query planner would pick up the annotation and act accordingly.
>>>>>>>
>>>>>>> (Here is a related issue https://issues.apache.or
>>>>>>> g/jira/browse/HIVE-1986
>>>>>>>
>>>>>>> As a result you can essentially implement two UDFS
>>>>>>>
>>>>>>> @UDFType(deterministic = false)
>>>>>>> public class UDFNow
>>>>>>>
>>>>>>> and for the other people
>>>>>>>
>>>>>>> @UDFType(deterministic = true)
>>>>>>> public class UDFNowOnce extends UDFNow
>>>>>>>
>>>>>>> Both user cases are met in a sensible way.
>>>>>>>
>>>>>>
>>>>>>
>>>>> The `now()` function is a small convenience but there is nothing you
>>>>> cannot do without it client side, and that actually basically stands for
>>>>> almost any use of (non aggregate) function in Cassandra currently.
>>>>>
>>>>> Casandra's changing philosophy over which entity should create such
>>>>> information client/server/driver does not make this problem easy.
>>>>>
>>>>> If you take into account that you have users who do not understand all
>>>>> the intricacy of uuid the problem is compounded. IE How does one generate 
>>>>> a
>>>>> UUID each c#, python, java etc? with the 47 random bits of bla bla. That 
>>>>> is
>>>>> not super easy information to find. Maybe you find a stack overflow post
>>>>> that actually gives bad advice etc.
>>>>>
>>>>> Many times in Cassandra you are using a uuid because you do not have a
>>>>> unique key in the insert and you wish to create one. If you are inserting
>>>>> more then a single record using that same UUID and you do not want the
>>>>> burden of wanting to do it yourself you would have to do 
>>>>> write>>read>>write
>>>>> which is an anti-pattern.
>>>>>
>>>>
>>> Not multiple ids for a single row. The same id for multiple inserts in a
>>> batch.
>>>
>>> For example lets say I have an application where my data has no unique
>>> key.
>>>
>>> Table poke
>>> Poker, pokee, time
>>>
>>> Suppose i consume pokes from kafka build a batch of 30k and insert them.
>>> You probably want to denormalize into two tables:
>>> Primary key (poker, time)
>>> Primary key (pokee,time)
>>>
>>> It makes sense that they all have the same uuid if you want it to be the
>>> uuid of the batch. This would make it easy to correlate all the events.
>>> Easy to delete them all as well.
>>>
>>> The do it client side argument is totally valid, but has been a
>>> justification for not adding features many of which are eventually added
>>> anyway.
>>>
>>>
>>>
>>>
>>> --
>>> Sorry this was sent from mobile. Will do less grammar and spell check
>>> than usual.
>>>
>>
> Debateable.
>
> Cassandra for example always said batch mutations happen.. all at
> once..but it was not until snaptree that you could see effects of half a
> batch. Even now a multi partition batch does not happen all at once.
>
> What people is expect does not always align with reality. Point me to a
> unit test that documents said behaivor and proves it does not change.
>
> Maybe people expect a query planner to fold constants, many people might
> think a smart query engine could memorize calls to the same function with
> no args, many expect that thinga happen in isolation.
>
>
> --
> Sorry this was sent from mobile. Will do less grammar and spell check than
> usual.
>

 A new unique timeuuid (at the time where the statement using it is
executed).

Indicates that each statement has one unique time uuid. Calling the udf
twice in one statement and getting different results dissagrees with the
documentation.



-- 
Sorry this was sent from mobile. Will do less grammar and spell check than
usual.

Re: Why does `now()` produce different times within the same query?

Reply via email to