Re: Semantic cleanup: How to adding two bytearray

Julien Le Dem Fri, 14 Jan 2011 14:41:39 -0800

Maps are sometimes used to represent JSON or similar data structures.
The resulting Pig objects are Maps with String keys and values being either: 
String, Number, Map, Bag (and recursively).
Julien


On 1/14/11 2:15 PM, "Dmitriy Ryaboy" <dvrya...@gmail.com> wrote:

fwiw most of our maps wind up being mixes of string->double and
string->string.  Sometimes string->map and string->bag . Having non-string
keys would really help us but I know that was pulled for a reason..

D

On Fri, Jan 14, 2011 at 2:00 PM, Alan Gates <ga...@yahoo-inc.com> wrote:

> I think the big win of static typing is that from examining the script
> alone you can know the output:
>
> A = load 'bla' using BinStorage();
> B = foreach A generate $0 + $1;
>
> With static typing $0 and $1 will both be viewed as bytearrays and thus
> will be cast to doubles, regardless of how BinStorage actually instantiated
> them.  With dynamic types we cannot know the answers without knowing the
> data that is fed through.
>
> The downside of the static typing case is that we explicitly allow unknown
> types in maps:
>
> A = load 'bla' using AvroStorage(); -- assume bla has a schema of m:map
>                                                       -- and that m has two
> keys, k1 and k2
>                                                       -- both with integer
> values
> B = foreach  A generate m#k1 + m#k2;
>
> Using static types, B.$0 will be a double, even though the underlying types
> are ints.  Users will not see that as intuitive even though the semantic is
> clear.  In the dynamic model proposed by Daniel, B.$0 will be an int.
>
> We are mitigating this case by allowing typed maps (where the value type of
> the map is declarable) in 0.9.  But maps with heterogenous values types will
> still suffer from this issue.
>
> I vote for static types for several reasons:
>
> 1) I like being able to know the output of the script by examining the
> script alone.  It provides a clear semantic that we can explain to users.
> 2) It's less of a maintenance cost, as the need to deal with dynamic type
> discovery is confined to the cast operator.  If we go full out dynamic types
> every expression operator has to be able to manage dynamism for byte arrays.
> 3) In my experience almost all maps are string->string so once we allow
> typed maps I suspect people will start using them heavily.
>
> I'm not sure there's a performance gain either way, since in both cases we
> have to manage the case where we think something is a bytearray and it turns
> out to be something else.
>
> Alan.
>
>
>
> On Jan 14, 2011, at 1:34 PM, Dmitriy Ryaboy wrote:
>
>  Agreed with what Scott said about procedurally building schemas, and what
>> Olga said about static typing.
>>
>> Daniel, I am not sure what you mean about run-time typing on a row by row
>> basis.  Certainly winding up with columns that are sometimes doubles,
>> sometimes floats, and sometimes ints can only lead to unexpected bugs?
>>
>> I know Yahoo went through a lot of pain with the LoadStore rework in 0.7
>> (heck I am still dealing with it), but seems like breaking compatibility
>> in
>> a minor way in order to clean up semantics is ok given that we had a
>> "stable" version in between. I don't think conversion would be too
>> onerous,
>> especially if declaring schemas is simplified.
>>
>> We can just say that odd versions can break apis and even can't :).
>>
>> D
>>
>> On Fri, Jan 14, 2011 at 12:27 PM, Scott Carey <sc...@richrelevance.com
>> >wrote:
>>
>>
>>>
>>> On 1/13/11 10:54 PM, "Dmitriy Ryaboy" <dvrya...@gmail.com> wrote:
>>>
>>>  How is runtime detection done? I worry that if 1.txt contains:
>>>> 1, 2
>>>> 1.1, 2.2
>>>>
>>>> We get into a situation where addition of the fields in the first tuple
>>>> produces integers, and adding the fields of the second tuple produces
>>>> doubles.
>>>>
>>>> A more invasive but perhaps easier to reason about solution might be to
>>>> be
>>>> stricter about types, and require bytearrays to be cast to whatever type
>>>> they are supposed to be if you want to add / delete / do non-byte-things
>>>> to
>>>> them.
>>>>
>>>> This is a problem if UDFs that output tuples or bags don't specify
>>>> schemas
>>>> (and specifying schemas of tuples and bags is fairly onerous right now
>>>> in
>>>> Java). I am not sure what the solution here is, other than finding a
>>>> clean,
>>>> less onerous way of declaring schemas, fixing up everything in builtin
>>>> and
>>>> piggybank to only use the new clean sparkly api and document the heck
>>>> out
>>>> of
>>>> it.
>>>>
>>>
>>> A longer term approach would likely strive to make schema specification
>>> of
>>> inputs and outputs for UDFs declarative and restrict the scope of the
>>> unknown.  Building schema data structures procedurally is NotFun(tm).
>>> All languages could support a string based schema representation, and
>>> many
>>> could use more type-safe declarations like Java annotations.  I think
>>> there is a long-term opportunity to make Pig's type system easier to work
>>> with and higher performance but its no small project.  Pig certainly
>>> isn't
>>> alone with these sort of issues.
>>>
>>>
>>>> D
>>>>
>>>> On Thu, Jan 13, 2011 at 8:58 PM, Daniel Dai <jiany...@yahoo-inc.com>
>>>> wrote:
>>>>
>>>>  One goal of semantic cleanup work undergoing is to clarify the usage of
>>>>> unknown type.
>>>>>
>>>>> In Pig schema system, user can define output schema for
>>>>> LoadFunc/EvalFunc.
>>>>> Pig will propagate those schema to the entire script. Defining schema
>>>>> for
>>>>> LoadFunc/EvalFunc is optional. If user don't define schema, Pig will
>>>>> mark
>>>>> them bytearray. However, in the run time, user can feed any data type
>>>>> in.
>>>>> Before, Pig assumes the runtime type for bytearray is DataByteArray,
>>>>> which
>>>>> arose several issues (PIG-1277, PIG-999, PIG-1016).
>>>>>
>>>>> In 0.9, Pig will treat bytearray as unknown type. Pig will inspect the
>>>>> object to figure out what the real type is at runtime. We've done that
>>>>> for
>>>>> all shuffle keys (PIG-1277). However, there are other cases. One case
>>>>> is
>>>>> adding two bytearray. For example,
>>>>>
>>>>> a = load '1.txt' using SomeLoader() as (a0, a1);  // Assume SomeLoader
>>>>> does
>>>>> not define schema, but actually feed Integer
>>>>> b = foreach a generate a0+a1;
>>>>>
>>>>> In Pig 0.8, schema system marks a0 and a1 as bytearray. In the case of
>>>>> a0+a1, Pig cast both a0 and a1 to double (in TypeCheckingVisitor), and
>>>>> mark
>>>>> the output schema for a0+a1 as double. Here is something interesting,
>>>>> SomeLoader loads Integer, and we get Double after adding. We can change
>>>>> it
>>>>> if we do the following:
>>>>> 1. Don't cast bytearray into Double (in TypeCheckingVisitor)
>>>>> 2. Change POAdd(Similarly, all other ExpressionOperators, multply,
>>>>> divide,
>>>>> etc) to handle bytearray. When the schema for POAdd is bytearray, Pig
>>>>> will
>>>>> figure out the data type at runtime, and process adding according to
>>>>> the
>>>>> real type
>>>>>
>>>>> Pro:
>>>>> 1. Consistent with the goal for unknown type cleanup: treat all
>>>>> bytearray
>>>>> as unknown type. In the runtime, inspect the object to find the real
>>>>> type
>>>>>
>>>>> Cons:
>>>>> 1. Slow down the processing since we need to inspect object type at
>>>>> runtime
>>>>> 2. Bring some indeterminism to schema system. Before a0+a1 is double,
>>>>> downstream schema is more clear.
>>>>>
>>>>> Any comments?
>>>>>
>>>>> Daniel
>>>>>
>>>>>
>>>
>>>
>

Re: Semantic cleanup: How to adding two bytearray

Reply via email to