Maps are sometimes used to represent JSON or similar data structures. The resulting Pig objects are Maps with String keys and values being either: String, Number, Map, Bag (and recursively). Julien
On 1/14/11 2:15 PM, "Dmitriy Ryaboy" <dvrya...@gmail.com> wrote: fwiw most of our maps wind up being mixes of string->double and string->string. Sometimes string->map and string->bag . Having non-string keys would really help us but I know that was pulled for a reason.. D On Fri, Jan 14, 2011 at 2:00 PM, Alan Gates <ga...@yahoo-inc.com> wrote: > I think the big win of static typing is that from examining the script > alone you can know the output: > > A = load 'bla' using BinStorage(); > B = foreach A generate $0 + $1; > > With static typing $0 and $1 will both be viewed as bytearrays and thus > will be cast to doubles, regardless of how BinStorage actually instantiated > them. With dynamic types we cannot know the answers without knowing the > data that is fed through. > > The downside of the static typing case is that we explicitly allow unknown > types in maps: > > A = load 'bla' using AvroStorage(); -- assume bla has a schema of m:map > -- and that m has two > keys, k1 and k2 > -- both with integer > values > B = foreach A generate m#k1 + m#k2; > > Using static types, B.$0 will be a double, even though the underlying types > are ints. Users will not see that as intuitive even though the semantic is > clear. In the dynamic model proposed by Daniel, B.$0 will be an int. > > We are mitigating this case by allowing typed maps (where the value type of > the map is declarable) in 0.9. But maps with heterogenous values types will > still suffer from this issue. > > I vote for static types for several reasons: > > 1) I like being able to know the output of the script by examining the > script alone. It provides a clear semantic that we can explain to users. > 2) It's less of a maintenance cost, as the need to deal with dynamic type > discovery is confined to the cast operator. If we go full out dynamic types > every expression operator has to be able to manage dynamism for byte arrays. > 3) In my experience almost all maps are string->string so once we allow > typed maps I suspect people will start using them heavily. > > I'm not sure there's a performance gain either way, since in both cases we > have to manage the case where we think something is a bytearray and it turns > out to be something else. > > Alan. > > > > On Jan 14, 2011, at 1:34 PM, Dmitriy Ryaboy wrote: > > Agreed with what Scott said about procedurally building schemas, and what >> Olga said about static typing. >> >> Daniel, I am not sure what you mean about run-time typing on a row by row >> basis. Certainly winding up with columns that are sometimes doubles, >> sometimes floats, and sometimes ints can only lead to unexpected bugs? >> >> I know Yahoo went through a lot of pain with the LoadStore rework in 0.7 >> (heck I am still dealing with it), but seems like breaking compatibility >> in >> a minor way in order to clean up semantics is ok given that we had a >> "stable" version in between. I don't think conversion would be too >> onerous, >> especially if declaring schemas is simplified. >> >> We can just say that odd versions can break apis and even can't :). >> >> D >> >> On Fri, Jan 14, 2011 at 12:27 PM, Scott Carey <sc...@richrelevance.com >> >wrote: >> >> >>> >>> On 1/13/11 10:54 PM, "Dmitriy Ryaboy" <dvrya...@gmail.com> wrote: >>> >>> How is runtime detection done? I worry that if 1.txt contains: >>>> 1, 2 >>>> 1.1, 2.2 >>>> >>>> We get into a situation where addition of the fields in the first tuple >>>> produces integers, and adding the fields of the second tuple produces >>>> doubles. >>>> >>>> A more invasive but perhaps easier to reason about solution might be to >>>> be >>>> stricter about types, and require bytearrays to be cast to whatever type >>>> they are supposed to be if you want to add / delete / do non-byte-things >>>> to >>>> them. >>>> >>>> This is a problem if UDFs that output tuples or bags don't specify >>>> schemas >>>> (and specifying schemas of tuples and bags is fairly onerous right now >>>> in >>>> Java). I am not sure what the solution here is, other than finding a >>>> clean, >>>> less onerous way of declaring schemas, fixing up everything in builtin >>>> and >>>> piggybank to only use the new clean sparkly api and document the heck >>>> out >>>> of >>>> it. >>>> >>> >>> A longer term approach would likely strive to make schema specification >>> of >>> inputs and outputs for UDFs declarative and restrict the scope of the >>> unknown. Building schema data structures procedurally is NotFun(tm). >>> All languages could support a string based schema representation, and >>> many >>> could use more type-safe declarations like Java annotations. I think >>> there is a long-term opportunity to make Pig's type system easier to work >>> with and higher performance but its no small project. Pig certainly >>> isn't >>> alone with these sort of issues. >>> >>> >>>> D >>>> >>>> On Thu, Jan 13, 2011 at 8:58 PM, Daniel Dai <jiany...@yahoo-inc.com> >>>> wrote: >>>> >>>> One goal of semantic cleanup work undergoing is to clarify the usage of >>>>> unknown type. >>>>> >>>>> In Pig schema system, user can define output schema for >>>>> LoadFunc/EvalFunc. >>>>> Pig will propagate those schema to the entire script. Defining schema >>>>> for >>>>> LoadFunc/EvalFunc is optional. If user don't define schema, Pig will >>>>> mark >>>>> them bytearray. However, in the run time, user can feed any data type >>>>> in. >>>>> Before, Pig assumes the runtime type for bytearray is DataByteArray, >>>>> which >>>>> arose several issues (PIG-1277, PIG-999, PIG-1016). >>>>> >>>>> In 0.9, Pig will treat bytearray as unknown type. Pig will inspect the >>>>> object to figure out what the real type is at runtime. We've done that >>>>> for >>>>> all shuffle keys (PIG-1277). However, there are other cases. One case >>>>> is >>>>> adding two bytearray. For example, >>>>> >>>>> a = load '1.txt' using SomeLoader() as (a0, a1); // Assume SomeLoader >>>>> does >>>>> not define schema, but actually feed Integer >>>>> b = foreach a generate a0+a1; >>>>> >>>>> In Pig 0.8, schema system marks a0 and a1 as bytearray. In the case of >>>>> a0+a1, Pig cast both a0 and a1 to double (in TypeCheckingVisitor), and >>>>> mark >>>>> the output schema for a0+a1 as double. Here is something interesting, >>>>> SomeLoader loads Integer, and we get Double after adding. We can change >>>>> it >>>>> if we do the following: >>>>> 1. Don't cast bytearray into Double (in TypeCheckingVisitor) >>>>> 2. Change POAdd(Similarly, all other ExpressionOperators, multply, >>>>> divide, >>>>> etc) to handle bytearray. When the schema for POAdd is bytearray, Pig >>>>> will >>>>> figure out the data type at runtime, and process adding according to >>>>> the >>>>> real type >>>>> >>>>> Pro: >>>>> 1. Consistent with the goal for unknown type cleanup: treat all >>>>> bytearray >>>>> as unknown type. In the runtime, inspect the object to find the real >>>>> type >>>>> >>>>> Cons: >>>>> 1. Slow down the processing since we need to inspect object type at >>>>> runtime >>>>> 2. Bring some indeterminism to schema system. Before a0+a1 is double, >>>>> downstream schema is more clear. >>>>> >>>>> Any comments? >>>>> >>>>> Daniel >>>>> >>>>> >>> >>> >