Agreed with what Scott said about procedurally building schemas, and what Olga said about static typing.
Daniel, I am not sure what you mean about run-time typing on a row by row basis. Certainly winding up with columns that are sometimes doubles, sometimes floats, and sometimes ints can only lead to unexpected bugs? I know Yahoo went through a lot of pain with the LoadStore rework in 0.7 (heck I am still dealing with it), but seems like breaking compatibility in a minor way in order to clean up semantics is ok given that we had a "stable" version in between. I don't think conversion would be too onerous, especially if declaring schemas is simplified. We can just say that odd versions can break apis and even can't :). D On Fri, Jan 14, 2011 at 12:27 PM, Scott Carey <[email protected]>wrote: > > > On 1/13/11 10:54 PM, "Dmitriy Ryaboy" <[email protected]> wrote: > > >How is runtime detection done? I worry that if 1.txt contains: > >1, 2 > >1.1, 2.2 > > > >We get into a situation where addition of the fields in the first tuple > >produces integers, and adding the fields of the second tuple produces > >doubles. > > > >A more invasive but perhaps easier to reason about solution might be to be > >stricter about types, and require bytearrays to be cast to whatever type > >they are supposed to be if you want to add / delete / do non-byte-things > >to > >them. > > > >This is a problem if UDFs that output tuples or bags don't specify schemas > >(and specifying schemas of tuples and bags is fairly onerous right now in > >Java). I am not sure what the solution here is, other than finding a > >clean, > >less onerous way of declaring schemas, fixing up everything in builtin and > >piggybank to only use the new clean sparkly api and document the heck out > >of > >it. > > A longer term approach would likely strive to make schema specification of > inputs and outputs for UDFs declarative and restrict the scope of the > unknown. Building schema data structures procedurally is NotFun(tm). > All languages could support a string based schema representation, and many > could use more type-safe declarations like Java annotations. I think > there is a long-term opportunity to make Pig's type system easier to work > with and higher performance but its no small project. Pig certainly isn't > alone with these sort of issues. > > > > >D > > > >On Thu, Jan 13, 2011 at 8:58 PM, Daniel Dai <[email protected]> > >wrote: > > > >> One goal of semantic cleanup work undergoing is to clarify the usage of > >> unknown type. > >> > >> In Pig schema system, user can define output schema for > >>LoadFunc/EvalFunc. > >> Pig will propagate those schema to the entire script. Defining schema > >>for > >> LoadFunc/EvalFunc is optional. If user don't define schema, Pig will > >>mark > >> them bytearray. However, in the run time, user can feed any data type > >>in. > >> Before, Pig assumes the runtime type for bytearray is DataByteArray, > >>which > >> arose several issues (PIG-1277, PIG-999, PIG-1016). > >> > >> In 0.9, Pig will treat bytearray as unknown type. Pig will inspect the > >> object to figure out what the real type is at runtime. We've done that > >>for > >> all shuffle keys (PIG-1277). However, there are other cases. One case is > >> adding two bytearray. For example, > >> > >> a = load '1.txt' using SomeLoader() as (a0, a1); // Assume SomeLoader > >>does > >> not define schema, but actually feed Integer > >> b = foreach a generate a0+a1; > >> > >> In Pig 0.8, schema system marks a0 and a1 as bytearray. In the case of > >> a0+a1, Pig cast both a0 and a1 to double (in TypeCheckingVisitor), and > >>mark > >> the output schema for a0+a1 as double. Here is something interesting, > >> SomeLoader loads Integer, and we get Double after adding. We can change > >>it > >> if we do the following: > >> 1. Don't cast bytearray into Double (in TypeCheckingVisitor) > >> 2. Change POAdd(Similarly, all other ExpressionOperators, multply, > >>divide, > >> etc) to handle bytearray. When the schema for POAdd is bytearray, Pig > >>will > >> figure out the data type at runtime, and process adding according to the > >> real type > >> > >> Pro: > >> 1. Consistent with the goal for unknown type cleanup: treat all > >>bytearray > >> as unknown type. In the runtime, inspect the object to find the real > >>type > >> > >> Cons: > >> 1. Slow down the processing since we need to inspect object type at > >>runtime > >> 2. Bring some indeterminism to schema system. Before a0+a1 is double, > >> downstream schema is more clear. > >> > >> Any comments? > >> > >> Daniel > >> > >
