Agreed with what Scott said about procedurally building schemas, and what
Olga said about static typing.

Daniel, I am not sure what you mean about run-time typing on a row by row
basis.  Certainly winding up with columns that are sometimes doubles,
sometimes floats, and sometimes ints can only lead to unexpected bugs?

I know Yahoo went through a lot of pain with the LoadStore rework in 0.7
(heck I am still dealing with it), but seems like breaking compatibility in
a minor way in order to clean up semantics is ok given that we had a
"stable" version in between. I don't think conversion would be too onerous,
especially if declaring schemas is simplified.

We can just say that odd versions can break apis and even can't :).

D

On Fri, Jan 14, 2011 at 12:27 PM, Scott Carey <[email protected]>wrote:

>
>
> On 1/13/11 10:54 PM, "Dmitriy Ryaboy" <[email protected]> wrote:
>
> >How is runtime detection done? I worry that if 1.txt contains:
> >1, 2
> >1.1, 2.2
> >
> >We get into a situation where addition of the fields in the first tuple
> >produces integers, and adding the fields of the second tuple produces
> >doubles.
> >
> >A more invasive but perhaps easier to reason about solution might be to be
> >stricter about types, and require bytearrays to be cast to whatever type
> >they are supposed to be if you want to add / delete / do non-byte-things
> >to
> >them.
> >
> >This is a problem if UDFs that output tuples or bags don't specify schemas
> >(and specifying schemas of tuples and bags is fairly onerous right now in
> >Java). I am not sure what the solution here is, other than finding a
> >clean,
> >less onerous way of declaring schemas, fixing up everything in builtin and
> >piggybank to only use the new clean sparkly api and document the heck out
> >of
> >it.
>
> A longer term approach would likely strive to make schema specification of
> inputs and outputs for UDFs declarative and restrict the scope of the
> unknown.  Building schema data structures procedurally is NotFun(tm).
> All languages could support a string based schema representation, and many
> could use more type-safe declarations like Java annotations.  I think
> there is a long-term opportunity to make Pig's type system easier to work
> with and higher performance but its no small project.  Pig certainly isn't
> alone with these sort of issues.
>
> >
> >D
> >
> >On Thu, Jan 13, 2011 at 8:58 PM, Daniel Dai <[email protected]>
> >wrote:
> >
> >> One goal of semantic cleanup work undergoing is to clarify the usage of
> >> unknown type.
> >>
> >> In Pig schema system, user can define output schema for
> >>LoadFunc/EvalFunc.
> >> Pig will propagate those schema to the entire script. Defining schema
> >>for
> >> LoadFunc/EvalFunc is optional. If user don't define schema, Pig will
> >>mark
> >> them bytearray. However, in the run time, user can feed any data type
> >>in.
> >> Before, Pig assumes the runtime type for bytearray is DataByteArray,
> >>which
> >> arose several issues (PIG-1277, PIG-999, PIG-1016).
> >>
> >> In 0.9, Pig will treat bytearray as unknown type. Pig will inspect the
> >> object to figure out what the real type is at runtime. We've done that
> >>for
> >> all shuffle keys (PIG-1277). However, there are other cases. One case is
> >> adding two bytearray. For example,
> >>
> >> a = load '1.txt' using SomeLoader() as (a0, a1);  // Assume SomeLoader
> >>does
> >> not define schema, but actually feed Integer
> >> b = foreach a generate a0+a1;
> >>
> >> In Pig 0.8, schema system marks a0 and a1 as bytearray. In the case of
> >> a0+a1, Pig cast both a0 and a1 to double (in TypeCheckingVisitor), and
> >>mark
> >> the output schema for a0+a1 as double. Here is something interesting,
> >> SomeLoader loads Integer, and we get Double after adding. We can change
> >>it
> >> if we do the following:
> >> 1. Don't cast bytearray into Double (in TypeCheckingVisitor)
> >> 2. Change POAdd(Similarly, all other ExpressionOperators, multply,
> >>divide,
> >> etc) to handle bytearray. When the schema for POAdd is bytearray, Pig
> >>will
> >> figure out the data type at runtime, and process adding according to the
> >> real type
> >>
> >> Pro:
> >> 1. Consistent with the goal for unknown type cleanup: treat all
> >>bytearray
> >> as unknown type. In the runtime, inspect the object to find the real
> >>type
> >>
> >> Cons:
> >> 1. Slow down the processing since we need to inspect object type at
> >>runtime
> >> 2. Bring some indeterminism to schema system. Before a0+a1 is double,
> >> downstream schema is more clear.
> >>
> >> Any comments?
> >>
> >> Daniel
> >>
>
>

Reply via email to