Re: Getting query information while loading data

Alan Gates Tue, 12 Feb 2008 12:30:55 -0800

Comments at the end.

Charlie Groves wrote:

On Feb 4, 2008, at 1:42 PM, Alan Gates wrote:
Charlie Groves wrote:
On Jan 18, 2008, at 11:45 AM, Alan Gates wrote:
Our thinking of how to provide field metadata (name and eventuallytypes) for pig queries was to allow several options:
   1) AS in the LOAD, as you can currently do for names.
2) using an outside metadata service, where we would tell it thefile name and it would tell us the metadata.
   3) Support self describing data formats such as JSON.
You're suggestion for a very simple schema provided in the firstline of the file falls under category 3. The trick here is that weneed to be able to read that metadata about the fields at parsetime (because we'd like to be able to do type checking and such).So in addition to the load function itself needing to examine thetuples, we need a way for the load function to read just enough ofthe file to tell the front end (on the client box, not on themap-reduce backend) the schema. Maybe the best way to implementthis is to have an interface that the load function would implementthat lets the parser know that the load function can discover themetadata for it, and then the parser could call that load functionbefore proceeding to type checking.
We're also interested in being able to tell the load function thefields needed in the query. Even if you don't have field per filestorage (aka columnar storage) it's useful to be able toimmediately project out fields you know the query won't care about,as you can avoid translation costs and memory storage.
It's not clear to me that we need another interface to implementthis. We could just add a method "void neededColumns(Schema s)" toPigLoader. As a post parsing step the parser would then visit theplan, as you suggest, and submit a schema to the PigLoaderfunction. It would be up to the specific loader implementation todecide whether to make use of the provided schema or not.
I don't see the use for the first new function in addition to thesecond. If a schema is required by the query, the loader must beable to produce data matching that schema. If the loader can figureout an internal schema, it can make that check that you describe infunction 1 in addition to structuring its data correctly as infunction 2. If it can't determine its internal schema until itloads data, then it can do neither and we have to wait until runtimeto see if it succeeds. What about making the call "SchemaneededColumns(Schema s) throws IOException"? The returned Schema isthe actual Schema that will be loaded which must be a superset ofthe incoming Schema. If the loader is unable to create the neededschema, an IOException is thrown.
I'm not sure I understand what you're proposing. I was trying to saythat we need two separate things from the load function:1) A way to discover the schema of the data at parse time for typechecking and query correctness checking (e.g. the user asked forfield 5, is there a field 5?) This is needed for metadata option 3,where the metadata is described by the data (as in JSON) or where themetadata is located in a file associated with the data. We want todetect these kinds of errors before we submit to the backend (i.e.Hadoop) so that we can give the earliest possible error feedback.2) A way to indicate to the load function the schema it needs toload, as a way to support columnar storage schemes (such as youpropose) or pushing projection down into the load.
Were you saying that you didn't think one of those is necessary, orare you saying that you think we can accomplish both with onefunction being adding to the load function?
I'm saying that both can be accomplished with one new function on theload func: Schema neededColumns(Schema s) throws IOException. s isthe schema derived from the query, and the load func can use it tosatisfy your first requirement. If it can check its underlying data,it can then compare it to the schema in s and throw an IOException ifit can't satisfy that. s can also be used to satisfy your secondrequirement as it indicates to the load func what it's expected to load.
The returned Schema is the form that the actual data returned by theload func will take. It must be a superset of the passed in Schema,and really just exists to allow the load func to say it isn't going toprune any of the data away at load time and just return everythingthat it finds. For load funcs that don't know the structure of theirdata until they actually read it, they can return the * schema andjust wait until runtime to see if things blow up just like things workcurrently.
I think this makes more sense as a single function because the tworequirements are essentially the same operation. To load enough ofthe data to check a given schema against what's actually in the storeis almost the same work to determine what it'll actually load forrequirement two.
Make more sense?

Let's work through a use case with the following script:

a = load 'mydata' using myloader();
b = filter a by $1 matches '.mysite.com';
c = group b by $0;
d = foreach c generate $0, SUM($1.$5);
store d into 'summeddata';

A post process step would figure out that the data loaded from 'myfile'needs to have at least 6 columns, column 2 needs to be a string, andcolumn 6 needs to be int, long, float, or double. It would then composea schema with those slots filled in and call neededColumns, passing inthat schema. If myloader was of a type that it could push theprojection down into the load, it would store this information for uselater when actually loading data. If myloader was loading some type ofself describing data it would need in this same function call, todiscover the schema of the data it is loading. It would then check thisagainst the passed in schema to assure it makes sense. In addition, itwould create an output schema that describes the data, and return thatfrom neededSchema. In the case where the data was not self describing,it would simply return a star schema (why not the schema passed in,since the data should match that or we'll get an error?). Is that correct?

It still feels to me like you have one function doing two things. Oneway or another, something needs to check that the schema derived fromthe post process analysis of the script matches the schema derived byinvestigating the data itself, and in your proposal that can be done inthis function.

One way or another I think we agree on the needed functionality,interface concerns are probably secondary. I need to update my typedesign doc to address how types are converted (at load time or lazily)and how the load function exposes that functionality. I'll add this tothe doc at the same time.


Alan.


Charlie

Re: Getting query information while loading data

Reply via email to