Dmitriy V. Ryaboy commented on PIG-966:
Responses to responses:
bq. Perhaps its best to leave this as strings but look for a scheme at the
beginning and interpret it as a URI if it has one (which is what Pig does now).
I understand the motivation more clearly now, thanks for the explanation.
Agreed with the quoted approach.
bq. [regarding single schema for partitioned datasets] Agreed, that is what I
was trying to say. Perhaps it wasn't clear.
Nope, it was clear, I just have a very verbose way of saying "yes".
Regarding merging the Schemas you said:
bq. No. It serves a different purpose, which is to define the content of data
flows inside the logical plan. We should not tie these two together.
I don't really understand the difference, but accept your superior knowledge of
the codebase and accept your decision :-).
bq. I'm not inclined to bend my programming style to match that of whoever
+9.3 from the Russian judge. Gleefully accepted.
bq. We need partition keys as part of this interface, as Pig will need to be
able to pass partition keys to loaders that are capable of doing partition
pruning. So we could add getPartitionKeys to the LoadMetadata interface.
That's precisely what I am suggesting -- take it out of Schema, put it in
LoadMetadata (or MetadataReader, as I like to call it....).
bq. The problem with key/value set ups like this is it can be hard for people
to understand what is already there. So they end up not using what already
exists, or worse, re-inventing the wheel. My hope is that by versioning this we
can get around the need for this key/value stuff.
Hm, I see your point. I am interested in being able to augment the set of
available statistics without requiring changes to the base classes, however. I
guess that's where inheritance comes in handy. Any comments on how to handle
missing data? Primitive types still don't work for that.
bq. So what happens tomorrow when some loaders can do merge joins on sorted
data? Now we have to have another interface. I want this to be easily
I must not be clear on what pushing down to a loader does. My interpretation
was that it allows pushing down operations to the point where you don't read
unnecessary data off disk. A classic example of filter projection would be
filtering by a partition key (so, dt >sysdate-30 , and our data is stored in
files one per day). An example of projection pushdown is when we have a column
store that simply avoids loading some of the columns.
I don't see how a loader can push down a join. That seems to require reading
and changing data. Is the idea that such a join can be performed without an MR
step? That seems like a Pig thing, not a loader thing.
In any case, yes, I think something like this would require a new interface in
the same namespace, since it's a drastically different capability.
Any thoughts on advisability of simplifying projection pushdown to just work on
an int array? I know it's limiting, but it's going to be a heck of a lot easier
for users to implement.
bq. I'm assuming that a given StoreFunc is tied to a particular metadata
instance, so it would return its implementation of StoreMetadata.
I was assuming that Pig would have a preferred metadata store (such as Owl),
and it would attempt to use it unless instructed otherwise. We could even try
some cascading thing: if the user specifies a metadata store on the command
line, use that; if not, see whether the loader suggests one; if not, use Owl;
if owl doesn't have anything, see if it's an file in a known scheme (hdfs,
file, s3n...) and at least get some file-level metadata such as create date and
size. StoreMetadata can do the same (except for hdfs part).
I'll take another look at PIG-967.
> Proposed rework for LoadFunc, StoreFunc, and Slice/r interfaces
> Key: PIG-966
> URL: https://issues.apache.org/jira/browse/PIG-966
> Project: Pig
> Issue Type: Improvement
> Components: impl
> Reporter: Alan Gates
> Assignee: Alan Gates
> I propose that we rework the LoadFunc, StoreFunc, and Slice/r interfaces
> significantly. See http://wiki.apache.org/pig/LoadStoreRedesignProposal for
> full details
This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.