[jira] Commented: (PIG-966) Proposed rework for LoadFunc, StoreFunc, and Slice/r interfaces

Dmitriy V. Ryaboy (JIRA) Mon, 21 Sep 2009 14:49:40 -0700

    [ 
https://issues.apache.org/jira/browse/PIG-966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12758037#action_12758037
 ]


Dmitriy V. Ryaboy commented on PIG-966:
---------------------------------------

Hi Alan, 
Responses to responses:

bq. Perhaps its best to leave this as strings but look for a scheme at the 
beginning and interpret it as a URI if it has one (which is what Pig does now).

I understand the motivation more clearly now, thanks for the explanation. 
Agreed with the quoted approach.

bq. [regarding single schema for partitioned datasets] Agreed, that is what I 
was trying to say. Perhaps it wasn't clear.

Nope, it was clear, I just have a very verbose way of saying "yes".

Regarding merging the Schemas you said:
bq. No. It serves a different purpose, which is to define the content of data 
flows inside the logical plan. We should not tie these two together.

I don't really understand the difference, but accept your superior knowledge of 
the codebase and accept your decision :-).

bq. I'm not inclined to bend my programming style to match that of whoever 
wrote findbugs.

+9.3 from the Russian judge. Gleefully accepted.

bq. We need partition keys as part of this interface, as Pig will need to be 
able to pass partition keys to loaders that are capable of doing partition 
pruning. So we could add getPartitionKeys to the LoadMetadata interface.

That's precisely what I am suggesting -- take it out of Schema, put it in 
LoadMetadata (or MetadataReader, as I like to call it....).

bq. The problem with key/value set ups like this is it can be hard for people 
to understand what is already there. So they end up not using what already 
exists, or worse, re-inventing the wheel. My hope is that by versioning this we 
can get around the need for this key/value stuff.

Hm, I see your point. I am interested in being able to augment the set of 
available statistics without requiring changes to the base classes, however. I 
guess that's where inheritance comes in handy.  Any comments on how to handle 
missing data? Primitive types still don't work for that.

bq. So what happens tomorrow when some loaders can do merge joins on sorted 
data? Now we have to have another interface. I want this to be easily 
extensible.

I must not be clear on what pushing down to a loader does.  My interpretation 
was that it allows pushing down operations to the point where you don't read 
unnecessary data off disk.  A classic example of filter projection would be 
filtering by a partition key (so, dt >sysdate-30 , and our data is stored in 
files one per day). An example of projection pushdown is when we have a column 
store that simply avoids loading some of the columns.

I don't see how a loader can push down a join. That seems to require reading 
and changing data. Is the idea that such a join can be performed without an MR 
step? That seems like a Pig thing, not a loader thing.

In any case, yes, I think something like this would require a new interface in 
the same namespace, since it's a drastically different capability.

Any thoughts on advisability of simplifying projection pushdown to just work on 
an int array? I know it's limiting, but it's going to be a heck of a lot easier 
for users to implement.

bq. I'm assuming that a given StoreFunc is tied to a particular metadata 
instance, so it would return its implementation of StoreMetadata.

I was assuming that Pig would have a preferred metadata store (such as Owl), 
and it would attempt to use it unless instructed otherwise. We could even try 
some cascading thing: if the user specifies a metadata store on the command 
line, use that; if not, see whether the loader suggests one; if not, use Owl; 
if owl doesn't have anything, see if it's an file in a known scheme (hdfs, 
file, s3n...) and at least get some file-level metadata such as create date and 
size.  StoreMetadata can do the same (except for hdfs part).

I'll take another look at PIG-967.

> Proposed rework for LoadFunc, StoreFunc, and Slice/r interfaces
> ---------------------------------------------------------------
>
>                 Key: PIG-966
>                 URL: https://issues.apache.org/jira/browse/PIG-966
>             Project: Pig
>          Issue Type: Improvement
>          Components: impl
>            Reporter: Alan Gates
>            Assignee: Alan Gates
>
> I propose that we rework the LoadFunc, StoreFunc, and Slice/r interfaces 
> significantly.  See http://wiki.apache.org/pig/LoadStoreRedesignProposal for 
> full details

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-966) Proposed rework for LoadFunc, StoreFunc, and Slice/r interfaces

Reply via email to