[jira] Commented: (PIG-111) Configuration of Pig

Stefan Groschupf (JIRA) Wed, 05 Mar 2008 17:21:13 -0800

    [ 
https://issues.apache.org/jira/browse/PIG-111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12575522#action_12575522
 ]


Stefan Groschupf commented on PIG-111:
--------------------------------------

Hi Alan, 
unfortunately jira is not available so I have to response by mail. I will 
attach the text to the issue later for reference.
  
<Alan>
You modified HExecutionEngine and LocalExecutionEngine to not have 
updateConfiguration and getConfiguration, but you didn't change the abstract 
class ExecutionEngine in the same way.  Did you intend to remove it from 
ExecutionEngine as well.  Also, why did you want to remove these features?  I 
think it's fine to not implement them for now (as they weren't in Local), but 
if we want to remove we need a good reason.
  
</Alan>
Sorry I messed this up in the last second during patch generation and didn't 
notice it. I rolled it back in again, since it looks like you dont wanna remove 
it. (see next patch version) However I planed to remove it before.
Here are my thoughts:
I personal think an interfaces especially in such an early stage of a project 
should be driven from the implementations not because we might or might not 
need it. some day in the long future. Having those interfaces and do not 
implement them in all our implementations seems wrong, we carry around code we 
do not need, but need to maintain and users get a wrong impressions - eg a 
update configuration during runtime is possible. Hadoop does not allow changing 
configurations during runtime (or only very limited) we can only reconnect to a 
different hadoop cluster or so.
The getConfiguration method is not sensefull in the executionengine since the 
pigContext hold all configuration properties, that's why it is a context. 
Having such a method in the ExecutionEngine gives again a wrong impression to 
the user that reads the java doc - a user would have the impression that an 
ExecutionEngine can have a different configuration as the context, but actually 
that is not the case.
 
<Alan>
Another couple of questions.  My expectation in looking at the patch is that by 
setting exectype in the config file, Main should then respond by starting 
PigServer with that exectype.  But that doesn't happen.  
</Alan>
Sorry, that is a bug and this is fixed in my next patch verison.
<Alan>
  Similarly if I set the cluster value to the name and port of a job tracker, I 
was expecting it to attach to that cluster.
</Alan> 
Also fixed in the next patch version - thanks for catching this.
<Alan>
You changed the default mode from mapreduce to local.  We shouldn't make 
interface changes like that without discussion.
</Alan>
I'm sorry you are right we should discuss this first, I wasn't aware that the 
default mode is map reduce. What makes no sense from my point of view. I think 
the user experience should be download, unzip start writing pig latin - out of 
the box. No extra configuration, no additional installation. Therefore I think 
local execution mode or map reduce but using the hadoop local jobclient should 
be chosen default. 
<Alan>
Also, I was pondering how to include hadoop specific data here.  Right now pig 
attaches to a cluster by reading a hadoop config file.  Obviously we don't want 
this in our general config file.  But maybe the file to be referenced should be 
referred to in this config file.  Or maybe it's ok to just pick the 
hadoop-site.xml up off the classpath, as is done now.  The modification would 
then be that we only do it if we're in mapreduce mode.  Thoughts on this?
</Alan>
What exactly we do need in our configuration hadoop specific? Namenode and 
Jobtracker. Everything else should be not relevant for us and will be 
overwritten in the job configuration when we submit a job, or do I 
misunderstand something here? 
This two properties can be easily in our properties file and actually I would 
even prefer if we call it mapred.job.tracker and not cluster. Since again it 
would be easier to understandable for the user. For different execution engines 
we need different sets of parameters anyway.

In case we need the hadoop configuration files in the classpath I would suggest 
do this optional and extend the shell script with a new parameter like 
-hadoopHome=/myHadoopFolder


Thanks for taking the time to look into this patch!
Stefan









> Configuration of Pig
> --------------------
>
>                 Key: PIG-111
>                 URL: https://issues.apache.org/jira/browse/PIG-111
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Craig Macdonald
>            Assignee: Stefan Groschupf
>         Attachments: after.png, before.png, config.patch.1502, 
> PIG-111-v04.patch, PIG-111-v05.patch, PIG-111-v06.patch, 
> PIG-111_v_3_sg.patch, PIG-111_v_7_r633244M.patch, PIG-93-v01.patch, 
> PIG-93-v02.patch
>
>
> This JIRA discusses issues relating to the configuration of Pig.
> Uses cases:
>  
> 1. I want to configure Pig programatically from Java
>  Motivation: pig can be embedded from another Java program, and configuration 
> should be accessible to be set by the client code
> 2. I want to configure Pig from the command line
> 3. I want to configure Pig from the Pig shell (Grunt)
> 4. I want Pig to remember my configuration for every Pig session
>  Motivation: to save me typing in some configuration stuff every time.
> 5. I want Pig to remember my configuration for this script.
>  Motivation: I must use a common configuration for 50% of my Pig scripts - 
> can I share this configuration between scripts.
> Current Status: 
>  * Pig uses System properties for some configuration
>  * A configuration properties object in PigContext is not used.
>  * pigrc can contain properties
>  * Configuration properties can not be set from Grunt
> Proposed solutions to use cases:
> 1. Configuration should be set in PigContext, and accessible from client code.
> 2. System properties are copied to PigContext, or can be specified on the 
> command line (duplication with System properties)
> 3. Allow configuration properties to be set using the "set" command in Grunt
> 4. Pigrc can contain properties. Is this enough, or can other configuration 
> stuff be set, eg aliases, imports, etc.
> 5. Add an include directive to pig, to allow a shared configuration/Pig 
> script to be included.
> Connections to Shell scripting: 
>  * The source command in Bash allows another bash script file to be included 
> - this allows shared variables to be set in one file shared between a set of 
> scripts.
>  * Aliases can be set, according to user preferences, etc.
>  * All this can be done in your .bashrc file
> Issues: 
>  * What happens when you change a property after the property has been read?
>  * Can Grunt read a pigrc containing various statements etc before the 
> PigServer is completely configured?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-111) Configuration of Pig

Reply via email to