[jira] Commented: (HIVE-549) Parallel Execution Mechanism

Zheng Shao (JIRA) Tue, 27 Oct 2009 23:20:26 -0700

    [ 
https://issues.apache.org/jira/browse/HIVE-549?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12770805#action_12770805
 ]


Zheng Shao commented on HIVE-549:
---------------------------------

Hive-549.patch: Overall the structure of the changes look good to me.

Utilities.java:
* gWorkContainer: this variable needs to be heavily commented - why do we need 
it, why it's ThreadLocal, any alternatives (and optionally, why this is better 
than alternatives).

* Where do we serialize data to "HIVE_PLAN"? Is that thread-safe?
           InputStream in = new FileInputStream("HIVE_PLAN");

HiveInputFormat.java:
* Unnecessary white-space changes only.

Driver.java:
* It's a good idea to assign the job numbers statically - if the jobNo of the 
same job can be different in different runs, it will be harder to debug.

* Shall we immediately stop all other running jobs if one of them have failed?
+          console.printError(errorMessage);
+          taskCleanup(runnable);

* I guess you mean "if the child has already started, or is NOT runnable"
+          // Check if the child has already started, or is runnable
+          if(checkLaunch(child)) { 

* There are 2 places where we do "curJobNo++" in this function:
+  public int launchTask(Task<? extends Serializable> tsk, String queryId,

* Do we need to get a copy of "conf" object since we are modifying it in 
launchTask?
+    tsk.initialize(conf, plan);


Can we add an option for the user to choose "sequential" or "parallel" 
execution?  The change could be simple - we just need to check the option in 
launchTask to decide whether we should call TaskRunner.start() or 
TaskRunner.run(). Please add the new option to HiveConf class, and 
conf/hive-default.xml.



> Parallel Execution Mechanism
> ----------------------------
>
>                 Key: HIVE-549
>                 URL: https://issues.apache.org/jira/browse/HIVE-549
>             Project: Hadoop Hive
>          Issue Type: Wish
>          Components: Query Processor
>            Reporter: Adam Kramer
>            Assignee: Chaitanya Mishra
>         Attachments: Hive-549.patch
>
>
> In a massively parallel database system, it would be awesome to also 
> parallelize some of the mapreduce phases that our data needs to go through.
> One example that just occurred to me is UNION ALL: when you union two SELECT 
> statements, effectively you could run those statements in parallel. There's 
> no situation (that I can think of, but I don't have a formal proof) in which 
> the left statement would rely on the right statement, or vice versa. So, they 
> could be run at the same time...and perhaps they should be. Or, perhaps there 
> should be a way to make this happen...PARALLEL UNION ALL? PUNION ALL?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-549) Parallel Execution Mechanism

Reply via email to