[jira] [Commented] (HIVE-2051) getInputSummary() to call FileSystem.getContentSummary() in parallel

2011-03-21 Thread Joydeep Sen Sarma (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-2051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13009426#comment-13009426
 ] 

Joydeep Sen Sarma commented on HIVE-2051:
-

committed - thanks Siying

 getInputSummary() to call FileSystem.getContentSummary() in parallel
 

 Key: HIVE-2051
 URL: https://issues.apache.org/jira/browse/HIVE-2051
 Project: Hive
  Issue Type: Improvement
Reporter: Siying Dong
Assignee: Siying Dong
Priority: Minor
 Attachments: HIVE-2051.1.patch, HIVE-2051.2.patch, HIVE-2051.3.patch, 
 HIVE-2051.4.patch, HIVE-2051.5.patch


 getInputSummary() now call FileSystem.getContentSummary() one by one, which 
 can be extremely slow when the number of input paths are huge. By calling 
 those functions in parallel, we can cut latency in most cases.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] Commented: (HIVE-2051) getInputSummary() to call FileSystem.getContentSummary() in parallel

2011-03-20 Thread Joydeep Sen Sarma (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-2051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13008903#comment-13008903
 ] 

Joydeep Sen Sarma commented on HIVE-2051:
-

based on: http://www.ibm.com/developerworks/java/library/j-jtp05236.html

it seems that the right thing to do here is to catch the interruptedexception 
and then call Thread.currentThread.interrupt() (grep for 'swallow interrupt' in 
this article).

we could also rethrow it - but the problem then will merely be punted to the 
higher layer (which probably will ignore it as well)

 getInputSummary() to call FileSystem.getContentSummary() in parallel
 

 Key: HIVE-2051
 URL: https://issues.apache.org/jira/browse/HIVE-2051
 Project: Hive
  Issue Type: Improvement
Reporter: Siying Dong
Assignee: Siying Dong
Priority: Minor
 Attachments: HIVE-2051.1.patch, HIVE-2051.2.patch, HIVE-2051.3.patch, 
 HIVE-2051.4.patch


 getInputSummary() now call FileSystem.getContentSummary() one by one, which 
 can be extremely slow when the number of input paths are huge. By calling 
 those functions in parallel, we can cut latency in most cases.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] Commented: (HIVE-2051) getInputSummary() to call FileSystem.getContentSummary() in parallel

2011-03-18 Thread Siying Dong (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-2051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13008654#comment-13008654
 ] 

Siying Dong commented on HIVE-2051:
---

Joydeep, I'm nervous about putting a synchronized object put in context and 
have it available everywhere. I'll made this static method synchronized, so 
that no parallel call to it.

I'm not sure whether I understand correctly but it seems that 
ExecutionException indicates that the waiting thread gets the signal, instead 
of the thread being waited. It does more sound like someone wants to kill the 
process. What we can do if we don't ignore nor throw ExecutionException? We 
only have 3 realistic choices: always throw, always ignore or continue to wait 
as a retry.. To me, always throwing sounds a better idea, as when we catch an 
exception that we don't know how to handle it, throwing it sounds the safest 
way to go. What's your suggestion to handle it?

I'll remove the awaitTermination()

 getInputSummary() to call FileSystem.getContentSummary() in parallel
 

 Key: HIVE-2051
 URL: https://issues.apache.org/jira/browse/HIVE-2051
 Project: Hive
  Issue Type: Improvement
Reporter: Siying Dong
Assignee: Siying Dong
Priority: Minor
 Attachments: HIVE-2051.1.patch, HIVE-2051.2.patch, HIVE-2051.3.patch, 
 HIVE-2051.4.patch


 getInputSummary() now call FileSystem.getContentSummary() one by one, which 
 can be extremely slow when the number of input paths are huge. By calling 
 those functions in parallel, we can cut latency in most cases.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] Commented: (HIVE-2051) getInputSummary() to call FileSystem.getContentSummary() in parallel

2011-03-18 Thread Siying Dong (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-2051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13008657#comment-13008657
 ] 

Siying Dong commented on HIVE-2051:
---

Joydeep, sorry you were talking about ExecutionException about 
InterruptedException. In that case, I'll just rethrow it.

 getInputSummary() to call FileSystem.getContentSummary() in parallel
 

 Key: HIVE-2051
 URL: https://issues.apache.org/jira/browse/HIVE-2051
 Project: Hive
  Issue Type: Improvement
Reporter: Siying Dong
Assignee: Siying Dong
Priority: Minor
 Attachments: HIVE-2051.1.patch, HIVE-2051.2.patch, HIVE-2051.3.patch, 
 HIVE-2051.4.patch


 getInputSummary() now call FileSystem.getContentSummary() one by one, which 
 can be extremely slow when the number of input paths are huge. By calling 
 those functions in parallel, we can cut latency in most cases.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] Commented: (HIVE-2051) getInputSummary() to call FileSystem.getContentSummary() in parallel

2011-03-18 Thread Siying Dong (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-2051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13008676#comment-13008676
 ] 

Siying Dong commented on HIVE-2051:
---

I still feel that it's too dangerous to ignore InterruptedException : 
http://download.oracle.com/javase/1.5.0/docs/api/java/lang/Thread.html#interrupt().
 It sounds like command to shutdown the thread smoothly. In that case, we 
should have special reason if we don't follow the command.

 getInputSummary() to call FileSystem.getContentSummary() in parallel
 

 Key: HIVE-2051
 URL: https://issues.apache.org/jira/browse/HIVE-2051
 Project: Hive
  Issue Type: Improvement
Reporter: Siying Dong
Assignee: Siying Dong
Priority: Minor
 Attachments: HIVE-2051.1.patch, HIVE-2051.2.patch, HIVE-2051.3.patch, 
 HIVE-2051.4.patch


 getInputSummary() now call FileSystem.getContentSummary() one by one, which 
 can be extremely slow when the number of input paths are huge. By calling 
 those functions in parallel, we can cut latency in most cases.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] Commented: (HIVE-2051) getInputSummary() to call FileSystem.getContentSummary() in parallel

2011-03-17 Thread MIS (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-2051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13008328#comment-13008328
 ] 

MIS commented on HIVE-2051:
---

Yes it is necessary for the executor to be terminated if the jobs have been 
submitted to it, even though submitted jobs may have been completed. 

However, what we need not do here is, after the executor is shutdown, await 
till the termination gets over, since this is redundant. As all the submitted 
jobs to the executor will be completed by the time we shutdown the executor. 
This is what is ensured when we do result.get()
i.e., the following piece of code is not required.
+  do {
+try {
+  executor.awaitTermination(Integer.MAX_VALUE, TimeUnit.SECONDS);
+  executorDone = true;
+} catch (InterruptedException e) {
+}
+  } while (!executorDone);

 getInputSummary() to call FileSystem.getContentSummary() in parallel
 

 Key: HIVE-2051
 URL: https://issues.apache.org/jira/browse/HIVE-2051
 Project: Hive
  Issue Type: Improvement
Reporter: Siying Dong
Assignee: Siying Dong
Priority: Minor
 Attachments: HIVE-2051.1.patch, HIVE-2051.2.patch, HIVE-2051.3.patch, 
 HIVE-2051.4.patch


 getInputSummary() now call FileSystem.getContentSummary() one by one, which 
 can be extremely slow when the number of input paths are huge. By calling 
 those functions in parallel, we can cut latency in most cases.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] Commented: (HIVE-2051) getInputSummary() to call FileSystem.getContentSummary() in parallel

2011-03-17 Thread MIS (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-2051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13008331#comment-13008331
 ] 

MIS commented on HIVE-2051:
---

The solution to this issue resembles that of HIVE-2026, so we can follow a 
similar approach.

 getInputSummary() to call FileSystem.getContentSummary() in parallel
 

 Key: HIVE-2051
 URL: https://issues.apache.org/jira/browse/HIVE-2051
 Project: Hive
  Issue Type: Improvement
Reporter: Siying Dong
Assignee: Siying Dong
Priority: Minor
 Attachments: HIVE-2051.1.patch, HIVE-2051.2.patch, HIVE-2051.3.patch, 
 HIVE-2051.4.patch


 getInputSummary() now call FileSystem.getContentSummary() one by one, which 
 can be extremely slow when the number of input paths are huge. By calling 
 those functions in parallel, we can cut latency in most cases.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] Commented: (HIVE-2051) getInputSummary() to call FileSystem.getContentSummary() in parallel

2011-03-15 Thread Siying Dong (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-2051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13007293#comment-13007293
 ] 

Siying Dong commented on HIVE-2051:
---

@Carl?

 getInputSummary() to call FileSystem.getContentSummary() in parallel
 

 Key: HIVE-2051
 URL: https://issues.apache.org/jira/browse/HIVE-2051
 Project: Hive
  Issue Type: Improvement
Reporter: Siying Dong
Assignee: Siying Dong
Priority: Minor
 Attachments: HIVE-2051.1.patch, HIVE-2051.2.patch, HIVE-2051.3.patch


 getInputSummary() now call FileSystem.getContentSummary() one by one, which 
 can be extremely slow when the number of input paths are huge. By calling 
 those functions in parallel, we can cut latency in most cases.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] Commented: (HIVE-2051) getInputSummary() to call FileSystem.getContentSummary() in parallel

2011-03-15 Thread Joydeep Sen Sarma (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-2051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13007323#comment-13007323
 ] 

Joydeep Sen Sarma commented on HIVE-2051:
-

looked at the latest patch from Carl. don't get it - why should we pay cost for 
creating thread when one is not required? 

 getInputSummary() to call FileSystem.getContentSummary() in parallel
 

 Key: HIVE-2051
 URL: https://issues.apache.org/jira/browse/HIVE-2051
 Project: Hive
  Issue Type: Improvement
Reporter: Siying Dong
Assignee: Siying Dong
Priority: Minor
 Attachments: HIVE-2051.1.patch, HIVE-2051.2.patch, HIVE-2051.3.patch


 getInputSummary() now call FileSystem.getContentSummary() one by one, which 
 can be extremely slow when the number of input paths are huge. By calling 
 those functions in parallel, we can cut latency in most cases.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] Commented: (HIVE-2051) getInputSummary() to call FileSystem.getContentSummary() in parallel

2011-03-15 Thread Carl Steinbach (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-2051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13007327#comment-13007327
 ] 

Carl Steinbach commented on HIVE-2051:
--

Just to be clear I updated the reviewboard ticket with the latest version of 
Siying's patch. Also, the comments on reviewboard are from M IS, not me.

 getInputSummary() to call FileSystem.getContentSummary() in parallel
 

 Key: HIVE-2051
 URL: https://issues.apache.org/jira/browse/HIVE-2051
 Project: Hive
  Issue Type: Improvement
Reporter: Siying Dong
Assignee: Siying Dong
Priority: Minor
 Attachments: HIVE-2051.1.patch, HIVE-2051.2.patch, HIVE-2051.3.patch


 getInputSummary() now call FileSystem.getContentSummary() one by one, which 
 can be extremely slow when the number of input paths are huge. By calling 
 those functions in parallel, we can cut latency in most cases.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] Commented: (HIVE-2051) getInputSummary() to call FileSystem.getContentSummary() in parallel

2011-03-11 Thread Carl Steinbach (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-2051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13005883#comment-13005883
 ] 

Carl Steinbach commented on HIVE-2051:
--

Review request: https://reviews.apache.org/r/491/


 getInputSummary() to call FileSystem.getContentSummary() in parallel
 

 Key: HIVE-2051
 URL: https://issues.apache.org/jira/browse/HIVE-2051
 Project: Hive
  Issue Type: Improvement
Reporter: Siying Dong
Assignee: Siying Dong
Priority: Minor
 Attachments: HIVE-2051.1.patch


 getInputSummary() now call FileSystem.getContentSummary() one by one, which 
 can be extremely slow when the number of input paths are huge. By calling 
 those functions in parallel, we can cut latency in most cases.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira