[jira] [Commented] (HIVE-2051) getInputSummary() to call FileSystem.getContentSummary() in parallel
[ https://issues.apache.org/jira/browse/HIVE-2051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13009426#comment-13009426 ] Joydeep Sen Sarma commented on HIVE-2051: - committed - thanks Siying getInputSummary() to call FileSystem.getContentSummary() in parallel Key: HIVE-2051 URL: https://issues.apache.org/jira/browse/HIVE-2051 Project: Hive Issue Type: Improvement Reporter: Siying Dong Assignee: Siying Dong Priority: Minor Attachments: HIVE-2051.1.patch, HIVE-2051.2.patch, HIVE-2051.3.patch, HIVE-2051.4.patch, HIVE-2051.5.patch getInputSummary() now call FileSystem.getContentSummary() one by one, which can be extremely slow when the number of input paths are huge. By calling those functions in parallel, we can cut latency in most cases. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (HIVE-2051) getInputSummary() to call FileSystem.getContentSummary() in parallel
[ https://issues.apache.org/jira/browse/HIVE-2051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13008903#comment-13008903 ] Joydeep Sen Sarma commented on HIVE-2051: - based on: http://www.ibm.com/developerworks/java/library/j-jtp05236.html it seems that the right thing to do here is to catch the interruptedexception and then call Thread.currentThread.interrupt() (grep for 'swallow interrupt' in this article). we could also rethrow it - but the problem then will merely be punted to the higher layer (which probably will ignore it as well) getInputSummary() to call FileSystem.getContentSummary() in parallel Key: HIVE-2051 URL: https://issues.apache.org/jira/browse/HIVE-2051 Project: Hive Issue Type: Improvement Reporter: Siying Dong Assignee: Siying Dong Priority: Minor Attachments: HIVE-2051.1.patch, HIVE-2051.2.patch, HIVE-2051.3.patch, HIVE-2051.4.patch getInputSummary() now call FileSystem.getContentSummary() one by one, which can be extremely slow when the number of input paths are huge. By calling those functions in parallel, we can cut latency in most cases. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (HIVE-2051) getInputSummary() to call FileSystem.getContentSummary() in parallel
[ https://issues.apache.org/jira/browse/HIVE-2051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13008654#comment-13008654 ] Siying Dong commented on HIVE-2051: --- Joydeep, I'm nervous about putting a synchronized object put in context and have it available everywhere. I'll made this static method synchronized, so that no parallel call to it. I'm not sure whether I understand correctly but it seems that ExecutionException indicates that the waiting thread gets the signal, instead of the thread being waited. It does more sound like someone wants to kill the process. What we can do if we don't ignore nor throw ExecutionException? We only have 3 realistic choices: always throw, always ignore or continue to wait as a retry.. To me, always throwing sounds a better idea, as when we catch an exception that we don't know how to handle it, throwing it sounds the safest way to go. What's your suggestion to handle it? I'll remove the awaitTermination() getInputSummary() to call FileSystem.getContentSummary() in parallel Key: HIVE-2051 URL: https://issues.apache.org/jira/browse/HIVE-2051 Project: Hive Issue Type: Improvement Reporter: Siying Dong Assignee: Siying Dong Priority: Minor Attachments: HIVE-2051.1.patch, HIVE-2051.2.patch, HIVE-2051.3.patch, HIVE-2051.4.patch getInputSummary() now call FileSystem.getContentSummary() one by one, which can be extremely slow when the number of input paths are huge. By calling those functions in parallel, we can cut latency in most cases. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (HIVE-2051) getInputSummary() to call FileSystem.getContentSummary() in parallel
[ https://issues.apache.org/jira/browse/HIVE-2051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13008657#comment-13008657 ] Siying Dong commented on HIVE-2051: --- Joydeep, sorry you were talking about ExecutionException about InterruptedException. In that case, I'll just rethrow it. getInputSummary() to call FileSystem.getContentSummary() in parallel Key: HIVE-2051 URL: https://issues.apache.org/jira/browse/HIVE-2051 Project: Hive Issue Type: Improvement Reporter: Siying Dong Assignee: Siying Dong Priority: Minor Attachments: HIVE-2051.1.patch, HIVE-2051.2.patch, HIVE-2051.3.patch, HIVE-2051.4.patch getInputSummary() now call FileSystem.getContentSummary() one by one, which can be extremely slow when the number of input paths are huge. By calling those functions in parallel, we can cut latency in most cases. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (HIVE-2051) getInputSummary() to call FileSystem.getContentSummary() in parallel
[ https://issues.apache.org/jira/browse/HIVE-2051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13008676#comment-13008676 ] Siying Dong commented on HIVE-2051: --- I still feel that it's too dangerous to ignore InterruptedException : http://download.oracle.com/javase/1.5.0/docs/api/java/lang/Thread.html#interrupt(). It sounds like command to shutdown the thread smoothly. In that case, we should have special reason if we don't follow the command. getInputSummary() to call FileSystem.getContentSummary() in parallel Key: HIVE-2051 URL: https://issues.apache.org/jira/browse/HIVE-2051 Project: Hive Issue Type: Improvement Reporter: Siying Dong Assignee: Siying Dong Priority: Minor Attachments: HIVE-2051.1.patch, HIVE-2051.2.patch, HIVE-2051.3.patch, HIVE-2051.4.patch getInputSummary() now call FileSystem.getContentSummary() one by one, which can be extremely slow when the number of input paths are huge. By calling those functions in parallel, we can cut latency in most cases. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (HIVE-2051) getInputSummary() to call FileSystem.getContentSummary() in parallel
[ https://issues.apache.org/jira/browse/HIVE-2051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13008328#comment-13008328 ] MIS commented on HIVE-2051: --- Yes it is necessary for the executor to be terminated if the jobs have been submitted to it, even though submitted jobs may have been completed. However, what we need not do here is, after the executor is shutdown, await till the termination gets over, since this is redundant. As all the submitted jobs to the executor will be completed by the time we shutdown the executor. This is what is ensured when we do result.get() i.e., the following piece of code is not required. + do { +try { + executor.awaitTermination(Integer.MAX_VALUE, TimeUnit.SECONDS); + executorDone = true; +} catch (InterruptedException e) { +} + } while (!executorDone); getInputSummary() to call FileSystem.getContentSummary() in parallel Key: HIVE-2051 URL: https://issues.apache.org/jira/browse/HIVE-2051 Project: Hive Issue Type: Improvement Reporter: Siying Dong Assignee: Siying Dong Priority: Minor Attachments: HIVE-2051.1.patch, HIVE-2051.2.patch, HIVE-2051.3.patch, HIVE-2051.4.patch getInputSummary() now call FileSystem.getContentSummary() one by one, which can be extremely slow when the number of input paths are huge. By calling those functions in parallel, we can cut latency in most cases. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (HIVE-2051) getInputSummary() to call FileSystem.getContentSummary() in parallel
[ https://issues.apache.org/jira/browse/HIVE-2051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13008331#comment-13008331 ] MIS commented on HIVE-2051: --- The solution to this issue resembles that of HIVE-2026, so we can follow a similar approach. getInputSummary() to call FileSystem.getContentSummary() in parallel Key: HIVE-2051 URL: https://issues.apache.org/jira/browse/HIVE-2051 Project: Hive Issue Type: Improvement Reporter: Siying Dong Assignee: Siying Dong Priority: Minor Attachments: HIVE-2051.1.patch, HIVE-2051.2.patch, HIVE-2051.3.patch, HIVE-2051.4.patch getInputSummary() now call FileSystem.getContentSummary() one by one, which can be extremely slow when the number of input paths are huge. By calling those functions in parallel, we can cut latency in most cases. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (HIVE-2051) getInputSummary() to call FileSystem.getContentSummary() in parallel
[ https://issues.apache.org/jira/browse/HIVE-2051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13007293#comment-13007293 ] Siying Dong commented on HIVE-2051: --- @Carl? getInputSummary() to call FileSystem.getContentSummary() in parallel Key: HIVE-2051 URL: https://issues.apache.org/jira/browse/HIVE-2051 Project: Hive Issue Type: Improvement Reporter: Siying Dong Assignee: Siying Dong Priority: Minor Attachments: HIVE-2051.1.patch, HIVE-2051.2.patch, HIVE-2051.3.patch getInputSummary() now call FileSystem.getContentSummary() one by one, which can be extremely slow when the number of input paths are huge. By calling those functions in parallel, we can cut latency in most cases. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (HIVE-2051) getInputSummary() to call FileSystem.getContentSummary() in parallel
[ https://issues.apache.org/jira/browse/HIVE-2051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13007323#comment-13007323 ] Joydeep Sen Sarma commented on HIVE-2051: - looked at the latest patch from Carl. don't get it - why should we pay cost for creating thread when one is not required? getInputSummary() to call FileSystem.getContentSummary() in parallel Key: HIVE-2051 URL: https://issues.apache.org/jira/browse/HIVE-2051 Project: Hive Issue Type: Improvement Reporter: Siying Dong Assignee: Siying Dong Priority: Minor Attachments: HIVE-2051.1.patch, HIVE-2051.2.patch, HIVE-2051.3.patch getInputSummary() now call FileSystem.getContentSummary() one by one, which can be extremely slow when the number of input paths are huge. By calling those functions in parallel, we can cut latency in most cases. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (HIVE-2051) getInputSummary() to call FileSystem.getContentSummary() in parallel
[ https://issues.apache.org/jira/browse/HIVE-2051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13007327#comment-13007327 ] Carl Steinbach commented on HIVE-2051: -- Just to be clear I updated the reviewboard ticket with the latest version of Siying's patch. Also, the comments on reviewboard are from M IS, not me. getInputSummary() to call FileSystem.getContentSummary() in parallel Key: HIVE-2051 URL: https://issues.apache.org/jira/browse/HIVE-2051 Project: Hive Issue Type: Improvement Reporter: Siying Dong Assignee: Siying Dong Priority: Minor Attachments: HIVE-2051.1.patch, HIVE-2051.2.patch, HIVE-2051.3.patch getInputSummary() now call FileSystem.getContentSummary() one by one, which can be extremely slow when the number of input paths are huge. By calling those functions in parallel, we can cut latency in most cases. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (HIVE-2051) getInputSummary() to call FileSystem.getContentSummary() in parallel
[ https://issues.apache.org/jira/browse/HIVE-2051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13005883#comment-13005883 ] Carl Steinbach commented on HIVE-2051: -- Review request: https://reviews.apache.org/r/491/ getInputSummary() to call FileSystem.getContentSummary() in parallel Key: HIVE-2051 URL: https://issues.apache.org/jira/browse/HIVE-2051 Project: Hive Issue Type: Improvement Reporter: Siying Dong Assignee: Siying Dong Priority: Minor Attachments: HIVE-2051.1.patch getInputSummary() now call FileSystem.getContentSummary() one by one, which can be extremely slow when the number of input paths are huge. By calling those functions in parallel, we can cut latency in most cases. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira