[jira] Updated: (HIVE-2026) Parallelize UpdateInputAccessTimeHook
[ https://issues.apache.org/jira/browse/HIVE-2026?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ning Zhang updated HIVE-2026: - Attachment: HIVE-2026.patch running tests and will update a review board. Parallelize UpdateInputAccessTimeHook - Key: HIVE-2026 URL: https://issues.apache.org/jira/browse/HIVE-2026 Project: Hive Issue Type: Improvement Reporter: Ning Zhang Assignee: Ning Zhang Attachments: HIVE-2026.patch UpdateInputAccessTimeHook is usually used as a pre-execution hook to update the metastore's lastAccessTime field of input partition/table. If a query touches a large number of partitions, this hooks takes a long time to execute. One approach is to make the hook itself to run in a separate thread. But it is hard to guarantee backward compatibility in semantics in case of exceptions encountered in the hook execution. This task takes another approach to parallelize the hook itself (update multiple partitions concurrently), but execute each pre-hook in sequential order. -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira
Review Request: HIVE-2026. Parallelize UpdateInputAccessTimeHook
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/460/ --- Review request for hive. Summary --- define hive.hooks.parallel.degree to control max # of thread to update metastore in parallel. Diffs - trunk/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java 1076459 trunk/conf/hive-default.xml 1076459 trunk/metastore/src/java/org/apache/hadoop/hive/metastore/ObjectStore.java 1076459 trunk/ql/src/java/org/apache/hadoop/hive/ql/hooks/UpdateInputAccessTimeHook.java 1076459 Diff: https://reviews.apache.org/r/460/diff Testing --- Thanks, Ning
[jira] Commented: (HIVE-2026) Parallelize UpdateInputAccessTimeHook
[ https://issues.apache.org/jira/browse/HIVE-2026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13001923#comment-13001923 ] Ning Zhang commented on HIVE-2026: -- review board: https://reviews.apache.org/r/460/ Parallelize UpdateInputAccessTimeHook - Key: HIVE-2026 URL: https://issues.apache.org/jira/browse/HIVE-2026 Project: Hive Issue Type: Improvement Reporter: Ning Zhang Assignee: Ning Zhang Attachments: HIVE-2026.patch UpdateInputAccessTimeHook is usually used as a pre-execution hook to update the metastore's lastAccessTime field of input partition/table. If a query touches a large number of partitions, this hooks takes a long time to execute. One approach is to make the hook itself to run in a separate thread. But it is hard to guarantee backward compatibility in semantics in case of exceptions encountered in the hook execution. This task takes another approach to parallelize the hook itself (update multiple partitions concurrently), but execute each pre-hook in sequential order. -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Updated: (HIVE-2026) Parallelize UpdateInputAccessTimeHook
[ https://issues.apache.org/jira/browse/HIVE-2026?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ning Zhang updated HIVE-2026: - Status: Patch Available (was: Open) Parallelize UpdateInputAccessTimeHook - Key: HIVE-2026 URL: https://issues.apache.org/jira/browse/HIVE-2026 Project: Hive Issue Type: Improvement Reporter: Ning Zhang Assignee: Ning Zhang Attachments: HIVE-2026.patch UpdateInputAccessTimeHook is usually used as a pre-execution hook to update the metastore's lastAccessTime field of input partition/table. If a query touches a large number of partitions, this hooks takes a long time to execute. One approach is to make the hook itself to run in a separate thread. But it is hard to guarantee backward compatibility in semantics in case of exceptions encountered in the hook execution. This task takes another approach to parallelize the hook itself (update multiple partitions concurrently), but execute each pre-hook in sequential order. -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira
RE: Regarding HIVE-1737
Hi Mohit, Can you be more precise how the fixed and variable row size are evaluated wrongly? I don't quite understand what you mean. Did I miss any context? I guess you are running a previous version and try to figure out whether you need to port this patch? In that case, I think OOM is the worst possible case. We also care about whether one task uses more resource than it really needs and competes resource with other tasks. I don't think there can be other impact. If you want to try to repro a OOM, you should produce a condition that sum of distinct string key size maximum heap size, and fix size + aggregate parameter size much smaller than average key size. You can try very long distinct string keys as input and group by it. My feeling is that it is not such a common case, since we never hit OOM for this. For current trunk or version 0.7, now the codes are really not the same as when we did HIVE-1737, since we've had HIVE-1830 now, which put a memory usage check and force to flush the disk when memory is more than a threshold, so that even without HIVE-1737, there won't be OOM any way. Thanks, Siying From: Mohit [mohitsi...@huawei.com] Sent: Tuesday, March 01, 2011 7:08 AM To: Siying Dong Cc: Namit Jain; chinna...@huawei.com; hive-...@hadoop.apache.org Subject: FW: Regarding HIVE-1737 Hi Namit/Siying, Ok, even I agree with your analysis. Both the fixed and variable row size evaluated wrongly here. But what I was more interested in how critical is the change; like what if hash aggregation map is not flushed, even if the number of existing entries overshot the false entries stats calculated on basis of configured property hive.map.aggr.map.percentmemory (whereas if it happens faithfully by the code changes you did, it will trigger flush), any issues apart from out of memory in child JVM or there is more to it, something else bad can happen? If you can provide me the pointers to re-produce it's side effect, It will be great. -Mohit *** This e-mail and attachments contain confidential information from HUAWEI, which is intended only for the person or entity whose address is listed above. Any use of the information contained herein in any way (including, but not limited to, total or partial disclosure, reproduction, or dissemination) by persons other than the intended recipient's) is prohibited. If you receive this e-mail in error, please notify the sender by phone or email immediately and delete it! From: Mohit [mailto:mohitsi...@huawei.com] Sent: Tuesday, March 01, 2011 12:39 PM To: 'siyin...@fb.com' Subject: Regarding HIVE-1737 Hi Siying, Hope you doing great. Well, I have one request regarding this defect, I'm not able to understand and hence reproduce this issue. May be you can help in that, I need to know what queries you ran. -Mohit *** This e-mail and attachments contain confidential information from HUAWEI, which is intended only for the person or entity whose address is listed above. Any use of the information contained herein in any way (including, but not limited to, total or partial disclosure, reproduction, or dissemination) by persons other than the intended recipient's) is prohibited. If you receive this e-mail in error, please notify the sender by phone or email immediately and delete it!
Re: Review Request: HIVE-2026. Parallelize UpdateInputAccessTimeHook
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/460/#review283 --- trunk/ql/src/java/org/apache/hadoop/hive/ql/hooks/UpdateInputAccessTimeHook.java https://reviews.apache.org/r/460/#comment534 How about going for a centralized thread pool and submitting the tasks for that pool. This can have advantages like, we need not have to create threads and we could come to know of the status of the task submitted through the future object. And use this future to to wait till the task is finished. We can re factor the code to make UpdateWorker to implement Runnable instead of extending of Thread. - M On 2011-03-03 00:53:49, Ning Zhang wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/460/ --- (Updated 2011-03-03 00:53:49) Review request for hive. Summary --- define hive.hooks.parallel.degree to control max # of thread to update metastore in parallel. Diffs - trunk/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java 1076459 trunk/conf/hive-default.xml 1076459 trunk/metastore/src/java/org/apache/hadoop/hive/metastore/ObjectStore.java 1076459 trunk/ql/src/java/org/apache/hadoop/hive/ql/hooks/UpdateInputAccessTimeHook.java 1076459 Diff: https://reviews.apache.org/r/460/diff Testing --- Thanks, Ning
[jira] Updated: (HIVE-1694) Accelerate GROUP BY execution using indexes
[ https://issues.apache.org/jira/browse/HIVE-1694?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prajakta Kalmegh updated HIVE-1694: --- Attachment: HIVE-1694.2.patch.txt Patch version 2 - includes changes for review comments from John. Accelerate GROUP BY execution using indexes --- Key: HIVE-1694 URL: https://issues.apache.org/jira/browse/HIVE-1694 Project: Hive Issue Type: New Feature Components: Indexing, Query Processor Affects Versions: 0.7.0 Reporter: Nikhil Deshpande Assignee: Prajakta Kalmegh Attachments: HIVE-1694.1.patch.txt, HIVE-1694.2.patch.txt, HIVE-1694_2010-10-28.diff, demo_q1.hql, demo_q2.hql The index building patch (Hive-417) is checked into trunk, this JIRA issue tracks supporting indexes in Hive compiler execution engine for SELECT queries. This is in ref. to John's comment at https://issues.apache.org/jira/browse/HIVE-417?focusedCommentId=12884869page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12884869 on creating separate JIRA issue for tracking index usage in optimizer query execution. The aim of this effort is to use indexes to accelerate query execution (for certain class of queries). E.g. - Filters and range scans (already being worked on by He Yongqiang as part of HIVE-417?) - Joins (index based joins) - Group By, Order By and other misc cases The proposal is multi-step: 1. Building index based operators, compiler and execution engine changes 2. Optimizer enhancements (e.g. cost-based optimizer to compare and choose between index scans, full table scans etc.) This JIRA initially focuses on the first step. This JIRA is expected to hold the information about index based plans operator implementations for above mentioned cases. -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (HIVE-1694) Accelerate GROUP BY execution using indexes
[ https://issues.apache.org/jira/browse/HIVE-1694?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13002017#comment-13002017 ] Prajakta Kalmegh commented on HIVE-1694: Hi John We have made all the changes as suggested by you except for making the code pluggable (so that the rewrite expression changes depending on which index handler is used). We will submit this change along with the patch for new index type. We have started working on the new index type creation as per your suggestion and will let you know once that is complete. Accelerate GROUP BY execution using indexes --- Key: HIVE-1694 URL: https://issues.apache.org/jira/browse/HIVE-1694 Project: Hive Issue Type: New Feature Components: Indexing, Query Processor Affects Versions: 0.7.0 Reporter: Nikhil Deshpande Assignee: Prajakta Kalmegh Attachments: HIVE-1694.1.patch.txt, HIVE-1694.2.patch.txt, HIVE-1694_2010-10-28.diff, demo_q1.hql, demo_q2.hql The index building patch (Hive-417) is checked into trunk, this JIRA issue tracks supporting indexes in Hive compiler execution engine for SELECT queries. This is in ref. to John's comment at https://issues.apache.org/jira/browse/HIVE-417?focusedCommentId=12884869page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12884869 on creating separate JIRA issue for tracking index usage in optimizer query execution. The aim of this effort is to use indexes to accelerate query execution (for certain class of queries). E.g. - Filters and range scans (already being worked on by He Yongqiang as part of HIVE-417?) - Joins (index based joins) - Group By, Order By and other misc cases The proposal is multi-step: 1. Building index based operators, compiler and execution engine changes 2. Optimizer enhancements (e.g. cost-based optimizer to compare and choose between index scans, full table scans etc.) This JIRA initially focuses on the first step. This JIRA is expected to hold the information about index based plans operator implementations for above mentioned cases. -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Created: (HIVE-2027) Asynchronous Hooks
Asynchronous Hooks -- Key: HIVE-2027 URL: https://issues.apache.org/jira/browse/HIVE-2027 Project: Hive Issue Type: New Feature Reporter: Ning Zhang PreHook and PostHook are executed sequentially in the order if they are defined in hive.exec.pre.hooks and hve.exec.post.hooks. In some cases the sequential semantics are mandatory, but not for all cases. It would be desirable to define an AysncHook that extends Hook (similarly for AsyncPreHook and AsyncPostHook) to asynchronously execute the hooks in a thread pool. -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Updated: (HIVE-1694) Accelerate GROUP BY execution using indexes
[ https://issues.apache.org/jira/browse/HIVE-1694?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prajakta Kalmegh updated HIVE-1694: --- Attachment: (was: HIVE-1694.2.patch.txt) Accelerate GROUP BY execution using indexes --- Key: HIVE-1694 URL: https://issues.apache.org/jira/browse/HIVE-1694 Project: Hive Issue Type: New Feature Components: Indexing, Query Processor Affects Versions: 0.7.0 Reporter: Nikhil Deshpande Assignee: Prajakta Kalmegh Attachments: HIVE-1694.1.patch.txt, HIVE-1694_2010-10-28.diff, demo_q1.hql, demo_q2.hql The index building patch (Hive-417) is checked into trunk, this JIRA issue tracks supporting indexes in Hive compiler execution engine for SELECT queries. This is in ref. to John's comment at https://issues.apache.org/jira/browse/HIVE-417?focusedCommentId=12884869page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12884869 on creating separate JIRA issue for tracking index usage in optimizer query execution. The aim of this effort is to use indexes to accelerate query execution (for certain class of queries). E.g. - Filters and range scans (already being worked on by He Yongqiang as part of HIVE-417?) - Joins (index based joins) - Group By, Order By and other misc cases The proposal is multi-step: 1. Building index based operators, compiler and execution engine changes 2. Optimizer enhancements (e.g. cost-based optimizer to compare and choose between index scans, full table scans etc.) This JIRA initially focuses on the first step. This JIRA is expected to hold the information about index based plans operator implementations for above mentioned cases. -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira
Re: Review Request: HIVE-2026. Parallelize UpdateInputAccessTimeHook
Hi, Ning Just to be clear on what I was suggesting, I have created a patch only for this file. Please have a look. Thanks, MIS. On Thu, Mar 3, 2011 at 5:50 PM, M IS misapa...@gmail.com wrote: This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/460/ trunk/ql/src/java/org/apache/hadoop/hive/ql/hooks/UpdateInputAccessTimeHook.javahttps://reviews.apache.org/r/460/diff/1/?file=13550#file13550line82 (Diff revision 1) public void run(SessionState sess, SetReadEntity inputs, 77 Thread[] threads = new Thread[nThreads]; How about going for a centralized thread pool and submitting the tasks for that pool. This can have advantages like, we need not have to create threads and we could come to know of the status of the task submitted through the future object. And use this future to to wait till the task is finished. We can re factor the code to make UpdateWorker to implement Runnable instead of extending of Thread. - M On March 3rd, 2011, 12:53 a.m., Ning Zhang wrote: Review request for hive. By Ning Zhang. *Updated 2011-03-03 00:53:49* Description define hive.hooks.parallel.degree to control max # of thread to update metastore in parallel. Diffs - trunk/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java (1076459) - trunk/conf/hive-default.xml (1076459) - trunk/metastore/src/java/org/apache/hadoop/hive/metastore/ObjectStore.java (1076459) - trunk/ql/src/java/org/apache/hadoop/hive/ql/hooks/UpdateInputAccessTimeHook.java (1076459) View Diff https://reviews.apache.org/r/460/diff/ ### Eclipse Workspace Patch 1.0 #P hive Index: ql/src/java/org/apache/hadoop/hive/ql/hooks/UpdateInputAccessTimeHook.java === --- ql/src/java/org/apache/hadoop/hive/ql/hooks/UpdateInputAccessTimeHook.java (revision 1076702) +++ ql/src/java/org/apache/hadoop/hive/ql/hooks/UpdateInputAccessTimeHook.java (working copy) @@ -17,18 +17,26 @@ */ package org.apache.hadoop.hive.ql.hooks; +import java.util.ArrayList; +import java.util.Collections; +import java.util.HashSet; +import java.util.List; import java.util.Set; -import java.util.LinkedHashMap; -import java.util.Map; +import java.util.concurrent.ExecutorService; +import java.util.concurrent.Future; +import java.util.concurrent.LinkedBlockingQueue; +import java.util.concurrent.ThreadPoolExecutor; +import java.util.concurrent.TimeUnit; import org.apache.hadoop.hive.conf.HiveConf; -import org.apache.hadoop.hive.ql.session.SessionState; -import org.apache.hadoop.security.UserGroupInformation; import org.apache.hadoop.hive.ql.metadata.Hive; import org.apache.hadoop.hive.ql.metadata.HiveException; import org.apache.hadoop.hive.ql.metadata.Partition; import org.apache.hadoop.hive.ql.metadata.Table; +import org.apache.hadoop.hive.ql.session.SessionState; +import org.apache.hadoop.security.UserGroupInformation; + /** * Implementation of a pre execute hook that updates the access * times for all the inputs. @@ -39,7 +47,6 @@ public static class PreExec implements PreExecute { Hive db; - public void run(SessionState sess, SetReadEntity inputs, SetWriteEntity outputs, UserGroupInformation ugi) throws Exception { @@ -54,35 +61,122 @@ } } + if (inputs.size() == 0) { +return; + } + int lastAccessTime = (int) (System.currentTimeMillis()/1000); - for(ReadEntity re: inputs) { -// Set the last query time + int nThreads = HiveConf.getIntVar(sess.getConf(), HiveConf.ConfVars.HOOKS_PARALLEL_DEGREE); + int maxThreads = HiveConf.getIntVar(sess.getConf(), HiveConf.ConfVars.METASTORESERVERMAXTHREADS); + + if (nThreads 1) { +nThreads = 1; + } else if (nThreads maxThreads) { +nThreads = maxThreads; + } + if (nThreads inputs.size()) { +nThreads = inputs.size(); + } + + // This can be a rather common/centrally used thread pool. + ExecutorService exeService = new ThreadPoolExecutor(nThreads, nThreads, 5000, TimeUnit.MILLISECONDS, + new LinkedBlockingQueueRunnable()); + ListFuture? futures = new ArrayListFuture?(nThreads); + + ListReadEntity[] threadInputs = new List[nThreads]; + + // assign ReadEntities to threads + int i = 0; + for (i = 0; i nThreads; ++i) { +threadInputs[i] = new ArrayListReadEntity(); + } + + i = 0; + for (ReadEntity re: inputs) { +threadInputs[i % nThreads].add(re); +++i; + } + + try { +// launch all threads +Runnable updateWorker; +Future? futureTask; +for (i = 0; i nThreads; ++i) { + updateWorker = new UpdateWorker(sess.getConf(), threadInputs[i], lastAccessTime); + futureTask = exeService.submit(updateWorker); + futures.add(futureTask); +} + +// wait
[jira] Updated: (HIVE-1694) Accelerate GROUP BY execution using indexes
[ https://issues.apache.org/jira/browse/HIVE-1694?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prajakta Kalmegh updated HIVE-1694: --- Attachment: HIVE-1694.2.patch.txt Patch version 2 - includes changes for review comments from John. Re-attaching the appropriate file. Accelerate GROUP BY execution using indexes --- Key: HIVE-1694 URL: https://issues.apache.org/jira/browse/HIVE-1694 Project: Hive Issue Type: New Feature Components: Indexing, Query Processor Affects Versions: 0.7.0 Reporter: Nikhil Deshpande Assignee: Prajakta Kalmegh Attachments: HIVE-1694.1.patch.txt, HIVE-1694.2.patch.txt, HIVE-1694_2010-10-28.diff, demo_q1.hql, demo_q2.hql The index building patch (Hive-417) is checked into trunk, this JIRA issue tracks supporting indexes in Hive compiler execution engine for SELECT queries. This is in ref. to John's comment at https://issues.apache.org/jira/browse/HIVE-417?focusedCommentId=12884869page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12884869 on creating separate JIRA issue for tracking index usage in optimizer query execution. The aim of this effort is to use indexes to accelerate query execution (for certain class of queries). E.g. - Filters and range scans (already being worked on by He Yongqiang as part of HIVE-417?) - Joins (index based joins) - Group By, Order By and other misc cases The proposal is multi-step: 1. Building index based operators, compiler and execution engine changes 2. Optimizer enhancements (e.g. cost-based optimizer to compare and choose between index scans, full table scans etc.) This JIRA initially focuses on the first step. This JIRA is expected to hold the information about index based plans operator implementations for above mentioned cases. -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira
Re: Review Request: HIVE-2026. Parallelize UpdateInputAccessTimeHook
Hi MIS, Thanks for the contribution! To allow broader audience to review, can you upload your patch to the JIRA and the review board (I can help you with the review board if it doesn't allow you to change the request). A couple of comments before uploading your patch: 1) the 5 sec keepAliveTime seems low. If the # of threads is more than the # of cores, does it mean the thread will be terminated after 5 secs after it is waiting to get scheduled? 2) do you need to call execService.shutDown() in case of a Throwable is caught? On Mar 3, 2011, at 10:09 AM, MIS wrote: Hi, Ning Just to be clear on what I was suggesting, I have created a patch only for this file. Please have a look. Thanks, MIS. On Thu, Mar 3, 2011 at 5:50 PM, M IS misapa...@gmail.commailto:misapa...@gmail.com wrote: This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/460/ trunk/ql/src/java/org/apache/hadoop/hive/ql/hooks/UpdateInputAccessTimeHook.javahttps://reviews.apache.org/r/460/diff/1/?file=13550#file13550line82 (Diff revision 1) public void run(SessionState sess, SetReadEntity inputs, 77 Thread[] threads = new Thread[nThreads]; How about going for a centralized thread pool and submitting the tasks for that pool. This can have advantages like, we need not have to create threads and we could come to know of the status of the task submitted through the future object. And use this future to to wait till the task is finished. We can re factor the code to make UpdateWorker to implement Runnable instead of extending of Thread. - M On March 3rd, 2011, 12:53 a.m., Ning Zhang wrote: Review request for hive. By Ning Zhang. Updated 2011-03-03 00:53:49 Description define hive.hooks.parallel.degree to control max # of thread to update metastore in parallel. Diffs * trunk/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java (1076459) * trunk/conf/hive-default.xml (1076459) * trunk/metastore/src/java/org/apache/hadoop/hive/metastore/ObjectStore.java (1076459) * trunk/ql/src/java/org/apache/hadoop/hive/ql/hooks/UpdateInputAccessTimeHook.java (1076459) View Diffhttps://reviews.apache.org/r/460/diff/ HIVE-2026_1.patch
[jira] Commented: (HIVE-1833) Task-cleanup task should be disabled
[ https://issues.apache.org/jira/browse/HIVE-1833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13002165#comment-13002165 ] Joydeep Sen Sarma commented on HIVE-1833: - committed - thanks Scott! Task-cleanup task should be disabled Key: HIVE-1833 URL: https://issues.apache.org/jira/browse/HIVE-1833 Project: Hive Issue Type: Improvement Components: Server Infrastructure Reporter: Scott Chen Assignee: Scott Chen Attachments: HIVE-1833.1.txt, HIVE-1833.txt Currently when task fails, a cleanup attempt will be scheduled right after that. This is unnecessary and increase the latency. MapReduce will allow disabling this (see MAPREDUCE-2206). After that patch is committed, we should set the JobConf in HIVE to disable cleanup task. -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Resolved: (HIVE-1833) Task-cleanup task should be disabled
[ https://issues.apache.org/jira/browse/HIVE-1833?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joydeep Sen Sarma resolved HIVE-1833. - Resolution: Fixed Fix Version/s: 0.8.0 Task-cleanup task should be disabled Key: HIVE-1833 URL: https://issues.apache.org/jira/browse/HIVE-1833 Project: Hive Issue Type: Improvement Components: Server Infrastructure Reporter: Scott Chen Assignee: Scott Chen Fix For: 0.8.0 Attachments: HIVE-1833.1.txt, HIVE-1833.txt Currently when task fails, a cleanup attempt will be scheduled right after that. This is unnecessary and increase the latency. MapReduce will allow disabling this (see MAPREDUCE-2206). After that patch is committed, we should set the JobConf in HIVE to disable cleanup task. -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (HIVE-2022) Making JDO thread-safe by default
[ https://issues.apache.org/jira/browse/HIVE-2022?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13002184#comment-13002184 ] Paul Yang commented on HIVE-2022: - Apologies for the build break - Ning and I are looking into fixing some issues with my build environment. Making JDO thread-safe by default - Key: HIVE-2022 URL: https://issues.apache.org/jira/browse/HIVE-2022 Project: Hive Issue Type: Bug Components: Configuration, Metastore Reporter: Ning Zhang Assignee: Ning Zhang Fix For: 0.8.0 Attachments: HIVE-2022.patch If there are multiple thread accessing metastore concurrently, there are cases that JDO threw exceptions because of concurrent access of HashMap inside JDO. Setting javax.jdo.option.Multithreaded to true solves this issue. -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (HIVE-2025) Fix TestEmbeddedHiveMetaStore and TestRemoteHiveMetaStore broken by HIVE-2022
[ https://issues.apache.org/jira/browse/HIVE-2025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13002259#comment-13002259 ] Carl Steinbach commented on HIVE-2025: -- Review request: https://reviews.apache.org/r/464/ Fix TestEmbeddedHiveMetaStore and TestRemoteHiveMetaStore broken by HIVE-2022 - Key: HIVE-2025 URL: https://issues.apache.org/jira/browse/HIVE-2025 Project: Hive Issue Type: Bug Components: Metastore Reporter: Carl Steinbach Assignee: Ning Zhang Priority: Critical Attachments: HIVE-2025.patch The patch for HIVE-2022 broke TestEmbeddedHiveMetaStore and TestRemoteHiveMetaStore https://hudson.apache.org/hudson/job/Hive-trunk-h0.20/590/ @Paul: Assigning this to you. -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira
Re: Review Request: HIVE-2025: Fix TestEmbeddedHiveMetaStore and TestRemoteHiveMetaStore broken by HIVE-2022
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/464/#review297 --- Ship it! Looks good to me - will test and commit. - Paul On 2011-03-03 13:46:55, Carl Steinbach wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/464/ --- (Updated 2011-03-03 13:46:55) Review request for hive. Summary --- Review request for HIVE-2025. This addresses bugs HIVE-2022 and HIVE-2025. https://issues.apache.org/jira/browse/HIVE-2022 https://issues.apache.org/jira/browse/HIVE-2025 Diffs - trunk/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java 1076530 trunk/metastore/src/java/org/apache/hadoop/hive/metastore/ObjectStore.java 1076530 Diff: https://reviews.apache.org/r/464/diff Testing --- Thanks, Carl
[jira] Commented: (HIVE-1833) Task-cleanup task should be disabled
[ https://issues.apache.org/jira/browse/HIVE-1833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13002293#comment-13002293 ] Scott Chen commented on HIVE-1833: -- Thanks for the help, Joy :) Task-cleanup task should be disabled Key: HIVE-1833 URL: https://issues.apache.org/jira/browse/HIVE-1833 Project: Hive Issue Type: Improvement Components: Server Infrastructure Reporter: Scott Chen Assignee: Scott Chen Fix For: 0.8.0 Attachments: HIVE-1833.1.txt, HIVE-1833.txt Currently when task fails, a cleanup attempt will be scheduled right after that. This is unnecessary and increase the latency. MapReduce will allow disabling this (see MAPREDUCE-2206). After that patch is committed, we should set the JobConf in HIVE to disable cleanup task. -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (HIVE-2025) Fix TestEmbeddedHiveMetaStore and TestRemoteHiveMetaStore broken by HIVE-2022
[ https://issues.apache.org/jira/browse/HIVE-2025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13002296#comment-13002296 ] Carl Steinbach commented on HIVE-2025: -- +1. Will commit if tests pass. Fix TestEmbeddedHiveMetaStore and TestRemoteHiveMetaStore broken by HIVE-2022 - Key: HIVE-2025 URL: https://issues.apache.org/jira/browse/HIVE-2025 Project: Hive Issue Type: Bug Components: Metastore Reporter: Carl Steinbach Assignee: Ning Zhang Priority: Critical Attachments: HIVE-2025.patch The patch for HIVE-2022 broke TestEmbeddedHiveMetaStore and TestRemoteHiveMetaStore https://hudson.apache.org/hudson/job/Hive-trunk-h0.20/590/ @Paul: Assigning this to you. -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira
Build failed in Jenkins: Hive-trunk-h0.20 #592
See https://hudson.apache.org/hudson/job/Hive-trunk-h0.20/592/changes Changes: [jssarma] HIVE-1833: Task Cleanup task should be disabled (Scott Chen via jssarma) -- [...truncated 27596 lines...] [junit] PREHOOK: query: drop table testhivedrivertable [junit] PREHOOK: type: DROPTABLE [junit] POSTHOOK: query: drop table testhivedrivertable [junit] POSTHOOK: type: DROPTABLE [junit] OK [junit] PREHOOK: query: create table testhivedrivertable (num int) [junit] PREHOOK: type: CREATETABLE [junit] POSTHOOK: query: create table testhivedrivertable (num int) [junit] POSTHOOK: type: CREATETABLE [junit] POSTHOOK: Output: default@testhivedrivertable [junit] OK [junit] PREHOOK: query: load data local inpath 'https://hudson.apache.org/hudson/job/Hive-trunk-h0.20/ws/hive/data/files/kv1.txt' into table testhivedrivertable [junit] PREHOOK: type: LOAD [junit] Copying data from https://hudson.apache.org/hudson/job/Hive-trunk-h0.20/ws/hive/data/files/kv1.txt [junit] Loading data to table default.testhivedrivertable [junit] POSTHOOK: query: load data local inpath 'https://hudson.apache.org/hudson/job/Hive-trunk-h0.20/ws/hive/data/files/kv1.txt' into table testhivedrivertable [junit] POSTHOOK: type: LOAD [junit] POSTHOOK: Output: default@testhivedrivertable [junit] OK [junit] PREHOOK: query: select count(1) as cnt from testhivedrivertable [junit] PREHOOK: type: QUERY [junit] PREHOOK: Input: default@testhivedrivertable [junit] PREHOOK: Output: file:/tmp/hudson/hive_2011-03-03_15-28-14_662_975814063074649674/-mr-1 [junit] Total MapReduce jobs = 1 [junit] Launching Job 1 out of 1 [junit] Number of reduce tasks determined at compile time: 1 [junit] In order to change the average load for a reducer (in bytes): [junit] set hive.exec.reducers.bytes.per.reducer=number [junit] In order to limit the maximum number of reducers: [junit] set hive.exec.reducers.max=number [junit] In order to set a constant number of reducers: [junit] set mapred.reduce.tasks=number [junit] Job running in-process (local Hadoop) [junit] 2011-03-03 15:28:17,738 null map = 100%, reduce = 100% [junit] Ended Job = job_local_0001 [junit] POSTHOOK: query: select count(1) as cnt from testhivedrivertable [junit] POSTHOOK: type: QUERY [junit] POSTHOOK: Input: default@testhivedrivertable [junit] POSTHOOK: Output: file:/tmp/hudson/hive_2011-03-03_15-28-14_662_975814063074649674/-mr-1 [junit] OK [junit] PREHOOK: query: drop table testhivedrivertable [junit] PREHOOK: type: DROPTABLE [junit] PREHOOK: Input: default@testhivedrivertable [junit] PREHOOK: Output: default@testhivedrivertable [junit] POSTHOOK: query: drop table testhivedrivertable [junit] POSTHOOK: type: DROPTABLE [junit] POSTHOOK: Input: default@testhivedrivertable [junit] POSTHOOK: Output: default@testhivedrivertable [junit] OK [junit] Hive history file=https://hudson.apache.org/hudson/job/Hive-trunk-h0.20/ws/hive/build/service/tmp/hive_job_log_hudson_201103031528_1528160580.txt [junit] PREHOOK: query: drop table testhivedrivertable [junit] PREHOOK: type: DROPTABLE [junit] POSTHOOK: query: drop table testhivedrivertable [junit] POSTHOOK: type: DROPTABLE [junit] OK [junit] PREHOOK: query: create table testhivedrivertable (num int) [junit] PREHOOK: type: CREATETABLE [junit] POSTHOOK: query: create table testhivedrivertable (num int) [junit] POSTHOOK: type: CREATETABLE [junit] POSTHOOK: Output: default@testhivedrivertable [junit] OK [junit] PREHOOK: query: load data local inpath 'https://hudson.apache.org/hudson/job/Hive-trunk-h0.20/ws/hive/data/files/kv1.txt' into table testhivedrivertable [junit] PREHOOK: type: LOAD [junit] Copying data from https://hudson.apache.org/hudson/job/Hive-trunk-h0.20/ws/hive/data/files/kv1.txt [junit] Loading data to table default.testhivedrivertable [junit] POSTHOOK: query: load data local inpath 'https://hudson.apache.org/hudson/job/Hive-trunk-h0.20/ws/hive/data/files/kv1.txt' into table testhivedrivertable [junit] POSTHOOK: type: LOAD [junit] POSTHOOK: Output: default@testhivedrivertable [junit] OK [junit] PREHOOK: query: select * from testhivedrivertable limit 10 [junit] PREHOOK: type: QUERY [junit] PREHOOK: Input: default@testhivedrivertable [junit] PREHOOK: Output: file:/tmp/hudson/hive_2011-03-03_15-28-19_161_1275930092982649220/-mr-1 [junit] POSTHOOK: query: select * from testhivedrivertable limit 10 [junit] POSTHOOK: type: QUERY [junit] POSTHOOK: Input: default@testhivedrivertable [junit] POSTHOOK: Output: file:/tmp/hudson/hive_2011-03-03_15-28-19_161_1275930092982649220/-mr-1 [junit] OK [junit] PREHOOK: query: drop table testhivedrivertable [junit] PREHOOK: type: DROPTABLE
Re: Review Request: HIVE-2026. Parallelize UpdateInputAccessTimeHook
Hi, Ning 1) You are right on this. But, here the keepAliveTime will not be having much of prominence as the core pool size itself is nThreads and the max pool size is also nThreads. So, even if the threads are idle, nThreads will always remain in the pool that is created to process the tasks. Also, since in this scenario the thread pool is being created for a specific purpose thread pool configuration is fine. This can be achieved in a more simple manner as below: ExecutorService exeService = Executors.newFixedThreadPool(nThreads); I'll use this in the new patch. Further, If we are going to be use (which we need to in future) a common or a centralized thread pool, then the thread pool configuration needs to be carefully arrived at taking into account the number of cores available at our disposal on a particular machine and depending on profiling results, but this is for later. 2) In the current scenario, we need to call execService.shutDown() in any case, if an exception is thrown or otherwise, as it is a local thread pool and we won't be using it any further. If the thread pool were to be a common/centralized one, we need not have to call shutDown(). Please let me know if this is fine, then I'll upload the patch attached with this file in Jira. Thanks, ಕರಿಯ On Fri, Mar 4, 2011 at 12:44 AM, Ning Zhang nzh...@fb.com wrote: Hi MIS, Thanks for the contribution! To allow broader audience to review, can you upload your patch to the JIRA and the review board (I can help you with the review board if it doesn't allow you to change the request). A couple of comments before uploading your patch: 1) the 5 sec keepAliveTime seems low. If the # of threads is more than the # of cores, does it mean the thread will be terminated after 5 secs after it is waiting to get scheduled? 2) do you need to call execService.shutDown() in case of a Throwable is caught? On Mar 3, 2011, at 10:09 AM, MIS wrote: Hi, Ning Just to be clear on what I was suggesting, I have created a patch only for this file. Please have a look. Thanks, MIS. On Thu, Mar 3, 2011 at 5:50 PM, M IS misapa...@gmail.com wrote: This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/460/ trunk/ql/src/java/org/apache/hadoop/hive/ql/hooks/UpdateInputAccessTimeHook.javahttps://reviews.apache.org/r/460/diff/1/?file=13550#file13550line82 (Diff revision 1) public void run(SessionState sess, SetReadEntity inputs, 77 Thread[] threads = new Thread[nThreads]; How about going for a centralized thread pool and submitting the tasks for that pool. This can have advantages like, we need not have to create threads and we could come to know of the status of the task submitted through the future object. And use this future to to wait till the task is finished. We can re factor the code to make UpdateWorker to implement Runnable instead of extending of Thread. - M On March 3rd, 2011, 12:53 a.m., Ning Zhang wrote: Review request for hive. By Ning Zhang. *Updated 2011-03-03 00:53:49* Description define hive.hooks.parallel.degree to control max # of thread to update metastore in parallel. Diffs - trunk/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java (1076459) - trunk/conf/hive-default.xml (1076459) - trunk/metastore/src/java/org/apache/hadoop/hive/metastore/ObjectStore.java (1076459) - trunk/ql/src/java/org/apache/hadoop/hive/ql/hooks/UpdateInputAccessTimeHook.java (1076459) View Diff https://reviews.apache.org/r/460/diff/ HIVE-2026_1.patch ### Eclipse Workspace Patch 1.0 #P hive Index: common/src/java/org/apache/hadoop/hive/conf/HiveConf.java === --- common/src/java/org/apache/hadoop/hive/conf/HiveConf.java (revision 1076715) +++ common/src/java/org/apache/hadoop/hive/conf/HiveConf.java (working copy) @@ -111,6 +111,7 @@ DEFAULT_ZOOKEEPER_PARTITION_NAME(hive.lockmgr.zookeeper.default.partition.name, __HIVE_DEFAULT_ZOOKEEPER_PARTITION__), // Whether to show a link to the most failed task + debugging tips SHOW_JOB_FAIL_DEBUG_INFO(hive.exec.show.job.failure.debug.info, true), +HOOKS_PARALLEL_DEGREE(hive.hooks.parallel.degree, 1), // should hive determine whether to run in local mode automatically ? LOCALMODEAUTO(hive.exec.mode.local.auto, false), Index: metastore/src/java/org/apache/hadoop/hive/metastore/ObjectStore.java === --- metastore/src/java/org/apache/hadoop/hive/metastore/ObjectStore.java (revision 1076715) +++ metastore/src/java/org/apache/hadoop/hive/metastore/ObjectStore.java (working copy) @@ -26,9 +26,9 @@ import java.util.Iterator; import java.util.List; import java.util.Map; -import java.util.Set; import java.util.Map.Entry; import java.util.Properties; +import java.util.Set; import java.util.concurrent.locks.Lock;
Re: Review Request: HIVE-2026. Parallelize UpdateInputAccessTimeHook
It looks good to me. Thanks! On Mar 3, 2011, at 10:18 PM, ಕರಿಯ wrote: Hi, Ning 1) You are right on this. But, here the keepAliveTime will not be having much of prominence as the core pool size itself is nThreads and the max pool size is also nThreads. So, even if the threads are idle, nThreads will always remain in the pool that is created to process the tasks. Also, since in this scenario the thread pool is being created for a specific purpose thread pool configuration is fine. This can be achieved in a more simple manner as below: ExecutorService exeService = Executors.newFixedThreadPool(nThreads); I'll use this in the new patch. Further, If we are going to be use (which we need to in future) a common or a centralized thread pool, then the thread pool configuration needs to be carefully arrived at taking into account the number of cores available at our disposal on a particular machine and depending on profiling results, but this is for later. 2) In the current scenario, we need to call execService.shutDown() in any case, if an exception is thrown or otherwise, as it is a local thread pool and we won't be using it any further. If the thread pool were to be a common/centralized one, we need not have to call shutDown(). Please let me know if this is fine, then I'll upload the patch attached with this file in Jira. Thanks, ಕರಿಯ On Fri, Mar 4, 2011 at 12:44 AM, Ning Zhang nzh...@fb.commailto:nzh...@fb.com wrote: Hi MIS, Thanks for the contribution! To allow broader audience to review, can you upload your patch to the JIRA and the review board (I can help you with the review board if it doesn't allow you to change the request). A couple of comments before uploading your patch: 1) the 5 sec keepAliveTime seems low. If the # of threads is more than the # of cores, does it mean the thread will be terminated after 5 secs after it is waiting to get scheduled? 2) do you need to call execService.shutDown() in case of a Throwable is caught? On Mar 3, 2011, at 10:09 AM, MIS wrote: Hi, Ning Just to be clear on what I was suggesting, I have created a patch only for this file. Please have a look. Thanks, MIS. On Thu, Mar 3, 2011 at 5:50 PM, M IS misapa...@gmail.commailto:misapa...@gmail.com wrote: This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/460/ trunk/ql/src/java/org/apache/hadoop/hive/ql/hooks/UpdateInputAccessTimeHook.javahttps://reviews.apache.org/r/460/diff/1/?file=13550#file13550line82 (Diff revision 1) public void run(SessionState sess, SetReadEntity inputs, 77 Thread[] threads = new Thread[nThreads]; How about going for a centralized thread pool and submitting the tasks for that pool. This can have advantages like, we need not have to create threads and we could come to know of the status of the task submitted through the future object. And use this future to to wait till the task is finished. We can re factor the code to make UpdateWorker to implement Runnable instead of extending of Thread. - M On March 3rd, 2011, 12:53 a.m., Ning Zhang wrote: Review request for hive. By Ning Zhang. Updated 2011-03-03 00:53:49 Description define hive.hooks.parallel.degree to control max # of thread to update metastore in parallel. Diffs * trunk/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java (1076459) * trunk/conf/hive-default.xml (1076459) * trunk/metastore/src/java/org/apache/hadoop/hive/metastore/ObjectStore.java (1076459) * trunk/ql/src/java/org/apache/hadoop/hive/ql/hooks/UpdateInputAccessTimeHook.java (1076459) View Diffhttps://reviews.apache.org/r/460/diff/ HIVE-2026_1.patch HIVE-2026_2.patch
[jira] Updated: (HIVE-2025) Fix TestEmbeddedHiveMetaStore and TestRemoteHiveMetaStore broken by HIVE-2022
[ https://issues.apache.org/jira/browse/HIVE-2025?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Carl Steinbach updated HIVE-2025: - Resolution: Fixed Fix Version/s: 0.8.0 Hadoop Flags: [Reviewed] Status: Resolved (was: Patch Available) Committed to trunk. Thanks Ning! Fix TestEmbeddedHiveMetaStore and TestRemoteHiveMetaStore broken by HIVE-2022 - Key: HIVE-2025 URL: https://issues.apache.org/jira/browse/HIVE-2025 Project: Hive Issue Type: Bug Components: Metastore Reporter: Carl Steinbach Assignee: Ning Zhang Priority: Critical Fix For: 0.8.0 Attachments: HIVE-2025.patch The patch for HIVE-2022 broke TestEmbeddedHiveMetaStore and TestRemoteHiveMetaStore https://hudson.apache.org/hudson/job/Hive-trunk-h0.20/590/ @Paul: Assigning this to you. -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Resolved: (HIVE-2023) Add javax.jdo.option.Multithreaded configuration property to HiveConf
[ https://issues.apache.org/jira/browse/HIVE-2023?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ning Zhang resolved HIVE-2023. -- Resolution: Fixed fixed as part of HIVE-2025. Add javax.jdo.option.Multithreaded configuration property to HiveConf - Key: HIVE-2023 URL: https://issues.apache.org/jira/browse/HIVE-2023 Project: Hive Issue Type: Bug Components: Configuration, Metastore Reporter: Carl Steinbach Assignee: Ning Zhang The configuration property javax.jdo.option.Multithreaded was added to hive-default.xml in HIVE-2022. This property also needs to be added to HiveConf.java. -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Updated: (HIVE-2026) Parallelize UpdateInputAccessTimeHook
[ https://issues.apache.org/jira/browse/HIVE-2026?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ಕರಿಯ updated HIVE-2026: --- Attachment: HIVE-2026_2.patch Patch incorporating the review comments. Parallelize UpdateInputAccessTimeHook - Key: HIVE-2026 URL: https://issues.apache.org/jira/browse/HIVE-2026 Project: Hive Issue Type: Improvement Reporter: Ning Zhang Assignee: Ning Zhang Attachments: HIVE-2026.patch, HIVE-2026_2.patch UpdateInputAccessTimeHook is usually used as a pre-execution hook to update the metastore's lastAccessTime field of input partition/table. If a query touches a large number of partitions, this hooks takes a long time to execute. One approach is to make the hook itself to run in a separate thread. But it is hard to guarantee backward compatibility in semantics in case of exceptions encountered in the hook execution. This task takes another approach to parallelize the hook itself (update multiple partitions concurrently), but execute each pre-hook in sequential order. -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (HIVE-2026) Parallelize UpdateInputAccessTimeHook
[ https://issues.apache.org/jira/browse/HIVE-2026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13002488#comment-13002488 ] Ning Zhang commented on HIVE-2026: -- The review board has also been updated with the new HIVE-2026_2.patch Parallelize UpdateInputAccessTimeHook - Key: HIVE-2026 URL: https://issues.apache.org/jira/browse/HIVE-2026 Project: Hive Issue Type: Improvement Reporter: Ning Zhang Assignee: Ning Zhang Attachments: HIVE-2026.patch, HIVE-2026_2.patch UpdateInputAccessTimeHook is usually used as a pre-execution hook to update the metastore's lastAccessTime field of input partition/table. If a query touches a large number of partitions, this hooks takes a long time to execute. One approach is to make the hook itself to run in a separate thread. But it is hard to guarantee backward compatibility in semantics in case of exceptions encountered in the hook execution. This task takes another approach to parallelize the hook itself (update multiple partitions concurrently), but execute each pre-hook in sequential order. -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira