[jira] [Commented] (HIVE-1451) Creating a table stores the full address of namenode in the metadata. This leads to problems when the namenode address changes.
[ https://issues.apache.org/jira/browse/HIVE-1451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13094706#comment-13094706 ] MIS commented on HIVE-1451: --- +1 for the issue. This is one of those features which many assume exists by default, but doesn't. I too have run into this and resolved it by changing the DB_LOCATION_URI column and LOCATION in the tables DBS and SDS respectively to point to the latest namenode URI. {My metastore was on MySql}. This issue will help us from manually changing namenode URI in db should the address of the namenode change. Creating a table stores the full address of namenode in the metadata. This leads to problems when the namenode address changes. --- Key: HIVE-1451 URL: https://issues.apache.org/jira/browse/HIVE-1451 Project: Hive Issue Type: Bug Components: Metastore, Query Processor Affects Versions: 0.5.0 Environment: Any Reporter: Arvind Prabhakar Here is an excerpt from table metadata for an arbitrary table {{table1}}: {noformat} hive describe extended table1; OK ... Detailed Table Information... location:hdfs://localhost:9000/user/arvind/hive/warehouse/table1, ... {noformat} As can be seen, the full address of namenode is captured in the location information for the table. This information is later used to run any queries on the table - thus making it impossible to change the namenode location once the table has been created. For example, for the above table, a query will fail if the namenode is migrated from port 9000 to 8020: {noformat} hive select * from table1; OK Failed with exception java.io.IOException:java.net.ConnectException: Call to localhost/127.0.0.1:9000 failed on connection exception: java.net.ConnectException: Connection refused Time taken: 10.78 seconds hive {noformat} It should be possible to change the namenode location regardless of when the tables are created. Also, any query execution should work with the configured namenode at that point in time rather than requiring the configuration to be exactly the same at the time when the tables were created. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-2181) Clean up the scratch.dir (tmp/hive-root) while restarting Hive server.
[ https://issues.apache.org/jira/browse/HIVE-2181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13083306#comment-13083306 ] MIS commented on HIVE-2181: --- -1 for the issue. What if I'm running multiple hive servers on different port in the same machine {With my metastore db on a mysql server}, then if one of the server instances restarts, it would end up deleting the scratch dir, which would affect other running instances as well. Even if we specify different scratch dir for each of the instances, I doubt about the value add from this property. Clean up the scratch.dir (tmp/hive-root) while restarting Hive server. Key: HIVE-2181 URL: https://issues.apache.org/jira/browse/HIVE-2181 Project: Hive Issue Type: Bug Components: Server Infrastructure Affects Versions: 0.8.0 Environment: Suse linux, Hadoop 20.1, Hive 0.8 Reporter: sanoj mathew Assignee: Chinna Rao Lalam Priority: Minor Labels: patch Fix For: 0.8.0 Attachments: HIVE-2181.patch Original Estimate: 48h Remaining Estimate: 48h Now queries leaves the map outputs under scratch.dir after execution. If the hive server is stopped we need not keep the stopped server's map oputputs. So whle starting the server we can clear the scratch.dir. This can help in improved disk usage. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (HIVE-2051) getInputSummary() to call FileSystem.getContentSummary() in parallel
[ https://issues.apache.org/jira/browse/HIVE-2051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13008328#comment-13008328 ] MIS commented on HIVE-2051: --- Yes it is necessary for the executor to be terminated if the jobs have been submitted to it, even though submitted jobs may have been completed. However, what we need not do here is, after the executor is shutdown, await till the termination gets over, since this is redundant. As all the submitted jobs to the executor will be completed by the time we shutdown the executor. This is what is ensured when we do result.get() i.e., the following piece of code is not required. + do { +try { + executor.awaitTermination(Integer.MAX_VALUE, TimeUnit.SECONDS); + executorDone = true; +} catch (InterruptedException e) { +} + } while (!executorDone); getInputSummary() to call FileSystem.getContentSummary() in parallel Key: HIVE-2051 URL: https://issues.apache.org/jira/browse/HIVE-2051 Project: Hive Issue Type: Improvement Reporter: Siying Dong Assignee: Siying Dong Priority: Minor Attachments: HIVE-2051.1.patch, HIVE-2051.2.patch, HIVE-2051.3.patch, HIVE-2051.4.patch getInputSummary() now call FileSystem.getContentSummary() one by one, which can be extremely slow when the number of input paths are huge. By calling those functions in parallel, we can cut latency in most cases. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (HIVE-2051) getInputSummary() to call FileSystem.getContentSummary() in parallel
[ https://issues.apache.org/jira/browse/HIVE-2051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13008331#comment-13008331 ] MIS commented on HIVE-2051: --- The solution to this issue resembles that of HIVE-2026, so we can follow a similar approach. getInputSummary() to call FileSystem.getContentSummary() in parallel Key: HIVE-2051 URL: https://issues.apache.org/jira/browse/HIVE-2051 Project: Hive Issue Type: Improvement Reporter: Siying Dong Assignee: Siying Dong Priority: Minor Attachments: HIVE-2051.1.patch, HIVE-2051.2.patch, HIVE-2051.3.patch, HIVE-2051.4.patch getInputSummary() now call FileSystem.getContentSummary() one by one, which can be extremely slow when the number of input paths are huge. By calling those functions in parallel, we can cut latency in most cases. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (HIVE-1959) Potential memory leak when same connection used for long time. TaskInfo and QueryInfo objects are getting accumulated on executing more queries on the same connection.
[ https://issues.apache.org/jira/browse/HIVE-1959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13000970#comment-13000970 ] MIS commented on HIVE-1959: --- How about using WeakHashMap in place of using HashMap instead of explicitly removing from the map! The WeakHashMap can be used for both the fields- queryInfoMap and taskInfoMap of HiveHistory.java class. Potential memory leak when same connection used for long time. TaskInfo and QueryInfo objects are getting accumulated on executing more queries on the same connection. --- Key: HIVE-1959 URL: https://issues.apache.org/jira/browse/HIVE-1959 Project: Hive Issue Type: Bug Components: Server Infrastructure Affects Versions: 0.8.0 Environment: Hadoop 0.20.1, Hive0.5.0 and SUSE Linux Enterprise Server 10 SP2 (i586) - Kernel 2.6.16.60-0.21-smp (5). Reporter: Chinna Rao Lalam Assignee: Chinna Rao Lalam Attachments: HIVE-1959.patch *org.apache.hadoop.hive.ql.history.HiveHistory$TaskInfo* and *org.apache.hadoop.hive.ql.history.HiveHistory$QueryInfo* these two objects are getting accumulated on executing more number of queries on the same connection. These objects are getting released only when the connection is closed. -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (HIVE-1883) Periodic cleanup of Hive History log files.
[ https://issues.apache.org/jira/browse/HIVE-1883?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12995934#comment-12995934 ] MIS commented on HIVE-1883: --- Carl is right on this. There is no need to have a 'scheduled' timer task to take care of the log files. There are enough handles already available in log4j library used by Hive to handle the log files. As far as the current issue is concerned, RolllingFileAppender can be used and a max limit can be set. if it is wished that no data should be lost then DailyRollingFileAppender can be used and a cron job can be run to handle the a week's[or what ever the time frame chosen] log files. Further, there is one more disadvantage in running the 'scheduled' timer task to handle log files, creates more problems than it solves. Though ScheduledThreadPoolExecutor could be an answer, but its just not worth the effort. Periodic cleanup of Hive History log files. --- Key: HIVE-1883 URL: https://issues.apache.org/jira/browse/HIVE-1883 Project: Hive Issue Type: Bug Components: Query Processor Affects Versions: 0.6.0 Environment: Hive 0.6.0, Hadoop 0.20.1 SUSE Linux Enterprise Server 11 (i586) VERSION = 11 PATCHLEVEL = 0 Reporter: Mohit Sikri After starting hive and running queries transaction history files are getting creating in the /tmp/root folder. These files we should remove periodically(not all of them but) which are too old to represent any significant information. Solution :- A scheduled timer task, which cleans up the log files older than the configured time. -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira