[jira] [Updated] (HIVE-1095) Hive in Maven
[ https://issues.apache.org/jira/browse/HIVE-1095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Amareshwari Sriramadasu updated HIVE-1095: -- Resolution: Fixed Fix Version/s: 0.8.0 Hadoop Flags: [Reviewed] Status: Resolved (was: Patch Available) I just committed this to trunk and branch 0.7. Thanks Gerrit and Carl ! Hive in Maven - Key: HIVE-1095 URL: https://issues.apache.org/jira/browse/HIVE-1095 Project: Hive Issue Type: Task Components: Build Infrastructure Affects Versions: 0.6.0 Reporter: Gerrit Jansen van Vuuren Priority: Minor Fix For: 0.7.1, 0.8.0 Attachments: HIVE-1095-trunk.patch, HIVE-1095.7.patch.txt, HIVE-1095.v2.PATCH, HIVE-1095.v3.PATCH, HIVE-1095.v4.PATCH, HIVE-1095.v5.PATCH, HIVE-1095.v6.patch, hiveReleasedToMaven.tar.gz, make-maven.log Getting hive into maven main repositories Documentation on how to do this is on: http://maven.apache.org/guides/mini/guide-central-repository-upload.html -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-1095) Hive in Maven
[ https://issues.apache.org/jira/browse/HIVE-1095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13035278#comment-13035278 ] Gerrit Jansen van Vuuren commented on HIVE-1095: Great Thanks Carl, Amareshwari for seeing this through. Hive in Maven - Key: HIVE-1095 URL: https://issues.apache.org/jira/browse/HIVE-1095 Project: Hive Issue Type: Task Components: Build Infrastructure Affects Versions: 0.6.0 Reporter: Gerrit Jansen van Vuuren Priority: Minor Fix For: 0.7.1, 0.8.0 Attachments: HIVE-1095-trunk.patch, HIVE-1095.7.patch.txt, HIVE-1095.v2.PATCH, HIVE-1095.v3.PATCH, HIVE-1095.v4.PATCH, HIVE-1095.v5.PATCH, HIVE-1095.v6.patch, hiveReleasedToMaven.tar.gz, make-maven.log Getting hive into maven main repositories Documentation on how to do this is on: http://maven.apache.org/guides/mini/guide-central-repository-upload.html -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
Jenkins build is still unstable: Hive-trunk-h0.21 #736
See https://builds.apache.org/hudson/job/Hive-trunk-h0.21/changes
[jira] [Commented] (HIVE-1095) Hive in Maven
[ https://issues.apache.org/jira/browse/HIVE-1095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13035417#comment-13035417 ] Hudson commented on HIVE-1095: -- Integrated in Hive-trunk-h0.21 #736 (See [https://builds.apache.org/hudson/job/Hive-trunk-h0.21/736/]) HIVE-1095. Hive in Maven. Contributed by Gerrit Jansen van Vuuren, Amareshwari Sriramadasu and Carl Steinbach. amareshwari : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1124164 Files : * /hive/trunk/ant/ivy.xml * /hive/trunk/ivy.xml * /hive/trunk/jdbc/ivy.xml * /hive/trunk/ql/ivy.xml * /hive/trunk/build.xml * /hive/trunk/service/ivy.xml * /hive/trunk/hbase-handler/ivy.xml * /hive/trunk/contrib/ivy.xml * /hive/trunk/shims/ivy.xml * /hive/trunk/hwi/ivy.xml * /hive/trunk/ivy/libraries.properties * /hive/trunk/metastore/ivy.xml * /hive/trunk/cli/ivy.xml * /hive/trunk/serde/ivy.xml * /hive/trunk/common/ivy.xml * /hive/trunk/build-common.xml Hive in Maven - Key: HIVE-1095 URL: https://issues.apache.org/jira/browse/HIVE-1095 Project: Hive Issue Type: Task Components: Build Infrastructure Affects Versions: 0.6.0 Reporter: Gerrit Jansen van Vuuren Priority: Minor Fix For: 0.7.1, 0.8.0 Attachments: HIVE-1095-trunk.patch, HIVE-1095.7.patch.txt, HIVE-1095.v2.PATCH, HIVE-1095.v3.PATCH, HIVE-1095.v4.PATCH, HIVE-1095.v5.PATCH, HIVE-1095.v6.patch, hiveReleasedToMaven.tar.gz, make-maven.log Getting hive into maven main repositories Documentation on how to do this is on: http://maven.apache.org/guides/mini/guide-central-repository-upload.html -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-2161) Remaining patch for HIVE-2148
[ https://issues.apache.org/jira/browse/HIVE-2161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13035462#comment-13035462 ] Ashutosh Chauhan commented on HIVE-2161: Can some one commit this one, it has already been discussed at HIVE-2148 Remaining patch for HIVE-2148 - Key: HIVE-2161 URL: https://issues.apache.org/jira/browse/HIVE-2161 Project: Hive Issue Type: Task Reporter: Ashutosh Chauhan Assignee: Ashutosh Chauhan Fix For: 0.8.0 Attachments: hive_2161.patch Follow-up jira for HIVE-2148. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-2144) reduce workload generated by JDBCStatsPublisher
[ https://issues.apache.org/jira/browse/HIVE-2144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13035493#comment-13035493 ] Tomasz Nykiel commented on HIVE-2144: - Currently the schema of the stat table is the following: PARTITION_STAT_TABLE ( ID VARCHAR(255), ROW_COUNT BIGINT ) and does not have any integrity constraints declared. We can amend it to: PARTITION_STAT_TABLE ( ID VARCHAR(255) UNIQUE , ROW_COUNT BIGINT ). Then instead of executing two queries per row inserted, we can execute one INSERT query, as we do currently. In the case when the integrity constraint is violated, via the unique index, which can be caught by an exception, we perform a single UPDATE query. The UPDATE query needs to check the condition, if the currently inserted stats are newer then the ones already in the table: UPDATE PARTITION_STAT_TBL SET ROW_COUNT = new_value WHERE ID = rowID AND (0)new_value (1)(SELECT TEMP.ROW_COUNT FROM (2)(SELECT ROW_COUNT FROM PARTITION_STAT_TBL WHERE ID = rowID) TEMP ) --(0) is a condition that checks if the newly inserted value is greater that the one we already have. --(1) and (2) is a work-around for MySQL, which does not allow to refer to the table that occurs in the update statement. Here, we basically materialize the value that we need for comparison. --(1) should theoretically have (LIMIT 1) to choose exactly one tuple, however Derby does not support it, and by the unique constraint, and the fact that the insert failed, there exists exactly one tuple matching the ID predicate. To summarize, for non existing rows, only one insert query will be executed, instead of two. For existing rows, which seems to occur very infrequently, two queries instead of three will be executed. reduce workload generated by JDBCStatsPublisher --- Key: HIVE-2144 URL: https://issues.apache.org/jira/browse/HIVE-2144 Project: Hive Issue Type: Improvement Reporter: Ning Zhang Assignee: Tomasz Nykiel In JDBCStatsPublisher, we first try a SELECT query to see if the specific ID was inserted by another task (mostly likely a speculative or previously failed task). Depending on if the ID is there, an INSERT or UPDATE query was issues. So there are basically 2x of queries per row inserted into the intermediate stats table. This workload could be reduced to 1/2 if we insert it anyway (it is very rare that IDs are duplicated) and use a different SQL query in the aggregation phase to dedup the ID (e.g., using group-by and max()). The benefits are that even though the aggregation query is more expensive, it is only run once per query. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-2144) reduce workload generated by JDBCStatsPublisher
[ https://issues.apache.org/jira/browse/HIVE-2144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13035612#comment-13035612 ] Ning Zhang commented on HIVE-2144: -- Great! I like the idea. One comment about the primary key constraint: I'm not sure if UNIQUE is the standard way to specify primary key constraint. There are people using Oralce/MS SQL sever/Postgres as metastore, we should use a standard way. I think 'id varchar(255) PRIMARY KEY' is more widely supported. Can you double check with mysql and derby? reduce workload generated by JDBCStatsPublisher --- Key: HIVE-2144 URL: https://issues.apache.org/jira/browse/HIVE-2144 Project: Hive Issue Type: Improvement Reporter: Ning Zhang Assignee: Tomasz Nykiel In JDBCStatsPublisher, we first try a SELECT query to see if the specific ID was inserted by another task (mostly likely a speculative or previously failed task). Depending on if the ID is there, an INSERT or UPDATE query was issues. So there are basically 2x of queries per row inserted into the intermediate stats table. This workload could be reduced to 1/2 if we insert it anyway (it is very rare that IDs are duplicated) and use a different SQL query in the aggregation phase to dedup the ID (e.g., using group-by and max()). The benefits are that even though the aggregation query is more expensive, it is only run once per query. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
Jenkins build is still unstable: Hive-trunk-h0.21 #737
See https://builds.apache.org/hudson/job/Hive-trunk-h0.21/changes
[jira] [Commented] (HIVE-2144) reduce workload generated by JDBCStatsPublisher
[ https://issues.apache.org/jira/browse/HIVE-2144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13035634#comment-13035634 ] Tomasz Nykiel commented on HIVE-2144: - Yes, I agree. There are some subtle differences between UNIQUE and PK in Derby and MySQL (e.g., in MySQL the unique index allows null values, and in Derby it does not. So in general, PK constraint will be more suitable. CREATE TABLE PARTITION_STAT_TBL ( IDE VARCHAR(255) PRIMARY KEY, ROW_COUNT BIGINT ) works for both Derby and MySql. After a quick check it seems that it's supported by Oracle/MSSQL as well. reduce workload generated by JDBCStatsPublisher --- Key: HIVE-2144 URL: https://issues.apache.org/jira/browse/HIVE-2144 Project: Hive Issue Type: Improvement Reporter: Ning Zhang Assignee: Tomasz Nykiel In JDBCStatsPublisher, we first try a SELECT query to see if the specific ID was inserted by another task (mostly likely a speculative or previously failed task). Depending on if the ID is there, an INSERT or UPDATE query was issues. So there are basically 2x of queries per row inserted into the intermediate stats table. This workload could be reduced to 1/2 if we insert it anyway (it is very rare that IDs are duplicated) and use a different SQL query in the aggregation phase to dedup the ID (e.g., using group-by and max()). The benefits are that even though the aggregation query is more expensive, it is only run once per query. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-2096) throw a error if the input is larger than a threshold for index input format
[ https://issues.apache.org/jira/browse/HIVE-2096?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13035965#comment-13035965 ] He Yongqiang commented on HIVE-2096: will commit after tests pass. throw a error if the input is larger than a threshold for index input format Key: HIVE-2096 URL: https://issues.apache.org/jira/browse/HIVE-2096 Project: Hive Issue Type: Bug Affects Versions: 0.8.0 Reporter: Namit Jain Attachments: HIVE-2096.1.patch.txt, HIVE-2096.2.patch.txt, HIVE-2096.3.patch.txt, HIVE-2096.4.patch.txt This can hang for ever. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira