[GitHub] [hive] shawnweeks commented on issue #575: HIVE-21409 Add Jars to Session Conf ClassLoader
shawnweeks commented on issue #575: HIVE-21409 Add Jars to Session Conf ClassLoader URL: https://github.com/apache/hive/pull/575#issuecomment-475092978 I suspect I should be doing the same thing for unregisterJar as it has the risk of overwriting the classloader in SessionState.get().getConf() as well. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [hive] shawnweeks opened a new pull request #575: HIVE-21409 Add Jars to Session Conf ClassLoader
shawnweeks opened a new pull request #575: HIVE-21409 Add Jars to Session Conf ClassLoader URL: https://github.com/apache/hive/pull/575 It is possible that the current threads classloader may be modified after the sessionstate hiveconf has been attached to the current thread. This ensure we are always adding jars to the correct class loader. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [hive] rmsmani commented on issue #540: HIVE-21283 Synonyms for the existing functions
rmsmani commented on issue #540: HIVE-21283 Synonyms for the existing functions URL: https://github.com/apache/hive/pull/540#issuecomment-475115627 Hi @sankarh At last the unit test came green. Please review and merge the code to master This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [hive] rmsmani closed pull request #540: HIVE-21283 Synonyms for the existing functions
rmsmani closed pull request #540: HIVE-21283 Synonyms for the existing functions URL: https://github.com/apache/hive/pull/540 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [hive] rmsmani commented on issue #540: HIVE-21283 Synonyms for the existing functions
rmsmani commented on issue #540: HIVE-21283 Synonyms for the existing functions URL: https://github.com/apache/hive/pull/540#issuecomment-476459266 Source committed to master,so closing This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [hive] ashutosh-bapat commented on a change in pull request #579: HIVE-21109 : Support stats replication for ACID tables.
ashutosh-bapat commented on a change in pull request #579: HIVE-21109 : Support stats replication for ACID tables. URL: https://github.com/apache/hive/pull/579#discussion_r269425412 ## File path: ql/src/java/org/apache/hadoop/hive/ql/parse/ImportSemanticAnalyzer.java ## @@ -1247,17 +1244,37 @@ private static void createReplImportTasks( } else if (!replicationSpec.isMetadataOnly() && !shouldSkipDataCopyInReplScope(tblDesc, replicationSpec)) { x.getLOG().debug("adding dependent CopyWork/MoveWork for table"); -t.addDependentTask(loadTable(fromURI, table, replicationSpec.isReplace(), -new Path(tblDesc.getLocation()), replicationSpec, x, writeId, stmtId)); +dependentTasks = new ArrayList<>(1); +dependentTasks.add(loadTable(fromURI, table, replicationSpec.isReplace(), + new Path(tblDesc.getLocation()), replicationSpec, + x, writeId, stmtId)); } - if (dropTblTask != null) { -// Drop first and then create -dropTblTask.addDependentTask(t); -x.getTasks().add(dropTblTask); + // During replication, by the time we reply a commit transaction event, the table should + // have been already created when replaying previous events. So no need to create table + // again. For some reason we need create table task for partitioned table though. Review comment: Corrected. The partition case is already fixed, but the comment wasn't corrected. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [hive] ashutosh-bapat commented on a change in pull request #579: HIVE-21109 : Support stats replication for ACID tables.
ashutosh-bapat commented on a change in pull request #579: HIVE-21109 : Support stats replication for ACID tables. URL: https://github.com/apache/hive/pull/579#discussion_r269468871 ## File path: ql/src/java/org/apache/hadoop/hive/ql/metadata/Hive.java ## @@ -2950,21 +2956,33 @@ public Partition createPartition(Table tbl, Map partSpec) throws int size = addPartitionDesc.getPartitionCount(); List in = new ArrayList(size); -AcidUtils.TableSnapshot tableSnapshot = AcidUtils.getTableSnapshot(conf, tbl, true); long writeId; String validWriteIdList; -if (tableSnapshot != null && tableSnapshot.getWriteId() > 0) { - writeId = tableSnapshot.getWriteId(); - validWriteIdList = tableSnapshot.getValidWriteIdList(); + +// In case of replication, get the writeId from the source and use valid write Id list +// for replication. +if (addPartitionDesc.getReplicationSpec() != null && Review comment: Done. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [hive] ashutosh-bapat commented on a change in pull request #579: HIVE-21109 : Support stats replication for ACID tables.
ashutosh-bapat commented on a change in pull request #579: HIVE-21109 : Support stats replication for ACID tables. URL: https://github.com/apache/hive/pull/579#discussion_r269476213 ## File path: ql/src/java/org/apache/hadoop/hive/ql/plan/ImportTableDesc.java ## @@ -381,4 +382,11 @@ public void setOwnerName(String ownerName) { throw new RuntimeException("Invalid table type : " + getDescType()); } } + + public Long getReplWriteId() { +if (this.createTblDesc != null) { + return this.createTblDesc.getReplWriteId(); Review comment: In getBaseAddPartitionDescFromPartition() where we use this function, we don't have access to the event message. Instead we are passing the writeId through ImportTableDesc by calling setReplWriteId(). This function just introduces the missing getReplWriteId() method symmetric to setReplWriteId(). If we use local variable and pass it around there is a possibility that local writeId variable can go out of sync with that in ImportTableDesc. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [hive] ashutosh-bapat commented on a change in pull request #579: HIVE-21109 : Support stats replication for ACID tables.
ashutosh-bapat commented on a change in pull request #579: HIVE-21109 : Support stats replication for ACID tables. URL: https://github.com/apache/hive/pull/579#discussion_r269463941 ## File path: standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/HiveMetaStore.java ## @@ -1894,6 +1898,16 @@ private void create_table_core(final RawStore ms, final Table tbl, List checkConstraints) throws AlreadyExistsException, MetaException, InvalidObjectException, NoSuchObjectException, InvalidInputException { + + ColumnStatistics colStats = null; + // If the given table has column statistics, save it here. We will update it later. + // We don't want it to be part of the Table object being created, lest the create table Review comment: " and also shouldn't be persisted". That's not true. We will persist the table stats but later. If you let me know which part of the comment is complex (needs simplification), will come up with alternate wording reflecting the same meaning. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [hive] ashutosh-bapat commented on a change in pull request #579: HIVE-21109 : Support stats replication for ACID tables.
ashutosh-bapat commented on a change in pull request #579: HIVE-21109 : Support stats replication for ACID tables. URL: https://github.com/apache/hive/pull/579#discussion_r269467626 ## File path: standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/HiveMetaStore.java ## @@ -2130,11 +2144,18 @@ private void create_table_core(final RawStore ms, final Table tbl, // If the table has column statistics, update it into the metastore. This feature is used // by replication to replicate table level statistics. - if (tbl.isSetColStats()) { -// We do not replicate statistics for a transactional table right now and hence we do not -// expect a transactional table to have column statistics here. So passing null -// validWriteIds is fine for now. -updateTableColumnStatsInternal(tbl.getColStats(), null, tbl.getWriteId()); + if (colStats != null) { +// On replica craft a valid snapshot out of the writeId in the table. +long writeId = tbl.getWriteId(); +String validWriteIds = null; +if (writeId > 0) { + ValidWriteIdList vwil = Review comment: Done. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [hive] ashutosh-bapat commented on a change in pull request #579: HIVE-21109 : Support stats replication for ACID tables.
ashutosh-bapat commented on a change in pull request #579: HIVE-21109 : Support stats replication for ACID tables. URL: https://github.com/apache/hive/pull/579#discussion_r269467699 ## File path: standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/HiveMetaStore.java ## @@ -2130,11 +2144,18 @@ private void create_table_core(final RawStore ms, final Table tbl, // If the table has column statistics, update it into the metastore. This feature is used // by replication to replicate table level statistics. - if (tbl.isSetColStats()) { -// We do not replicate statistics for a transactional table right now and hence we do not -// expect a transactional table to have column statistics here. So passing null -// validWriteIds is fine for now. -updateTableColumnStatsInternal(tbl.getColStats(), null, tbl.getWriteId()); + if (colStats != null) { +// On replica craft a valid snapshot out of the writeId in the table. +long writeId = tbl.getWriteId(); +String validWriteIds = null; +if (writeId > 0) { + ValidWriteIdList vwil = + new ValidReaderWriteIdList(TableName.getDbTable(tbl.getDbName(), Review comment: Done. Please check. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [hive] ashutosh-bapat commented on a change in pull request #579: HIVE-21109 : Support stats replication for ACID tables.
ashutosh-bapat commented on a change in pull request #579: HIVE-21109 : Support stats replication for ACID tables. URL: https://github.com/apache/hive/pull/579#discussion_r269467769 ## File path: standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/HiveMetaStore.java ## @@ -3539,10 +3573,19 @@ public boolean equals(Object obj) { } // Update partition column statistics if available -for (Partition newPart : newParts) { - if (newPart.isSetColStats()) { -updatePartitonColStatsInternal(tbl, newPart.getColStats(), null, newPart.getWriteId()); +int cnt = 0; +for (ColumnStatistics partColStats: partsColStats) { + long writeId = partsWriteIds.get(cnt++); + // On replica craft a valid snapshot out of the writeId in the partition + String validWriteIds = null; + if (writeId > 0) { +ValidWriteIdList vwil = Review comment: Done. Please check. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [hive] ashutosh-bapat commented on a change in pull request #579: HIVE-21109 : Support stats replication for ACID tables.
ashutosh-bapat commented on a change in pull request #579: HIVE-21109 : Support stats replication for ACID tables. URL: https://github.com/apache/hive/pull/579#discussion_r269523732 ## File path: ql/src/java/org/apache/hadoop/hive/ql/exec/DDLTask.java ## @@ -2689,7 +2689,19 @@ private int alterTable(Hive db, AlterTableDesc alterTbl) throws HiveException { } else { // Note: this is necessary for UPDATE_STATISTICS command, that operates via ADDPROPS (why?). // For any other updates, we don't want to do txn check on partitions when altering table. -boolean isTxn = alterTbl.getPartSpec() != null && alterTbl.getOp() == AlterTableTypes.ADDPROPS; +boolean isTxn = false; +if (alterTbl.getPartSpec() != null && alterTbl.getOp() == AlterTableTypes.ADDPROPS) { + // ADDPROPS is used to add repl.last.id during replication. That's not a transactional + // change. + Map props = alterTbl.getProps(); + if (props.size() <= 1 && props.get(ReplicationSpec.KEY.CURR_STATE_ID.toString()) != null) { Review comment: itself is wrong. I do not see any ADDPROPS usage which is updating transactional properties. All those seem to come through AddPartition and not alterTable for partitioned table. So, may be we can safely mark this as non-transactional always. Does that look right? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [hive] ashutosh-bapat commented on a change in pull request #579: HIVE-21109 : Support stats replication for ACID tables.
ashutosh-bapat commented on a change in pull request #579: HIVE-21109 : Support stats replication for ACID tables. URL: https://github.com/apache/hive/pull/579#discussion_r269452642 ## File path: ql/src/java/org/apache/hadoop/hive/ql/exec/repl/bootstrap/events/filesystem/FSTableEvent.java ## @@ -199,12 +199,15 @@ private AddPartitionDesc partitionDesc(Path fromPath, // Right now, we do not have a way of associating a writeId with statistics for a table // converted to a transactional table if it was non-transactional on the source. So, do not Review comment: Done. Looks like I missed pushing an entire commit fixing the comments. Done now. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [hive] ashutosh-bapat commented on a change in pull request #579: HIVE-21109 : Support stats replication for ACID tables.
ashutosh-bapat commented on a change in pull request #579: HIVE-21109 : Support stats replication for ACID tables. URL: https://github.com/apache/hive/pull/579#discussion_r269508934 ## File path: ql/src/java/org/apache/hadoop/hive/ql/ddl/table/CreateTableDesc.java ## @@ -118,7 +118,8 @@ List notNullConstraints; List defaultConstraints; List checkConstraints; - private ColumnStatistics colStats; + private ColumnStatistics colStats; // For the sake of replication + private long writeId = -1; // For the sake of replication Review comment: I was initially afraid that there could be other side-effects of this change. Your suggestion will bring all writeId replication through replWriteId, which is good. Done. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [hive] ashutosh-bapat commented on a change in pull request #579: HIVE-21109 : Support stats replication for ACID tables.
ashutosh-bapat commented on a change in pull request #579: HIVE-21109 : Support stats replication for ACID tables. URL: https://github.com/apache/hive/pull/579#discussion_r269523732 ## File path: ql/src/java/org/apache/hadoop/hive/ql/exec/DDLTask.java ## @@ -2689,7 +2689,19 @@ private int alterTable(Hive db, AlterTableDesc alterTbl) throws HiveException { } else { // Note: this is necessary for UPDATE_STATISTICS command, that operates via ADDPROPS (why?). // For any other updates, we don't want to do txn check on partitions when altering table. -boolean isTxn = alterTbl.getPartSpec() != null && alterTbl.getOp() == AlterTableTypes.ADDPROPS; +boolean isTxn = false; +if (alterTbl.getPartSpec() != null && alterTbl.getOp() == AlterTableTypes.ADDPROPS) { + // ADDPROPS is used to add repl.last.id during replication. That's not a transactional + // change. + Map props = alterTbl.getProps(); + if (props.size() <= 1 && props.get(ReplicationSpec.KEY.CURR_STATE_ID.toString()) != null) { Review comment: The comment // Note: this is necessary for UPDATE_STATISTICS command, that operates via ADDPROPS (why?). // For any other updates, we don't want to do txn check on partitions when altering table. itself looks wrong. I do not see any ADDPROPS usage which is updating statistics properties. All those seem to come through AddPartition and not alterTable for partitioned table. So, may be we can safely mark this as non-transactional always. Does that look right? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [hive] ashutosh-bapat commented on a change in pull request #579: HIVE-21109 : Support stats replication for ACID tables.
ashutosh-bapat commented on a change in pull request #579: HIVE-21109 : Support stats replication for ACID tables. URL: https://github.com/apache/hive/pull/579#discussion_r269424107 ## File path: ql/src/java/org/apache/hadoop/hive/ql/metadata/Hive.java ## @@ -828,6 +828,8 @@ public void alterPartitions(String tblName, List newParts, new ArrayList(); try { AcidUtils.TableSnapshot tableSnapshot = null; + // TODO: In case of replication use the writeId and valid write id list constructed for Review comment: I have addressed this comment and removed it as well. But didn't commit the change and thus wasn't part of the PR. I have updated PR. This TODO is no more there. Sorry. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [hive] ashutosh-bapat commented on a change in pull request #579: HIVE-21109 : Support stats replication for ACID tables.
ashutosh-bapat commented on a change in pull request #579: HIVE-21109 : Support stats replication for ACID tables. URL: https://github.com/apache/hive/pull/579#discussion_r269423978 ## File path: ql/src/java/org/apache/hadoop/hive/ql/exec/DDLTask.java ## @@ -2689,7 +2689,19 @@ private int alterTable(Hive db, AlterTableDesc alterTbl) throws HiveException { } else { // Note: this is necessary for UPDATE_STATISTICS command, that operates via ADDPROPS (why?). // For any other updates, we don't want to do txn check on partitions when altering table. -boolean isTxn = alterTbl.getPartSpec() != null && alterTbl.getOp() == AlterTableTypes.ADDPROPS; +boolean isTxn = false; +if (alterTbl.getPartSpec() != null && alterTbl.getOp() == AlterTableTypes.ADDPROPS) { + // ADDPROPS is used to add repl.last.id during replication. That's not a transactional + // change. + Map props = alterTbl.getProps(); + if (props.size() <= 1 && props.get(ReplicationSpec.KEY.CURR_STATE_ID.toString()) != null) { +isTxn = false; + } else { +isTxn = true; + } +} +// TODO: Somehow we have to signal alterPartitions that it's part of replication and +// should use replication's valid writeid list instead of creating one. Review comment: I have addressed this comment and removed it as well. But didn't commit the change and thus wasn't part of the PR. I have updated PR. This TODO is no more there. Sorry. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [hive] miklosgergely opened a new pull request #580: HIVE-21231 HiveJoinAddNotNullRule support for range predicates
miklosgergely opened a new pull request #580: HIVE-21231 HiveJoinAddNotNullRule support for range predicates URL: https://github.com/apache/hive/pull/580 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [hive] sankarh commented on a change in pull request #579: HIVE-21109 : Support stats replication for ACID tables.
sankarh commented on a change in pull request #579: HIVE-21109 : Support stats replication for ACID tables. URL: https://github.com/apache/hive/pull/579#discussion_r269262756 ## File path: ql/src/java/org/apache/hadoop/hive/ql/plan/ImportTableDesc.java ## @@ -381,4 +382,11 @@ public void setOwnerName(String ownerName) { throw new RuntimeException("Invalid table type : " + getDescType()); } } + + public Long getReplWriteId() { +if (this.createTblDesc != null) { + return this.createTblDesc.getReplWriteId(); Review comment: This replWriteId is just a place holder for the writeId from the event message. It need not be in CreateTableDesc. Can be maintained in local variables and pass around. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [hive] sankarh commented on a change in pull request #579: HIVE-21109 : Support stats replication for ACID tables.
sankarh commented on a change in pull request #579: HIVE-21109 : Support stats replication for ACID tables. URL: https://github.com/apache/hive/pull/579#discussion_r269110947 ## File path: ql/src/java/org/apache/hadoop/hive/ql/metadata/Hive.java ## @@ -828,6 +828,8 @@ public void alterPartitions(String tblName, List newParts, new ArrayList(); try { AcidUtils.TableSnapshot tableSnapshot = null; + // TODO: In case of replication use the writeId and valid write id list constructed for Review comment: Is it done or still TODO? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [hive] sankarh commented on a change in pull request #579: HIVE-21109 : Support stats replication for ACID tables.
sankarh commented on a change in pull request #579: HIVE-21109 : Support stats replication for ACID tables. URL: https://github.com/apache/hive/pull/579#discussion_r269247183 ## File path: itests/hive-unit/src/test/java/org/apache/hadoop/hive/ql/parse/TestStatsReplicationScenarios.java ## @@ -359,17 +383,20 @@ private void testStatsReplicationCommon(boolean parallelBootstrap, boolean metad } @Test - public void testForNonAcidTables() throws Throwable { + public void testNonParallelBootstrapLoad() throws Throwable { +LOG.info("Testing " + testName.getClass().getName() + "." + testName.getMethodName()); testStatsReplicationCommon(false, false); } @Test - public void testForNonAcidTablesParallelBootstrapLoad() throws Throwable { -testStatsReplicationCommon(true, false); + public void testForParallelBootstrapLoad() throws Throwable { +LOG.info("Testing " + testName.getClass().getName() + "." + testName.getMethodName()); +testStatsReplicationCommon(true, false ); } @Test - public void testNonAcidMetadataOnlyDump() throws Throwable { + public void testMetadataOnlyDump() throws Throwable { Review comment: Add more tests for the following scenarios. 1. REPL LOAD fails after replicating table or partition objects with stats but before setting last replId. Now, retry which takes alter table/partition replace flows and stats should be valid after successful replication. Need this for all non-transactional, transactional and migration cases. 2. Parallel inserts with autogather enabled. Now, we will have events such that multiple txns open when updating stats event. Also, try to simulate that one stats update was successful and the other one invalidates it due to concurrent writes. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [hive] sankarh commented on a change in pull request #579: HIVE-21109 : Support stats replication for ACID tables.
sankarh commented on a change in pull request #579: HIVE-21109 : Support stats replication for ACID tables. URL: https://github.com/apache/hive/pull/579#discussion_r269223302 ## File path: ql/src/java/org/apache/hadoop/hive/ql/metadata/Hive.java ## @@ -2950,21 +2956,33 @@ public Partition createPartition(Table tbl, Map partSpec) throws int size = addPartitionDesc.getPartitionCount(); List in = new ArrayList(size); -AcidUtils.TableSnapshot tableSnapshot = AcidUtils.getTableSnapshot(conf, tbl, true); long writeId; String validWriteIdList; -if (tableSnapshot != null && tableSnapshot.getWriteId() > 0) { - writeId = tableSnapshot.getWriteId(); - validWriteIdList = tableSnapshot.getValidWriteIdList(); + +// In case of replication, get the writeId from the source and use valid write Id list +// for replication. +if (addPartitionDesc.getReplicationSpec() != null && Review comment: addPartitionDesc.getReplicationSpec() will never be null. Can remove this check. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [hive] sankarh commented on a change in pull request #579: HIVE-21109 : Support stats replication for ACID tables.
sankarh commented on a change in pull request #579: HIVE-21109 : Support stats replication for ACID tables. URL: https://github.com/apache/hive/pull/579#discussion_r269161871 ## File path: standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/HiveMetaStore.java ## @@ -2130,11 +2144,18 @@ private void create_table_core(final RawStore ms, final Table tbl, // If the table has column statistics, update it into the metastore. This feature is used // by replication to replicate table level statistics. - if (tbl.isSetColStats()) { -// We do not replicate statistics for a transactional table right now and hence we do not -// expect a transactional table to have column statistics here. So passing null -// validWriteIds is fine for now. -updateTableColumnStatsInternal(tbl.getColStats(), null, tbl.getWriteId()); + if (colStats != null) { +// On replica craft a valid snapshot out of the writeId in the table. +long writeId = tbl.getWriteId(); +String validWriteIds = null; +if (writeId > 0) { + ValidWriteIdList vwil = Review comment: Shall use meaningful names instead of "vwil". This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [hive] sankarh commented on a change in pull request #579: HIVE-21109 : Support stats replication for ACID tables.
sankarh commented on a change in pull request #579: HIVE-21109 : Support stats replication for ACID tables. URL: https://github.com/apache/hive/pull/579#discussion_r269257547 ## File path: ql/src/java/org/apache/hadoop/hive/ql/metadata/Hive.java ## @@ -987,10 +989,14 @@ public void createTable(Table tbl, boolean ifNotExists, tTbl.setPrivileges(principalPrivs); } } - // Set table snapshot to api.Table to make it persistent. - TableSnapshot tableSnapshot = AcidUtils.getTableSnapshot(conf, tbl, true); - if (tableSnapshot != null) { -tbl.getTTable().setWriteId(tableSnapshot.getWriteId()); + // Set table snapshot to api.Table to make it persistent. A transactional table being + // replicated may have a valid write Id copied from the source. Use that instead of + // crafting one on the replica. + if (tTbl.getWriteId() <= 0) { Review comment: DO_NOT_UPDATE_STATS flag should be set in createTableFlow as well. Or else in autogather mode at target, it will be updated automatically. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [hive] sankarh commented on a change in pull request #579: HIVE-21109 : Support stats replication for ACID tables.
sankarh commented on a change in pull request #579: HIVE-21109 : Support stats replication for ACID tables. URL: https://github.com/apache/hive/pull/579#discussion_r269172695 ## File path: standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/HiveMetaStore.java ## @@ -3539,10 +3573,19 @@ public boolean equals(Object obj) { } // Update partition column statistics if available -for (Partition newPart : newParts) { - if (newPart.isSetColStats()) { -updatePartitonColStatsInternal(tbl, newPart.getColStats(), null, newPart.getWriteId()); +int cnt = 0; +for (ColumnStatistics partColStats: partsColStats) { + long writeId = partsWriteIds.get(cnt++); + // On replica craft a valid snapshot out of the writeId in the partition + String validWriteIds = null; + if (writeId > 0) { +ValidWriteIdList vwil = Review comment: Same as above. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [hive] sankarh commented on a change in pull request #579: HIVE-21109 : Support stats replication for ACID tables.
sankarh commented on a change in pull request #579: HIVE-21109 : Support stats replication for ACID tables. URL: https://github.com/apache/hive/pull/579#discussion_r269156935 ## File path: standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/HiveMetaStore.java ## @@ -1894,6 +1898,16 @@ private void create_table_core(final RawStore ms, final Table tbl, List checkConstraints) throws AlreadyExistsException, MetaException, InvalidObjectException, NoSuchObjectException, InvalidInputException { + + ColumnStatistics colStats = null; + // If the given table has column statistics, save it here. We will update it later. + // We don't want it to be part of the Table object being created, lest the create table Review comment: Shall simplify the comment. "Column stats are not expected to be part of Create table event and also shouldn't be persisted. So remove it from Table object." This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [hive] sankarh commented on a change in pull request #579: HIVE-21109 : Support stats replication for ACID tables.
sankarh commented on a change in pull request #579: HIVE-21109 : Support stats replication for ACID tables. URL: https://github.com/apache/hive/pull/579#discussion_r269169210 ## File path: standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/HiveMetaStore.java ## @@ -2130,11 +2144,18 @@ private void create_table_core(final RawStore ms, final Table tbl, // If the table has column statistics, update it into the metastore. This feature is used // by replication to replicate table level statistics. - if (tbl.isSetColStats()) { -// We do not replicate statistics for a transactional table right now and hence we do not -// expect a transactional table to have column statistics here. So passing null -// validWriteIds is fine for now. -updateTableColumnStatsInternal(tbl.getColStats(), null, tbl.getWriteId()); + if (colStats != null) { +// On replica craft a valid snapshot out of the writeId in the table. +long writeId = tbl.getWriteId(); +String validWriteIds = null; +if (writeId > 0) { + ValidWriteIdList vwil = + new ValidReaderWriteIdList(TableName.getDbTable(tbl.getDbName(), Review comment: Shall add a comment on why the hardcoded validWriteList is used in this flow instead of taking current state of txns. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [hive] sankarh commented on a change in pull request #579: HIVE-21109 : Support stats replication for ACID tables.
sankarh commented on a change in pull request #579: HIVE-21109 : Support stats replication for ACID tables. URL: https://github.com/apache/hive/pull/579#discussion_r269154738 ## File path: standalone-metastore/metastore-common/src/main/java/org/apache/hadoop/hive/metastore/txn/TxnCommonUtils.java ## @@ -84,6 +86,73 @@ public static ValidTxnList createValidReadTxnList(GetOpenTxnsResponse txns, long return new ValidReadTxnList(exceptions, outAbortedBits, highWaterMark, minOpenTxnId); } + /** + * Transform a {@link org.apache.hadoop.hive.metastore.api.GetOpenTxnsResponse} to a + * {@link org.apache.hadoop.hive.common.ValidTxnList}. This assumes that the caller intends to + * read the files, and thus treats both open and aborted transactions as invalid. + * + * This API is used by Hive replication which may have multiple transactions open at a time. + * + * @param txns open txn list from the metastore + * @param currentTxns Current transactions that the replication has opened. If any of the + *transactions is greater than 0 it will be removed from the exceptions + *list so that the replication sees its own transaction as valid. + * @return a valid txn list. + */ + public static ValidTxnList createValidReadTxnList(GetOpenTxnsResponse txns, Review comment: The complete logic of considering all txns opened in a batch by open txn event as current txns is incorrect. Multiple txns are opened by repl task only for replicating Hive Streaming case where we allocate txns batch but use one at a time. Also, we don't update stats in that case. Even if we update stats, it should refer to one txn as current txn and rest of the txns are left open. Shall remove replTxnIds cache in TxnManager as well. All callers shall create a hardcoded ValidWriteIdList using the writeId received from event msg. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [hive] sankarh commented on a change in pull request #579: HIVE-21109 : Support stats replication for ACID tables.
sankarh commented on a change in pull request #579: HIVE-21109 : Support stats replication for ACID tables. URL: https://github.com/apache/hive/pull/579#discussion_r269081532 ## File path: ql/src/java/org/apache/hadoop/hive/ql/exec/DDLTask.java ## @@ -2689,7 +2689,19 @@ private int alterTable(Hive db, AlterTableDesc alterTbl) throws HiveException { } else { // Note: this is necessary for UPDATE_STATISTICS command, that operates via ADDPROPS (why?). // For any other updates, we don't want to do txn check on partitions when altering table. -boolean isTxn = alterTbl.getPartSpec() != null && alterTbl.getOp() == AlterTableTypes.ADDPROPS; +boolean isTxn = false; +if (alterTbl.getPartSpec() != null && alterTbl.getOp() == AlterTableTypes.ADDPROPS) { + // ADDPROPS is used to add repl.last.id during replication. That's not a transactional + // change. + Map props = alterTbl.getProps(); + if (props.size() <= 1 && props.get(ReplicationSpec.KEY.CURR_STATE_ID.toString()) != null) { Review comment: ReplUtils.REPL_CHECKPOINT_KEY is another prop we set it in repl flow which is not transactional. This check doesn't seems to be clean as in future we might add more such alters in repl flow. Can we check replicationSpec.isReplicationScope instead or another flag in AlterTableDesc to skip this? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [hive] sankarh commented on a change in pull request #579: HIVE-21109 : Support stats replication for ACID tables.
sankarh commented on a change in pull request #579: HIVE-21109 : Support stats replication for ACID tables. URL: https://github.com/apache/hive/pull/579#discussion_r269103325 ## File path: ql/src/java/org/apache/hadoop/hive/ql/exec/repl/bootstrap/events/filesystem/FSTableEvent.java ## @@ -199,12 +199,15 @@ private AddPartitionDesc partitionDesc(Path fromPath, // Right now, we do not have a way of associating a writeId with statistics for a table // converted to a transactional table if it was non-transactional on the source. So, do not Review comment: Comment needs to be corrected. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [hive] sankarh commented on a change in pull request #579: HIVE-21109 : Support stats replication for ACID tables.
sankarh commented on a change in pull request #579: HIVE-21109 : Support stats replication for ACID tables. URL: https://github.com/apache/hive/pull/579#discussion_r269220469 ## File path: ql/src/java/org/apache/hadoop/hive/ql/metadata/Hive.java ## @@ -2950,21 +2956,33 @@ public Partition createPartition(Table tbl, Map partSpec) throws int size = addPartitionDesc.getPartitionCount(); List in = new ArrayList(size); -AcidUtils.TableSnapshot tableSnapshot = AcidUtils.getTableSnapshot(conf, tbl, true); long writeId; String validWriteIdList; -if (tableSnapshot != null && tableSnapshot.getWriteId() > 0) { - writeId = tableSnapshot.getWriteId(); - validWriteIdList = tableSnapshot.getValidWriteIdList(); + +// In case of replication, get the writeId from the source and use valid write Id list +// for replication. +if (addPartitionDesc.getReplicationSpec() != null && +addPartitionDesc.getReplicationSpec().isInReplicationScope() && +addPartitionDesc.getPartition(0).getWriteId() > 0) { + writeId = addPartitionDesc.getPartition(0).getWriteId(); + validWriteIdList = Review comment: In replication flow, it is fine to use hardcoded ValidWriteIdList as we want to forcefully set this writeId into table or partition objects. Getting it from current state might be wrong as we don't update ValidTxnList in conf for repl created txns. ValidWriteIdList is just used to check if writeId in metastore objects are updated by any concurrent inserts. In repl load flow, it is not possible as we replicate one event at a time or in bootstrap, no 2 threads writes into same table. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [hive] sankarh commented on a change in pull request #579: HIVE-21109 : Support stats replication for ACID tables.
sankarh commented on a change in pull request #579: HIVE-21109 : Support stats replication for ACID tables. URL: https://github.com/apache/hive/pull/579#discussion_r269136269 ## File path: ql/src/java/org/apache/hadoop/hive/ql/parse/ImportSemanticAnalyzer.java ## @@ -1247,17 +1244,37 @@ private static void createReplImportTasks( } else if (!replicationSpec.isMetadataOnly() && !shouldSkipDataCopyInReplScope(tblDesc, replicationSpec)) { x.getLOG().debug("adding dependent CopyWork/MoveWork for table"); -t.addDependentTask(loadTable(fromURI, table, replicationSpec.isReplace(), -new Path(tblDesc.getLocation()), replicationSpec, x, writeId, stmtId)); +dependentTasks = new ArrayList<>(1); +dependentTasks.add(loadTable(fromURI, table, replicationSpec.isReplace(), + new Path(tblDesc.getLocation()), replicationSpec, + x, writeId, stmtId)); } - if (dropTblTask != null) { -// Drop first and then create -dropTblTask.addDependentTask(t); -x.getTasks().add(dropTblTask); + // During replication, by the time we reply a commit transaction event, the table should + // have been already created when replaying previous events. So no need to create table + // again. For some reason we need create table task for partitioned table though. Review comment: The comment says for partitioned table, create table task needed but in the code it is skipped always for commit txn event. Which one is correct? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [hive] sankarh commented on a change in pull request #579: HIVE-21109 : Support stats replication for ACID tables.
sankarh commented on a change in pull request #579: HIVE-21109 : Support stats replication for ACID tables. URL: https://github.com/apache/hive/pull/579#discussion_r269098036 ## File path: ql/src/java/org/apache/hadoop/hive/ql/exec/DDLTask.java ## @@ -2689,7 +2689,19 @@ private int alterTable(Hive db, AlterTableDesc alterTbl) throws HiveException { } else { // Note: this is necessary for UPDATE_STATISTICS command, that operates via ADDPROPS (why?). // For any other updates, we don't want to do txn check on partitions when altering table. -boolean isTxn = alterTbl.getPartSpec() != null && alterTbl.getOp() == AlterTableTypes.ADDPROPS; +boolean isTxn = false; +if (alterTbl.getPartSpec() != null && alterTbl.getOp() == AlterTableTypes.ADDPROPS) { + // ADDPROPS is used to add repl.last.id during replication. That's not a transactional + // change. + Map props = alterTbl.getProps(); + if (props.size() <= 1 && props.get(ReplicationSpec.KEY.CURR_STATE_ID.toString()) != null) { +isTxn = false; + } else { +isTxn = true; + } +} +// TODO: Somehow we have to signal alterPartitions that it's part of replication and +// should use replication's valid writeid list instead of creating one. Review comment: What do you mean by replication's valid writeid list in this comment? Even in repl flow, we get validWriteIdList from HMS based on incoming writeId in the event msg. Are you suggesting to cache this ValidWriteIdList somewhere and use it instead of invoking HMS API? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [hive] sankarh commented on a change in pull request #579: HIVE-21109 : Support stats replication for ACID tables.
sankarh commented on a change in pull request #579: HIVE-21109 : Support stats replication for ACID tables. URL: https://github.com/apache/hive/pull/579#discussion_r269060256 ## File path: ql/src/java/org/apache/hadoop/hive/ql/ddl/table/CreateTableDesc.java ## @@ -118,7 +118,8 @@ List notNullConstraints; List defaultConstraints; List checkConstraints; - private ColumnStatistics colStats; + private ColumnStatistics colStats; // For the sake of replication + private long writeId = -1; // For the sake of replication Review comment: Can we re-use the replWriteId variable that we already have? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [hive] Fokko opened a new pull request #576: FLINK-11992 Update Apache Parquet to 1.10.1
Fokko opened a new pull request #576: FLINK-11992 Update Apache Parquet to 1.10.1 URL: https://github.com/apache/hive/pull/576 Fixes two bugs which were discovered in Apache Spark: https://github.com/apache/parquet-mr/blob/master/CHANGES.md#version-1101 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [hive] maheshk114 commented on a change in pull request #578: HIVE-21471: Replicating conversion of managed to external table leaks HDFS files at target.
maheshk114 commented on a change in pull request #578: HIVE-21471: Replicating conversion of managed to external table leaks HDFS files at target. URL: https://github.com/apache/hive/pull/578#discussion_r268475619 ## File path: standalone-metastore/metastore-common/src/main/java/org/apache/hadoop/hive/common/ReplConst.java ## @@ -0,0 +1,33 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.hadoop.hive.common; + +/** + * A class that defines the constant strings used by the replication implementation. + */ + +public class ReplConst { Review comment: The name can be repl common as its in common folder This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [hive] maheshk114 commented on a change in pull request #578: HIVE-21471: Replicating conversion of managed to external table leaks HDFS files at target.
maheshk114 commented on a change in pull request #578: HIVE-21471: Replicating conversion of managed to external table leaks HDFS files at target. URL: https://github.com/apache/hive/pull/578#discussion_r268474981 ## File path: itests/hive-unit/src/test/java/org/apache/hadoop/hive/ql/parse/TestReplicationScenariosExternalTables.java ## @@ -673,6 +674,59 @@ public void retryIncBootstrapExternalTablesFromDifferentDumpWithoutCleanTablesCo ErrorMsg.REPL_BOOTSTRAP_LOAD_PATH_NOT_VALID.getErrorCode()); } + @Test + public void dynamicallyConvertManagedToExternalTable() throws Throwable { +List dumpWithClause = Collections.singletonList( +"'" + HiveConf.ConfVars.REPL_INCLUDE_EXTERNAL_TABLES.varname + "'='true'" +); +List loadWithClause = externalTableBasePathWithClause(); + +WarehouseInstance.Tuple tupleBootstrapManagedTable = primary.run("use " + primaryDbName) +.run("create table t1 (id int)") +.run("insert into table t1 values (1)") +.run("create table t2 (id int) partitioned by (key int)") Review comment: the case does not exist that a managed acid table is converted to external ..it should be always using migration that an acid table will be converted to external .. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [hive] sankarh commented on a change in pull request #578: HIVE-21471: Replicating conversion of managed to external table leaks HDFS files at target.
sankarh commented on a change in pull request #578: HIVE-21471: Replicating conversion of managed to external table leaks HDFS files at target. URL: https://github.com/apache/hive/pull/578#discussion_r268483462 ## File path: itests/hive-unit/src/test/java/org/apache/hadoop/hive/ql/parse/TestReplicationScenariosExternalTables.java ## @@ -673,6 +674,59 @@ public void retryIncBootstrapExternalTablesFromDifferentDumpWithoutCleanTablesCo ErrorMsg.REPL_BOOTSTRAP_LOAD_PATH_NOT_VALID.getErrorCode()); } + @Test + public void dynamicallyConvertManagedToExternalTable() throws Throwable { +List dumpWithClause = Collections.singletonList( +"'" + HiveConf.ConfVars.REPL_INCLUDE_EXTERNAL_TABLES.varname + "'='true'" +); +List loadWithClause = externalTableBasePathWithClause(); + +WarehouseInstance.Tuple tupleBootstrapManagedTable = primary.run("use " + primaryDbName) +.run("create table t1 (id int)") +.run("insert into table t1 values (1)") +.run("create table t2 (id int) partitioned by (key int)") Review comment: Did you mean, a non-acid table is created as acid table in target (due to migration) and later in source they change to external? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [hive] ashutosh-bapat opened a new pull request #579: HIVE-21109 : Support stats replication for ACID tables.
ashutosh-bapat opened a new pull request #579: HIVE-21109 : Support stats replication for ACID tables. URL: https://github.com/apache/hive/pull/579 During bootstrap we use a method similar to non-ACID tables to transfer statistics of an ACID table from source to replica. However installing statistics of an ACID table requires a valid writeId and writeId list. We use the table/partition's latest writeId and a valid transaction list containing only that writeId to install the statistics in the metastore. During incremental replication writeId is obtained from the UpdateStats event and valid writeId list with that writeId marked as valid is used to install the column statistics. Table level statistics is replicated by replaying corresponding ALTER_TABLE/ALTER_PARTITION event. Further this commit has following related changes. 1. The table or the partition associated with the commit transaction event should have been created when replaying corresponding events before commit transaction event. Thus there is no need to add tasks for creating the table or the partition. 2 Maintain a list of open replicated transactions and use that to create valid transactions list when replaying a replicated event. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [hive] maheshk114 commented on a change in pull request #578: HIVE-21471: Replicating conversion of managed to external table leaks HDFS files at target.
maheshk114 commented on a change in pull request #578: HIVE-21471: Replicating conversion of managed to external table leaks HDFS files at target. URL: https://github.com/apache/hive/pull/578#discussion_r268474795 ## File path: standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/HiveAlterHandler.java ## @@ -99,9 +100,12 @@ public void alterTable(RawStore msdb, Warehouse wh, String catName, String dbnam dbname = dbname.toLowerCase(); final boolean cascade = environmentContext != null -&& environmentContext.isSetProperties() -&& StatsSetupConst.TRUE.equals(environmentContext.getProperties().get( -StatsSetupConst.CASCADE)); +&& environmentContext.isSetProperties() +&& StatsSetupConst.TRUE.equals(environmentContext.getProperties().get(StatsSetupConst.CASCADE)); +final boolean replDataLocationChanged = environmentContext != null Review comment: yes ..i think the flag sent from hive server to meta store using environment context is not required ...the code changes are redundant .. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [hive] sankarh commented on a change in pull request #578: HIVE-21471: Replicating conversion of managed to external table leaks HDFS files at target.
sankarh commented on a change in pull request #578: HIVE-21471: Replicating conversion of managed to external table leaks HDFS files at target. URL: https://github.com/apache/hive/pull/578#discussion_r268480500 ## File path: standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/HiveAlterHandler.java ## @@ -400,7 +405,26 @@ public void alterTable(RawStore msdb, Warehouse wh, String catName, String dbnam "Unable to change partition or table. Object " + e.getMessage() + " does not exist." + " Check metastore logs for detailed stack."); } finally { - if (!success) { + if (success) { +// Txn was committed successfully. +// If data location is changed in replication flow, then need to delete the old path. +if (replDataLocationChanged) { + assert(olddb != null); + assert(oldt != null); + Path deleteOldDataLoc = new Path(oldt.getSd().getLocation()); + boolean isAutoPurge = "true".equalsIgnoreCase(oldt.getParameters().get("auto.purge")); + try { +wh.deleteDir(deleteOldDataLoc, true, isAutoPurge, olddb); +LOG.info("Deleted the old data location: {} for the table: {}", +deleteOldDataLoc, dbname + "." + name); + } catch (MetaException ex) { +// Eat the exception as it doesn't affect the state of existing tables. +// Expect, user to manually drop this path when exception and so logging a warning. +LOG.warn("Unable to delete the old data location: {} for the table: {}", Review comment: Doesn't matter, we ignore that exception here. REPL LOAD will succeed. Also, previous run already archived in CM dir. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [hive] sankarh closed pull request #578: HIVE-21471: Replicating conversion of managed to external table leaks HDFS files at target.
sankarh closed pull request #578: HIVE-21471: Replicating conversion of managed to external table leaks HDFS files at target. URL: https://github.com/apache/hive/pull/578 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [hive] sankarh commented on a change in pull request #578: HIVE-21471: Replicating conversion of managed to external table leaks HDFS files at target.
sankarh commented on a change in pull request #578: HIVE-21471: Replicating conversion of managed to external table leaks HDFS files at target. URL: https://github.com/apache/hive/pull/578#discussion_r268480328 ## File path: standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/HiveAlterHandler.java ## @@ -192,12 +197,12 @@ public void alterTable(RawStore msdb, Warehouse wh, String catName, String dbnam // 2) the table is not an external table, and // 3) the user didn't change the default location (or new location is empty), and // 4) the table was not initially created with a specified location - if (rename - && !oldt.getTableType().equals(TableType.VIRTUAL_VIEW.toString()) - && (oldt.getSd().getLocation().compareTo(newt.getSd().getLocation()) == 0 -|| StringUtils.isEmpty(newt.getSd().getLocation())) - && !MetaStoreUtils.isExternalTable(oldt)) { -Database olddb = msdb.getDatabase(catName, dbname); + if (replDataLocationChanged + || (rename Review comment: It cannot be same because the behaviour is different. If location is set for normal flow, we shouldn't update for partitions. Also, it is risky to modify any of current behaviour. It is not in scope of this ticket. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [hive] maheshk114 commented on a change in pull request #578: HIVE-21471: Replicating conversion of managed to external table leaks HDFS files at target.
maheshk114 commented on a change in pull request #578: HIVE-21471: Replicating conversion of managed to external table leaks HDFS files at target. URL: https://github.com/apache/hive/pull/578#discussion_r268476396 ## File path: standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/HiveAlterHandler.java ## @@ -400,7 +405,26 @@ public void alterTable(RawStore msdb, Warehouse wh, String catName, String dbnam "Unable to change partition or table. Object " + e.getMessage() + " does not exist." + " Check metastore logs for detailed stack."); } finally { - if (!success) { + if (success) { +// Txn was committed successfully. +// If data location is changed in replication flow, then need to delete the old path. +if (replDataLocationChanged) { + assert(olddb != null); + assert(oldt != null); + Path deleteOldDataLoc = new Path(oldt.getSd().getLocation()); + boolean isAutoPurge = "true".equalsIgnoreCase(oldt.getParameters().get("auto.purge")); + try { +wh.deleteDir(deleteOldDataLoc, true, isAutoPurge, olddb); +LOG.info("Deleted the old data location: {} for the table: {}", +deleteOldDataLoc, dbname + "." + name); + } catch (MetaException ex) { +// Eat the exception as it doesn't affect the state of existing tables. +// Expect, user to manually drop this path when exception and so logging a warning. +LOG.warn("Unable to delete the old data location: {} for the table: {}", Review comment: if the delete directory succeeds and the event reply fails for some other reason, then the event will be replayed again. In the next replay, if it does not finds the directory, cm copy might fail. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [hive] sankarh commented on a change in pull request #578: HIVE-21471: Replicating conversion of managed to external table leaks HDFS files at target.
sankarh commented on a change in pull request #578: HIVE-21471: Replicating conversion of managed to external table leaks HDFS files at target. URL: https://github.com/apache/hive/pull/578#discussion_r268480242 ## File path: standalone-metastore/metastore-common/src/main/java/org/apache/hadoop/hive/common/ReplConst.java ## @@ -0,0 +1,33 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.hadoop.hive.common; + +/** + * A class that defines the constant strings used by the replication implementation. + */ + +public class ReplConst { Review comment: I will keep it ReplConst for now. We have ReplUtils for common methods and if needed we can add another class. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [hive] sankarh commented on a change in pull request #578: HIVE-21471: Replicating conversion of managed to external table leaks HDFS files at target.
sankarh commented on a change in pull request #578: HIVE-21471: Replicating conversion of managed to external table leaks HDFS files at target. URL: https://github.com/apache/hive/pull/578#discussion_r268480382 ## File path: standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/HiveAlterHandler.java ## @@ -209,7 +214,13 @@ public void alterTable(RawStore msdb, Warehouse wh, String catName, String dbnam boolean tableInSpecifiedLoc = !oldtRelativePath.equalsIgnoreCase(name) && !oldtRelativePath.equalsIgnoreCase(name + Path.SEPARATOR); -if (!tableInSpecifiedLoc) { +if (replDataLocationChanged) { + // If data location is changed in replication flow, then new path was already set in + // the newt. Also, it is as good as the data is moved and set dataWasMoved=true so that + // location in partitions are also updated accordingly. + destPath = new Path(newt.getSd().getLocation()); Review comment: Yes. it is already like that. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [hive] maheshk114 commented on a change in pull request #578: HIVE-21471: Replicating conversion of managed to external table leaks HDFS files at target.
maheshk114 commented on a change in pull request #578: HIVE-21471: Replicating conversion of managed to external table leaks HDFS files at target. URL: https://github.com/apache/hive/pull/578#discussion_r268474981 ## File path: itests/hive-unit/src/test/java/org/apache/hadoop/hive/ql/parse/TestReplicationScenariosExternalTables.java ## @@ -673,6 +674,59 @@ public void retryIncBootstrapExternalTablesFromDifferentDumpWithoutCleanTablesCo ErrorMsg.REPL_BOOTSTRAP_LOAD_PATH_NOT_VALID.getErrorCode()); } + @Test + public void dynamicallyConvertManagedToExternalTable() throws Throwable { +List dumpWithClause = Collections.singletonList( +"'" + HiveConf.ConfVars.REPL_INCLUDE_EXTERNAL_TABLES.varname + "'='true'" +); +List loadWithClause = externalTableBasePathWithClause(); + +WarehouseInstance.Tuple tupleBootstrapManagedTable = primary.run("use " + primaryDbName) +.run("create table t1 (id int)") +.run("insert into table t1 values (1)") +.run("create table t2 (id int) partitioned by (key int)") Review comment: the case does not exist that a managed acid table is converted to external ..it should be always in using migration that an acid table will be converted to external .. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [hive] maheshk114 commented on a change in pull request #578: HIVE-21471: Replicating conversion of managed to external table leaks HDFS files at target.
maheshk114 commented on a change in pull request #578: HIVE-21471: Replicating conversion of managed to external table leaks HDFS files at target. URL: https://github.com/apache/hive/pull/578#discussion_r268475152 ## File path: standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/HiveAlterHandler.java ## @@ -192,12 +197,12 @@ public void alterTable(RawStore msdb, Warehouse wh, String catName, String dbnam // 2) the table is not an external table, and // 3) the user didn't change the default location (or new location is empty), and // 4) the table was not initially created with a specified location - if (rename - && !oldt.getTableType().equals(TableType.VIRTUAL_VIEW.toString()) - && (oldt.getSd().getLocation().compareTo(newt.getSd().getLocation()) == 0 -|| StringUtils.isEmpty(newt.getSd().getLocation())) - && !MetaStoreUtils.isExternalTable(oldt)) { -Database olddb = msdb.getDatabase(catName, dbname); + if (replDataLocationChanged + || (rename Review comment: what i meant is the flow can be kept same for replication and normal flow to avoid adding extra complexity .. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [hive] sankarh commented on a change in pull request #578: HIVE-21471: Replicating conversion of managed to external table leaks HDFS files at target.
sankarh commented on a change in pull request #578: HIVE-21471: Replicating conversion of managed to external table leaks HDFS files at target. URL: https://github.com/apache/hive/pull/578#discussion_r268483462 ## File path: itests/hive-unit/src/test/java/org/apache/hadoop/hive/ql/parse/TestReplicationScenariosExternalTables.java ## @@ -673,6 +674,59 @@ public void retryIncBootstrapExternalTablesFromDifferentDumpWithoutCleanTablesCo ErrorMsg.REPL_BOOTSTRAP_LOAD_PATH_NOT_VALID.getErrorCode()); } + @Test + public void dynamicallyConvertManagedToExternalTable() throws Throwable { +List dumpWithClause = Collections.singletonList( +"'" + HiveConf.ConfVars.REPL_INCLUDE_EXTERNAL_TABLES.varname + "'='true'" +); +List loadWithClause = externalTableBasePathWithClause(); + +WarehouseInstance.Tuple tupleBootstrapManagedTable = primary.run("use " + primaryDbName) +.run("create table t1 (id int)") +.run("insert into table t1 values (1)") +.run("create table t2 (id int) partitioned by (key int)") Review comment: Did you mean, a non-acid table is created as acid table in target (due to migration) and later in source they change to external? I think, this case, need lot of changes to avoid distcp for external table and so on. I will create another ticket for this as there might be other cases too. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [hive] maheshk114 commented on a change in pull request #578: HIVE-21471: Replicating conversion of managed to external table leaks HDFS files at target.
maheshk114 commented on a change in pull request #578: HIVE-21471: Replicating conversion of managed to external table leaks HDFS files at target. URL: https://github.com/apache/hive/pull/578#discussion_r268475390 ## File path: standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/HiveAlterHandler.java ## @@ -209,7 +214,13 @@ public void alterTable(RawStore msdb, Warehouse wh, String catName, String dbnam boolean tableInSpecifiedLoc = !oldtRelativePath.equalsIgnoreCase(name) && !oldtRelativePath.equalsIgnoreCase(name + Path.SEPARATOR); -if (!tableInSpecifiedLoc) { +if (replDataLocationChanged) { + // If data location is changed in replication flow, then new path was already set in + // the newt. Also, it is as good as the data is moved and set dataWasMoved=true so that + // location in partitions are also updated accordingly. + destPath = new Path(newt.getSd().getLocation()); Review comment: if for a partition the location is not within the table location and that table is altered to a external table ..the partition location need not be changed to base path at target ? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [hive] maheshk114 commented on a change in pull request #578: HIVE-21471: Replicating conversion of managed to external table leaks HDFS files at target.
maheshk114 commented on a change in pull request #578: HIVE-21471: Replicating conversion of managed to external table leaks HDFS files at target. URL: https://github.com/apache/hive/pull/578#discussion_r268475390 ## File path: standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/HiveAlterHandler.java ## @@ -209,7 +214,13 @@ public void alterTable(RawStore msdb, Warehouse wh, String catName, String dbnam boolean tableInSpecifiedLoc = !oldtRelativePath.equalsIgnoreCase(name) && !oldtRelativePath.equalsIgnoreCase(name + Path.SEPARATOR); -if (!tableInSpecifiedLoc) { +if (replDataLocationChanged) { + // If data location is changed in replication flow, then new path was already set in + // the newt. Also, it is as good as the data is moved and set dataWasMoved=true so that + // location in partitions are also updated accordingly. + destPath = new Path(newt.getSd().getLocation()); Review comment: if for a partition the location is not within the table location and that table is altered to a external table ..the partition location need not be changed to base path ? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [hive] sankarh commented on a change in pull request #578: HIVE-21471: Replicating conversion of managed to external table leaks HDFS files at target.
sankarh commented on a change in pull request #578: HIVE-21471: Replicating conversion of managed to external table leaks HDFS files at target. URL: https://github.com/apache/hive/pull/578#discussion_r268480113 ## File path: itests/hive-unit/src/test/java/org/apache/hadoop/hive/ql/parse/TestReplicationScenariosExternalTables.java ## @@ -673,6 +674,59 @@ public void retryIncBootstrapExternalTablesFromDifferentDumpWithoutCleanTablesCo ErrorMsg.REPL_BOOTSTRAP_LOAD_PATH_NOT_VALID.getErrorCode()); } + @Test + public void dynamicallyConvertManagedToExternalTable() throws Throwable { +List dumpWithClause = Collections.singletonList( +"'" + HiveConf.ConfVars.REPL_INCLUDE_EXTERNAL_TABLES.varname + "'='true'" +); +List loadWithClause = externalTableBasePathWithClause(); + +WarehouseInstance.Tuple tupleBootstrapManagedTable = primary.run("use " + primaryDbName) +.run("create table t1 (id int)") +.run("insert into table t1 values (1)") +.run("create table t2 (id int) partitioned by (key int)") Review comment: It is not acid table. It is non-acid managed table. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [hive] sankarh commented on a change in pull request #578: HIVE-21471: Replicating conversion of managed to external table leaks HDFS files at target.
sankarh commented on a change in pull request #578: HIVE-21471: Replicating conversion of managed to external table leaks HDFS files at target. URL: https://github.com/apache/hive/pull/578#discussion_r268479992 ## File path: standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/HiveAlterHandler.java ## @@ -99,9 +100,12 @@ public void alterTable(RawStore msdb, Warehouse wh, String catName, String dbnam dbname = dbname.toLowerCase(); final boolean cascade = environmentContext != null -&& environmentContext.isSetProperties() -&& StatsSetupConst.TRUE.equals(environmentContext.getProperties().get( -StatsSetupConst.CASCADE)); +&& environmentContext.isSetProperties() +&& StatsSetupConst.TRUE.equals(environmentContext.getProperties().get(StatsSetupConst.CASCADE)); +final boolean replDataLocationChanged = environmentContext != null Review comment: I think flag is needed as we shouldn't update partition locations if user set location for a table. That is current behaviour and shouldn't be changed. Only for repl flow, this is needed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [hive] sankarh commented on a change in pull request #578: HIVE-21471: Replicating conversion of managed to external table leaks HDFS files at target.
sankarh commented on a change in pull request #578: HIVE-21471: Replicating conversion of managed to external table leaks HDFS files at target. URL: https://github.com/apache/hive/pull/578#discussion_r268480500 ## File path: standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/HiveAlterHandler.java ## @@ -400,7 +405,26 @@ public void alterTable(RawStore msdb, Warehouse wh, String catName, String dbnam "Unable to change partition or table. Object " + e.getMessage() + " does not exist." + " Check metastore logs for detailed stack."); } finally { - if (!success) { + if (success) { +// Txn was committed successfully. +// If data location is changed in replication flow, then need to delete the old path. +if (replDataLocationChanged) { + assert(olddb != null); + assert(oldt != null); + Path deleteOldDataLoc = new Path(oldt.getSd().getLocation()); + boolean isAutoPurge = "true".equalsIgnoreCase(oldt.getParameters().get("auto.purge")); + try { +wh.deleteDir(deleteOldDataLoc, true, isAutoPurge, olddb); +LOG.info("Deleted the old data location: {} for the table: {}", +deleteOldDataLoc, dbname + "." + name); + } catch (MetaException ex) { +// Eat the exception as it doesn't affect the state of existing tables. +// Expect, user to manually drop this path when exception and so logging a warning. +LOG.warn("Unable to delete the old data location: {} for the table: {}", Review comment: I think, if old dir is deleted, then metadata update was already successful. In this case, during replay of event in next cycle would not set the flag as the table/partition locations were already pointing to new location under base dir. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [hive] sankarh commented on a change in pull request #578: HIVE-21471: Replicating conversion of managed to external table leaks HDFS files at target.
sankarh commented on a change in pull request #578: HIVE-21471: Replicating conversion of managed to external table leaks HDFS files at target. URL: https://github.com/apache/hive/pull/578#discussion_r268381295 ## File path: standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/HiveAlterHandler.java ## @@ -192,12 +197,12 @@ public void alterTable(RawStore msdb, Warehouse wh, String catName, String dbnam // 2) the table is not an external table, and // 3) the user didn't change the default location (or new location is empty), and // 4) the table was not initially created with a specified location - if (rename - && !oldt.getTableType().equals(TableType.VIRTUAL_VIEW.toString()) - && (oldt.getSd().getLocation().compareTo(newt.getSd().getLocation()) == 0 -|| StringUtils.isEmpty(newt.getSd().getLocation())) - && !MetaStoreUtils.isExternalTable(oldt)) { -Database olddb = msdb.getDatabase(catName, dbname); + if (replDataLocationChanged + || (rename Review comment: This patch is not dealing with set location at source. It deals with changing table type to external at source via ALTER table EXTERNAL=true in table properties. So, this scenario is not valid and not in scope. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [hive] sankarh commented on a change in pull request #578: HIVE-21471: Replicating conversion of managed to external table leaks HDFS files at target.
sankarh commented on a change in pull request #578: HIVE-21471: Replicating conversion of managed to external table leaks HDFS files at target. URL: https://github.com/apache/hive/pull/578#discussion_r268381343 ## File path: standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/HiveAlterHandler.java ## @@ -357,6 +368,13 @@ public void alterTable(RawStore msdb, Warehouse wh, String catName, String dbnam } } + // If data location is changed in replication flow, then need to delete the old path. + if (replDataLocationChanged) { +Path deleteOldDataLoc = new Path(oldt.getSd().getLocation()); +boolean isAutoPurge = "true".equalsIgnoreCase(oldt.getParameters().get("auto.purge")); +wh.deleteDir(deleteOldDataLoc, true, isAutoPurge, olddb); Review comment: Good catch. Will move it to finally block only when success=true and eat any exception by DeleteDir and just log a warn instead. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [hive] sankarh commented on a change in pull request #578: HIVE-21471: Replicating conversion of managed to external table leaks HDFS files at target.
sankarh commented on a change in pull request #578: HIVE-21471: Replicating conversion of managed to external table leaks HDFS files at target. URL: https://github.com/apache/hive/pull/578#discussion_r268381321 ## File path: standalone-metastore/metastore-common/src/main/java/org/apache/hadoop/hive/common/ReplConst.java ## @@ -0,0 +1,33 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.hadoop.hive.common; + +/** + * A class that defines the constant strings used by the replication implementation. + */ + +public class ReplConst { Review comment: ReplUtils was already taken. I used similar naming as StatsSetupConst. I think ReplConst is good enough for now. Will change if any utility methods comes up for metastore common. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [hive] sankarh commented on a change in pull request #578: HIVE-21471: Replicating conversion of managed to external table leaks HDFS files at target.
sankarh commented on a change in pull request #578: HIVE-21471: Replicating conversion of managed to external table leaks HDFS files at target. URL: https://github.com/apache/hive/pull/578#discussion_r268381165 ## File path: ql/src/java/org/apache/hadoop/hive/ql/ddl/table/CreateTableOperation.java ## @@ -54,17 +55,24 @@ public int execute() throws HiveException { Table tbl = desc.toTable(context.getConf()); LOG.debug("creating table {} on {}", tbl.getFullyQualifiedName(), tbl.getDataLocation()); -if (desc.getReplicationSpec().isInReplicationScope() && (!desc.getReplaceMode())){ - // if this is a replication spec, then replace-mode semantics might apply. - // if we're already asking for a table replacement, then we can skip this check. - // however, otherwise, if in replication scope, and we've not been explicitly asked - // to replace, we should check if the object we're looking at exists, and if so, +boolean dataLocationChanged = false; +if (desc.getReplicationSpec().isInReplicationScope()) { + // If in replication scope, we should check if the object we're looking at exists, and if so, // trigger replace-mode semantics. Table existingTable = context.getDb().getTable(tbl.getDbName(), tbl.getTableName(), false); - if (existingTable != null){ + if (existingTable != null) { if (desc.getReplicationSpec().allowEventReplacementInto(existingTable.getParameters())) { desc.setReplaceMode(true); // we replace existing table. ReplicationSpec.copyLastReplId(existingTable.getParameters(), tbl.getParameters()); + + // If location of an existing managed table is changed, then need to delete the old location if exists. + // This scenario occurs when a managed table is converted into external table at source. In this case, + // at target, the table data would be moved to different location under base directory for external tables. + if (existingTable.getTableType().equals(TableType.MANAGED_TABLE) + && tbl.getTableType().equals(TableType.EXTERNAL_TABLE) + && (!existingTable.getDataLocation().equals(tbl.getDataLocation( { Review comment: It's not possible in replication flow. But I kept the check explicitly for better readability. There is no harm in keeping it. If you insist, let me change it to assert. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [hive] sankarh commented on a change in pull request #578: HIVE-21471: Replicating conversion of managed to external table leaks HDFS files at target.
sankarh commented on a change in pull request #578: HIVE-21471: Replicating conversion of managed to external table leaks HDFS files at target. URL: https://github.com/apache/hive/pull/578#discussion_r268381082 ## File path: standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/HiveAlterHandler.java ## @@ -99,9 +100,12 @@ public void alterTable(RawStore msdb, Warehouse wh, String catName, String dbnam dbname = dbname.toLowerCase(); final boolean cascade = environmentContext != null -&& environmentContext.isSetProperties() -&& StatsSetupConst.TRUE.equals(environmentContext.getProperties().get( -StatsSetupConst.CASCADE)); +&& environmentContext.isSetProperties() +&& StatsSetupConst.TRUE.equals(environmentContext.getProperties().get(StatsSetupConst.CASCADE)); +final boolean replDataLocationChanged = environmentContext != null Review comment: It is not possible in normal flow to have table type changed from managed to external and location set in same ALTER query. Also, it won't reach here as it is CreateTableTask flow. Only repl flow reaches here for this scenario. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [hive] sankarh commented on a change in pull request #578: HIVE-21471: Replicating conversion of managed to external table leaks HDFS files at target.
sankarh commented on a change in pull request #578: HIVE-21471: Replicating conversion of managed to external table leaks HDFS files at target. URL: https://github.com/apache/hive/pull/578#discussion_r26838 ## File path: standalone-metastore/metastore-common/src/main/java/org/apache/hadoop/hive/common/ReplConst.java ## @@ -0,0 +1,33 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.hadoop.hive.common; + +/** + * A class that defines the constant strings used by the replication implementation. + */ + +public class ReplConst { + + /** + * The constant that denotes the table data location is changed to different path. This indicates + * Metastore to update corresponding path in Partitions and also need to delete old path. + */ + public static final String DATA_LOCATION_CHANGED = "DATA_LOCATION_CHANGED"; Review comment: I thought ReplConst is self explanatory that it is Repl specific. Anyways, I will prefix with REPL_ This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [hive] sankarh commented on a change in pull request #578: HIVE-21471: Replicating conversion of managed to external table leaks HDFS files at target.
sankarh commented on a change in pull request #578: HIVE-21471: Replicating conversion of managed to external table leaks HDFS files at target. URL: https://github.com/apache/hive/pull/578#discussion_r268381147 ## File path: itests/hive-unit/src/test/java/org/apache/hadoop/hive/ql/parse/TestReplicationScenariosExternalTables.java ## @@ -673,6 +674,59 @@ public void retryIncBootstrapExternalTablesFromDifferentDumpWithoutCleanTablesCo ErrorMsg.REPL_BOOTSTRAP_LOAD_PATH_NOT_VALID.getErrorCode()); } + @Test + public void dynamicallyConvertManagedToExternalTable() throws Throwable { +List dumpWithClause = Collections.singletonList( +"'" + HiveConf.ConfVars.REPL_INCLUDE_EXTERNAL_TABLES.varname + "'='true'" +); +List loadWithClause = externalTableBasePathWithClause(); + +WarehouseInstance.Tuple tupleBootstrapManagedTable = primary.run("use " + primaryDbName) +.run("create table t1 (id int)") +.run("insert into table t1 values (1)") +.run("create table t2 (id int) partitioned by (key int)") Review comment: Nope. It is not a case of migration. It is external table replication specific where the table is converted to external at source not at target. So, here location is changed only at target not in source. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [hive] sankarh commented on a change in pull request #578: HIVE-21471: Replicating conversion of managed to external table leaks HDFS files at target.
sankarh commented on a change in pull request #578: HIVE-21471: Replicating conversion of managed to external table leaks HDFS files at target. URL: https://github.com/apache/hive/pull/578#discussion_r268381082 ## File path: standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/HiveAlterHandler.java ## @@ -99,9 +100,12 @@ public void alterTable(RawStore msdb, Warehouse wh, String catName, String dbnam dbname = dbname.toLowerCase(); final boolean cascade = environmentContext != null -&& environmentContext.isSetProperties() -&& StatsSetupConst.TRUE.equals(environmentContext.getProperties().get( -StatsSetupConst.CASCADE)); +&& environmentContext.isSetProperties() +&& StatsSetupConst.TRUE.equals(environmentContext.getProperties().get(StatsSetupConst.CASCADE)); +final boolean replDataLocationChanged = environmentContext != null Review comment: It is not possible in normal flow to have table type changed from managed to external and location set in same ALTER query. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [hive] sankarh commented on a change in pull request #579: HIVE-21109 : Support stats replication for ACID tables.
sankarh commented on a change in pull request #579: HIVE-21109 : Support stats replication for ACID tables. URL: https://github.com/apache/hive/pull/579#discussion_r269880083 ## File path: ql/src/java/org/apache/hadoop/hive/ql/metadata/Hive.java ## @@ -2950,21 +2956,33 @@ public Partition createPartition(Table tbl, Map partSpec) throws int size = addPartitionDesc.getPartitionCount(); List in = new ArrayList(size); -AcidUtils.TableSnapshot tableSnapshot = AcidUtils.getTableSnapshot(conf, tbl, true); long writeId; String validWriteIdList; -if (tableSnapshot != null && tableSnapshot.getWriteId() > 0) { - writeId = tableSnapshot.getWriteId(); - validWriteIdList = tableSnapshot.getValidWriteIdList(); + +// In case of replication, get the writeId from the source and use valid write Id list +// for replication. +if (addPartitionDesc.getReplicationSpec() != null && +addPartitionDesc.getReplicationSpec().isInReplicationScope() && +addPartitionDesc.getPartition(0).getWriteId() > 0) { + writeId = addPartitionDesc.getPartition(0).getWriteId(); + validWriteIdList = Review comment: Even that logic to create ValidWriteIdList based on all repl opened txns isn't right as it says that stats are valid for all these open txns but it isn't. Also, it sets high water mark based on 0th index in the replTxnsList map which might be pointing to wrong writeId compared to current txn's writeId. So, I doubt, this logic should be anyways removed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [hive] ashutosh-bapat commented on a change in pull request #579: HIVE-21109 : Support stats replication for ACID tables.
ashutosh-bapat commented on a change in pull request #579: HIVE-21109 : Support stats replication for ACID tables. URL: https://github.com/apache/hive/pull/579#discussion_r269879748 ## File path: ql/src/java/org/apache/hadoop/hive/ql/exec/DDLTask.java ## @@ -2689,7 +2689,19 @@ private int alterTable(Hive db, AlterTableDesc alterTbl) throws HiveException { } else { // Note: this is necessary for UPDATE_STATISTICS command, that operates via ADDPROPS (why?). // For any other updates, we don't want to do txn check on partitions when altering table. -boolean isTxn = alterTbl.getPartSpec() != null && alterTbl.getOp() == AlterTableTypes.ADDPROPS; +boolean isTxn = false; +if (alterTbl.getPartSpec() != null && alterTbl.getOp() == AlterTableTypes.ADDPROPS) { + // ADDPROPS is used to add repl.last.id during replication. That's not a transactional + // change. + Map props = alterTbl.getProps(); + if (props.size() <= 1 && props.get(ReplicationSpec.KEY.CURR_STATE_ID.toString()) != null) { Review comment: Done. Instead of last.repl.id, I am explicitly checking if the property is related to stats and then set isTxn only in case of replication. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [hive] sankarh commented on a change in pull request #579: HIVE-21109 : Support stats replication for ACID tables.
sankarh commented on a change in pull request #579: HIVE-21109 : Support stats replication for ACID tables. URL: https://github.com/apache/hive/pull/579#discussion_r269880708 ## File path: standalone-metastore/metastore-common/src/main/java/org/apache/hadoop/hive/metastore/txn/TxnCommonUtils.java ## @@ -84,6 +86,73 @@ public static ValidTxnList createValidReadTxnList(GetOpenTxnsResponse txns, long return new ValidReadTxnList(exceptions, outAbortedBits, highWaterMark, minOpenTxnId); } + /** + * Transform a {@link org.apache.hadoop.hive.metastore.api.GetOpenTxnsResponse} to a + * {@link org.apache.hadoop.hive.common.ValidTxnList}. This assumes that the caller intends to + * read the files, and thus treats both open and aborted transactions as invalid. + * + * This API is used by Hive replication which may have multiple transactions open at a time. + * + * @param txns open txn list from the metastore + * @param currentTxns Current transactions that the replication has opened. If any of the + *transactions is greater than 0 it will be removed from the exceptions + *list so that the replication sees its own transaction as valid. + * @return a valid txn list. + */ + public static ValidTxnList createValidReadTxnList(GetOpenTxnsResponse txns, Review comment: Yes, even I think, for REPL LOAD, we should always hardcode the ValidWriteIdList using current writeId so that stats are always valid while applying current event. Even if it is invalid, the subsequent alterTable/partition event would set it so in the table/partition parameters. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [hive] sankarh commented on a change in pull request #579: HIVE-21109 : Support stats replication for ACID tables.
sankarh commented on a change in pull request #579: HIVE-21109 : Support stats replication for ACID tables. URL: https://github.com/apache/hive/pull/579#discussion_r269861070 ## File path: ql/src/java/org/apache/hadoop/hive/ql/plan/ImportTableDesc.java ## @@ -381,4 +382,11 @@ public void setOwnerName(String ownerName) { throw new RuntimeException("Invalid table type : " + getDescType()); } } + + public Long getReplWriteId() { +if (this.createTblDesc != null) { + return this.createTblDesc.getReplWriteId(); Review comment: If we unify writeId and replWriteId in CreateTableDesc into one, then it's fine. In fact, they are one and the same. So, no point in having 2 members for same value. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [hive] ashutosh-bapat commented on a change in pull request #579: HIVE-21109 : Support stats replication for ACID tables.
ashutosh-bapat commented on a change in pull request #579: HIVE-21109 : Support stats replication for ACID tables. URL: https://github.com/apache/hive/pull/579#discussion_r269872622 ## File path: standalone-metastore/metastore-common/src/main/java/org/apache/hadoop/hive/metastore/txn/TxnCommonUtils.java ## @@ -84,6 +86,73 @@ public static ValidTxnList createValidReadTxnList(GetOpenTxnsResponse txns, long return new ValidReadTxnList(exceptions, outAbortedBits, highWaterMark, minOpenTxnId); } + /** + * Transform a {@link org.apache.hadoop.hive.metastore.api.GetOpenTxnsResponse} to a + * {@link org.apache.hadoop.hive.common.ValidTxnList}. This assumes that the caller intends to + * read the files, and thus treats both open and aborted transactions as invalid. + * + * This API is used by Hive replication which may have multiple transactions open at a time. + * + * @param txns open txn list from the metastore + * @param currentTxns Current transactions that the replication has opened. If any of the + *transactions is greater than 0 it will be removed from the exceptions + *list so that the replication sees its own transaction as valid. + * @return a valid txn list. + */ + public static ValidTxnList createValidReadTxnList(GetOpenTxnsResponse txns, Review comment: If there were multiple transactions on the source running concurrently at a time, there will be those many open transaction events in the dump which when replicated will have those many open transactions at a time on the target while replaying those events. So, there could be multiple open transactions on target even during repl load. The only link between CreateTableOperation#createTableReplaceMode and Hive#alterTable is EnvironmentContext, so would could use this to pass a flag to indicate the valid writeId list should be created using the given writeId. But we are using Environment context to pass information only to the metastore and not use it in-between. We could construct the valid writeId list in the metastore directly like what we are doing for create table and partition using that kind of flag. Does that look good? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [hive] ashutosh-bapat commented on a change in pull request #579: HIVE-21109 : Support stats replication for ACID tables.
ashutosh-bapat commented on a change in pull request #579: HIVE-21109 : Support stats replication for ACID tables. URL: https://github.com/apache/hive/pull/579#discussion_r269874353 ## File path: ql/src/java/org/apache/hadoop/hive/ql/metadata/Hive.java ## @@ -2950,21 +2956,33 @@ public Partition createPartition(Table tbl, Map partSpec) throws int size = addPartitionDesc.getPartitionCount(); List in = new ArrayList(size); -AcidUtils.TableSnapshot tableSnapshot = AcidUtils.getTableSnapshot(conf, tbl, true); long writeId; String validWriteIdList; -if (tableSnapshot != null && tableSnapshot.getWriteId() > 0) { - writeId = tableSnapshot.getWriteId(); - validWriteIdList = tableSnapshot.getValidWriteIdList(); + +// In case of replication, get the writeId from the source and use valid write Id list +// for replication. +if (addPartitionDesc.getReplicationSpec() != null && +addPartitionDesc.getReplicationSpec().isInReplicationScope() && +addPartitionDesc.getPartition(0).getWriteId() > 0) { + writeId = addPartitionDesc.getPartition(0).getWriteId(); + validWriteIdList = Review comment: Ok. Underneath this code is using the valid write id list created using open transaction list of repl. So, this isn't wrong. But this may change subject to the changes because of other comment you have. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [hive] ashutosh-bapat commented on a change in pull request #579: HIVE-21109 : Support stats replication for ACID tables.
ashutosh-bapat commented on a change in pull request #579: HIVE-21109 : Support stats replication for ACID tables. URL: https://github.com/apache/hive/pull/579#discussion_r269856891 ## File path: ql/src/java/org/apache/hadoop/hive/ql/plan/ImportTableDesc.java ## @@ -381,4 +382,11 @@ public void setOwnerName(String ownerName) { throw new RuntimeException("Invalid table type : " + getDescType()); } } + + public Long getReplWriteId() { +if (this.createTblDesc != null) { + return this.createTblDesc.getReplWriteId(); Review comment: AFAIU, the reason we set replWriteId in CreateTableDesc is it can be then passed everywhere CreateTableDesc is used. It's better not to create two paths for passing same writeId, with a risk of those going of sync with each other. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [hive] sankarh commented on a change in pull request #579: HIVE-21109 : Support stats replication for ACID tables.
sankarh commented on a change in pull request #579: HIVE-21109 : Support stats replication for ACID tables. URL: https://github.com/apache/hive/pull/579#discussion_r269861070 ## File path: ql/src/java/org/apache/hadoop/hive/ql/plan/ImportTableDesc.java ## @@ -381,4 +382,11 @@ public void setOwnerName(String ownerName) { throw new RuntimeException("Invalid table type : " + getDescType()); } } + + public Long getReplWriteId() { +if (this.createTblDesc != null) { + return this.createTblDesc.getReplWriteId(); Review comment: If we unify writeId and replWriteId in CreateTableDesc, then it's fine. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [hive] sankarh commented on a change in pull request #579: HIVE-21109 : Support stats replication for ACID tables.
sankarh commented on a change in pull request #579: HIVE-21109 : Support stats replication for ACID tables. URL: https://github.com/apache/hive/pull/579#discussion_r269843519 ## File path: ql/src/java/org/apache/hadoop/hive/ql/plan/ImportTableDesc.java ## @@ -381,4 +382,11 @@ public void setOwnerName(String ownerName) { throw new RuntimeException("Invalid table type : " + getDescType()); } } + + public Long getReplWriteId() { +if (this.createTblDesc != null) { + return this.createTblDesc.getReplWriteId(); Review comment: I meant, prepareImport already takes writeId as input parameter which is being set in CreateTableDesc and later read back by getBaseAddPartitionDescFromPartition. Instead, writeId itself can be passed to getBaseAddPartitionDescFromPartition. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [hive] sankarh commented on a change in pull request #579: HIVE-21109 : Support stats replication for ACID tables.
sankarh commented on a change in pull request #579: HIVE-21109 : Support stats replication for ACID tables. URL: https://github.com/apache/hive/pull/579#discussion_r269843944 ## File path: standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/HiveMetaStore.java ## @@ -1894,6 +1898,16 @@ private void create_table_core(final RawStore ms, final Table tbl, List checkConstraints) throws AlreadyExistsException, MetaException, InvalidObjectException, NoSuchObjectException, InvalidInputException { + + ColumnStatistics colStats = null; + // If the given table has column statistics, save it here. We will update it later. + // We don't want it to be part of the Table object being created, lest the create table Review comment: I think, it's fine. Ignore this comment. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [hive] sankarh commented on a change in pull request #579: HIVE-21109 : Support stats replication for ACID tables.
sankarh commented on a change in pull request #579: HIVE-21109 : Support stats replication for ACID tables. URL: https://github.com/apache/hive/pull/579#discussion_r269257547 ## File path: ql/src/java/org/apache/hadoop/hive/ql/metadata/Hive.java ## @@ -987,10 +989,14 @@ public void createTable(Table tbl, boolean ifNotExists, tTbl.setPrivileges(principalPrivs); } } - // Set table snapshot to api.Table to make it persistent. - TableSnapshot tableSnapshot = AcidUtils.getTableSnapshot(conf, tbl, true); - if (tableSnapshot != null) { -tbl.getTTable().setWriteId(tableSnapshot.getWriteId()); + // Set table snapshot to api.Table to make it persistent. A transactional table being + // replicated may have a valid write Id copied from the source. Use that instead of + // crafting one on the replica. + if (tTbl.getWriteId() <= 0) { Review comment: DO_NOT_UPDATE_STATS flag should be set in createTableFlow as well. Or else in autogather mode at target, it will be updated automatically. Not sure if it is needed as table itself is not there in metastore. Anyways, please check if needed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [hive] sankarh commented on a change in pull request #579: HIVE-21109 : Support stats replication for ACID tables.
sankarh commented on a change in pull request #579: HIVE-21109 : Support stats replication for ACID tables. URL: https://github.com/apache/hive/pull/579#discussion_r269843519 ## File path: ql/src/java/org/apache/hadoop/hive/ql/plan/ImportTableDesc.java ## @@ -381,4 +382,11 @@ public void setOwnerName(String ownerName) { throw new RuntimeException("Invalid table type : " + getDescType()); } } + + public Long getReplWriteId() { +if (this.createTblDesc != null) { + return this.createTblDesc.getReplWriteId(); Review comment: I meant, prepareImport already takes writeId (which comes from event message) as input parameter which is being set in CreateTableDesc and later read back by getBaseAddPartitionDescFromPartition. Instead, writeId itself can be passed to getBaseAddPartitionDescFromPartition. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [hive] sankarh commented on a change in pull request #579: HIVE-21109 : Support stats replication for ACID tables.
sankarh commented on a change in pull request #579: HIVE-21109 : Support stats replication for ACID tables. URL: https://github.com/apache/hive/pull/579#discussion_r269844369 ## File path: ql/src/java/org/apache/hadoop/hive/ql/exec/DDLTask.java ## @@ -2689,7 +2689,19 @@ private int alterTable(Hive db, AlterTableDesc alterTbl) throws HiveException { } else { // Note: this is necessary for UPDATE_STATISTICS command, that operates via ADDPROPS (why?). // For any other updates, we don't want to do txn check on partitions when altering table. -boolean isTxn = alterTbl.getPartSpec() != null && alterTbl.getOp() == AlterTableTypes.ADDPROPS; +boolean isTxn = false; +if (alterTbl.getPartSpec() != null && alterTbl.getOp() == AlterTableTypes.ADDPROPS) { + // ADDPROPS is used to add repl.last.id during replication. That's not a transactional + // change. + Map props = alterTbl.getProps(); + if (props.size() <= 1 && props.get(ReplicationSpec.KEY.CURR_STATE_ID.toString()) != null) { Review comment: I don't know. I would suggest to keep it non-transactional only in repl flow to avoid any impacts. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [hive] sankarh commented on a change in pull request #578: HIVE-21471: Replicating conversion of managed to external table leaks HDFS files at target.
sankarh commented on a change in pull request #578: HIVE-21471: Replicating conversion of managed to external table leaks HDFS files at target. URL: https://github.com/apache/hive/pull/578#discussion_r268381242 ## File path: standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/HiveAlterHandler.java ## @@ -192,12 +197,12 @@ public void alterTable(RawStore msdb, Warehouse wh, String catName, String dbnam // 2) the table is not an external table, and Review comment: OK This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [hive] sankarh commented on a change in pull request #578: HIVE-21471: Replicating conversion of managed to external table leaks HDFS files at target.
sankarh commented on a change in pull request #578: HIVE-21471: Replicating conversion of managed to external table leaks HDFS files at target. URL: https://github.com/apache/hive/pull/578#discussion_r268381258 ## File path: standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/HiveAlterHandler.java ## @@ -209,7 +214,13 @@ public void alterTable(RawStore msdb, Warehouse wh, String catName, String dbnam boolean tableInSpecifiedLoc = !oldtRelativePath.equalsIgnoreCase(name) && !oldtRelativePath.equalsIgnoreCase(name + Path.SEPARATOR); -if (!tableInSpecifiedLoc) { +if (replDataLocationChanged) { + // If data location is changed in replication flow, then new path was already set in + // the newt. Also, it is as good as the data is moved and set dataWasMoved=true so that + // location in partitions are also updated accordingly. + destPath = new Path(newt.getSd().getLocation()); Review comment: Already handled. If you notice the line 279, the partition location is changed only if the old location is a sub-dir of old table location. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [hive] sankarh commented on a change in pull request #578: HIVE-21471: Replicating conversion of managed to external table leaks HDFS files at target.
sankarh commented on a change in pull request #578: HIVE-21471: Replicating conversion of managed to external table leaks HDFS files at target. URL: https://github.com/apache/hive/pull/578#discussion_r268381212 ## File path: standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/HiveAlterHandler.java ## @@ -99,9 +100,12 @@ public void alterTable(RawStore msdb, Warehouse wh, String catName, String dbnam dbname = dbname.toLowerCase(); final boolean cascade = environmentContext != null -&& environmentContext.isSetProperties() -&& StatsSetupConst.TRUE.equals(environmentContext.getProperties().get( -StatsSetupConst.CASCADE)); +&& environmentContext.isSetProperties() +&& StatsSetupConst.TRUE.equals(environmentContext.getProperties().get(StatsSetupConst.CASCADE)); +final boolean replDataLocationChanged = environmentContext != null Review comment: OK. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [hive] sankarh commented on a change in pull request #578: HIVE-21471: Replicating conversion of managed to external table leaks HDFS files at target.
sankarh commented on a change in pull request #578: HIVE-21471: Replicating conversion of managed to external table leaks HDFS files at target. URL: https://github.com/apache/hive/pull/578#discussion_r268381204 ## File path: ql/src/java/org/apache/hadoop/hive/ql/ddl/table/CreateTableOperation.java ## @@ -108,6 +116,12 @@ private void createTableReplaceMode(Table tbl) throws HiveException { } } +// If table's data location is moved, then set the corresponding flag in environment context to Review comment: Nope. The scenario is location is changed at target not at source. At source only the table type is changed from managed to external not location. I will update the comment to say that in repl flow. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [hive] sankarh commented on a change in pull request #578: HIVE-21471: Replicating conversion of managed to external table leaks HDFS files at target.
sankarh commented on a change in pull request #578: HIVE-21471: Replicating conversion of managed to external table leaks HDFS files at target. URL: https://github.com/apache/hive/pull/578#discussion_r268381269 ## File path: standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/HiveAlterHandler.java ## @@ -209,7 +214,13 @@ public void alterTable(RawStore msdb, Warehouse wh, String catName, String dbnam boolean tableInSpecifiedLoc = !oldtRelativePath.equalsIgnoreCase(name) Review comment: OK This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [hive] rmsmani commented on issue #576: HIVE-21488 Update Apache Parquet to 1.10.1
rmsmani commented on issue #576: HIVE-21488 Update Apache Parquet to 1.10.1 URL: https://github.com/apache/hive/pull/576#issuecomment-475909207 Thanks, please submit the patch file in Jira. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] maheshk114 commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled
maheshk114 commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled URL: https://github.com/apache/hive/pull/541#discussion_r259606331 ## File path: ql/src/java/org/apache/hadoop/hive/ql/parse/ReplicationSpec.java ## @@ -426,4 +427,14 @@ public static void copyLastReplId(Map srcParameter, Map
[GitHub] maheshk114 commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled
maheshk114 commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled URL: https://github.com/apache/hive/pull/541#discussion_r259606337 ## File path: itests/hive-unit/src/test/java/org/apache/hadoop/hive/ql/parse/TestReplicationWithTableMigrationEx.java ## @@ -0,0 +1,213 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.hadoop.hive.ql.parse; + +import org.apache.hadoop.hdfs.DistributedFileSystem; +import org.apache.hadoop.hdfs.MiniDFSCluster; +import org.apache.hadoop.hive.conf.HiveConf; +import org.apache.hadoop.hive.metastore.InjectableBehaviourObjectStore; +import org.apache.hadoop.hive.metastore.InjectableBehaviourObjectStore.BehaviourInjection; +import org.apache.hadoop.hive.metastore.api.CurrentNotificationEventId; +import org.apache.hadoop.hive.ql.exec.repl.util.ReplUtils; +import org.apache.hadoop.hive.shims.Utils; +import org.junit.*; +import org.junit.rules.TestName; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import java.io.IOException; +import java.util.*; + +import static org.apache.hadoop.hive.metastore.ReplChangeManager.SOURCE_OF_REPLICATION; +import static org.apache.hadoop.hive.ql.io.AcidUtils.isFullAcidTable; +import static org.apache.hadoop.hive.ql.io.AcidUtils.isTransactionalTable; +import static org.junit.Assert.*; + +/** + * TestReplicationWithTableMigrationEx - test replication for Hive2 to Hive3 (Strict managed tables) + */ +public class TestReplicationWithTableMigrationEx { + @Rule + public final TestName testName = new TestName(); + + protected static final Logger LOG = LoggerFactory.getLogger(TestReplicationWithTableMigrationEx.class); + private static WarehouseInstance primary, replica; + private String primaryDbName, replicatedDbName; + + @BeforeClass + public static void classLevelSetup() throws Exception { +HashMap overrideProperties = new HashMap<>(); +internalBeforeClassSetup(overrideProperties); + } + + static void internalBeforeClassSetup(Map overrideConfigs) throws Exception { Review comment: no This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] ZhangXinJason opened a new pull request #548: HIVE-21313: Use faster method to prevent copying bytes twice
ZhangXinJason opened a new pull request #548: HIVE-21313: Use faster method to prevent copying bytes twice URL: https://github.com/apache/hive/pull/548 In file ql/src/java/org/apache/hadoop/hive/ql/exec/vector/VectorAssignRow.java We may find code like this: ``` Text text = (Text) convertTargetWritable; if (text == null) { text = new Text(); } text.set(string); ((BytesColumnVector) columnVector).setVal( batchIndex, text.getBytes(), 0, text.getLength()); ``` Using `setVal` method can copy the bytes array generated by `text.getBytes()`. This is totally unnecessary at all. Since the bytes array is immutable, we can just use `setRef` method to point to the specific byte array, which will also lower the memory usage. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] maheshk114 commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled
maheshk114 commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled URL: https://github.com/apache/hive/pull/541#discussion_r259606409 ## File path: ql/src/java/org/apache/hadoop/hive/ql/exec/repl/util/ReplUtils.java ## @@ -187,4 +192,12 @@ public static PathFilter getEventsDirectoryFilter(final FileSystem fs) { } }; } + + public static boolean isFirstIncDone(Map parameter) { +if (parameter == null) { + return true; +} +String compFlag = parameter.get(ReplUtils.REPL_FIRST_INC_PENDING_FLAG); +return compFlag == null || compFlag.isEmpty() || "false".equalsIgnoreCase(compFlag); Review comment: i think better to have this check as for this case null and empty has special meaning This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled
sankarh commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled URL: https://github.com/apache/hive/pull/541#discussion_r259606013 ## File path: ql/src/java/org/apache/hadoop/hive/ql/exec/ReplCopyTask.java ## @@ -271,12 +299,13 @@ public String getName() { LOG.debug("ReplCopyTask:getLoadCopyTask: {}=>{}", srcPath, dstPath); if ((replicationSpec != null) && replicationSpec.isInReplicationScope()){ ReplCopyWork rcwork = new ReplCopyWork(srcPath, dstPath, false); - if (replicationSpec.isReplace() && conf.getBoolVar(REPL_ENABLE_MOVE_OPTIMIZATION)) { + if (replicationSpec.isReplace() && (conf.getBoolVar(REPL_ENABLE_MOVE_OPTIMIZATION) || copyToMigratedTxnTable)) { rcwork.setDeleteDestIfExist(true); Review comment: As per discussion, we need to skip duplicate check for replace case, as we may end up with skipping copy of files with same name but different content. Also, we may need to move those matching files to new base path which is complicated. So, we just skip duplicate check for replace flow. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] miklosgergely commented on a change in pull request #544: HIVE-16924 Support distinct in presence of Group By
miklosgergely commented on a change in pull request #544: HIVE-16924 Support distinct in presence of Group By URL: https://github.com/apache/hive/pull/544#discussion_r259953166 ## File path: ql/src/test/results/clientpositive/distinct_groupby.q.out ## @@ -0,0 +1,1530 @@ +PREHOOK: query: explain select distinct count(*) from src1 where key in (128,146,150) +PREHOOK: type: QUERY +PREHOOK: Input: default@src1 + A masked pattern was here +POSTHOOK: query: explain select distinct count(*) from src1 where key in (128,146,150) +POSTHOOK: type: QUERY +POSTHOOK: Input: default@src1 + A masked pattern was here +STAGE DEPENDENCIES: + Stage-1 is a root stage + Stage-0 depends on stages: Stage-1 + +STAGE PLANS: + Stage: Stage-1 +Map Reduce + Map Operator Tree: + TableScan +alias: src1 +filterExpr: (UDFToDouble(key)) IN (128.0D, 146.0D, 150.0D) (type: boolean) +Statistics: Num rows: 25 Data size: 2150 Basic stats: COMPLETE Column stats: COMPLETE +Filter Operator + predicate: (UDFToDouble(key)) IN (128.0D, 146.0D, 150.0D) (type: boolean) + Statistics: Num rows: 12 Data size: 1032 Basic stats: COMPLETE Column stats: COMPLETE + Select Operator +Statistics: Num rows: 12 Data size: 1032 Basic stats: COMPLETE Column stats: COMPLETE +Group By Operator + aggregations: count() + mode: hash + outputColumnNames: _col0 + Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: COMPLETE + Reduce Output Operator +sort order: +Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: COMPLETE +value expressions: _col0 (type: bigint) + Execution mode: vectorized + Reduce Operator Tree: +Group By Operator + aggregations: count(VALUE._col0) + mode: mergepartial + outputColumnNames: _col0 + Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: COMPLETE + File Output Operator +compressed: false +Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: COMPLETE +table: +input format: org.apache.hadoop.mapred.SequenceFileInputFormat +output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat +serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe + + Stage: Stage-0 +Fetch Operator + limit: -1 + Processor Tree: +ListSink + +PREHOOK: query: select distinct count(*) from src1 where key in (128,146,150) +PREHOOK: type: QUERY +PREHOOK: Input: default@src1 + A masked pattern was here +POSTHOOK: query: select distinct count(*) from src1 where key in (128,146,150) +POSTHOOK: type: QUERY +POSTHOOK: Input: default@src1 + A masked pattern was here +3 +PREHOOK: query: explain select distinct * from (select distinct count(*) from src1 where key in (128,146,150)) as T +PREHOOK: type: QUERY +PREHOOK: Input: default@src1 + A masked pattern was here +POSTHOOK: query: explain select distinct * from (select distinct count(*) from src1 where key in (128,146,150)) as T +POSTHOOK: type: QUERY +POSTHOOK: Input: default@src1 + A masked pattern was here +STAGE DEPENDENCIES: + Stage-1 is a root stage + Stage-0 depends on stages: Stage-1 + +STAGE PLANS: + Stage: Stage-1 +Map Reduce + Map Operator Tree: + TableScan +alias: src1 +filterExpr: (UDFToDouble(key)) IN (128.0D, 146.0D, 150.0D) (type: boolean) +Statistics: Num rows: 25 Data size: 2150 Basic stats: COMPLETE Column stats: COMPLETE +Filter Operator + predicate: (UDFToDouble(key)) IN (128.0D, 146.0D, 150.0D) (type: boolean) + Statistics: Num rows: 12 Data size: 1032 Basic stats: COMPLETE Column stats: COMPLETE + Select Operator +Statistics: Num rows: 12 Data size: 1032 Basic stats: COMPLETE Column stats: COMPLETE +Group By Operator + aggregations: count() + mode: hash + outputColumnNames: _col0 + Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: COMPLETE + Reduce Output Operator +sort order: +Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: COMPLETE +value expressions: _col0 (type: bigint) + Execution mode: vectorized + Reduce Operator Tree: +Group By Operator + aggregations: count(VALUE._col0) + mode: mergepartial + outputColumnNames: _col0 + Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE
[GitHub] miklosgergely commented on a change in pull request #544: HIVE-16924 Support distinct in presence of Group By
miklosgergely commented on a change in pull request #544: HIVE-16924 Support distinct in presence of Group By URL: https://github.com/apache/hive/pull/544#discussion_r259953429 ## File path: ql/src/test/queries/clientpositive/distinct_groupby.q ## @@ -0,0 +1,57 @@ +--! qt:dataset:src1 + Review comment: adding q tests for non-cbo as well, good idea! This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] miklosgergely commented on a change in pull request #544: HIVE-16924 Support distinct in presence of Group By
miklosgergely commented on a change in pull request #544: HIVE-16924 Support distinct in presence of Group By URL: https://github.com/apache/hive/pull/544#discussion_r259958209 ## File path: ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java ## @@ -4230,6 +4229,34 @@ public static long unsetBit(long bitmap, int bitIdx) { } } + protected boolean isGroupBy(ASTNode expr) { +boolean isGroupBy = false; +if (expr.getParent() != null && expr.getParent() instanceof Node) +for (Node sibling : ((Node)expr.getParent()).getChildren()) { + isGroupBy |= sibling instanceof ASTNode && ((ASTNode)sibling).getType() == HiveParser.TOK_GROUPBY; +} + +return isGroupBy; + } + + protected boolean isSelectDistinct(ASTNode expr) { +return expr.getType() == HiveParser.TOK_SELECTDI; + } + + protected boolean isAggregateInSelect(Node node, Collection aggregateFunction) { +if (node.getChildren() == null) { + return false; +} + +for (Node child : node.getChildren()) { Review comment: I doubt there is any. The above example is not valid, it says: Unsupported SubQuery Expression Invalid subquery. Subquery with DISTINCT clause is not supported! This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] maheshk114 commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled
maheshk114 commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled URL: https://github.com/apache/hive/pull/541#discussion_r259622274 ## File path: ql/src/java/org/apache/hadoop/hive/ql/parse/ReplicationSpec.java ## @@ -426,4 +427,14 @@ public static void copyLastReplId(Map srcParameter, Map
[GitHub] maheshk114 commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled
maheshk114 commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled URL: https://github.com/apache/hive/pull/541#discussion_r259622357 ## File path: ql/src/java/org/apache/hadoop/hive/ql/exec/repl/bootstrap/load/LoadDatabase.java ## @@ -135,7 +135,8 @@ private boolean isDbEmpty(String dbName) throws HiveException { } private Task alterDbTask(Database dbObj) { -return alterDbTask(dbObj.getName(), updateDbProps(dbObj, context.dumpDirectory), context.hiveConf); +return alterDbTask(dbObj.getName(), updateDbProps(dbObj, context.dumpDirectory, false), Review comment: done This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] maheshk114 commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled
maheshk114 commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled URL: https://github.com/apache/hive/pull/541#discussion_r259622413 ## File path: ql/src/java/org/apache/hadoop/hive/ql/exec/repl/bootstrap/load/LoadDatabase.java ## @@ -158,6 +159,15 @@ private boolean isDbEmpty(String dbName) throws HiveException { // Add the checkpoint key to the Database binding it to current dump directory. // So, if retry using same dump, we shall skip Database object update. parameters.put(ReplUtils.REPL_CHECKPOINT_KEY, dumpDirectory); + +if (needSetIncFlag) { Review comment: done This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] maheshk114 commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled
maheshk114 commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled URL: https://github.com/apache/hive/pull/541#discussion_r259622407 ## File path: ql/src/java/org/apache/hadoop/hive/ql/exec/repl/incremental/IncrementalLoadTasksBuilder.java ## @@ -164,6 +165,12 @@ public IncrementalLoadTasksBuilder(String dbName, String tableName, String loadP lastEventid); } } + + ReplSetFirstIncLoadFlagDesc desc = new ReplSetFirstIncLoadFlagDesc(dbName, tableName); Review comment: new ddl task is created to make it simpler with table level and warehouse level support This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] maheshk114 commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled
maheshk114 commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled URL: https://github.com/apache/hive/pull/541#discussion_r259617900 ## File path: ql/src/java/org/apache/hadoop/hive/ql/exec/ReplCopyTask.java ## @@ -61,6 +62,21 @@ public ReplCopyTask(){ super(); } + // If file is already present in base directory, then remove it from the list. + // Check HIVE-21197 for more detail + private void updateSrcFileListForDupCopy(FileSystem dstFs, Path toPath, List srcFiles, + long writeId, int stmtId) throws IOException { +ListIterator iter = srcFiles.listIterator(); +Path basePath = new Path(toPath, AcidUtils.baseOrDeltaSubdir(true, writeId, writeId, stmtId)); +while (iter.hasNext()) { + Path filePath = new Path(basePath, iter.next().getSourcePath().getName()); + if (dstFs.exists(filePath)) { Review comment: the i/o exception retry case is handled specifically at 2 places only. there are many other i/o failure scenarios which are not handled. I think its not required here. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] maheshk114 commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled
maheshk114 commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled URL: https://github.com/apache/hive/pull/541#discussion_r259622991 ## File path: ql/src/java/org/apache/hadoop/hive/ql/txn/compactor/Initiator.java ## @@ -112,6 +118,12 @@ public void run() { continue; } + if (replIsCompactionDisabledForTable(t)) { Review comment: not done This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] maheshk114 commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled
maheshk114 commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled URL: https://github.com/apache/hive/pull/541#discussion_r259622969 ## File path: ql/src/java/org/apache/hadoop/hive/ql/exec/DDLTask.java ## @@ -661,6 +663,10 @@ public int execute(DriverContext driverContext) { if (work.getAlterMaterializedViewDesc() != null) { return alterMaterializedView(db, work.getAlterMaterializedViewDesc()); } + + if (work.getReplSetFirstIncLoadFlagDesc() != null) { Review comment: its done in a separate task to make it simpler This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] maheshk114 commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled
maheshk114 commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled URL: https://github.com/apache/hive/pull/541#discussion_r259618842 ## File path: ql/src/java/org/apache/hadoop/hive/ql/exec/repl/util/ReplUtils.java ## @@ -187,4 +192,12 @@ public static PathFilter getEventsDirectoryFilter(final FileSystem fs) { } }; } + + public static boolean isFirstIncDone(Map parameter) { +if (parameter == null) { + return true; +} +String compFlag = parameter.get(ReplUtils.REPL_FIRST_INC_PENDING_FLAG); Review comment: done This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] maheshk114 commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled
maheshk114 commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled URL: https://github.com/apache/hive/pull/541#discussion_r259618838 ## File path: ql/src/java/org/apache/hadoop/hive/ql/exec/repl/util/ReplUtils.java ## @@ -187,4 +192,12 @@ public static PathFilter getEventsDirectoryFilter(final FileSystem fs) { } }; } + + public static boolean isFirstIncDone(Map parameter) { Review comment: done This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] maheshk114 commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled
maheshk114 commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled URL: https://github.com/apache/hive/pull/541#discussion_r259622190 ## File path: ql/src/java/org/apache/hadoop/hive/ql/exec/ReplCopyTask.java ## @@ -271,12 +299,13 @@ public String getName() { LOG.debug("ReplCopyTask:getLoadCopyTask: {}=>{}", srcPath, dstPath); if ((replicationSpec != null) && replicationSpec.isInReplicationScope()){ ReplCopyWork rcwork = new ReplCopyWork(srcPath, dstPath, false); - if (replicationSpec.isReplace() && conf.getBoolVar(REPL_ENABLE_MOVE_OPTIMIZATION)) { + if (replicationSpec.isReplace() && (conf.getBoolVar(REPL_ENABLE_MOVE_OPTIMIZATION) || copyToMigratedTxnTable)) { rcwork.setDeleteDestIfExist(true); Review comment: done This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services