[GitHub] [incubator-hudi] thesuperzapper commented on issue #780: Cleanup Maven POM/Classpath
thesuperzapper commented on issue #780: Cleanup Maven POM/Classpath URL: https://github.com/apache/incubator-hudi/pull/780#issuecomment-511091893 > @thesuperzapper as for testing, if you can run the demo steps once and confirm there are no NoClassDefFound errors and such, it would be a good start. After our discussion, just ran through the demo, and it works properly. If you want, we can just finalised and merge your one, then I can rebase mine? (Unless you really want to make both changes at once) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [incubator-hudi] eisig commented on issue #779: HoodieDeltaStreamer may insert duplicate record?
eisig commented on issue #779: HoodieDeltaStreamer may insert duplicate record? URL: https://github.com/apache/incubator-hudi/issues/779#issuecomment-511091534 I am testing the master branch. ``` 0 Jul 12 20:01 20190712120106.deltacommit.inflight 913 Jul 12 20:01 20190712120106.rollback 1218 Jul 12 19:47 20190712114645.clean 73378 Jul 12 19:47 20190712114645.deltacommit 1218 Jul 12 19:46 20190712114551.clean 62036 Jul 12 19:46 20190712114551.deltacommit 1218 Jul 12 19:45 20190712114453.clean 69016 Jul 12 19:45 20190712114453.deltacommit 1218 Jul 12 19:44 20190712114357.clean 75986 Jul 12 19:44 20190712114357.deltacommit 1218 Jul 12 19:43 20190712114302.clean 65526 Jul 12 19:43 20190712114302.deltacommit 1218 Jul 12 19:42 20190712114214.clean 69863 Jul 12 19:42 20190712114214.deltacommit 1218 Jul 12 19:42 20190712114128.clean 66374 Jul 12 19:42 20190712114128.deltacommit 1218 Jul 12 19:41 20190712114049.clean 60268 Jul 12 19:41 20190712114049.deltacommit 1218 Jul 12 19:40 20190712114008.clean 63753 Jul 12 19:40 20190712114008.deltacommit 1218 Jul 12 19:40 20190712113920.clean 64645 Jul 12 19:40 20190712113920.deltacommit 1218 Jul 12 19:39 20190712113835.clean 64632 Jul 12 19:39 20190712113835.deltacommit 1218 Jul 12 19:38 20190712113748.clean 69859 Jul 12 19:38 20190712113748.deltacommit 1218 Jul 12 19:37 20190712113659.clean 56778 Jul 12 19:37 20190712113659.deltacommit 1218 Jul 12 19:36 20190712113616.clean 67234 Jul 12 19:36 20190712113616.deltacommit 1218 Jul 12 19:36 20190712113520.clean 69874 Jul 12 19:36 20190712113520.deltacommit 1218 Jul 12 19:35 20190712113427.clean 68984 Jul 12 19:35 20190712113427.deltacommit 1218 Jul 12 19:34 20190712113340.clean 65494 Jul 12 19:34 20190712113340.deltacommit 1218 Jul 12 19:33 20190712113220.clean 105746 Jul 12 19:33 20190712113220.deltacommit 1218 Jul 12 19:32 20190712113129.clean 69853 Jul 12 19:32 20190712113129.deltacommit 1218 Jul 12 19:31 20190712113031.clean 75100 Jul 12 19:31 20190712113031.deltacommit 1218 Jul 12 19:30 20190712112927.clean 70739 Jul 12 19:30 20190712112927.deltacommit 1218 Jul 12 19:29 20190712112829.clean 65504 Jul 12 19:29 20190712112829.deltacommit 1218 Jul 12 19:28 20190712112737.clean 67232 Jul 12 19:28 20190712112737.deltacommit 1218 Jul 12 19:27 20190712112638.clean 64629 Jul 12 19:27 20190712112638.deltacommit 1218 Jul 12 19:26 20190712112547.clean 67225 Jul 12 19:26 20190712112547.deltacommit 61138 Jul 12 19:25 20190712112456.deltacommit 64626 Jul 12 19:24 20190712112407.deltacommit 913 Jul 12 16:54 20190712085450.rollback 913 Jul 12 16:51 20190712085153.rollback 173 Jul 11 11:36 hoodie.properties ``` order by time desc ``` 4442 Jul 12 19:47 .349fb959-6762-41dc-b657-c3ac2cb0581f-0_20190712082340.log.116_27-5532-38216 7067 Jul 12 19:47 .927ac226-15f5-49f6-916c-c7789a59d722-0_20190712082937.log.110_24-5532-38213 5292 Jul 12 19:47 .43d29590-5309-4fa5-9c00-b85fd8e2f23d-0_20190712084151.log.110_23-5532-38212 5309 Jul 12 19:47 .00b2332d-ee2e-4b13-9081-338afe0688dd-0_20190712080919.log.101_15-5532-38204 7999 Jul 12 19:47 .2d00af36-27b7-4312-955d-c86cc81291f7-0_20190712092457.log.86_14-5532-38203 4422 Jul 12 19:47 .3fae4203-735c-462c-99e2-975d1e223bf0-0_20190712080712.log.103_12-5532-38201 5304 Jul 12 19:47 .4ed26f22-7660-4c03-8c73-ccf9cdf5f35d-0_20190712082340.log.106_9-5532-38198 4447 Jul 12 19:47 .b17cba40-e664-4e17-b3b0-b3e7e74b005a-0_20190712080712.log.98_10-5532-38199 6188 Jul 12 19:47 .34cf303a-a4d7-4d8f-8a5b-3d11540535ed-0_20190712082340.log.115_7-5532-38196 5318 Jul 12 19:47 .f2b1e4f2-4032-40fc-bd48-287a1f0b5b77-0_20190712081407.log.110_8-5532-38197 4452 Jul 12 19:47 .eb9621b0-59c6-42d9-8b20-1c2a8d15b12b-0_20190712083508.log.102_5-5532-38194 1348740 Jul 12 19:47 5084ee47-9a81-4a21-8557-d1b250f7e16b-0_2-5532-38191_20190712114645.parquet 5282 Jul 12 19:47 .c87d3580-86fe-40f9-8f6c-7c95cc91caa6-0_20190712084720.log.118_1-5532-38190 4423 Jul 12 19:46 .00b2332d-ee2e-4b13-9081-338afe0688dd-0_20190712080919.log.100_15-5496-38025 4423 Jul 12 19:46 .349fb959-6762-41dc-b657-c3ac2cb0581f-0_20190712082340.log.115_13-5496-38023 4424 Jul 12 19:46 .4ed26f22-7660-4c03-8c73-ccf9cdf5f35d-0_20190712082340.log.105_10-5496-38020 4442 Jul 12 19:46 .43d29590-5309-4fa5-9c00-b85fd8e2f23d-0_20190712084151.log.109_8-5496-38018 5286 Jul 12 19:46 .927ac226-15f5-49f6-916c-c7789a59d722-0_20190712082937.log.109_9-5496-38019 4422 Jul 12 19:46 .eb9621b0-59c6-42d9-8b20-1c2a8d15b12b-0_20190712083508.log.101_5-5496-38015 1348239 Jul 12 19:46 5084ee47-9a81-4a21-8557-d1b250f7e16b-0_2-5496-38012_20190712114551.parquet 4421
[GitHub] [incubator-hudi] eisig edited a comment on issue #779: HoodieDeltaStreamer may insert duplicate record?
eisig edited a comment on issue #779: HoodieDeltaStreamer may insert duplicate record? URL: https://github.com/apache/incubator-hudi/issues/779#issuecomment-511091534 @vinothchandar I am testing the master branch. ``` 0 Jul 12 20:01 20190712120106.deltacommit.inflight 913 Jul 12 20:01 20190712120106.rollback 1218 Jul 12 19:47 20190712114645.clean 73378 Jul 12 19:47 20190712114645.deltacommit 1218 Jul 12 19:46 20190712114551.clean 62036 Jul 12 19:46 20190712114551.deltacommit 1218 Jul 12 19:45 20190712114453.clean 69016 Jul 12 19:45 20190712114453.deltacommit 1218 Jul 12 19:44 20190712114357.clean 75986 Jul 12 19:44 20190712114357.deltacommit 1218 Jul 12 19:43 20190712114302.clean 65526 Jul 12 19:43 20190712114302.deltacommit 1218 Jul 12 19:42 20190712114214.clean 69863 Jul 12 19:42 20190712114214.deltacommit 1218 Jul 12 19:42 20190712114128.clean 66374 Jul 12 19:42 20190712114128.deltacommit 1218 Jul 12 19:41 20190712114049.clean 60268 Jul 12 19:41 20190712114049.deltacommit 1218 Jul 12 19:40 20190712114008.clean 63753 Jul 12 19:40 20190712114008.deltacommit 1218 Jul 12 19:40 20190712113920.clean 64645 Jul 12 19:40 20190712113920.deltacommit 1218 Jul 12 19:39 20190712113835.clean 64632 Jul 12 19:39 20190712113835.deltacommit 1218 Jul 12 19:38 20190712113748.clean 69859 Jul 12 19:38 20190712113748.deltacommit 1218 Jul 12 19:37 20190712113659.clean 56778 Jul 12 19:37 20190712113659.deltacommit 1218 Jul 12 19:36 20190712113616.clean 67234 Jul 12 19:36 20190712113616.deltacommit 1218 Jul 12 19:36 20190712113520.clean 69874 Jul 12 19:36 20190712113520.deltacommit 1218 Jul 12 19:35 20190712113427.clean 68984 Jul 12 19:35 20190712113427.deltacommit 1218 Jul 12 19:34 20190712113340.clean 65494 Jul 12 19:34 20190712113340.deltacommit 1218 Jul 12 19:33 20190712113220.clean 105746 Jul 12 19:33 20190712113220.deltacommit 1218 Jul 12 19:32 20190712113129.clean 69853 Jul 12 19:32 20190712113129.deltacommit 1218 Jul 12 19:31 20190712113031.clean 75100 Jul 12 19:31 20190712113031.deltacommit 1218 Jul 12 19:30 20190712112927.clean 70739 Jul 12 19:30 20190712112927.deltacommit 1218 Jul 12 19:29 20190712112829.clean 65504 Jul 12 19:29 20190712112829.deltacommit 1218 Jul 12 19:28 20190712112737.clean 67232 Jul 12 19:28 20190712112737.deltacommit 1218 Jul 12 19:27 20190712112638.clean 64629 Jul 12 19:27 20190712112638.deltacommit 1218 Jul 12 19:26 20190712112547.clean 67225 Jul 12 19:26 20190712112547.deltacommit 61138 Jul 12 19:25 20190712112456.deltacommit 64626 Jul 12 19:24 20190712112407.deltacommit 913 Jul 12 16:54 20190712085450.rollback 913 Jul 12 16:51 20190712085153.rollback 173 Jul 11 11:36 hoodie.properties ``` order by time desc ``` 4442 Jul 12 19:47 .349fb959-6762-41dc-b657-c3ac2cb0581f-0_20190712082340.log.116_27-5532-38216 7067 Jul 12 19:47 .927ac226-15f5-49f6-916c-c7789a59d722-0_20190712082937.log.110_24-5532-38213 5292 Jul 12 19:47 .43d29590-5309-4fa5-9c00-b85fd8e2f23d-0_20190712084151.log.110_23-5532-38212 5309 Jul 12 19:47 .00b2332d-ee2e-4b13-9081-338afe0688dd-0_20190712080919.log.101_15-5532-38204 7999 Jul 12 19:47 .2d00af36-27b7-4312-955d-c86cc81291f7-0_20190712092457.log.86_14-5532-38203 4422 Jul 12 19:47 .3fae4203-735c-462c-99e2-975d1e223bf0-0_20190712080712.log.103_12-5532-38201 5304 Jul 12 19:47 .4ed26f22-7660-4c03-8c73-ccf9cdf5f35d-0_20190712082340.log.106_9-5532-38198 4447 Jul 12 19:47 .b17cba40-e664-4e17-b3b0-b3e7e74b005a-0_20190712080712.log.98_10-5532-38199 6188 Jul 12 19:47 .34cf303a-a4d7-4d8f-8a5b-3d11540535ed-0_20190712082340.log.115_7-5532-38196 5318 Jul 12 19:47 .f2b1e4f2-4032-40fc-bd48-287a1f0b5b77-0_20190712081407.log.110_8-5532-38197 4452 Jul 12 19:47 .eb9621b0-59c6-42d9-8b20-1c2a8d15b12b-0_20190712083508.log.102_5-5532-38194 1348740 Jul 12 19:47 5084ee47-9a81-4a21-8557-d1b250f7e16b-0_2-5532-38191_20190712114645.parquet 5282 Jul 12 19:47 .c87d3580-86fe-40f9-8f6c-7c95cc91caa6-0_20190712084720.log.118_1-5532-38190 4423 Jul 12 19:46 .00b2332d-ee2e-4b13-9081-338afe0688dd-0_20190712080919.log.100_15-5496-38025 4423 Jul 12 19:46 .349fb959-6762-41dc-b657-c3ac2cb0581f-0_20190712082340.log.115_13-5496-38023 4424 Jul 12 19:46 .4ed26f22-7660-4c03-8c73-ccf9cdf5f35d-0_20190712082340.log.105_10-5496-38020 4442 Jul 12 19:46 .43d29590-5309-4fa5-9c00-b85fd8e2f23d-0_20190712084151.log.109_8-5496-38018 5286 Jul 12 19:46 .927ac226-15f5-49f6-916c-c7789a59d722-0_20190712082937.log.109_9-5496-38019 4422 Jul 12 19:46 .eb9621b0-59c6-42d9-8b20-1c2a8d15b12b-0_20190712083508.log.101_5-5496-38015 1348239 Jul 12 19:46
[incubator-hudi] branch asf-site updated: Remove --key-generator-class CLI arg for DeltaStreamer.
This is an automated email from the ASF dual-hosted git repository. nagarwal pushed a commit to branch asf-site in repository https://gitbox.apache.org/repos/asf/incubator-hudi.git The following commit(s) were added to refs/heads/asf-site by this push: new e55816b Remove --key-generator-class CLI arg for DeltaStreamer. e55816b is described below commit e55816bcf48ccb26612c978b0abc38e4c8df1063 Author: Ethan Guo AuthorDate: Fri Jul 12 16:39:53 2019 -0700 Remove --key-generator-class CLI arg for DeltaStreamer. --- docs/writing_data.md | 6 -- 1 file changed, 6 deletions(-) diff --git a/docs/writing_data.md b/docs/writing_data.md index c2d1df8..9f5eb2b 100644 --- a/docs/writing_data.md +++ b/docs/writing_data.md @@ -42,12 +42,6 @@ Usage: [options] parameter "--propsFilePath") can also be passed command line using this parameter Default: [] ---key-generator-class - Subclass of com.uber.hoodie.KeyGenerator to generate a HoodieKey from - the given avro record. Built in: SimpleKeyGenerator (uses provided field - names as recordkey & partitionpath. Nested fields specified via dot - notation, e.g: a.b.c) - Default: com.uber.hoodie.SimpleKeyGenerator --op Takes one of these values : UPSERT (default), INSERT (use when input is purely new data/inserts to gain speed)
[GitHub] [incubator-hudi] n3nash merged pull request #785: Remove --key-generator-class CLI arg for DeltaStreamer
n3nash merged pull request #785: Remove --key-generator-class CLI arg for DeltaStreamer URL: https://github.com/apache/incubator-hudi/pull/785 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [incubator-hudi] yihua commented on issue #781: [HUDI-161] Remove --key-generator-class CLI arg in HoodieDeltaStreamer and use key generator class specified in datasource properties
yihua commented on issue #781: [HUDI-161] Remove --key-generator-class CLI arg in HoodieDeltaStreamer and use key generator class specified in datasource properties URL: https://github.com/apache/incubator-hudi/pull/781#issuecomment-511049228 @vinothchandar yes, I'll send another PR on the docs change. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [incubator-hudi] vinothchandar commented on issue #764: Hoodie 0.4.7: Error upserting bucketType UPDATE for partition #, No value present
vinothchandar commented on issue #764: Hoodie 0.4.7: Error upserting bucketType UPDATE for partition #, No value present URL: https://github.com/apache/incubator-hudi/issues/764#issuecomment-511030769 To summarize @n3nash is looking into the avro issue and @bhasudha is going to try repro the empty path exception, as a ramp up task. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[incubator-hudi] branch master updated: [HUDI-161] Remove --key-generator-class CLI arg in HoodieDeltaStreamer and use key generator class specified in datasource properties. (#781)
This is an automated email from the ASF dual-hosted git repository. vinoth pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/incubator-hudi.git The following commit(s) were added to refs/heads/master by this push: new 621c246 [HUDI-161] Remove --key-generator-class CLI arg in HoodieDeltaStreamer and use key generator class specified in datasource properties. (#781) 621c246 is described below commit 621c246fa9ea607a3cd8f33fdc3d8b528315e327 Author: Yihua Guo AuthorDate: Fri Jul 12 13:45:49 2019 -0700 [HUDI-161] Remove --key-generator-class CLI arg in HoodieDeltaStreamer and use key generator class specified in datasource properties. (#781) --- .../main/java/com/uber/hoodie/DataSourceUtils.java | 13 +++-- .../com/uber/hoodie/HoodieSparkSqlWriter.scala | 5 +- .../hoodie/utilities/deltastreamer/DeltaSync.java | 2 +- .../deltastreamer/HoodieDeltaStreamer.java | 6 --- .../hoodie/utilities/TestHoodieDeltaStreamer.java | 58 +++--- 5 files changed, 62 insertions(+), 22 deletions(-) diff --git a/hoodie-spark/src/main/java/com/uber/hoodie/DataSourceUtils.java b/hoodie-spark/src/main/java/com/uber/hoodie/DataSourceUtils.java index e7b9494..d700ff6 100644 --- a/hoodie-spark/src/main/java/com/uber/hoodie/DataSourceUtils.java +++ b/hoodie-spark/src/main/java/com/uber/hoodie/DataSourceUtils.java @@ -90,10 +90,17 @@ public class DataSourceUtils { } /** - * Create a key generator class via reflection, passing in any configs needed + * Create a key generator class via reflection, passing in any configs needed. + * + * If the class name of key generator is configured through the properties file, i.e., {@code + * props}, use the corresponding key generator class; otherwise, use the default key generator + * class specified in {@code DataSourceWriteOptions}. */ - public static KeyGenerator createKeyGenerator(String keyGeneratorClass, - TypedProperties props) throws IOException { + public static KeyGenerator createKeyGenerator(TypedProperties props) throws IOException { +String keyGeneratorClass = props.getString( +DataSourceWriteOptions.KEYGENERATOR_CLASS_OPT_KEY(), +DataSourceWriteOptions.DEFAULT_KEYGENERATOR_CLASS_OPT_VAL() +); try { return (KeyGenerator) ReflectionUtils.loadClass(keyGeneratorClass, props); } catch (Throwable e) { diff --git a/hoodie-spark/src/main/scala/com/uber/hoodie/HoodieSparkSqlWriter.scala b/hoodie-spark/src/main/scala/com/uber/hoodie/HoodieSparkSqlWriter.scala index cf44e09..414cad4 100644 --- a/hoodie-spark/src/main/scala/com/uber/hoodie/HoodieSparkSqlWriter.scala +++ b/hoodie-spark/src/main/scala/com/uber/hoodie/HoodieSparkSqlWriter.scala @@ -84,10 +84,7 @@ private[hoodie] object HoodieSparkSqlWriter { log.info(s"Registered avro schema : ${schema.toString(true)}") // Convert to RDD[HoodieRecord] -val keyGenerator = DataSourceUtils.createKeyGenerator( - parameters(KEYGENERATOR_CLASS_OPT_KEY), - toProperties(parameters) -) +val keyGenerator = DataSourceUtils.createKeyGenerator(toProperties(parameters)) val genericRecords: RDD[GenericRecord] = AvroConversionUtils.createRdd(df, structName, nameSpace) val hoodieAllIncomingRecords = genericRecords.map(gr => { val orderingVal = DataSourceUtils.getNestedFieldValAsString( diff --git a/hoodie-utilities/src/main/java/com/uber/hoodie/utilities/deltastreamer/DeltaSync.java b/hoodie-utilities/src/main/java/com/uber/hoodie/utilities/deltastreamer/DeltaSync.java index 89e5c73..00d270b 100644 --- a/hoodie-utilities/src/main/java/com/uber/hoodie/utilities/deltastreamer/DeltaSync.java +++ b/hoodie-utilities/src/main/java/com/uber/hoodie/utilities/deltastreamer/DeltaSync.java @@ -171,7 +171,7 @@ public class DeltaSync implements Serializable { refreshTimeline(); this.transformer = UtilHelpers.createTransformer(cfg.transformerClassName); -this.keyGenerator = DataSourceUtils.createKeyGenerator(cfg.keyGeneratorClass, props); +this.keyGenerator = DataSourceUtils.createKeyGenerator(props); this.formatAdapter = new SourceFormatAdapter(UtilHelpers.createSource(cfg.sourceClassName, props, jssc, sparkSession, schemaProvider)); diff --git a/hoodie-utilities/src/main/java/com/uber/hoodie/utilities/deltastreamer/HoodieDeltaStreamer.java b/hoodie-utilities/src/main/java/com/uber/hoodie/utilities/deltastreamer/HoodieDeltaStreamer.java index c49f3f8..1951546 100644 --- a/hoodie-utilities/src/main/java/com/uber/hoodie/utilities/deltastreamer/HoodieDeltaStreamer.java +++ b/hoodie-utilities/src/main/java/com/uber/hoodie/utilities/deltastreamer/HoodieDeltaStreamer.java @@ -27,7 +27,6 @@ import com.beust.jcommander.ParameterException; import com.google.common.base.Preconditions; import com.uber.hoodie.HoodieWriteClient; import com.uber.hoodie.OverwriteWithLatestAvroPayload; -import com.uber.hoodie.SimpleKeyGenerator;
[GitHub] [incubator-hudi] n3nash commented on issue #770: remove com.databricks:spark-avro to build spark avro schema by itself
n3nash commented on issue #770: remove com.databricks:spark-avro to build spark avro schema by itself URL: https://github.com/apache/incubator-hudi/pull/770#issuecomment-510983462 @cdmikechen Have you tried running the demo steps to ensure these changes work fine ? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [incubator-hudi] n3nash commented on a change in pull request #771: fix error: java.lang.IllegalArgumentException: Can not create a Path from an empty string
n3nash commented on a change in pull request #771: fix error: java.lang.IllegalArgumentException: Can not create a Path from an empty string URL: https://github.com/apache/incubator-hudi/pull/771#discussion_r303093241 ## File path: hoodie-common/src/main/java/com/uber/hoodie/common/table/view/AbstractTableFileSystemView.java ## @@ -216,7 +218,9 @@ private void ensurePartitionLoadedCorrectly(String partition) { log.info("Building file system view for partition (" + partitionPathStr + ")"); // Create the path if it does not exist already - Path partitionPath = FSUtils.getPartitionPath(metaClient.getBasePath(), partitionPathStr); Review comment: +1 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [incubator-hudi] vinothchandar commented on issue #779: HoodieDeltaStreamer may insert duplicate record?
vinothchandar commented on issue #779: HoodieDeltaStreamer may insert duplicate record? URL: https://github.com/apache/incubator-hudi/issues/779#issuecomment-510967073 @eisig are you testing on master? what version are you using? Given you are disabling compaction, not sure how ro and rt views match, since records would have made their way from log to parquet files without compaction. I am suspecting somehow you end up using `--op INSERT` or `--op BULK_INSERT` instead of UPSERT. I don't see this option in your command above.. Can you list the .hoodie folder and a partition for me, so I can look at what files you have underneath? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [incubator-hudi] vinothchandar commented on issue #779: HoodieDeltaStreamer may insert duplicate record?
vinothchandar commented on issue #779: HoodieDeltaStreamer may insert duplicate record? URL: https://github.com/apache/incubator-hudi/issues/779#issuecomment-510960909 @bvaradar is this related to #775 ? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [incubator-hudi] vinothchandar closed issue #784: Can Hudi delete records?
vinothchandar closed issue #784: Can Hudi delete records? URL: https://github.com/apache/incubator-hudi/issues/784 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [incubator-hudi] vinothchandar commented on issue #784: Can Hudi delete records?
vinothchandar commented on issue #784: Can Hudi delete records? URL: https://github.com/apache/incubator-hudi/issues/784#issuecomment-510960314 yes . you can use the `EmptyRecordPayload` as in #635 to perform hard deletes. upsert with null for soft deletes. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [incubator-hudi] vinothchandar edited a comment on issue #714: Performance Comparison of HoodieDeltaStreamer and DataSourceAPI
vinothchandar edited a comment on issue #714: Performance Comparison of HoodieDeltaStreamer and DataSourceAPI URL: https://github.com/apache/incubator-hudi/issues/714#issuecomment-510959002 This indicates general spark shuffle failures.. I'd suggest first running it in a larger say 20 executor cluster first, and then start shrinking. >>tage 2 is showing that the input size is 1888.8 MB while stage 21 its showing 6.6 GB That expansion is just for the index checking operation. output table will not be 6.6GB This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [incubator-hudi] vinothchandar commented on issue #714: Performance Comparison of HoodieDeltaStreamer and DataSourceAPI
vinothchandar commented on issue #714: Performance Comparison of HoodieDeltaStreamer and DataSourceAPI URL: https://github.com/apache/incubator-hudi/issues/714#issuecomment-510959002 This indicates general spark shuffle failures.. I'd suggest first running it in a larger say 20 executor cluster first, and then start shrinking. >>tage 2 is showing that the input size is 1888.8 MB while stage 21 its showing 6.6 GB That expansion is just for the index checking operation. output table will not be 6.6GB This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [incubator-hudi] vinothchandar merged pull request #778: Fixed TableNotFoundException when write with structured streaming
vinothchandar merged pull request #778: Fixed TableNotFoundException when write with structured streaming URL: https://github.com/apache/incubator-hudi/pull/778 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[incubator-hudi] branch master updated: Fixed TableNotFoundException when write with structured streaming (#778)
This is an automated email from the ASF dual-hosted git repository. vinoth pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/incubator-hudi.git The following commit(s) were added to refs/heads/master by this push: new 11c4121 Fixed TableNotFoundException when write with structured streaming (#778) 11c4121 is described below commit 11c4121f739d1d00a4ec66b4e243e47602d6ffb4 Author: Ho Tien Vu AuthorDate: Sat Jul 13 00:17:16 2019 +0800 Fixed TableNotFoundException when write with structured streaming (#778) - When write to a new hoodie table, if checkpoint dir is under target path, Spark will create the base path and thus skip initializing .hoodie which result in error - apply .hoodie existent check for all save mode --- .../src/main/scala/com/uber/hoodie/HoodieSparkSqlWriter.scala | 10 +- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/hoodie-spark/src/main/scala/com/uber/hoodie/HoodieSparkSqlWriter.scala b/hoodie-spark/src/main/scala/com/uber/hoodie/HoodieSparkSqlWriter.scala index 35c19aa..cf44e09 100644 --- a/hoodie-spark/src/main/scala/com/uber/hoodie/HoodieSparkSqlWriter.scala +++ b/hoodie-spark/src/main/scala/com/uber/hoodie/HoodieSparkSqlWriter.scala @@ -100,23 +100,23 @@ private[hoodie] object HoodieSparkSqlWriter { val basePath = new Path(parameters("path")) val fs = basePath.getFileSystem(sparkContext.hadoopConfiguration) -var exists = fs.exists(basePath) +var exists = fs.exists(new Path(basePath, HoodieTableMetaClient.METAFOLDER_NAME)) // Handle various save modes if (mode == SaveMode.ErrorIfExists && exists) { - throw new HoodieException(s"basePath ${basePath} already exists.") + throw new HoodieException(s"hoodie dataset at $basePath already exists.") } if (mode == SaveMode.Ignore && exists) { - log.warn(s" basePath ${basePath} already exists. Ignoring & not performing actual writes.") + log.warn(s"hoodie dataset at $basePath already exists. Ignoring & not performing actual writes.") return (true, None) } if (mode == SaveMode.Overwrite && exists) { - log.warn(s" basePath ${basePath} already exists. Deleting existing data & overwriting with new data.") + log.warn(s"hoodie dataset at $basePath already exists. Deleting existing data & overwriting with new data.") fs.delete(basePath, true) exists = false } -// Create the dataset if not present (APPEND mode) +// Create the dataset if not present if (!exists) { HoodieTableMetaClient.initTableType(sparkContext.hadoopConfiguration, path.get, storageType, tblName.get, "archived")
[GitHub] [incubator-hudi] eisig edited a comment on issue #779: HoodieDeltaStreamer may install duplicate record?
eisig edited a comment on issue #779: HoodieDeltaStreamer may install duplicate record? URL: https://github.com/apache/incubator-hudi/issues/779#issuecomment-510828768 I have restart the job several times, and add --disable-compaction Ohter results seems wrong. ``` select count(*) count1, count(distinct id) count2 from hive200.test.t_order_mor03_rt select count(*) count3, count(distinct id) count4 from hive200.test.t_order_mor03 ``` count1 == count3 count2 == count4 count1 != count2 and count3 != count4 ``` select (select max(_hoodie_commit_time) from hive200.test.t_order_mor03), (select max(_hoodie_commit_time) from hive200.test.t_order_mor03_rt) ``` _hoodie_commit_time are always the some. ``` select count(*) count from hive200.test.t_order_mor03_rt rt join hive200.test.t_order_mor03 ro on ro.id = rt.id where rt.modify_date!=ro.modify_date ``` count is going up This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [incubator-hudi] hotienvu commented on issue #778: Fixed TableNotFoundException when write with structured streaming
hotienvu commented on issue #778: Fixed TableNotFoundException when write with structured streaming URL: https://github.com/apache/incubator-hudi/pull/778#issuecomment-510830795 @vinothchandar commits squashed. thanks for looking into this. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [incubator-hudi] eisig edited a comment on issue #779: HoodieDeltaStreamer may install duplicate record?
eisig edited a comment on issue #779: HoodieDeltaStreamer may install duplicate record? URL: https://github.com/apache/incubator-hudi/issues/779#issuecomment-510828768 I have restart the job several times, and add --disable-compactionother . Ohter results seems wrong. ``` select count(*) count1, count(distinct id) count2 from hive200.test.t_order_mor03_rt select count(*) count3, count(distinct id) count4 from hive200.test.t_order_mor03 ``` count1 == count3 count2 == count4 count1 != count2 and count3 != count4 ``` select (select max(_hoodie_commit_time) from hive200.test.t_order_mor03), (select max(_hoodie_commit_time) from hive200.test.t_order_mor03_rt) ``` _hoodie_commit_time are always the some value. ``` select count(*) count from hive200.test.t_order_mor03_rt rt join hive200.test.t_order_mor03 ro on ro.id = rt.id where rt.modify_date!=ro.modify_date ``` count is going up This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [incubator-hudi] eisig commented on issue #779: HoodieDeltaStreamer may install duplicate record?
eisig commented on issue #779: HoodieDeltaStreamer may install duplicate record? URL: https://github.com/apache/incubator-hudi/issues/779#issuecomment-510828768 other results seems wrong. ``` select count(*) count1, count(distinct id) count2 from hive200.test.t_order_mor03_rt select count(*) count3, count(distinct id) count4 from hive200.test.t_order_mor03 ``` count1 == count3 count2 == count4 count1 != count2 and count3 != count4 ``` select (select max(_hoodie_commit_time) from hive200.test.t_order_mor03), (select max(_hoodie_commit_time) from hive200.test.t_order_mor03_rt) ``` _hoodie_commit_time are always the some value. ``` select count(*) count from hive200.test.t_order_mor03_rt rt join hive200.test.t_order_mor03 ro on ro.id = rt.id where rt.modify_date!=ro.modify_date ``` count is going up This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [incubator-hudi] NetsanetGeb edited a comment on issue #714: Performance Comparison of HoodieDeltaStreamer and DataSourceAPI
NetsanetGeb edited a comment on issue #714: Performance Comparison of HoodieDeltaStreamer and DataSourceAPI URL: https://github.com/apache/incubator-hudi/issues/714#issuecomment-510818215 The failures are: ``` org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle 3 at org.apache.spark.MapOutputTracker$$anonfun$convertMapStatuses$2.apply(MapOutputTracker.scala:882) at org.apache.spark.MapOutputTracker$$anonfun$convertMapStatuses$2.apply(MapOutputTracker.scala:878) at scala.collection.Iterator$class.foreach(Iterator.scala:891) at scala.collection.AbstractIterator.foreach(Iterator.scala:1334) at org.apache.spark.MapOutputTracker$.convertMapStatuses(MapOutputTracker.scala:878) at org.apache.spark.MapOutputTrackerWorker.getMapSizesByExecutorId(MapOutputTracker.scala:691) at org.apache.spark.shuffle.BlockStoreShuffleReader.read(BlockStoreShuffleReader.scala:49) at org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$2.apply(CoGroupedRDD.scala:148) at org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$2.apply(CoGroupedRDD.scala:137) at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:733) at scala.collection.immutable.List.foreach(List.scala:392) at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:732) at org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:137) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at org.apache.spark.rdd.RDD$$anonfun$7.apply(RDD.scala:337) at org.apache.spark.rdd.RDD$$anonfun$7.apply(RDD.scala:335) at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1182) at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1156) at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:1091) at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1156) at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:882) at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:335) at org.apache.spark.rdd.RDD.iterator(RDD.scala:286) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) ``` In addition, stage 2 is showing that the input size is 1888.8 MB while stage 21 its showing 6.6 GB. Is this showing that a total of 6.6 GB is written as a hoodie modeled table? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [incubator-hudi] NetsanetGeb commented on issue #714: Performance Comparison of HoodieDeltaStreamer and DataSourceAPI
NetsanetGeb commented on issue #714: Performance Comparison of HoodieDeltaStreamer and DataSourceAPI URL: https://github.com/apache/incubator-hudi/issues/714#issuecomment-510818215 The failures are: ``` org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle 3 at org.apache.spark.MapOutputTracker$$anonfun$convertMapStatuses$2.apply(MapOutputTracker.scala:882) at org.apache.spark.MapOutputTracker$$anonfun$convertMapStatuses$2.apply(MapOutputTracker.scala:878) at scala.collection.Iterator$class.foreach(Iterator.scala:891) at scala.collection.AbstractIterator.foreach(Iterator.scala:1334) at org.apache.spark.MapOutputTracker$.convertMapStatuses(MapOutputTracker.scala:878) at org.apache.spark.MapOutputTrackerWorker.getMapSizesByExecutorId(MapOutputTracker.scala:691) at org.apache.spark.shuffle.BlockStoreShuffleReader.read(BlockStoreShuffleReader.scala:49) at org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$2.apply(CoGroupedRDD.scala:148) at org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$2.apply(CoGroupedRDD.scala:137) at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:733) at scala.collection.immutable.List.foreach(List.scala:392) at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:732) at org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:137) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at org.apache.spark.rdd.RDD$$anonfun$7.apply(RDD.scala:337) at org.apache.spark.rdd.RDD$$anonfun$7.apply(RDD.scala:335) at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1182) at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1156) at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:1091) at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1156) at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:882) at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:335) at org.apache.spark.rdd.RDD.iterator(RDD.scala:286) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55) at org.apache.spark.scheduler.Task.run(Task.scala:121) at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:402) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:408) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748)``` In addition, stage 2 is showing that the input size is 1888.8 MB while stage 21 its showing 6.6 GB. Is this showing that a total of 6.6 GB is written as a hoodie modeled table? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment.
[GitHub] [incubator-hudi] zhangxinjian123 opened a new issue #784: Can Hudi delete records?
zhangxinjian123 opened a new issue #784: Can Hudi delete records? URL: https://github.com/apache/incubator-hudi/issues/784 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [incubator-hudi] vinothchandar opened a new pull request #783: Updating site with latest content from docs folder
vinothchandar opened a new pull request #783: Updating site with latest content from docs folder URL: https://github.com/apache/incubator-hudi/pull/783 - yotpo usage - hoodie-utilities-bundle jar replacement in deltastreamer commands This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services