[GitHub] [incubator-hudi] bvaradar commented on a change in pull request #626: Adding documentation for hudi test suite

2019-11-07 Thread GitBox
bvaradar commented on a change in pull request #626: Adding documentation for 
hudi test suite
URL: https://github.com/apache/incubator-hudi/pull/626#discussion_r344022879
 
 

 ##
 File path: docs/test_suite.md
 ##
 @@ -0,0 +1,155 @@
+---
+title: Test Suite
+keywords: test suite
+sidebar: mydoc_sidebar
+permalink: test_suite.html
+toc: false
+summary: In this page, we will discuss the Hudi Test suite to perform end to 
end tests
+
+This page describes in detail how to run end to end tests on a hudi dataset 
that helps in improving our confidence 
+in a release as well as perform large scale performance benchmarks.  
+
+### Objectives
+
+1. Test with different versions of core libraries and components such as 
`hdfs`, `parquet`, `spark`, 
+`hive` and `avro`.
+2. Generate different types of workloads across different dimensions such as 
`payload size`, `number of updates`, 
+`number of inserts`, `number of partitions`
+3. Perform multiple types of operations such as `insert`, `bulk_insert`, 
`upsert`, `compact`, `query`
+4. Support custom post process actions and validations
+
+### High Level Design
+
+{% include image.html file="hudi_test_suite_design.png" 
alt="hudi_test_suite_design.png" %}
+
+The Hudi test suite runs as a long running spark job. The suite is divided 
into the following high level components : 
+
+# Workload Generation
+
+This component does the work of generating the workload; `inserts`, `upserts` 
etc.
+
+# Workload Scheduling
+
+Depending on the type of workload generated, data is either ingested into the 
target hudi 
+dataset or the corresponding workload operation is executed. For example 
compaction does not necessarily need a workload
+to be generated/ingested but can require an execution.
+
+### Usage instructions
+
+
+# Entry class to the test suite
+
+```
+org.apache.hudi.bench.job.HudiTestSuiteJob.java - Entry Point of the hudi test 
suite job. This 
+class wraps all the functionalities required to run a configurable integration 
suite.
+```
+
+# Configurations required to run the job
+```
+org.apache.hudi.bench.job.HudiTestSuiteConfig - Config class that drives the 
behavior of the 
+integration test suite. This class extends from 
com.uber.hoodie.utilities.DeltaStreamerConfig. Look at 
+link#HudiDeltaStreamer page to learn about all the available configs 
applicable to your test suite.
+```
+
+# Generating a custom Workload Pattern
+```
+There are 2 ways to generate a workload pattern
+
+1. Programatically
+Choose to write up the entire DAG of operations programatically, take a look 
at WorkflowDagGenerator class.
+Once you're ready with the DAG you want to execute, simply pass the class name 
as follows
+spark-submit
+...
+...
+--class org.apache.hudi.bench.job.HudiTestSuiteJob 
+--workload-generator-classname 
org.apache.hudi.bench.dag.scheduler.
+...
+2. YAML file
+Choose to write up the entire DAG of operations in YAML, take a look at 
complex-workload-dag-cow.yaml or 
+complex-workload-dag-mor.yaml.
+Once you're ready with the DAG you want to execute, simply pass the yaml file 
path as follows
+spark-submit
+...
+...
+--class org.apache.hudi.bench.job.HudiTestSuiteJob 
+--workload-yaml-path /path/to/your-workflow-dag.yaml
+...
+```
+
+ Building the test suite
+
+The test suite can be found in the `hudi-bench` module. Use the 
`prepare_integration_suite.sh` script to build 
+the test suite, you can provide different parameters to the script.
+
+```
+shell$ ./prepare_integration_suite.sh --help
 
 Review comment:
   Great !!


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] bvaradar commented on a change in pull request #626: Adding documentation for hudi test suite

2019-11-07 Thread GitBox
bvaradar commented on a change in pull request #626: Adding documentation for 
hudi test suite
URL: https://github.com/apache/incubator-hudi/pull/626#discussion_r344022479
 
 

 ##
 File path: docs/docker_demo.md
 ##
 @@ -1081,6 +1081,34 @@ presto:default>
 
 This brings the demo to an end.
 
+## Running an end to end test suite in Local Docker environment
+
+```
+docker exec -it adhoc-2 /bin/bash
+
+# COPY_ON_WRITE tables
+=
+## Run the following command to start the test suite
+spark-submit  --packages com.databricks:spark-avro_2.11:4.0.0  --conf 
spark.task.cpus=1  --conf spark.executor.cores=1  --conf 
spark.task.maxFailures=100  --conf spark.memory.fraction=0.4  --conf 
spark.rdd.compress=true  --conf spark.kryoserializer.buffer.max=2000m  --conf 
spark.serializer=org.apache.spark.serializer.KryoSerializer  --conf 
spark.memory.storageFraction=0.1  --conf spark.shuffle.service.enabled=true  
--conf spark.sql.hive.convertMetastoreParquet=false  --conf spark.ui.port=  
--conf spark.driver.maxResultSize=12g  --conf 
spark.executor.heartbeatInterval=120s  --conf spark.network.timeout=600s  
--conf spark.eventLog.overwrite=true  --conf spark.eventLog.enabled=true  
--conf spark.yarn.max.executor.failures=10  --conf 
spark.sql.catalogImplementation=hive  --conf spark.sql.shuffle.partitions=1000  
--class org.apache.hudi.bench.job.HudiTestSuiteJob $HUDI_BENCH_BUNDLE 
--source-ordering-field timestamp  --target-base-path 
/user/hive/warehouse/hudi-bench/output  --input-base-path 
/user/hive/warehouse/hudi-bench/input  --target-table test_table  --props 
/var/hoodie/ws/docker/demo/config/bench/test-source.properties  
--schemaprovider-class org.apache.hudi.utilities.schema.FilebasedSchemaProvider 
 --source-limit 30  --source-class 
org.apache.hudi.utilities.sources.AvroDFSSource  --input-file-size 125829120  
--workload-yaml-path 
/var/hoodie/ws/docker/demo/config/bench/complex-workflow-dag-cow.yaml  
--storage-type COPY_ON_WRITE  --compact-scheduling-minshare 1  --hoodie-conf 
"hoodie.deltastreamer.source.test.num_partitions=100"  --hoodie-conf 
"hoodie.deltastreamer.source.test.datagen.use_rocksdb_for_storing_existing_keys=false"
  --hoodie-conf "hoodie.deltastreamer.source.test.max_unique_records=1" 
 --hoodie-conf "hoodie.embed.timeline.server=false"  --hoodie-conf 
"hoodie.datasource.write.recordkey.field=_row_key"  --hoodie-conf 
"hoodie.deltastreamer.source.dfs.root=/user/hive/warehouse/hudi-bench/input"  
--hoodie-conf 
"hoodie.datasource.write.keygenerator.class=org.apache.hudi.ComplexKeyGenerator"
  --hoodie-conf "hoodie.datasource.write.partitionpath.field=timestamp"  
--hoodie-conf 
"hoodie.deltastreamer.schemaprovider.source.schema.file=/var/hoodie/ws/docker/demo/config/bench/source.avsc"
  --hoodie-conf "hoodie.datasource.hive_sync.assume_date_partitioning=false"  
--hoodie-conf 
"hoodie.datasource.hive_sync.jdbcurl=jdbc:hive2://hiveserver:1/"  
--hoodie-conf "hoodie.datasource.hive_sync.database=testdb"  --hoodie-conf 
"hoodie.datasource.hive_sync.table=test_table"  --hoodie-conf 
"hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.hive.NonPartitionedExtractor"
  --hoodie-conf "hoodie.datasource.hive_sync.assume_date_partitioning=true"  
--hoodie-conf 
"hoodie.datasource.write.keytranslator.class=org.apache.hudi.DayBasedPartitionPathKeyTranslator"
  --hoodie-conf 
"hoodie.deltastreamer.schemaprovider.target.schema.file=/var/hoodie/ws/docker/demo/config/bench/source.avsc"
+...
+...
+2019-11-03 05:44:47 INFO  DagScheduler:69 - --- Finished workloads 
--
+2019-11-03 05:44:47 INFO  HudiTestSuiteJob:138 - Finished scheduling all tasks
+...
+2019-11-03 05:44:48 INFO  SparkContext:54 - Successfully stopped SparkContext
+
+# MERGE_ON_READ tables
+=
+## Run the following command to start the test suite
+spark-submit  --packages com.databricks:spark-avro_2.11:4.0.0  --conf 
spark.task.cpus=1  --conf spark.executor.cores=1  --conf 
spark.task.maxFailures=100  --conf spark.memory.fraction=0.4  --conf 
spark.rdd.compress=true  --conf spark.kryoserializer.buffer.max=2000m  --conf 
spark.serializer=org.apache.spark.serializer.KryoSerializer  --conf 
spark.memory.storageFraction=0.1  --conf spark.shuffle.service.enabled=true  
--conf spark.sql.hive.convertMetastoreParquet=false  --conf spark.ui.port=  
--conf spark.driver.maxResultSize=12g  --conf 
spark.executor.heartbeatInterval=120s  --conf spark.network.timeout=600s  
--conf spark.eventLog.overwrite=true  --conf spark.eventLog.enabled=true  
--conf spark.yarn.max.executor.failures=10  --conf 
spark.sql.catalogImplementation=hive  --conf spark.sql.shuffle.partitions=1000  
--class org.apache.hudi.bench.job.HudiTestSuiteJob $HUDI_BENCH_BUNDLE 
--source-ordering-field timestamp  --target-base-path 
/user/hive/warehouse/hudi-bench/output  --input-base-path 
/user/hive/warehouse/hudi-bench/input  --target-table test_table  --props 

[GitHub] [incubator-hudi] bvaradar commented on a change in pull request #626: Adding documentation for hudi test suite

2019-11-07 Thread GitBox
bvaradar commented on a change in pull request #626: Adding documentation for 
hudi test suite
URL: https://github.com/apache/incubator-hudi/pull/626#discussion_r344021615
 
 

 ##
 File path: docs/test_suite.md
 ##
 @@ -0,0 +1,155 @@
+---
+title: Test Suite
+keywords: test suite
+sidebar: mydoc_sidebar
+permalink: test_suite.html
+toc: false
+summary: In this page, we will discuss the Hudi Test suite to perform end to 
end tests
+
+This page describes in detail how to run end to end tests on a hudi dataset 
that helps in improving our confidence 
+in a release as well as perform large scale performance benchmarks.  
+
+### Objectives
+
+1. Test with different versions of core libraries and components such as 
`hdfs`, `parquet`, `spark`, 
+`hive` and `avro`.
+2. Generate different types of workloads across different dimensions such as 
`payload size`, `number of updates`, 
+`number of inserts`, `number of partitions`
+3. Perform multiple types of operations such as `insert`, `bulk_insert`, 
`upsert`, `compact`, `query`
+4. Support custom post process actions and validations
+
+### High Level Design
+
+{% include image.html file="hudi_test_suite_design.png" 
alt="hudi_test_suite_design.png" %}
 
 Review comment:
   Some terminology like : Operation and Action is defined in the image. Trying 
to understand - Is this is still applicable after your refactor ? 
   
   Also, Can you elaborate on the terminologies in this readme.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] bvaradar commented on a change in pull request #626: Adding documentation for hudi test suite

2019-11-07 Thread GitBox
bvaradar commented on a change in pull request #626: Adding documentation for 
hudi test suite
URL: https://github.com/apache/incubator-hudi/pull/626#discussion_r344020491
 
 

 ##
 File path: docs/docker_demo.md
 ##
 @@ -1081,6 +1081,34 @@ presto:default>
 
 This brings the demo to an end.
 
+## Running an end to end test suite in Local Docker environment
+
+```
+docker exec -it adhoc-2 /bin/bash
+
+# COPY_ON_WRITE tables
+=
+## Run the following command to start the test suite
+spark-submit  --packages com.databricks:spark-avro_2.11:4.0.0  --conf 
spark.task.cpus=1  --conf spark.executor.cores=1  --conf 
spark.task.maxFailures=100  --conf spark.memory.fraction=0.4  --conf 
spark.rdd.compress=true  --conf spark.kryoserializer.buffer.max=2000m  --conf 
spark.serializer=org.apache.spark.serializer.KryoSerializer  --conf 
spark.memory.storageFraction=0.1  --conf spark.shuffle.service.enabled=true  
--conf spark.sql.hive.convertMetastoreParquet=false  --conf spark.ui.port=  
--conf spark.driver.maxResultSize=12g  --conf 
spark.executor.heartbeatInterval=120s  --conf spark.network.timeout=600s  
--conf spark.eventLog.overwrite=true  --conf spark.eventLog.enabled=true  
--conf spark.yarn.max.executor.failures=10  --conf 
spark.sql.catalogImplementation=hive  --conf spark.sql.shuffle.partitions=1000  
--class org.apache.hudi.bench.job.HudiTestSuiteJob $HUDI_BENCH_BUNDLE 
--source-ordering-field timestamp  --target-base-path 
/user/hive/warehouse/hudi-bench/output  --input-base-path 
/user/hive/warehouse/hudi-bench/input  --target-table test_table  --props 
/var/hoodie/ws/docker/demo/config/bench/test-source.properties  
--schemaprovider-class org.apache.hudi.utilities.schema.FilebasedSchemaProvider 
 --source-limit 30  --source-class 
org.apache.hudi.utilities.sources.AvroDFSSource  --input-file-size 125829120  
--workload-yaml-path 
/var/hoodie/ws/docker/demo/config/bench/complex-workflow-dag-cow.yaml  
--storage-type COPY_ON_WRITE  --compact-scheduling-minshare 1  --hoodie-conf 
"hoodie.deltastreamer.source.test.num_partitions=100"  --hoodie-conf 
"hoodie.deltastreamer.source.test.datagen.use_rocksdb_for_storing_existing_keys=false"
  --hoodie-conf "hoodie.deltastreamer.source.test.max_unique_records=1" 
 --hoodie-conf "hoodie.embed.timeline.server=false"  --hoodie-conf 
"hoodie.datasource.write.recordkey.field=_row_key"  --hoodie-conf 
"hoodie.deltastreamer.source.dfs.root=/user/hive/warehouse/hudi-bench/input"  
--hoodie-conf 
"hoodie.datasource.write.keygenerator.class=org.apache.hudi.ComplexKeyGenerator"
  --hoodie-conf "hoodie.datasource.write.partitionpath.field=timestamp"  
--hoodie-conf 
"hoodie.deltastreamer.schemaprovider.source.schema.file=/var/hoodie/ws/docker/demo/config/bench/source.avsc"
  --hoodie-conf "hoodie.datasource.hive_sync.assume_date_partitioning=false"  
--hoodie-conf 
"hoodie.datasource.hive_sync.jdbcurl=jdbc:hive2://hiveserver:1/"  
--hoodie-conf "hoodie.datasource.hive_sync.database=testdb"  --hoodie-conf 
"hoodie.datasource.hive_sync.table=test_table"  --hoodie-conf 
"hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.hive.NonPartitionedExtractor"
  --hoodie-conf "hoodie.datasource.hive_sync.assume_date_partitioning=true"  
--hoodie-conf 
"hoodie.datasource.write.keytranslator.class=org.apache.hudi.DayBasedPartitionPathKeyTranslator"
  --hoodie-conf 
"hoodie.deltastreamer.schemaprovider.target.schema.file=/var/hoodie/ws/docker/demo/config/bench/source.avsc"
 
 Review comment:
   Do we need to create any input or output directories beforehand ? Is it just 
enough to directly run this command 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] bvaradar commented on a change in pull request #626: Adding documentation for hudi test suite

2019-11-07 Thread GitBox
bvaradar commented on a change in pull request #626: Adding documentation for 
hudi test suite
URL: https://github.com/apache/incubator-hudi/pull/626#discussion_r344023332
 
 

 ##
 File path: docs/test_suite.md
 ##
 @@ -0,0 +1,155 @@
+---
+title: Test Suite
+keywords: test suite
+sidebar: mydoc_sidebar
+permalink: test_suite.html
+toc: false
+summary: In this page, we will discuss the Hudi Test suite to perform end to 
end tests
+
+This page describes in detail how to run end to end tests on a hudi dataset 
that helps in improving our confidence 
+in a release as well as perform large scale performance benchmarks.  
+
+### Objectives
+
+1. Test with different versions of core libraries and components such as 
`hdfs`, `parquet`, `spark`, 
+`hive` and `avro`.
+2. Generate different types of workloads across different dimensions such as 
`payload size`, `number of updates`, 
+`number of inserts`, `number of partitions`
+3. Perform multiple types of operations such as `insert`, `bulk_insert`, 
`upsert`, `compact`, `query`
+4. Support custom post process actions and validations
+
+### High Level Design
+
+{% include image.html file="hudi_test_suite_design.png" 
alt="hudi_test_suite_design.png" %}
+
+The Hudi test suite runs as a long running spark job. The suite is divided 
into the following high level components : 
+
+# Workload Generation
+
+This component does the work of generating the workload; `inserts`, `upserts` 
etc.
+
+# Workload Scheduling
+
+Depending on the type of workload generated, data is either ingested into the 
target hudi 
+dataset or the corresponding workload operation is executed. For example 
compaction does not necessarily need a workload
+to be generated/ingested but can require an execution.
+
+### Usage instructions
+
+
+# Entry class to the test suite
+
+```
+org.apache.hudi.bench.job.HudiTestSuiteJob.java - Entry Point of the hudi test 
suite job. This 
+class wraps all the functionalities required to run a configurable integration 
suite.
+```
+
+# Configurations required to run the job
+```
+org.apache.hudi.bench.job.HudiTestSuiteConfig - Config class that drives the 
behavior of the 
+integration test suite. This class extends from 
com.uber.hoodie.utilities.DeltaStreamerConfig. Look at 
+link#HudiDeltaStreamer page to learn about all the available configs 
applicable to your test suite.
+```
+
+# Generating a custom Workload Pattern
+```
+There are 2 ways to generate a workload pattern
+
+1. Programatically
+Choose to write up the entire DAG of operations programatically, take a look 
at WorkflowDagGenerator class.
+Once you're ready with the DAG you want to execute, simply pass the class name 
as follows
+spark-submit
+...
+...
+--class org.apache.hudi.bench.job.HudiTestSuiteJob 
+--workload-generator-classname 
org.apache.hudi.bench.dag.scheduler.
+...
+2. YAML file
+Choose to write up the entire DAG of operations in YAML, take a look at 
complex-workload-dag-cow.yaml or 
+complex-workload-dag-mor.yaml.
+Once you're ready with the DAG you want to execute, simply pass the yaml file 
path as follows
+spark-submit
+...
+...
+--class org.apache.hudi.bench.job.HudiTestSuiteJob 
+--workload-yaml-path /path/to/your-workflow-dag.yaml
+...
+```
+
+ Building the test suite
+
+The test suite can be found in the `hudi-bench` module. Use the 
`prepare_integration_suite.sh` script to build 
+the test suite, you can provide different parameters to the script.
+
+```
+shell$ ./prepare_integration_suite.sh --help
+Usage: prepare_integration_suite.sh
+   --spark-command, prints the spark command
+   -h, hdfs-version
+   -s, spark version
+   -p, parquet version
+   -a, avro version
+   -s, hive version
+```
+
+```
+shell$ ./prepare_integration_suite.sh
+
+
+Final command : mvn clean install -DskipTests
+[INFO] 
+[INFO] Reactor Summary:
+[INFO]
+[INFO] Hudi ... SUCCESS [  2.749 s]
+[INFO] hudi-common  SUCCESS [ 12.711 s]
+[INFO] hudi-timeline-service .. SUCCESS [  1.924 s]
+[INFO] hudi-hadoop-mr . SUCCESS [  7.203 s]
+[INFO] hudi-client  SUCCESS [ 10.486 s]
+[INFO] hudi-hive .. SUCCESS [  5.159 s]
+[INFO] hudi-spark . SUCCESS [ 34.499 s]
+[INFO] hudi-utilities . SUCCESS [  8.626 s]
+[INFO] hudi-cli ... SUCCESS [ 14.921 s]
+[INFO] hudi-bench . SUCCESS [  7.706 s]
+[INFO] hudi-hadoop-mr-bundle .. SUCCESS [  1.873 s]
+[INFO] hudi-hive-bundle ... SUCCESS [  1.508 s]
+[INFO] hudi-spark-bundle .. SUCCESS [ 

[GitHub] [incubator-hudi] bvaradar commented on a change in pull request #626: Adding documentation for hudi test suite

2019-11-07 Thread GitBox
bvaradar commented on a change in pull request #626: Adding documentation for 
hudi test suite
URL: https://github.com/apache/incubator-hudi/pull/626#discussion_r344020142
 
 

 ##
 File path: docs/docker_demo.md
 ##
 @@ -1081,6 +1081,34 @@ presto:default>
 
 This brings the demo to an end.
 
+## Running an end to end test suite in Local Docker environment
+
+```
+docker exec -it adhoc-2 /bin/bash
+
+# COPY_ON_WRITE tables
+=
+## Run the following command to start the test suite
+spark-submit  --packages com.databricks:spark-avro_2.11:4.0.0  --conf 
spark.task.cpus=1  --conf spark.executor.cores=1  --conf 
spark.task.maxFailures=100  --conf spark.memory.fraction=0.4  --conf 
spark.rdd.compress=true  --conf spark.kryoserializer.buffer.max=2000m  --conf 
spark.serializer=org.apache.spark.serializer.KryoSerializer  --conf 
spark.memory.storageFraction=0.1  --conf spark.shuffle.service.enabled=true  
--conf spark.sql.hive.convertMetastoreParquet=false  --conf spark.ui.port=  
--conf spark.driver.maxResultSize=12g  --conf 
spark.executor.heartbeatInterval=120s  --conf spark.network.timeout=600s  
--conf spark.eventLog.overwrite=true  --conf spark.eventLog.enabled=true  
--conf spark.yarn.max.executor.failures=10  --conf 
spark.sql.catalogImplementation=hive  --conf spark.sql.shuffle.partitions=1000  
--class org.apache.hudi.bench.job.HudiTestSuiteJob $HUDI_BENCH_BUNDLE 
--source-ordering-field timestamp  --target-base-path 
/user/hive/warehouse/hudi-bench/output  --input-base-path 
/user/hive/warehouse/hudi-bench/input  --target-table test_table  --props 
/var/hoodie/ws/docker/demo/config/bench/test-source.properties  
--schemaprovider-class org.apache.hudi.utilities.schema.FilebasedSchemaProvider 
 --source-limit 30  --source-class 
org.apache.hudi.utilities.sources.AvroDFSSource  --input-file-size 125829120  
--workload-yaml-path 
/var/hoodie/ws/docker/demo/config/bench/complex-workflow-dag-cow.yaml  
--storage-type COPY_ON_WRITE  --compact-scheduling-minshare 1  --hoodie-conf 
"hoodie.deltastreamer.source.test.num_partitions=100"  --hoodie-conf 
"hoodie.deltastreamer.source.test.datagen.use_rocksdb_for_storing_existing_keys=false"
  --hoodie-conf "hoodie.deltastreamer.source.test.max_unique_records=1" 
 --hoodie-conf "hoodie.embed.timeline.server=false"  --hoodie-conf 
"hoodie.datasource.write.recordkey.field=_row_key"  --hoodie-conf 
"hoodie.deltastreamer.source.dfs.root=/user/hive/warehouse/hudi-bench/input"  
--hoodie-conf 
"hoodie.datasource.write.keygenerator.class=org.apache.hudi.ComplexKeyGenerator"
  --hoodie-conf "hoodie.datasource.write.partitionpath.field=timestamp"  
--hoodie-conf 
"hoodie.deltastreamer.schemaprovider.source.schema.file=/var/hoodie/ws/docker/demo/config/bench/source.avsc"
  --hoodie-conf "hoodie.datasource.hive_sync.assume_date_partitioning=false"  
--hoodie-conf 
"hoodie.datasource.hive_sync.jdbcurl=jdbc:hive2://hiveserver:1/"  
--hoodie-conf "hoodie.datasource.hive_sync.database=testdb"  --hoodie-conf 
"hoodie.datasource.hive_sync.table=test_table"  --hoodie-conf 
"hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.hive.NonPartitionedExtractor"
  --hoodie-conf "hoodie.datasource.hive_sync.assume_date_partitioning=true"  
--hoodie-conf 
"hoodie.datasource.write.keytranslator.class=org.apache.hudi.DayBasedPartitionPathKeyTranslator"
  --hoodie-conf 
"hoodie.deltastreamer.schemaprovider.target.schema.file=/var/hoodie/ws/docker/demo/config/bench/source.avsc"
+...
+...
+2019-11-03 05:44:47 INFO  DagScheduler:69 - --- Finished workloads 
--
+2019-11-03 05:44:47 INFO  HudiTestSuiteJob:138 - Finished scheduling all tasks
+...
+2019-11-03 05:44:48 INFO  SparkContext:54 - Successfully stopped SparkContext
+
+# MERGE_ON_READ tables
+=
+## Run the following command to start the test suite
+spark-submit  --packages com.databricks:spark-avro_2.11:4.0.0  --conf 
spark.task.cpus=1  --conf spark.executor.cores=1  --conf 
spark.task.maxFailures=100  --conf spark.memory.fraction=0.4  --conf 
spark.rdd.compress=true  --conf spark.kryoserializer.buffer.max=2000m  --conf 
spark.serializer=org.apache.spark.serializer.KryoSerializer  --conf 
spark.memory.storageFraction=0.1  --conf spark.shuffle.service.enabled=true  
--conf spark.sql.hive.convertMetastoreParquet=false  --conf spark.ui.port=  
--conf spark.driver.maxResultSize=12g  --conf 
spark.executor.heartbeatInterval=120s  --conf spark.network.timeout=600s  
--conf spark.eventLog.overwrite=true  --conf spark.eventLog.enabled=true  
--conf spark.yarn.max.executor.failures=10  --conf 
spark.sql.catalogImplementation=hive  --conf spark.sql.shuffle.partitions=1000  
--class org.apache.hudi.bench.job.HudiTestSuiteJob $HUDI_BENCH_BUNDLE 
--source-ordering-field timestamp  --target-base-path 
/user/hive/warehouse/hudi-bench/output  --input-base-path 
/user/hive/warehouse/hudi-bench/input  --target-table test_table  --props 

[GitHub] [incubator-hudi] bvaradar commented on a change in pull request #626: Adding documentation for hudi test suite

2019-11-07 Thread GitBox
bvaradar commented on a change in pull request #626: Adding documentation for 
hudi test suite
URL: https://github.com/apache/incubator-hudi/pull/626#discussion_r344021879
 
 

 ##
 File path: docs/docker_demo.md
 ##
 @@ -1081,6 +1081,34 @@ presto:default>
 
 This brings the demo to an end.
 
+## Running an end to end test suite in Local Docker environment
+
+```
+docker exec -it adhoc-2 /bin/bash
+
+# COPY_ON_WRITE tables
+=
+## Run the following command to start the test suite
+spark-submit  --packages com.databricks:spark-avro_2.11:4.0.0  --conf 
spark.task.cpus=1  --conf spark.executor.cores=1  --conf 
spark.task.maxFailures=100  --conf spark.memory.fraction=0.4  --conf 
spark.rdd.compress=true  --conf spark.kryoserializer.buffer.max=2000m  --conf 
spark.serializer=org.apache.spark.serializer.KryoSerializer  --conf 
spark.memory.storageFraction=0.1  --conf spark.shuffle.service.enabled=true  
--conf spark.sql.hive.convertMetastoreParquet=false  --conf spark.ui.port=  
--conf spark.driver.maxResultSize=12g  --conf 
spark.executor.heartbeatInterval=120s  --conf spark.network.timeout=600s  
--conf spark.eventLog.overwrite=true  --conf spark.eventLog.enabled=true  
--conf spark.yarn.max.executor.failures=10  --conf 
spark.sql.catalogImplementation=hive  --conf spark.sql.shuffle.partitions=1000  
--class org.apache.hudi.bench.job.HudiTestSuiteJob $HUDI_BENCH_BUNDLE 
--source-ordering-field timestamp  --target-base-path 
/user/hive/warehouse/hudi-bench/output  --input-base-path 
/user/hive/warehouse/hudi-bench/input  --target-table test_table  --props 
/var/hoodie/ws/docker/demo/config/bench/test-source.properties  
--schemaprovider-class org.apache.hudi.utilities.schema.FilebasedSchemaProvider 
 --source-limit 30  --source-class 
org.apache.hudi.utilities.sources.AvroDFSSource  --input-file-size 125829120  
--workload-yaml-path 
/var/hoodie/ws/docker/demo/config/bench/complex-workflow-dag-cow.yaml  
--storage-type COPY_ON_WRITE  --compact-scheduling-minshare 1  --hoodie-conf 
"hoodie.deltastreamer.source.test.num_partitions=100"  --hoodie-conf 
"hoodie.deltastreamer.source.test.datagen.use_rocksdb_for_storing_existing_keys=false"
  --hoodie-conf "hoodie.deltastreamer.source.test.max_unique_records=1" 
 --hoodie-conf "hoodie.embed.timeline.server=false"  --hoodie-conf 
"hoodie.datasource.write.recordkey.field=_row_key"  --hoodie-conf 
"hoodie.deltastreamer.source.dfs.root=/user/hive/warehouse/hudi-bench/input"  
--hoodie-conf 
"hoodie.datasource.write.keygenerator.class=org.apache.hudi.ComplexKeyGenerator"
  --hoodie-conf "hoodie.datasource.write.partitionpath.field=timestamp"  
--hoodie-conf 
"hoodie.deltastreamer.schemaprovider.source.schema.file=/var/hoodie/ws/docker/demo/config/bench/source.avsc"
  --hoodie-conf "hoodie.datasource.hive_sync.assume_date_partitioning=false"  
--hoodie-conf 
"hoodie.datasource.hive_sync.jdbcurl=jdbc:hive2://hiveserver:1/"  
--hoodie-conf "hoodie.datasource.hive_sync.database=testdb"  --hoodie-conf 
"hoodie.datasource.hive_sync.table=test_table"  --hoodie-conf 
"hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.hive.NonPartitionedExtractor"
  --hoodie-conf "hoodie.datasource.hive_sync.assume_date_partitioning=true"  
--hoodie-conf 
"hoodie.datasource.write.keytranslator.class=org.apache.hudi.DayBasedPartitionPathKeyTranslator"
  --hoodie-conf 
"hoodie.deltastreamer.schemaprovider.target.schema.file=/var/hoodie/ws/docker/demo/config/bench/source.avsc"
 
 Review comment:
   Do we need to do any setup before we run this command ?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Issue Comment Deleted] (HUDI-114) Allow for clients to overwrite the payload implementation in hoodie.properties

2019-11-07 Thread Pratyaksh Sharma (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pratyaksh Sharma updated HUDI-114:
--
Comment: was deleted

(was: [~nishith29] yeah I would like to have some more clarification before 
starting working on it. Precisely, I want to get more context on why one may 
need to pass new payload class name. It is already possible to configure it at 
run time using HoodieDeltaStreamer.Config class.

Also which is the datasource API you are talking about? )

> Allow for clients to overwrite the payload implementation in hoodie.properties
> --
>
> Key: HUDI-114
> URL: https://issues.apache.org/jira/browse/HUDI-114
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>  Components: newbie
>Reporter: Nishith Agarwal
>Assignee: Pratyaksh Sharma
>Priority: Minor
>
> Right now, once the payload class is set once in hoodie.properties, it cannot 
> be changed. In some cases, if a code refactor is done and the jar updated, 
> one may need to pass the new payload class name.
> Also, fix picking up the payload name for datasource API. By default 
> HoodieAvroPayload is written whereas for datasource API default is 
> OverwriteLatestAvroPayload



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-64) Estimation of compression ratio & other dynamic storage knobs based on historical stats

2019-11-07 Thread vinoyang (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-64?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16969874#comment-16969874
 ] 

vinoyang commented on HUDI-64:
--

Start thinking and understand the details of the implementation. Any ideas will 
be presented here.

> Estimation of compression ratio & other dynamic storage knobs based on 
> historical stats
> ---
>
> Key: HUDI-64
> URL: https://issues.apache.org/jira/browse/HUDI-64
> Project: Apache Hudi (incubating)
>  Issue Type: New Feature
>  Components: Storage Management, Write Client
>Reporter: Vinoth Chandar
>Assignee: Vinoth Chandar
>Priority: Major
>
> Something core to Hudi writing is using heuristics or runtime workload 
> statistics to optimize aspects of storage like file sizes, partitioning and 
> so on.  
> Below lists all such places. 
>  
>  # Compression ratio for parquet 
> [https://github.com/apache/incubator-hudi/blob/a4f9d7575f39bb79089714049ffea12ba5f25ec8/hudi-client/src/main/java/org/apache/hudi/config/HoodieStorageConfig.java#L46]
>  . This is used by HoodieWrapperFileSystem, to estimate amount of bytes it 
> has written for a given parquet file and closes the parquet file once the 
> configured size has reached. DFSOutputStream level we only know bytes written 
> before compression. Once enough data has been written, it should be possible 
> to replace this by a simple estimate of what the avg record size would be 
> (commit metadata would give you size and number of records in each file)
>  # Very similar problem exists for log files 
> [https://github.com/apache/incubator-hudi/blob/a4f9d7575f39bb79089714049ffea12ba5f25ec8/hudi-client/src/main/java/org/apache/hudi/config/HoodieStorageConfig.java#L52]
>  We write data into logs in avro and can log updates to same record in 
> parquet multiple times. We need to estimate again how large the log file(s) 
> can grow to, and still we would be able to produce a parquet file of 
> configured size during compaction. (hope I conveyed this clearly)
>  # WorkloadProfile : 
> [https://github.com/apache/incubator-hudi/blob/b19bed442d84c1cb1e48d184c9554920735bcb6c/hudi-client/src/main/java/org/apache/hudi/table/WorkloadProfile.java]
>  caches the input records using Spark Caching and computes the shape of the 
> workload, i.e how many records per partition, how many inserts vs updates 
> etc. This is used by the Partitioner here 
> [https://github.com/apache/incubator-hudi/blob/b19bed442d84c1cb1e48d184c9554920735bcb6c/hudi-client/src/main/java/org/apache/hudi/table/HoodieCopyOnWriteTable.java#L141]
>  for assigning records to a file group. This is the critical one to replace 
> for Flink support and probably the hardest, since we need to guess input, 
> which is not always possible? 
>  # Within partitioner, we already derive a simple average size per record 
> [https://github.com/apache/incubator-hudi/blob/b19bed442d84c1cb1e48d184c9554920735bcb6c/hudi-client/src/main/java/org/apache/hudi/table/HoodieCopyOnWriteTable.java#L756]
>  from the last commit metadata alone. This can be generalized.  (default : 
> [https://github.com/apache/incubator-hudi/blob/b19bed442d84c1cb1e48d184c9554920735bcb6c/hudi-client/src/main/java/org/apache/hudi/config/HoodieCompactionConfig.java#L71])
>  
>  # 
> Our goal in this Jira is to see, if could derive this information in the 
> background purely using the commit metadata.. Some parts of this are 
> open-ended.. Good starting point would be to see whats feasible, estimate ROI 
> before aactually implementing 
>  
>  
>  
>  
>  
>  
> Roughly along the likes of. [https://github.com/uber/hudi/issues/270] 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [incubator-hudi] n3nash commented on issue #998: Incremental view not implemented yet, for merge-on-read datasets

2019-11-07 Thread GitBox
n3nash commented on issue #998: Incremental view not implemented yet, for 
merge-on-read datasets
URL: https://github.com/apache/incubator-hudi/issues/998#issuecomment-551393282
 
 
   @HariprasadAllaka1612 This is still not implemented for Spark datasource but 
we plan to implement it soon. @bhasudha  please find the ticket here : 
https://issues.apache.org/jira/browse/HUDI-58


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] n3nash commented on issue #626: Adding documentation for hudi test suite

2019-11-07 Thread GitBox
n3nash commented on issue #626: Adding documentation for hudi test suite
URL: https://github.com/apache/incubator-hudi/pull/626#issuecomment-551392836
 
 
   @bvaradar please review this as well


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


Build failed in Jenkins: hudi-snapshot-deployment-0.5 #92

2019-11-07 Thread Apache Jenkins Server
See 


Changes:


--
[...truncated 2.23 KB...]
/home/jenkins/tools/maven/apache-maven-3.5.4/bin:
m2.conf
mvn
mvn.cmd
mvnDebug
mvnDebug.cmd
mvnyjp

/home/jenkins/tools/maven/apache-maven-3.5.4/boot:
plexus-classworlds-2.5.2.jar

/home/jenkins/tools/maven/apache-maven-3.5.4/conf:
logging
settings.xml
toolchains.xml

/home/jenkins/tools/maven/apache-maven-3.5.4/conf/logging:
simplelogger.properties

/home/jenkins/tools/maven/apache-maven-3.5.4/lib:
aopalliance-1.0.jar
cdi-api-1.0.jar
cdi-api.license
commons-cli-1.4.jar
commons-cli.license
commons-io-2.5.jar
commons-io.license
commons-lang3-3.5.jar
commons-lang3.license
ext
guava-20.0.jar
guice-4.2.0-no_aop.jar
jansi-1.17.1.jar
jansi-native
javax.inject-1.jar
jcl-over-slf4j-1.7.25.jar
jcl-over-slf4j.license
jsr250-api-1.0.jar
jsr250-api.license
maven-artifact-3.5.4.jar
maven-artifact.license
maven-builder-support-3.5.4.jar
maven-builder-support.license
maven-compat-3.5.4.jar
maven-compat.license
maven-core-3.5.4.jar
maven-core.license
maven-embedder-3.5.4.jar
maven-embedder.license
maven-model-3.5.4.jar
maven-model-builder-3.5.4.jar
maven-model-builder.license
maven-model.license
maven-plugin-api-3.5.4.jar
maven-plugin-api.license
maven-repository-metadata-3.5.4.jar
maven-repository-metadata.license
maven-resolver-api-1.1.1.jar
maven-resolver-api.license
maven-resolver-connector-basic-1.1.1.jar
maven-resolver-connector-basic.license
maven-resolver-impl-1.1.1.jar
maven-resolver-impl.license
maven-resolver-provider-3.5.4.jar
maven-resolver-provider.license
maven-resolver-spi-1.1.1.jar
maven-resolver-spi.license
maven-resolver-transport-wagon-1.1.1.jar
maven-resolver-transport-wagon.license
maven-resolver-util-1.1.1.jar
maven-resolver-util.license
maven-settings-3.5.4.jar
maven-settings-builder-3.5.4.jar
maven-settings-builder.license
maven-settings.license
maven-shared-utils-3.2.1.jar
maven-shared-utils.license
maven-slf4j-provider-3.5.4.jar
maven-slf4j-provider.license
org.eclipse.sisu.inject-0.3.3.jar
org.eclipse.sisu.inject.license
org.eclipse.sisu.plexus-0.3.3.jar
org.eclipse.sisu.plexus.license
plexus-cipher-1.7.jar
plexus-cipher.license
plexus-component-annotations-1.7.1.jar
plexus-component-annotations.license
plexus-interpolation-1.24.jar
plexus-interpolation.license
plexus-sec-dispatcher-1.4.jar
plexus-sec-dispatcher.license
plexus-utils-3.1.0.jar
plexus-utils.license
slf4j-api-1.7.25.jar
slf4j-api.license
wagon-file-3.1.0.jar
wagon-file.license
wagon-http-3.1.0-shaded.jar
wagon-http.license
wagon-provider-api-3.1.0.jar
wagon-provider-api.license

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/ext:
README.txt

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native:
freebsd32
freebsd64
linux32
linux64
osx
README.txt
windows32
windows64

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/freebsd32:
libjansi.so

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/freebsd64:
libjansi.so

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/linux32:
libjansi.so

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/linux64:
libjansi.so

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/osx:
libjansi.jnilib

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/windows32:
jansi.dll

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/windows64:
jansi.dll
Finished /home/jenkins/tools/maven/apache-maven-3.5.4 Directory Listing :
Detected current version as: 
'HUDI_home=
0.5.1-SNAPSHOT'
[INFO] Scanning for projects...
[INFO] 
[INFO] Reactor Build Order:
[INFO] 
[INFO] Hudi   [pom]
[INFO] hudi-common[jar]
[INFO] hudi-timeline-service  [jar]
[INFO] hudi-hadoop-mr [jar]
[INFO] hudi-client[jar]
[INFO] hudi-hive  [jar]
[INFO] hudi-spark [jar]
[INFO] hudi-utilities [jar]
[INFO] hudi-cli   [jar]
[INFO] hudi-hadoop-mr-bundle  [jar]
[INFO] hudi-hive-bundle   [jar]
[INFO] hudi-spark-bundle  [jar]
[INFO] hudi-presto-bundle [jar]
[INFO] hudi-utilities-bundle  [jar]
[INFO] hudi-timeline-server-bundle

[jira] [Closed] (HUDI-245) Refactor code references that call HoodieTimeline.getInstants() and reverse to directly use method HoodieTimeline.getReverseOrderedInstants

2019-11-07 Thread leesf (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

leesf closed HUDI-245.
--
Fix Version/s: 0.5.1
   Resolution: Fixed

Fixed via master: 0863b1cfd947402c66221afa8d1f18fd2bd8273b

> Refactor code references that call HoodieTimeline.getInstants() and reverse 
> to directly use method HoodieTimeline.getReverseOrderedInstants 
> 
>
> Key: HUDI-245
> URL: https://issues.apache.org/jira/browse/HUDI-245
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: newbie
>Reporter: Bhavani Sudha Saktheeswaran
>Assignee: Pratyaksh Sharma
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.5.1
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [incubator-hudi] leesf commented on a change in pull request #999: Update to align with original Uber whitepaper

2019-11-07 Thread GitBox
leesf commented on a change in pull request #999: Update to align with original 
Uber whitepaper
URL: https://github.com/apache/incubator-hudi/pull/999#discussion_r343955616
 
 

 ##
 File path: README.md
 ##
 @@ -16,7 +16,7 @@
 -->
 
 # Hudi
-Apache Hudi (Incubating) (pronounced Hoodie) stands for `Hadoop Upserts anD 
Incrementals`. 
+Apache Hudi (Incubating) (pronounced Hoodie) stands for `Hadoop Upsert Delete 
and Incremental`. 
 
 Review comment:
   > nit: `Upserts Deletes and Incrementals`?
   
   Paste from whitepaper
   >In short, Hudi (Hadoop Upsert Delete and Incremental) is an analytical, 
scan-optimized data storage abstraction which enables applying mutations to 
data in HDFS on the order of few minutes and chaining of incremental processing
   
   I think `Hadoop Upsert Delete and Incremental` is ok. @vinothchandar 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Created] (HUDI-327) Introduce "null" supporting ComplexKeyGenerator

2019-11-07 Thread Brandon Scheller (Jira)
Brandon Scheller created HUDI-327:
-

 Summary: Introduce "null" supporting ComplexKeyGenerator
 Key: HUDI-327
 URL: https://issues.apache.org/jira/browse/HUDI-327
 Project: Apache Hudi (incubating)
  Issue Type: Improvement
Reporter: Brandon Scheller


Customers have been running into issues where they would like to use a 
record_key from columns that can contain null values. Currently, this will 
cause Hudi to crash and throw a cryptic exception.(improving error messaging is 
a separate but related issue)

We would like to propose a new KeyGenerator based on ComplexKeyGenerator that 
allows for null record_keys.

At a basic level, using the key generator without any options would essentially 
allow a null record_key to be accepted. (It can be replaced with an empty 
string, null, or some predefined "null" string representation)

This comes with the negative side effect that all records with a null 
record_key would then be associated together. To work around this, you would be 
able to specify a secondary record_key to be used in the case that the first 
one is null. You would specify this in the same way that you do for the 
ComplexKeyGenerator as a comma separated list of record_keys. In this case, 
when the first key is seen as null then the second key will be used instead. We 
could support any arbitrary limit of record_keys here.

While we are aware there are many alternatives to avoid using a null 
record_key. We believe this will act as a usability improvement so that new 
users are not forced to clean/update their data in order to use Hudi.

We are hoping to get some feedback on the idea

 

 

 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-327) Introduce "null" supporting KeyGenerator

2019-11-07 Thread Brandon Scheller (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-327?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Brandon Scheller updated HUDI-327:
--
Summary: Introduce "null" supporting KeyGenerator  (was: Introduce "null" 
supporting ComplexKeyGenerator)

> Introduce "null" supporting KeyGenerator
> 
>
> Key: HUDI-327
> URL: https://issues.apache.org/jira/browse/HUDI-327
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>Reporter: Brandon Scheller
>Priority: Major
>
> Customers have been running into issues where they would like to use a 
> record_key from columns that can contain null values. Currently, this will 
> cause Hudi to crash and throw a cryptic exception.(improving error messaging 
> is a separate but related issue)
> We would like to propose a new KeyGenerator based on ComplexKeyGenerator that 
> allows for null record_keys.
> At a basic level, using the key generator without any options would 
> essentially allow a null record_key to be accepted. (It can be replaced with 
> an empty string, null, or some predefined "null" string representation)
> This comes with the negative side effect that all records with a null 
> record_key would then be associated together. To work around this, you would 
> be able to specify a secondary record_key to be used in the case that the 
> first one is null. You would specify this in the same way that you do for the 
> ComplexKeyGenerator as a comma separated list of record_keys. In this case, 
> when the first key is seen as null then the second key will be used instead. 
> We could support any arbitrary limit of record_keys here.
> While we are aware there are many alternatives to avoid using a null 
> record_key. We believe this will act as a usability improvement so that new 
> users are not forced to clean/update their data in order to use Hudi.
> We are hoping to get some feedback on the idea
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-326) Support deleting records with only record_key

2019-11-07 Thread Brandon Scheller (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-326?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16969684#comment-16969684
 ] 

Brandon Scheller commented on HUDI-326:
---

Additionally, does anyone have some context on how difficult this 
implementation would be? It seems like Hudi doesn't really have any way to 
track its own partitions, so it looks like we'd have to scan for all the 
partitions within the table if we wanted to implement something like this.

> Support deleting records with only record_key
> -
>
> Key: HUDI-326
> URL: https://issues.apache.org/jira/browse/HUDI-326
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>Reporter: Brandon Scheller
>Priority: Major
>
> Currently Hudi requires 3 things to issue a hard delete using 
> EmptyHoodieRecordPayload. It requires (record_key, partition_key, 
> precombine_key).
> This means that in many real use scenarios, you are required to issue a 
> select query to find the partition_key and possibly precombine_key for a 
> certain record before deleting it.
> We would like to avoid this extra step by being allowed to issue a delete 
> based on only the record_key of a record.
> This means that it would blanket delete all records with that specific 
> record_key across all partitions.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [incubator-hudi] umehrot2 commented on a change in pull request #1001: [HUDI-325] Fix Hive partition error for updated HDFS Hudi table

2019-11-07 Thread GitBox
umehrot2 commented on a change in pull request #1001: [HUDI-325] Fix Hive 
partition error for updated HDFS Hudi table
URL: https://github.com/apache/incubator-hudi/pull/1001#discussion_r343943582
 
 

 ##
 File path: hudi-hive/src/main/java/org/apache/hudi/hive/HoodieHiveClient.java
 ##
 @@ -192,7 +192,7 @@ private String getPartitionClause(String partition) {
 String alterTable = "ALTER TABLE " + syncConfig.tableName;
 for (String partition : partitions) {
   String partitionClause = getPartitionClause(partition);
-  String fullPartitionPath = FSUtils.getPartitionPath(syncConfig.basePath, 
partition).toString();
+  String fullPartitionPath = FSUtils.getURIPath(fs, 
FSUtils.getPartitionPath(syncConfig.basePath, partition));
 
 Review comment:
   I would add a check here to check `fs.uri` and do this only if it is `hdfs` 
and not in case of other file systems. It also makes it explicitly clear why we 
are doing this.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Created] (HUDI-326) Support deleting records with only record_key

2019-11-07 Thread Brandon Scheller (Jira)
Brandon Scheller created HUDI-326:
-

 Summary: Support deleting records with only record_key
 Key: HUDI-326
 URL: https://issues.apache.org/jira/browse/HUDI-326
 Project: Apache Hudi (incubating)
  Issue Type: Improvement
Reporter: Brandon Scheller


Currently Hudi requires 3 things to issue a hard delete using 
EmptyHoodieRecordPayload. It requires (record_key, partition_key, 
precombine_key).

This means that in many real use scenarios, you are required to issue a select 
query to find the partition_key and possibly precombine_key for a certain 
record before deleting it.

We would like to avoid this extra step by being allowed to issue a delete based 
on only the record_key of a record.
This means that it would blanket delete all records with that specific 
record_key across all partitions.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-325) Unable to query by Hive after updating HDFS Hudi table

2019-11-07 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-325:

Labels: pull-request-available  (was: )

> Unable to query by Hive after updating HDFS Hudi table
> --
>
> Key: HUDI-325
> URL: https://issues.apache.org/jira/browse/HUDI-325
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>Reporter: Wenning Ding
>Priority: Major
>  Labels: pull-request-available
>
> h3. Description
> While doing internal testing in EMR, we found that if Hudi table path follows 
> this kind of format: hdfs:///user/... or hdfs:/user/... then Hudi table would 
> unable to query by Hive after updating.
> h3. Reproduction
> {code:java}
> import org.apache.hudi.DataSourceWriteOptions
> import org.apache.hudi.config.HoodieWriteConfig
> import org.apache.spark.sql.SaveModeval 
> df = Seq(
>   (100, "event_name_900", "2015-01-01T13:51:39.340396Z", "type1"),
>   (101, "event_name_546", "2015-01-01T12:14:58.597216Z", "type2"),
>   (104, "event_name_123", "2015-01-01T12:15:00.512679Z", "type1"),
>   (105, "event_name_678", "2015-01-01T13:51:42.248818Z", "type2")
>   ).toDF("event_id", "event_name", "event_ts", "event_type")
> var tableName = "hudi_test"
> var tablePath = "hdfs:///user/hadoop/" + tableName
> // write hudi dataset
> df.write.format("org.apache.hudi")
>   .option("hoodie.upsert.shuffle.parallelism", "2")
>   .option(HoodieWriteConfig.TABLE_NAME, tableName)
>   .option(DataSourceWriteOptions.OPERATION_OPT_KEY, 
> DataSourceWriteOptions.INSERT_OPERATION_OPT_VAL)
>   .option(DataSourceWriteOptions.STORAGE_TYPE_OPT_KEY, 
> DataSourceWriteOptions.COW_STORAGE_TYPE_OPT_VAL)
>   .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY, "event_id")
>   .option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY, "event_type")
>   .option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY, "event_ts")
>   .option(DataSourceWriteOptions.HIVE_SYNC_ENABLED_OPT_KEY, "true")
>   .option(DataSourceWriteOptions.HIVE_TABLE_OPT_KEY, tableName)
>   .option(DataSourceWriteOptions.HIVE_PARTITION_FIELDS_OPT_KEY, "event_type")
>   .option(DataSourceWriteOptions.HIVE_ASSUME_DATE_PARTITION_OPT_KEY, "false")
>   .option(DataSourceWriteOptions.HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY, 
> "org.apache.hudi.hive.MultiPartKeysValueExtractor")
>   .mode(SaveMode.Overwrite)
>   .save(tablePath)
> // update hudi dataset
> val df2 = Seq(
>   (100, "event_name_1", "2015-01-01T13:51:39.340396Z", "type1"),
>   (107, "event_name_578", "2015-01-01T13:51:42.248818Z", "type3")
>   ).toDF("event_id", "event_name", "event_ts", "event_type")
> df2.write.format("org.apache.hudi")
>.option("hoodie.upsert.shuffle.parallelism", "2")
>.option(HoodieWriteConfig.TABLE_NAME, tableName)
>.option(DataSourceWriteOptions.OPERATION_OPT_KEY, 
> DataSourceWriteOptions.UPSERT_OPERATION_OPT_VAL)
>.option(DataSourceWriteOptions.STORAGE_TYPE_OPT_KEY, 
> DataSourceWriteOptions.COW_STORAGE_TYPE_OPT_VAL)
>.option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY, "event_id")
>.option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY, "event_type")
>.option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY, "event_ts")
>.option(DataSourceWriteOptions.HIVE_SYNC_ENABLED_OPT_KEY, "true")
>.option(DataSourceWriteOptions.HIVE_TABLE_OPT_KEY, tableName)
>.option(DataSourceWriteOptions.HIVE_PARTITION_FIELDS_OPT_KEY, "event_type")
>.option(DataSourceWriteOptions.HIVE_ASSUME_DATE_PARTITION_OPT_KEY, "false")
>.option(DataSourceWriteOptions.HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY, 
> "org.apache.hudi.hive.MultiPartKeysValueExtractor")
>.mode(SaveMode.Append)
>.save(tablePath)
> {code}
> Then do query in Hive:
> {code:java}
> select count(*) from hudi_test;
> {code}
> It returns: 
> {code:java}
> java.io.IOException: cannot find dir = 
> hdfs://ip-172-30-6-236.ec2.internal:8020/user/hadoop/elb_logs_hudi_cow_8/2015-01-01/cb7531ac-dadf-4118-b722-55cb34bc66f2-0_34-7-336_20191104223321.parquet
>  in pathToPartitionInfo: [hdfs:/user/hadoop/elb_logs_hudi_cow_8/2015-01-01]
> at 
> org.apache.hadoop.hive.ql.io.HiveFileFormatUtils.getPartitionDescFromPathRecursively(HiveFileFormatUtils.java:394)
> at 
> org.apache.hadoop.hive.ql.io.HiveFileFormatUtils.getPartitionDescFromPathRecursively(HiveFileFormatUtils.java:357)
> at 
> org.apache.hadoop.hive.ql.exec.tez.SplitGrouper.schemaEvolved(SplitGrouper.java:284)
> at 
> org.apache.hadoop.hive.ql.exec.tez.SplitGrouper.generateGroupedSplits(SplitGrouper.java:184)
> at 
> org.apache.hadoop.hive.ql.exec.tez.SplitGrouper.generateGroupedSplits(SplitGrouper.java:161)
> at 
> org.apache.hadoop.hive.ql.exec.tez.HiveSplitGenerator.initialize(HiveSplitGenerator.java:207)
> at 
> 

[GitHub] [incubator-hudi] zhedoubushishi opened a new pull request #1001: [HUDI-325] Fix Hive partition error for updated HDFS Hudi table

2019-11-07 Thread GitBox
zhedoubushishi opened a new pull request #1001: [HUDI-325] Fix Hive partition 
error for updated HDFS Hudi table
URL: https://github.com/apache/incubator-hudi/pull/1001
 
 
   Jira: https://jira.apache.org/jira/browse/HUDI-325


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Updated] (HUDI-325) Unable to query by Hive after updating HDFS Hudi table

2019-11-07 Thread Wenning Ding (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenning Ding updated HUDI-325:
--
Description: 
h3. Description

While doing internal testing in EMR, we found that if Hudi table path follows 
this kind of format: hdfs:///user/... or hdfs:/user/... then Hudi table would 
unable to query by Hive after updating.
h3. Reproduction
{code:java}
import org.apache.hudi.DataSourceWriteOptions
import org.apache.hudi.config.HoodieWriteConfig
import org.apache.spark.sql.SaveModeval 

df = Seq(
  (100, "event_name_900", "2015-01-01T13:51:39.340396Z", "type1"),
  (101, "event_name_546", "2015-01-01T12:14:58.597216Z", "type2"),
  (104, "event_name_123", "2015-01-01T12:15:00.512679Z", "type1"),
  (105, "event_name_678", "2015-01-01T13:51:42.248818Z", "type2")
  ).toDF("event_id", "event_name", "event_ts", "event_type")

var tableName = "hudi_test"
var tablePath = "hdfs:///user/hadoop/" + tableName

// write hudi dataset
df.write.format("org.apache.hudi")
  .option("hoodie.upsert.shuffle.parallelism", "2")
  .option(HoodieWriteConfig.TABLE_NAME, tableName)
  .option(DataSourceWriteOptions.OPERATION_OPT_KEY, 
DataSourceWriteOptions.INSERT_OPERATION_OPT_VAL)
  .option(DataSourceWriteOptions.STORAGE_TYPE_OPT_KEY, 
DataSourceWriteOptions.COW_STORAGE_TYPE_OPT_VAL)
  .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY, "event_id")
  .option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY, "event_type")
  .option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY, "event_ts")
  .option(DataSourceWriteOptions.HIVE_SYNC_ENABLED_OPT_KEY, "true")
  .option(DataSourceWriteOptions.HIVE_TABLE_OPT_KEY, tableName)
  .option(DataSourceWriteOptions.HIVE_PARTITION_FIELDS_OPT_KEY, "event_type")
  .option(DataSourceWriteOptions.HIVE_ASSUME_DATE_PARTITION_OPT_KEY, "false")
  .option(DataSourceWriteOptions.HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY, 
"org.apache.hudi.hive.MultiPartKeysValueExtractor")
  .mode(SaveMode.Overwrite)
  .save(tablePath)

// update hudi dataset
val df2 = Seq(
  (100, "event_name_1", "2015-01-01T13:51:39.340396Z", "type1"),
  (107, "event_name_578", "2015-01-01T13:51:42.248818Z", "type3")
  ).toDF("event_id", "event_name", "event_ts", "event_type")

df2.write.format("org.apache.hudi")
   .option("hoodie.upsert.shuffle.parallelism", "2")
   .option(HoodieWriteConfig.TABLE_NAME, tableName)
   .option(DataSourceWriteOptions.OPERATION_OPT_KEY, 
DataSourceWriteOptions.UPSERT_OPERATION_OPT_VAL)
   .option(DataSourceWriteOptions.STORAGE_TYPE_OPT_KEY, 
DataSourceWriteOptions.COW_STORAGE_TYPE_OPT_VAL)
   .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY, "event_id")
   .option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY, "event_type")
   .option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY, "event_ts")
   .option(DataSourceWriteOptions.HIVE_SYNC_ENABLED_OPT_KEY, "true")
   .option(DataSourceWriteOptions.HIVE_TABLE_OPT_KEY, tableName)
   .option(DataSourceWriteOptions.HIVE_PARTITION_FIELDS_OPT_KEY, "event_type")
   .option(DataSourceWriteOptions.HIVE_ASSUME_DATE_PARTITION_OPT_KEY, "false")
   .option(DataSourceWriteOptions.HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY, 
"org.apache.hudi.hive.MultiPartKeysValueExtractor")
   .mode(SaveMode.Append)
   .save(tablePath)
{code}
Then do query in Hive:
{code:java}
select count(*) from hudi_test;
{code}
It returns: 
{code:java}
java.io.IOException: cannot find dir = 
hdfs://ip-172-30-6-236.ec2.internal:8020/user/hadoop/elb_logs_hudi_cow_8/2015-01-01/cb7531ac-dadf-4118-b722-55cb34bc66f2-0_34-7-336_20191104223321.parquet
 in pathToPartitionInfo: [hdfs:/user/hadoop/elb_logs_hudi_cow_8/2015-01-01]
at 
org.apache.hadoop.hive.ql.io.HiveFileFormatUtils.getPartitionDescFromPathRecursively(HiveFileFormatUtils.java:394)
at 
org.apache.hadoop.hive.ql.io.HiveFileFormatUtils.getPartitionDescFromPathRecursively(HiveFileFormatUtils.java:357)
at 
org.apache.hadoop.hive.ql.exec.tez.SplitGrouper.schemaEvolved(SplitGrouper.java:284)
at 
org.apache.hadoop.hive.ql.exec.tez.SplitGrouper.generateGroupedSplits(SplitGrouper.java:184)
at 
org.apache.hadoop.hive.ql.exec.tez.SplitGrouper.generateGroupedSplits(SplitGrouper.java:161)
at 
org.apache.hadoop.hive.ql.exec.tez.HiveSplitGenerator.initialize(HiveSplitGenerator.java:207)
at 
org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable$1.run(RootInputInitializerManager.java:278)
at 
org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable$1.run(RootInputInitializerManager.java:269)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1844)
at 
org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable.call(RootInputInitializerManager.java:269)
at 

[jira] [Updated] (HUDI-325) Unable to query by Hive after updating HDFS Hudi table

2019-11-07 Thread Wenning Ding (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenning Ding updated HUDI-325:
--
Description: 
h3. Description

While doing internal testing in EMR, we found that if Hudi table path follows 
this kind of format: hdfs:///user/... or hdfs:/user/... then Hudi table would 
unable to query by Hive after updating.
h3. Reproduction
{code:java}
import org.apache.hudi.DataSourceWriteOptions
import org.apache.hudi.config.HoodieWriteConfig
import org.apache.spark.sql.SaveModeval 

df = Seq(
  (100, "event_name_900", "2015-01-01T13:51:39.340396Z", "type1"),
  (101, "event_name_546", "2015-01-01T12:14:58.597216Z", "type2"),
  (104, "event_name_123", "2015-01-01T12:15:00.512679Z", "type1"),
  (105, "event_name_678", "2015-01-01T13:51:42.248818Z", "type2")
  ).toDF("event_id", "event_name", "event_ts", "event_type")

var tableName = "hudi_test"
var tablePath = "hdfs:///user/hadoop/" + tableName

// write hudi dataset
df.write.format("org.apache.hudi")
  .option("hoodie.upsert.shuffle.parallelism", "2")
  .option(HoodieWriteConfig.TABLE_NAME, tableName)
  .option(DataSourceWriteOptions.OPERATION_OPT_KEY, 
DataSourceWriteOptions.INSERT_OPERATION_OPT_VAL)
  .option(DataSourceWriteOptions.STORAGE_TYPE_OPT_KEY, 
DataSourceWriteOptions.COW_STORAGE_TYPE_OPT_VAL)
  .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY, "event_id")
  .option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY, "event_type")
  .option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY, "event_ts")
  .option(DataSourceWriteOptions.HIVE_SYNC_ENABLED_OPT_KEY, "true")
  .option(DataSourceWriteOptions.HIVE_TABLE_OPT_KEY, tableName)
  .option(DataSourceWriteOptions.HIVE_PARTITION_FIELDS_OPT_KEY, "event_type")
  .option(DataSourceWriteOptions.HIVE_ASSUME_DATE_PARTITION_OPT_KEY, "false")
  .option(DataSourceWriteOptions.HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY, 
"org.apache.hudi.hive.MultiPartKeysValueExtractor")
  .mode(SaveMode.Overwrite)
  .save(tablePath)

// update hudi dataset
val df2 = Seq(
  (100, "event_name_1", "2015-01-01T13:51:39.340396Z", "type1"),
  (107, "event_name_578", "2015-01-01T13:51:42.248818Z", "type3")
  ).toDF("event_id", "event_name", "event_ts", "event_type")

df2.write.format("org.apache.hudi")
   .option("hoodie.upsert.shuffle.parallelism", "2")
   .option(HoodieWriteConfig.TABLE_NAME, tableName)
   .option(DataSourceWriteOptions.OPERATION_OPT_KEY, 
DataSourceWriteOptions.UPSERT_OPERATION_OPT_VAL)
   .option(DataSourceWriteOptions.STORAGE_TYPE_OPT_KEY, 
DataSourceWriteOptions.COW_STORAGE_TYPE_OPT_VAL)
   .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY, "event_id")
   .option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY, "event_type")
   .option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY, "event_ts")
   .option(DataSourceWriteOptions.HIVE_SYNC_ENABLED_OPT_KEY, "true")
   .option(DataSourceWriteOptions.HIVE_TABLE_OPT_KEY, tableName)
   .option(DataSourceWriteOptions.HIVE_PARTITION_FIELDS_OPT_KEY, "event_type")
   .option(DataSourceWriteOptions.HIVE_ASSUME_DATE_PARTITION_OPT_KEY, "false")
   .option(DataSourceWriteOptions.HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY, 
"org.apache.hudi.hive.MultiPartKeysValueExtractor")
   .mode(SaveMode.Append)
   .save(tablePath)
{code}
Then do query in Hive:
{code:java}
select count(*) from hudi_test;
{code}
It returns: 
{code:java}
java.io.IOException: cannot find dir = 
hdfs://ip-172-30-6-236.ec2.internal:8020/user/hadoop/elb_logs_hudi_cow_8/2015-01-01/cb7531ac-dadf-4118-b722-55cb34bc66f2-0_34-7-336_20191104223321.parquet
 in pathToPartitionInfo: [hdfs:/user/hadoop/elb_logs_hudi_cow_8/2015-01-01]
at 
org.apache.hadoop.hive.ql.io.HiveFileFormatUtils.getPartitionDescFromPathRecursively(HiveFileFormatUtils.java:394)
at 
org.apache.hadoop.hive.ql.io.HiveFileFormatUtils.getPartitionDescFromPathRecursively(HiveFileFormatUtils.java:357)
at 
org.apache.hadoop.hive.ql.exec.tez.SplitGrouper.schemaEvolved(SplitGrouper.java:284)
at 
org.apache.hadoop.hive.ql.exec.tez.SplitGrouper.generateGroupedSplits(SplitGrouper.java:184)
at 
org.apache.hadoop.hive.ql.exec.tez.SplitGrouper.generateGroupedSplits(SplitGrouper.java:161)
at 
org.apache.hadoop.hive.ql.exec.tez.HiveSplitGenerator.initialize(HiveSplitGenerator.java:207)
at 
org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable$1.run(RootInputInitializerManager.java:278)
at 
org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable$1.run(RootInputInitializerManager.java:269)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1844)
at 
org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable.call(RootInputInitializerManager.java:269)
at 

[jira] [Updated] (HUDI-325) Unable to query by Hive after updating HDFS Hudi table

2019-11-07 Thread Wenning Ding (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenning Ding updated HUDI-325:
--
Description: 
h3. Description

While doing internal testing in EMR, we found that if Hudi table path follows 
this kind of format: hdfs:///user/... or hdfs:/user/... then Hudi table would 
unable to query by Hive after updating.
h3. Reproduction
{code:java}
import org.apache.hudi.DataSourceWriteOptions
import org.apache.hudi.config.HoodieWriteConfig
import org.apache.spark.sql.SaveModeval 

df = Seq(
  (100, "event_name_900", "2015-01-01T13:51:39.340396Z", "type1"),
  (101, "event_name_546", "2015-01-01T12:14:58.597216Z", "type2"),
  (104, "event_name_123", "2015-01-01T12:15:00.512679Z", "type1"),
  (105, "event_name_678", "2015-01-01T13:51:42.248818Z", "type2")
  ).toDF("event_id", "event_name", "event_ts", "event_type")

var tableName = "hudi_test"
var tablePath = "hdfs:///user/hadoop/" + tableName

// write hudi dataset
df.write.format("org.apache.hudi")
  .option("hoodie.upsert.shuffle.parallelism", "2")
  .option(HoodieWriteConfig.TABLE_NAME, tableName)
  .option(DataSourceWriteOptions.OPERATION_OPT_KEY, 
DataSourceWriteOptions.INSERT_OPERATION_OPT_VAL)
  .option(DataSourceWriteOptions.STORAGE_TYPE_OPT_KEY, 
DataSourceWriteOptions.COW_STORAGE_TYPE_OPT_VAL)
  .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY, "event_id")
  .option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY, "event_type")
  .option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY, "event_ts")
  .option(DataSourceWriteOptions.HIVE_SYNC_ENABLED_OPT_KEY, "true")
  .option(DataSourceWriteOptions.HIVE_TABLE_OPT_KEY, tableName)
  .option(DataSourceWriteOptions.HIVE_PARTITION_FIELDS_OPT_KEY, "event_type")
  .option(DataSourceWriteOptions.HIVE_ASSUME_DATE_PARTITION_OPT_KEY, "false")
  .option(DataSourceWriteOptions.HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY, 
"org.apache.hudi.hive.MultiPartKeysValueExtractor")
  .mode(SaveMode.Overwrite)
  .save(tablePath)

// update hudi dataset
val df2 = Seq(
  (100, "event_name_1", "2015-01-01T13:51:39.340396Z", "type1"),
  (107, "event_name_578", "2015-01-01T13:51:42.248818Z", "type3")
  ).toDF("event_id", "event_name", "event_ts", "event_type")

df2.write.format("org.apache.hudi")
   .option("hoodie.upsert.shuffle.parallelism", "2")
   .option(HoodieWriteConfig.TABLE_NAME, tableName)
   .option(DataSourceWriteOptions.OPERATION_OPT_KEY, 
DataSourceWriteOptions.UPSERT_OPERATION_OPT_VAL)
   .option(DataSourceWriteOptions.STORAGE_TYPE_OPT_KEY, 
DataSourceWriteOptions.COW_STORAGE_TYPE_OPT_VAL)
   .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY, "event_id")
   .option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY, "event_type")
   .option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY, "event_ts")
   .option(DataSourceWriteOptions.HIVE_SYNC_ENABLED_OPT_KEY, "true")
   .option(DataSourceWriteOptions.HIVE_TABLE_OPT_KEY, tableName)
   .option(DataSourceWriteOptions.HIVE_PARTITION_FIELDS_OPT_KEY, "event_type")
   .option(DataSourceWriteOptions.HIVE_ASSUME_DATE_PARTITION_OPT_KEY, "false")
   .option(DataSourceWriteOptions.HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY, 
"org.apache.hudi.hive.MultiPartKeysValueExtractor")
   .mode(SaveMode.Append)
   .save(tablePath)
{code}
Then do query in Hive:
{code:java}
select count(*) from hudi_test;
{code}
It returns: 
{code:java}
java.io.IOException: cannot find dir = 
hdfs://ip-172-30-6-236.ec2.internal:8020/user/hadoop/elb_logs_hudi_cow_8/2015-01-01/cb7531ac-dadf-4118-b722-55cb34bc66f2-0_34-7-336_20191104223321.parquet
 in pathToPartitionInfo: [hdfs:/user/hadoop/elb_logs_hudi_cow_8/2015-01-01]
at 
org.apache.hadoop.hive.ql.io.HiveFileFormatUtils.getPartitionDescFromPathRecursively(HiveFileFormatUtils.java:394)
at 
org.apache.hadoop.hive.ql.io.HiveFileFormatUtils.getPartitionDescFromPathRecursively(HiveFileFormatUtils.java:357)
at 
org.apache.hadoop.hive.ql.exec.tez.SplitGrouper.schemaEvolved(SplitGrouper.java:284)
at 
org.apache.hadoop.hive.ql.exec.tez.SplitGrouper.generateGroupedSplits(SplitGrouper.java:184)
at 
org.apache.hadoop.hive.ql.exec.tez.SplitGrouper.generateGroupedSplits(SplitGrouper.java:161)
at 
org.apache.hadoop.hive.ql.exec.tez.HiveSplitGenerator.initialize(HiveSplitGenerator.java:207)
at 
org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable$1.run(RootInputInitializerManager.java:278)
at 
org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable$1.run(RootInputInitializerManager.java:269)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1844)
at 
org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable.call(RootInputInitializerManager.java:269)
at 

[GitHub] [incubator-hudi] bvaradar commented on a change in pull request #996: Fixes to ensure MOR incr pull provides consistent results

2019-11-07 Thread GitBox
bvaradar commented on a change in pull request #996: Fixes to ensure MOR incr 
pull provides consistent results
URL: https://github.com/apache/incubator-hudi/pull/996#discussion_r343814975
 
 

 ##
 File path: 
hudi-common/src/main/java/org/apache/hudi/common/table/timeline/HoodieActiveTimeline.java
 ##
 @@ -58,19 +59,28 @@
 
   public static final SimpleDateFormat COMMIT_FORMATTER = new 
SimpleDateFormat("MMddHHmmss");
 
-  public static final Set VALID_EXTENSIONS_IN_ACTIVE_TIMELINE = new 
HashSet<>(Arrays.asList(new String[] {
+  public static final Set VALID_EXTENSIONS_IN_ACTIVE_TIMELINE = new 
HashSet<>(Arrays.asList(new String[]{
   COMMIT_EXTENSION, INFLIGHT_COMMIT_EXTENSION, DELTA_COMMIT_EXTENSION, 
INFLIGHT_DELTA_COMMIT_EXTENSION,
   SAVEPOINT_EXTENSION, INFLIGHT_SAVEPOINT_EXTENSION, CLEAN_EXTENSION, 
INFLIGHT_CLEAN_EXTENSION,
   INFLIGHT_COMPACTION_EXTENSION, REQUESTED_COMPACTION_EXTENSION, 
INFLIGHT_RESTORE_EXTENSION, RESTORE_EXTENSION}));
 
   private static final transient Logger log = 
LogManager.getLogger(HoodieActiveTimeline.class);
   protected HoodieTableMetaClient metaClient;
+  private static AtomicReference lastInstantTime = new 
AtomicReference<>();
 
   /**
* Returns next commit time in the {@link #COMMIT_FORMATTER} format.
+   * Ensures each commit time is atleast 1 second apart since we create COMMIT 
times at second granularity
*/
   public static String createNewCommitTime() {
-return HoodieActiveTimeline.COMMIT_FORMATTER.format(new Date());
+lastInstantTime.updateAndGet((oldVal) -> {
+  String newCommitTime;
+  do {
+newCommitTime = HoodieActiveTimeline.COMMIT_FORMATTER.format(new 
Date());
+  } while (oldVal == newCommitTime);
 
 Review comment:
   To be defensive, oldVal <= newCommitTime ? And sleep() to avoid busy-wait ? 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] bvaradar commented on a change in pull request #996: Fixes to ensure MOR incr pull provides consistent results

2019-11-07 Thread GitBox
bvaradar commented on a change in pull request #996: Fixes to ensure MOR incr 
pull provides consistent results
URL: https://github.com/apache/incubator-hudi/pull/996#discussion_r343815022
 
 

 ##
 File path: 
hudi-common/src/main/java/org/apache/hudi/common/table/timeline/HoodieActiveTimeline.java
 ##
 @@ -58,19 +59,28 @@
 
   public static final SimpleDateFormat COMMIT_FORMATTER = new 
SimpleDateFormat("MMddHHmmss");
 
-  public static final Set VALID_EXTENSIONS_IN_ACTIVE_TIMELINE = new 
HashSet<>(Arrays.asList(new String[] {
+  public static final Set VALID_EXTENSIONS_IN_ACTIVE_TIMELINE = new 
HashSet<>(Arrays.asList(new String[]{
   COMMIT_EXTENSION, INFLIGHT_COMMIT_EXTENSION, DELTA_COMMIT_EXTENSION, 
INFLIGHT_DELTA_COMMIT_EXTENSION,
   SAVEPOINT_EXTENSION, INFLIGHT_SAVEPOINT_EXTENSION, CLEAN_EXTENSION, 
INFLIGHT_CLEAN_EXTENSION,
   INFLIGHT_COMPACTION_EXTENSION, REQUESTED_COMPACTION_EXTENSION, 
INFLIGHT_RESTORE_EXTENSION, RESTORE_EXTENSION}));
 
   private static final transient Logger log = 
LogManager.getLogger(HoodieActiveTimeline.class);
   protected HoodieTableMetaClient metaClient;
+  private static AtomicReference lastInstantTime = new 
AtomicReference<>();
 
   /**
* Returns next commit time in the {@link #COMMIT_FORMATTER} format.
+   * Ensures each commit time is atleast 1 second apart since we create COMMIT 
times at second granularity
*/
   public static String createNewCommitTime() {
-return HoodieActiveTimeline.COMMIT_FORMATTER.format(new Date());
+lastInstantTime.updateAndGet((oldVal) -> {
 
 Review comment:
   Along with this change, can we also make commit timestamp millisec 
granularity ? High level, don't see any backwards compatibility moving to lower 
granularity as long as all commit timestamps (for all actions) uses same 
granularity ? 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] vinothchandar merged pull request #1000: [HUDI-245]: replaced instances of getInstants() and reverse() with getReverseOrderedInstants()

2019-11-07 Thread GitBox
vinothchandar merged pull request #1000: [HUDI-245]: replaced instances of 
getInstants() and reverse() with getReverseOrderedInstants()
URL: https://github.com/apache/incubator-hudi/pull/1000
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[incubator-hudi] branch master updated: [HUDI-245]: replaced instances of getInstants() and reverse() with getReverseOrderedInstants() (#1000)

2019-11-07 Thread vinoth
This is an automated email from the ASF dual-hosted git repository.

vinoth pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/incubator-hudi.git


The following commit(s) were added to refs/heads/master by this push:
 new 0863b1c  [HUDI-245]: replaced instances of getInstants() and reverse() 
with getReverseOrderedInstants() (#1000)
0863b1c is described below

commit 0863b1cfd947402c66221afa8d1f18fd2bd8273b
Author: pratyakshsharma <30863489+pratyakshsha...@users.noreply.github.com>
AuthorDate: Thu Nov 7 22:12:48 2019 +0530

[HUDI-245]: replaced instances of getInstants() and reverse() with 
getReverseOrderedInstants() (#1000)
---
 .../main/java/org/apache/hudi/cli/commands/CleansCommand.java  |  4 +---
 .../main/java/org/apache/hudi/cli/commands/CommitsCommand.java |  4 +---
 .../java/org/apache/hudi/cli/commands/CompactionCommand.java   |  4 +---
 .../java/org/apache/hudi/cli/commands/SavepointsCommand.java   |  4 +---
 .../src/main/java/org/apache/hudi/HoodieWriteClient.java   | 10 --
 5 files changed, 8 insertions(+), 18 deletions(-)

diff --git 
a/hudi-cli/src/main/java/org/apache/hudi/cli/commands/CleansCommand.java 
b/hudi-cli/src/main/java/org/apache/hudi/cli/commands/CleansCommand.java
index 0143c54..1257b17 100644
--- a/hudi-cli/src/main/java/org/apache/hudi/cli/commands/CleansCommand.java
+++ b/hudi-cli/src/main/java/org/apache/hudi/cli/commands/CleansCommand.java
@@ -20,7 +20,6 @@ package org.apache.hudi.cli.commands;
 
 import java.io.IOException;
 import java.util.ArrayList;
-import java.util.Collections;
 import java.util.HashMap;
 import java.util.List;
 import java.util.Map;
@@ -69,9 +68,8 @@ public class CleansCommand implements CommandMarker {
 
 HoodieActiveTimeline activeTimeline = 
HoodieCLI.tableMetadata.getActiveTimeline();
 HoodieTimeline timeline = 
activeTimeline.getCleanerTimeline().filterCompletedInstants();
-List cleans = 
timeline.getInstants().collect(Collectors.toList());
+List cleans = 
timeline.getReverseOrderedInstants().collect(Collectors.toList());
 List rows = new ArrayList<>();
-Collections.reverse(cleans);
 for (int i = 0; i < cleans.size(); i++) {
   HoodieInstant clean = cleans.get(i);
   HoodieCleanMetadata cleanMetadata =
diff --git 
a/hudi-cli/src/main/java/org/apache/hudi/cli/commands/CommitsCommand.java 
b/hudi-cli/src/main/java/org/apache/hudi/cli/commands/CommitsCommand.java
index 448e995..ad94865 100644
--- a/hudi-cli/src/main/java/org/apache/hudi/cli/commands/CommitsCommand.java
+++ b/hudi-cli/src/main/java/org/apache/hudi/cli/commands/CommitsCommand.java
@@ -20,7 +20,6 @@ package org.apache.hudi.cli.commands;
 
 import java.io.IOException;
 import java.util.ArrayList;
-import java.util.Collections;
 import java.util.HashMap;
 import java.util.List;
 import java.util.Map;
@@ -80,9 +79,8 @@ public class CommitsCommand implements CommandMarker {
 
 HoodieActiveTimeline activeTimeline = 
HoodieCLI.tableMetadata.getActiveTimeline();
 HoodieTimeline timeline = 
activeTimeline.getCommitsTimeline().filterCompletedInstants();
-List commits = 
timeline.getInstants().collect(Collectors.toList());
+List commits = 
timeline.getReverseOrderedInstants().collect(Collectors.toList());
 List rows = new ArrayList<>();
-Collections.reverse(commits);
 for (int i = 0; i < commits.size(); i++) {
   HoodieInstant commit = commits.get(i);
   HoodieCommitMetadata commitMetadata =
diff --git 
a/hudi-cli/src/main/java/org/apache/hudi/cli/commands/CompactionCommand.java 
b/hudi-cli/src/main/java/org/apache/hudi/cli/commands/CompactionCommand.java
index 86f3fe8..4a9f4b7 100644
--- a/hudi-cli/src/main/java/org/apache/hudi/cli/commands/CompactionCommand.java
+++ b/hudi-cli/src/main/java/org/apache/hudi/cli/commands/CompactionCommand.java
@@ -21,7 +21,6 @@ package org.apache.hudi.cli.commands;
 import java.io.IOException;
 import java.io.ObjectInputStream;
 import java.util.ArrayList;
-import java.util.Collections;
 import java.util.HashMap;
 import java.util.List;
 import java.util.Map;
@@ -89,9 +88,8 @@ public class CompactionCommand implements CommandMarker {
 HoodieTimeline commitTimeline = 
activeTimeline.getCommitTimeline().filterCompletedInstants();
 Set committed = 
commitTimeline.getInstants().map(HoodieInstant::getTimestamp).collect(Collectors.toSet());
 
-List instants = 
timeline.getInstants().collect(Collectors.toList());
+List instants = 
timeline.getReverseOrderedInstants().collect(Collectors.toList());
 List rows = new ArrayList<>();
-Collections.reverse(instants);
 for (int i = 0; i < instants.size(); i++) {
   HoodieInstant instant = instants.get(i);
   HoodieCompactionPlan workload = null;
diff --git 
a/hudi-cli/src/main/java/org/apache/hudi/cli/commands/SavepointsCommand.java 
b/hudi-cli/src/main/java/org/apache/hudi/cli/commands/SavepointsCommand.java
index ae451ef..bcbaad8 100644
--- 

[GitHub] [incubator-hudi] bhasudha commented on issue #1000: [HUDI-245]: replaced instances of getInstants() and reverse() with getReverseOrderedInstants()

2019-11-07 Thread GitBox
bhasudha commented on issue #1000: [HUDI-245]: replaced instances of 
getInstants() and reverse() with getReverseOrderedInstants()
URL: https://github.com/apache/incubator-hudi/pull/1000#issuecomment-551068937
 
 
   LGTM. Thanks @pratyakshsharma 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Commented] (HUDI-114) Allow for clients to overwrite the payload implementation in hoodie.properties

2019-11-07 Thread Pratyaksh Sharma (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-114?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16969206#comment-16969206
 ] 

Pratyaksh Sharma commented on HUDI-114:
---

[~nishith29] yeah I would like to have some more clarification before starting 
working on it. Precisely, I want to get more context on why one may need to 
pass new payload class name. It is already possible to configure it at run time 
using HoodieDeltaStreamer.Config class.

Also which is the datasource API you are talking about? 

> Allow for clients to overwrite the payload implementation in hoodie.properties
> --
>
> Key: HUDI-114
> URL: https://issues.apache.org/jira/browse/HUDI-114
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>  Components: newbie
>Reporter: Nishith Agarwal
>Assignee: Pratyaksh Sharma
>Priority: Minor
>
> Right now, once the payload class is set once in hoodie.properties, it cannot 
> be changed. In some cases, if a code refactor is done and the jar updated, 
> one may need to pass the new payload class name.
> Also, fix picking up the payload name for datasource API. By default 
> HoodieAvroPayload is written whereas for datasource API default is 
> OverwriteLatestAvroPayload



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-245) Refactor code references that call HoodieTimeline.getInstants() and reverse to directly use method HoodieTimeline.getReverseOrderedInstants

2019-11-07 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-245:

Labels: pull-request-available  (was: )

> Refactor code references that call HoodieTimeline.getInstants() and reverse 
> to directly use method HoodieTimeline.getReverseOrderedInstants 
> 
>
> Key: HUDI-245
> URL: https://issues.apache.org/jira/browse/HUDI-245
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: newbie
>Reporter: Bhavani Sudha Saktheeswaran
>Assignee: Pratyaksh Sharma
>Priority: Minor
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [incubator-hudi] pratyakshsharma opened a new pull request #1000: [HUDI-245]: replaced instances of getInstants() and reverse() with getReverseOrderedInstants()

2019-11-07 Thread GitBox
pratyakshsharma opened a new pull request #1000: [HUDI-245]: replaced instances 
of getInstants() and reverse() with getReverseOrderedInstants()
URL: https://github.com/apache/incubator-hudi/pull/1000
 
 
   Small code refactoring to remove redundant code.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services