[jira] [Created] (HUDI-762) change the pom.xml to supportmaven version to 3.x

2020-04-05 Thread yaojingyi (Jira)
yaojingyi created HUDI-762:
--

 Summary: change the pom.xml to supportmaven version to 3.x
 Key: HUDI-762
 URL: https://issues.apache.org/jira/browse/HUDI-762
 Project: Apache Hudi (incubating)
  Issue Type: Improvement
Reporter: yaojingyi






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [incubator-hudi] pratyakshsharma commented on a change in pull request #1427: [HUDI-727]: Copy default values of fields if not present when rewriting incoming record with new schema

2020-04-05 Thread GitBox
pratyakshsharma commented on a change in pull request #1427: [HUDI-727]: Copy 
default values of fields if not present when rewriting incoming record with new 
schema
URL: https://github.com/apache/incubator-hudi/pull/1427#discussion_r403715629
 
 

 ##
 File path: 
hudi-common/src/main/java/org/apache/hudi/common/util/HoodieAvroUtils.java
 ##
 @@ -214,6 +218,21 @@ private static GenericRecord rewrite(GenericRecord 
record, Schema schemaWithFiel
 return newRecord;
   }
 
+  /*
+  This function takes the union of all the fields except hoodie metadata fields
+   */
+  private static List getAllFieldsToWrite(Schema oldSchema, Schema 
newSchema) {
+Set allFields = new HashSet<>(oldSchema.getFields());
+List fields = new ArrayList<>(oldSchema.getFields());
+for (Schema.Field f : newSchema.getFields()) {
+  if (!allFields.contains(f) && !isMetadataField(f.name())) {
+fields.add(f);
+  }
+}
+
 
 Review comment:
   fixed.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] pratyakshsharma commented on a change in pull request #1427: [HUDI-727]: Copy default values of fields if not present when rewriting incoming record with new schema

2020-04-05 Thread GitBox
pratyakshsharma commented on a change in pull request #1427: [HUDI-727]: Copy 
default values of fields if not present when rewriting incoming record with new 
schema
URL: https://github.com/apache/incubator-hudi/pull/1427#discussion_r403715964
 
 

 ##
 File path: 
hudi-common/src/main/java/org/apache/hudi/common/util/HoodieAvroUtils.java
 ##
 @@ -214,6 +218,21 @@ private static GenericRecord rewrite(GenericRecord 
record, Schema schemaWithFiel
 return newRecord;
   }
 
+  /*
+  This function takes the union of all the fields except hoodie metadata fields
+   */
+  private static List getAllFieldsToWrite(Schema oldSchema, Schema 
newSchema) {
 
 Review comment:
   actually here it is a union of old and new fields. That is why I kept this 
name. If you strongly feel about changing the name, let me know. 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] jvaesteves opened a new issue #1488: [SUPPORT] Hudi table has only five rows when record key is binary

2020-04-05 Thread GitBox
jvaesteves opened a new issue #1488: [SUPPORT] Hudi table has only five rows 
when record key is binary
URL: https://github.com/apache/incubator-hudi/issues/1488
 
 
   I was trying Hudi on some ORC backup files from my Kafka broker, to see if 
it would be a nice deduplication process for the messages.
   
   The source has 16767835 rows, with 4049589 unique keys, but when I first 
create the Hudi table from it, it only contains 5 rows. I tried to cast the key 
to string and it actually worked, droping the duplicate keys, but I want to 
know if this casting is a requirement for Hudi to properly work.
   
   Here is a sample from the data, and the snippet that I used.
   
   ```scala
   import org.apache.spark.sql.SaveMode
   import org.apache.spark.sql.functions._
   import org.apache.hudi.DataSourceWriteOptions
   import org.apache.hudi.config.HoodieWriteConfig
   
   //Set up various input values as variables
   val inputDataPath = ""
   val hudiTableName = "test"
   val hudiTablePath = "/tmp/kafka-hudi"
   
   // Set up our Hudi Data Source Options
   val hudiOptions = Map[String,String](
   DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY -> "key",
   DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY -> "dt", 
   HoodieWriteConfig.TABLE_NAME -> hudiTableName,
   DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY -> "timestamp"
   )
   
   // Read data from S3 and create a DataFrame with Partition and Record Key
   val inputDF = spark.read.orc(inputDataPath).withColumn("dt", 
to_date($"timestamp"))
   
   // Write data into the Hudi dataset
   
inputDF.write.format("org.apache.hudi").options(hudiOptions).mode(SaveMode.Append).save(hudiTablePath)
   ```
   ```
   
+-+---+--+
   |key  
|timestamp  |dt|
   
+-+---+--+
   |[38 33 39 36 35 34 32 38 30 7C 33 32 7C 31 31 37 36 31 35 37 34 35 32 
37]|2019-11-28 13:24:07.792|2019-11-28|
   |[38 34 34 39 34 33 34 37 32 7C 34 34 7C 31 31 37 36 33 37 39 31 34 36 
38]|2019-11-28 13:24:07.792|2019-11-28|
   |[38 34 32 37 39 38 34 37 38 7C 32 7C 31 31 38 33 31 39 33 37 30 31 35]   
|2019-11-28 13:24:07.793|2019-11-28|
   |[38 36 36 32 30 31 37 33 36 7C 32 7C 31 31 38 33 32 37 30 39 34 38 35]   
|2019-11-28 13:24:07.793|2019-11-28|
   |[38 36 34 34 30 33 36 36 38 7C 31 7C 31 31 37 37 30 37 37 35 30 39 39]   
|2019-11-28 13:24:07.798|2019-11-28|
   |[38 36 34 34 32 31 33 38 33 7C 31 7C 31 31 37 37 31 30 36 38 30 30 30]   
|2019-11-28 13:24:07.798|2019-11-28|
   |[38 36 33 33 38 38 32 34 32 7C 35 7C 31 31 37 39 34 37 36 39 31 38 38]   
|2019-11-28 13:24:07.809|2019-11-28|
   |[38 35 34 34 35 31 31 31 31 7C 33 7C 31 31 38 33 32 30 31 33 33 32 31]   
|2019-11-28 13:24:07.81 |2019-11-28|
   |[38 36 35 36 37 38 38 33 35 7C 31 7C 31 31 37 39 33 34 36 38 31 31 32]   
|2019-11-28 13:24:07.823|2019-11-28|
   |[38 35 34 38 34 39 38 30 36 7C 32 7C 31 31 38 33 32 32 35 32 38 38 36]   
|2019-11-28 13:24:07.823|2019-11-28|
   |[36 39 35 36 32 37 32 32 33 7C 38 39 7C 31 31 37 36 37 39 30 32 31 38 
38]|2019-11-28 13:24:05.651|2019-11-28|
   |[38 36 36 33 31 30 39 36 36 7C 31 7C 31 31 38 30 38 38 36 31 31 37 33]   
|2019-11-28 13:24:05.653|2019-11-28|
   |[37 39 39 30 35 32 31 36 31 7C 34 7C 31 31 37 36 37 37 30 35 30 32 33]   
|2019-11-28 13:24:05.653|2019-11-28|
   |[36 39 35 36 31 38 32 36 38 7C 38 36 7C 31 31 37 36 37 38 39 30 37 31 
37]|2019-11-28 13:24:05.655|2019-11-28|
   |[38 33 35 33 34 34 38 38 38 7C 35 39 7C 31 31 37 36 31 36 37 30 37 35 
30]|2019-11-28 13:24:05.949|2019-11-28|
   |[38 36 34 38 38 39 31 33 39 7C 31 7C 31 31 37 38 30 30 35 30 35 31 33]   
|2019-11-28 13:24:05.951|2019-11-28|
   |[38 36 33 36 38 37 34 33 37 7C 31 7C 31 31 37 35 31 32 37 35 33 31 34]   
|2019-11-28 13:24:05.951|2019-11-28|
   |[38 34 36 35 36 30 33 39 33 7C 34 32 7C 31 31 37 36 31 33 30 30 38 36 
37]|2019-11-28 13:24:05.952|2019-11-28|
   |[38 36 34 34 35 39 38 35 33 7C 31 7C 31 31 37 37 31 35 32 30 39 37 39]   
|2019-11-28 13:24:05.952|2019-11-28|
   |[38 36 30 39 36 33 37 37 33 7C 31 7C 31 31 36 38 38 30 33 36 33 33 30]   
|2019-11-28 13:24:05.952|2019-11-28|
   
+-+---+--+
   ```
   
   Also, if I use the **dt** column as the **PARTITIONPATH_FIELD_OPT_KEY**, 
when I ls the output directory, the partition name is /18228, is this the 
expected behaviour?
   
   
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,

[GitHub] [incubator-hudi] lamber-ken commented on issue #1488: [SUPPORT] Hudi table has only five rows when record key is binary

2020-04-05 Thread GitBox
lamber-ken commented on issue #1488: [SUPPORT] Hudi table has only five rows 
when record key is binary
URL: https://github.com/apache/incubator-hudi/issues/1488#issuecomment-609446861
 
 
   hi, try to reproduce it, what the original type of key?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] lamber-ken commented on issue #1487: [SUPPORT] Exception in thread "main" java.io.IOException: No FileSystem for scheme: hdfs

2020-04-05 Thread GitBox
lamber-ken commented on issue #1487: [SUPPORT] Exception in thread "main" 
java.io.IOException: No FileSystem for scheme: hdfs
URL: https://github.com/apache/incubator-hudi/issues/1487#issuecomment-609472838
 
 
   Hi, it works fine in my local env. steps:
   
   1.Add `spark-hive` dependency
   ```
   
 org.apache.spark
 spark-hive_${scala.binary.version}
 2.4.4
 compile
   
   
   SparkSession spark = SparkSession
   .builder()
   .appName("FeatureExtractor")
   .config("spark.master", "local")
   .config("spark.sql.hive.convertMetastoreParquet", false)
   .config("spark.submit.deployMode", "client")
   .config("spark.jars.packages", "org.apache.spark:spark-avro_2.11:2.4.4")
   .config("spark.sql.warehouse.dir", "/user/hive/warehouse")
   .config("hive.metastore.uris", "thrift://hivemetastore:9083")
   .enableHiveSupport()
   .getOrCreate();
   
   spark.sql("select * from stock_ticks_cow").show(100);
   ```
   
   2.Build fatjar
   ```
   mvn package
   ```
   
   3.Copy to `adhoc-1` docker
   ```
   docker cp test.jar  adhoc-1:/opt/spark
   ```
   
   4.Run test.jar
   ```
   bin/spark-submit \
   --class com.TestExample \
   --executor-memory 1G \
   --total-executor-cores 2 \
   test.jar
   ```
   
   
   
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] jvaesteves commented on issue #1488: [SUPPORT] Hudi table has only five rows when record key is binary

2020-04-05 Thread GitBox
jvaesteves commented on issue #1488: [SUPPORT] Hudi table has only five rows 
when record key is binary
URL: https://github.com/apache/incubator-hudi/issues/1488#issuecomment-609461079
 
 
   Binary (array of bytes)


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] pratyakshsharma commented on issue #1486: WIP[HUDI-759] Integrate checkpoint privoder with delta streamer

2020-04-05 Thread GitBox
pratyakshsharma commented on issue #1486: WIP[HUDI-759] Integrate checkpoint 
privoder with delta streamer
URL: https://github.com/apache/incubator-hudi/pull/1486#issuecomment-609480755
 
 
   LGTM


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] pratyakshsharma commented on a change in pull request #1150: [HUDI-288]: Add support for ingesting multiple kafka streams in a single DeltaStreamer deployment

2020-04-05 Thread GitBox
pratyakshsharma commented on a change in pull request #1150: [HUDI-288]: Add 
support for ingesting multiple kafka streams in a single DeltaStreamer 
deployment
URL: https://github.com/apache/incubator-hudi/pull/1150#discussion_r403746299
 
 

 ##
 File path: 
hudi-client/src/test/java/org/apache/hudi/common/HoodieTestDataGenerator.java
 ##
 @@ -155,6 +181,19 @@ public static TestRawTripPayload generateRandomValue(
 return new TestRawTripPayload(rec.toString(), key.getRecordKey(), 
key.getPartitionPath(), TRIP_EXAMPLE_SCHEMA);
   }
 
+  /**
+   * Generates a new avro record with TRIP_UBER_EXAMPLE_SCHEMA, retaining the 
key if optionally provided.
+   */
+  public TestRawTripPayload generatePayloadForUberSchema(HoodieKey key, String 
commitTime) throws IOException {
 
 Review comment:
   Done


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] pratyakshsharma commented on a change in pull request #1150: [HUDI-288]: Add support for ingesting multiple kafka streams in a single DeltaStreamer deployment

2020-04-05 Thread GitBox
pratyakshsharma commented on a change in pull request #1150: [HUDI-288]: Add 
support for ingesting multiple kafka streams in a single DeltaStreamer 
deployment
URL: https://github.com/apache/incubator-hudi/pull/1150#discussion_r403746251
 
 

 ##
 File path: 
hudi-client/src/test/java/org/apache/hudi/common/HoodieTestDataGenerator.java
 ##
 @@ -208,6 +247,31 @@ public static GenericRecord generateGenericRecord(String 
rowKey, String riderNam
 return rec;
   }
 
+  /*
+  Generate random record using TRIP_UBER_EXAMPLE_SCHEMA
+   */
+  public GenericRecord generateRecordForUberSchema(String rowKey, String 
riderName, String driverName, double timestamp) {
 
 Review comment:
   Done. 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] pratyakshsharma commented on a change in pull request #1150: [HUDI-288]: Add support for ingesting multiple kafka streams in a single DeltaStreamer deployment

2020-04-05 Thread GitBox
pratyakshsharma commented on a change in pull request #1150: [HUDI-288]: Add 
support for ingesting multiple kafka streams in a single DeltaStreamer 
deployment
URL: https://github.com/apache/incubator-hudi/pull/1150#discussion_r403746266
 
 

 ##
 File path: 
hudi-client/src/test/java/org/apache/hudi/common/HoodieTestDataGenerator.java
 ##
 @@ -155,6 +181,19 @@ public static TestRawTripPayload generateRandomValue(
 return new TestRawTripPayload(rec.toString(), key.getRecordKey(), 
key.getPartitionPath(), TRIP_EXAMPLE_SCHEMA);
   }
 
+  /**
+   * Generates a new avro record with TRIP_UBER_EXAMPLE_SCHEMA, retaining the 
key if optionally provided.
+   */
+  public TestRawTripPayload generatePayloadForUberSchema(HoodieKey key, String 
commitTime) throws IOException {
+GenericRecord rec = generateRecordForUberSchema(key.getRecordKey(), 
"rider-" + commitTime, "driver-" + commitTime, 0.0);
+return new TestRawTripPayload(rec.toString(), key.getRecordKey(), 
key.getPartitionPath(), TRIP_UBER_SCHEMA);
+  }
+
+  public TestRawTripPayload generatePayloadForFgSchema(HoodieKey key, String 
commitTime) throws IOException {
 
 Review comment:
   Done


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] pratyakshsharma commented on a change in pull request #1427: [HUDI-727]: Copy default values of fields if not present when rewriting incoming record with new schema

2020-04-05 Thread GitBox
pratyakshsharma commented on a change in pull request #1427: [HUDI-727]: Copy 
default values of fields if not present when rewriting incoming record with new 
schema
URL: https://github.com/apache/incubator-hudi/pull/1427#discussion_r403739002
 
 

 ##
 File path: 
hudi-common/src/main/java/org/apache/hudi/common/util/HoodieAvroUtils.java
 ##
 @@ -191,21 +191,25 @@ public static GenericRecord 
addCommitMetadataToRecord(GenericRecord record, Stri
* schema.
*/
   public static GenericRecord rewriteRecord(GenericRecord record, Schema 
newSchema) {
-return rewrite(record, record.getSchema(), newSchema);
+return rewrite(record, getAllFieldsToWrite(record.getSchema(), newSchema), 
newSchema);
   }
 
   /**
* Given a avro record with a given schema, rewrites it into the new schema 
while setting fields only from the new
* schema.
*/
   public static GenericRecord 
rewriteRecordWithOnlyNewSchemaFields(GenericRecord record, Schema newSchema) {
-return rewrite(record, newSchema, newSchema);
+return rewrite(record, newSchema.getFields(), newSchema);
   }
 
-  private static GenericRecord rewrite(GenericRecord record, Schema 
schemaWithFields, Schema newSchema) {
+  private static GenericRecord rewrite(GenericRecord record, List 
fieldsToWrite, Schema newSchema) {
 GenericRecord newRecord = new GenericData.Record(newSchema);
-for (Schema.Field f : schemaWithFields.getFields()) {
-  newRecord.put(f.name(), record.get(f.name()));
+for (Schema.Field f : fieldsToWrite) {
+  if (record.get(f.name()) == null) {
 
 Review comment:
   Actually in avro, actual field value is maintained with the 
GenericData.Record class but defaultValue is maintained with Schema.Field. So I 
guess Avro expects users to fetch them separately. 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] pratyakshsharma commented on a change in pull request #1150: [HUDI-288]: Add support for ingesting multiple kafka streams in a single DeltaStreamer deployment

2020-04-05 Thread GitBox
pratyakshsharma commented on a change in pull request #1150: [HUDI-288]: Add 
support for ingesting multiple kafka streams in a single DeltaStreamer 
deployment
URL: https://github.com/apache/incubator-hudi/pull/1150#discussion_r403740566
 
 

 ##
 File path: 
hudi-client/src/test/java/org/apache/hudi/common/HoodieTestDataGenerator.java
 ##
 @@ -84,26 +87,35 @@
   + "{\"name\": \"amount\",\"type\": \"double\"},{\"name\": \"currency\", 
\"type\": \"string\"}]}},";
   public static final String FARE_FLATTENED_SCHEMA = "{\"name\": \"fare\", 
\"type\": \"double\"},"
   + "{\"name\": \"currency\", \"type\": \"string\"},";
-
   public static final String TRIP_EXAMPLE_SCHEMA =
   TRIP_SCHEMA_PREFIX + FARE_NESTED_SCHEMA + TRIP_SCHEMA_SUFFIX;
   public static final String TRIP_FLATTENED_SCHEMA =
   TRIP_SCHEMA_PREFIX + FARE_FLATTENED_SCHEMA + TRIP_SCHEMA_SUFFIX;
 
+  public static final String TRIP_UBER_SCHEMA = 
"{\"type\":\"record\",\"name\":\"tripUberRec\",\"fields\":["
 
 Review comment:
   Sure. 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] xushiyan commented on issue #1480: [SUPPORT] Backwards Incompatible Schema Evolution

2020-04-05 Thread GitBox
xushiyan commented on issue #1480: [SUPPORT] Backwards Incompatible Schema 
Evolution
URL: https://github.com/apache/incubator-hudi/issues/1480#issuecomment-609466504
 
 
   @vinothchandar Yes the exporter tool can be used for this purpose, with some 
changes. It currently supports copying Hudi dataset as is. With this migration 
use case, we could extend the feature to include transformation when 
`--output-format hudi`, using a custom `Transformer`.
   
   Though MOR is a bit troublesome with log files conversions, we could start 
with COW tables support? Does this work for your case? @symfrog 
   
   As for splitting/merging usecases, something feasible as well; some more 
logic to implement for exporter to take multiple source/target paths. Also some 
efforts to support multiple datasets in `Transformer` interface.
   
   @vinothchandar Are my thoughts above aligned with yours?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] symfrog commented on issue #1480: [SUPPORT] Backwards Incompatible Schema Evolution

2020-04-05 Thread GitBox
symfrog commented on issue #1480: [SUPPORT] Backwards Incompatible Schema 
Evolution
URL: https://github.com/apache/incubator-hudi/issues/1480#issuecomment-609489074
 
 
   @xushiyan Yes, thanks, that would work. I am using COW for the tables.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] garyli1019 commented on issue #1486: [HUDI-759] Integrate checkpoint privoder with delta streamer

2020-04-05 Thread GitBox
garyli1019 commented on issue #1486: [HUDI-759] Integrate checkpoint privoder 
with delta streamer
URL: https://github.com/apache/incubator-hudi/pull/1486#issuecomment-609493045
 
 
   Test added. Thanks for the review


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] malanb5 commented on issue #1487: [SUPPORT] Exception in thread "main" java.io.IOException: No FileSystem for scheme: hdfs

2020-04-05 Thread GitBox
malanb5 commented on issue #1487: [SUPPORT] Exception in thread "main" 
java.io.IOException: No FileSystem for scheme: hdfs
URL: https://github.com/apache/incubator-hudi/issues/1487#issuecomment-609493874
 
 
   I was running this through the JVM not the script spark-submit which loaded 
in the spark classes.  Thanks for the help again @lamber-ken 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] malanb5 closed issue #1487: [SUPPORT] Exception in thread "main" java.io.IOException: No FileSystem for scheme: hdfs

2020-04-05 Thread GitBox
malanb5 closed issue #1487: [SUPPORT] Exception in thread "main" 
java.io.IOException: No FileSystem for scheme: hdfs
URL: https://github.com/apache/incubator-hudi/issues/1487
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] pratyakshsharma commented on a change in pull request #1150: [HUDI-288]: Add support for ingesting multiple kafka streams in a single DeltaStreamer deployment

2020-04-05 Thread GitBox
pratyakshsharma commented on a change in pull request #1150: [HUDI-288]: Add 
support for ingesting multiple kafka streams in a single DeltaStreamer 
deployment
URL: https://github.com/apache/incubator-hudi/pull/1150#discussion_r403740013
 
 

 ##
 File path: 
hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/HoodieMultiTableDeltaStreamer.java
 ##
 @@ -0,0 +1,269 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.utilities.deltastreamer;
+
+import org.apache.hudi.DataSourceWriteOptions;
+import org.apache.hudi.common.util.FSUtils;
+import org.apache.hudi.common.util.StringUtils;
+import org.apache.hudi.common.util.TypedProperties;
+import org.apache.hudi.exception.HoodieException;
+import org.apache.hudi.utilities.UtilHelpers;
+import org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.Config;
+import org.apache.hudi.utilities.schema.SchemaRegistryProvider;
+
+import com.beust.jcommander.JCommander;
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.fs.FileUtil;
+import org.apache.hadoop.fs.Path;
+import org.apache.log4j.LogManager;
+import org.apache.log4j.Logger;
+import org.apache.spark.api.java.JavaSparkContext;
+
+import java.io.IOException;
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.HashSet;
+import java.util.List;
+import java.util.Set;
+
+/**
+ * Wrapper over HoodieDeltaStreamer.java class.
+ * Helps with ingesting incremental data into hoodie datasets for multiple 
tables.
+ * Currently supports only COPY_ON_WRITE storage type.
+ */
+public class HoodieMultiTableDeltaStreamer {
+
+  private static Logger logger = 
LogManager.getLogger(HoodieMultiTableDeltaStreamer.class);
+
+  private List tableExecutionContexts;
+  private transient JavaSparkContext jssc;
+  private Set successTables;
+  private Set failedTables;
+
+  public HoodieMultiTableDeltaStreamer(Config config, JavaSparkContext jssc) 
throws IOException {
+this.tableExecutionContexts = new ArrayList<>();
+this.successTables = new HashSet<>();
+this.failedTables = new HashSet<>();
+this.jssc = jssc;
+String commonPropsFile = config.propsFilePath;
+String configFolder = config.configFolder;
+FileSystem fs = FSUtils.getFs(commonPropsFile, jssc.hadoopConfiguration());
+configFolder = configFolder.charAt(configFolder.length() - 1) == '/' ? 
configFolder.substring(0, configFolder.length() - 1) : configFolder;
+checkIfPropsFileAndConfigFolderExist(commonPropsFile, configFolder, fs);
+TypedProperties properties = UtilHelpers.readConfig(fs, new 
Path(commonPropsFile), new ArrayList<>()).getConfig();
+//get the tables to be ingested and their corresponding config files from 
this properties instance
+populateTableExecutionContextList(properties, configFolder, fs, config);
+  }
+
+  private void checkIfPropsFileAndConfigFolderExist(String commonPropsFile, 
String configFolder, FileSystem fs) throws IOException {
+if (!fs.exists(new Path(commonPropsFile))) {
+  throw new IllegalArgumentException("Please provide valid common config 
file path!");
+}
+
+if (!fs.exists(new Path(configFolder))) {
+  fs.mkdirs(new Path(configFolder));
+}
+  }
+
+  private void checkIfTableConfigFileExists(String configFolder, FileSystem 
fs, String configFilePath) throws IOException {
+if (!fs.exists(new Path(configFilePath)) || !fs.isFile(new 
Path(configFilePath))) {
+  throw new IllegalArgumentException("Please provide valid table config 
file path!");
+}
+
+Path path = new Path(configFilePath);
+Path filePathInConfigFolder = new Path(configFolder, path.getName());
+if (!fs.exists(filePathInConfigFolder)) {
+  FileUtil.copy(fs, path, fs, filePathInConfigFolder, false, fs.getConf());
+}
+  }
+
+  //commonProps are passed as parameter which contain table to config file 
mapping
+  private void populateTableExecutionContextList(TypedProperties properties, 
String configFolder, FileSystem fs, Config config) throws IOException {
+List tablesToBeIngested = getTablesToBeIngested(properties);
+logger.info("tables to be ingested via MultiTableDeltaStreamer : " + 
tablesToBeIngested);
+

[GitHub] [incubator-hudi] pratyakshsharma commented on a change in pull request #1150: [HUDI-288]: Add support for ingesting multiple kafka streams in a single DeltaStreamer deployment

2020-04-05 Thread GitBox
pratyakshsharma commented on a change in pull request #1150: [HUDI-288]: Add 
support for ingesting multiple kafka streams in a single DeltaStreamer 
deployment
URL: https://github.com/apache/incubator-hudi/pull/1150#discussion_r403746232
 
 

 ##
 File path: 
hudi-client/src/test/java/org/apache/hudi/common/HoodieTestDataGenerator.java
 ##
 @@ -208,6 +247,31 @@ public static GenericRecord generateGenericRecord(String 
rowKey, String riderNam
 return rec;
   }
 
+  /*
+  Generate random record using TRIP_UBER_EXAMPLE_SCHEMA
+   */
+  public GenericRecord generateRecordForUberSchema(String rowKey, String 
riderName, String driverName, double timestamp) {
+GenericRecord rec = new GenericData.Record(AVRO_TRIP_SCHEMA);
+rec.put("_row_key", rowKey);
+rec.put("timestamp", timestamp);
+rec.put("rider", riderName);
+rec.put("driver", driverName);
+rec.put("fare", RAND.nextDouble() * 100);
+rec.put("_hoodie_is_deleted", false);
+return rec;
+  }
+
+  public GenericRecord generateRecordForFgSchema(String rowKey, String 
riderName, String driverName, double timestamp) {
 
 Review comment:
   Done. 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] malanb5 commented on issue #1487: [SUPPORT] Exception in thread "main" java.io.IOException: No FileSystem for scheme: hdfs

2020-04-05 Thread GitBox
malanb5 commented on issue #1487: [SUPPORT] Exception in thread "main" 
java.io.IOException: No FileSystem for scheme: hdfs
URL: https://github.com/apache/incubator-hudi/issues/1487#issuecomment-609480782
 
 
   @lamber-ken Thank you for the help.  I updated the version of Hive.  Now I'm 
getting the following:
   
   ```
   Exception in thread "main" java.lang.IllegalArgumentException: Unable to 
instantiate SparkSession with Hive support because Hive classes are not found.
   at 
org.apache.spark.sql.SparkSession$Builder.enableHiveSupport(SparkSession.scala:869)
   at 
com.github.malanb5.mlpipelines.FeatureExtractor.main(FeatureExtractor.java:59
   ```


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] malanb5 opened a new issue #1487: [SUPPORT] Exception in thread "main" java.io.IOException: No FileSystem for scheme: hdfs

2020-04-05 Thread GitBox
malanb5 opened a new issue #1487: [SUPPORT] Exception in thread "main" 
java.io.IOException: No FileSystem for scheme: hdfs
URL: https://github.com/apache/incubator-hudi/issues/1487
 
 
   Receiving the following Exception when querying data brought in from a 
SparkSession from a Hive table, which was setup via the steps outlined in the 
demo: 
   
   https://hudi.apache.org/docs/docker_demo.html  
   
**Exception in thread "main" java.io.IOException: No FileSystem for scheme: 
hdfs**
   
   
   **To Reproduce**
   
   Steps to reproduce the behavior:
   
   1.  Inserting data in the steps outlined in the demo.
   2.  Starting a SparkSession using the following:
   ```
   private static final String HUDI_SPARK_BUNDLE_JAR_FP = 
"/var/hoodie/ws/docker/hoodie/hadoop/hive_base/target/hoodie-spark-bundle.jar";
   private static final String HADOOP_CONF_DIR = "/etc/hadoop";
   private static final String STOCK_TICKS_COW="stock_ticks_cow";
   
   SparkSession spark = SparkSession
   .builder()
   .appName("FeatureExtractor")
   .config("spark.master", "local")
   .config("spark.jars", 
FeatureExtractor.HUDI_SPARK_BUNDLE_JAR_FP)
   .config("spark.driver.extraClassPath", 
FeatureExtractor.HADOOP_CONF_DIR)
   .config("spark.sql.hive.convertMetastoreParquet", false)
   .config("spark.submit.deployMode", "client")
   .config("spark.jars.packages", 
"org.apache.spark:spark-avro_2.11:2.4.4")
   .config("spark.sql.warehouse.dir", "/user/hive/warehouse")   
   // on HDFS
   .config("hive.metastore.uris", 
"thrift://hivemetastore:9083")// hive metastore uri
   .enableHiveSupport()
   .getOrCreate();
   ```
   3.  Make a query and try to display the results: 
   ```
   Dataset all_stock_cow = spark.sql(String.format("select * from %s", 
STOCK_TICKS_COW));
   all_stock_cow.show();
   ```
   **Expected behavior**
   The contents of the table to be rendered.  The filesystem to recognize the 
scheme hdfs.
   
   **Environment Description**
   
   * Hudi version : 0.5.2-incubating
   
   * Spark version : 2.4.5
   
   * Hive version : 2.4.1
   
   * Hadoop version : 2.7.3
   
   * Storage (HDFS/S3/GCS..) : HDFS
   
   * Running on Docker? (yes/no) : yes
   
   **Additional context**
   Maven pom file:
   ```
   
   
   maven-assembly-plugin
   
   
   
   true
   
com.mlpipelines.FeatureExtractor
   
   
   
   jar-with-dependencies
   
   
   
   
   
   org.apache.maven.plugins
   maven-jar-plugin
   
   
   
   
   
   
   org.apache.hadoop
   hadoop-hdfs
   2.7.3
   provided
   
   
   
   org.apache.spark
   spark-sql_2.11
   2.4.5
   
   
   org.apache.spark
   spark-mllib_2.11
   2.4.5
   
   
   
   org.apache.spark
   spark-hive_2.11
   2.4.1
   
   
   
   org.apache.hudi
   hudi-hive
   0.5.2-incubating
   
   
   
   org.apache.hudi
   hudi-spark_2.12
   0.5.2-incubating
   
   
   
   org.apache.hudi
   hudi
   0.5.2-incubating
   pom
   
   
   ```
   
   **Stacktrace**
   
   ```
   Exception in thread "main" java.io.IOException: No FileSystem for scheme: 
hdfs
   at 
org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2586)
   at 
org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2593)
   at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:91)
   at 
org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2632)
   at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2614)
   at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:370)
   at org.apache.hadoop.fs.Path.getFileSystem(Path.java:296)
   at 
org.apache.hudi.hadoop.InputPathHandler.parseInputPaths(InputPathHandler.java:98)
   at 
org.apache.hudi.hadoop.InputPathHandler.(InputPathHandler.java:58)
   at 
org.apache.hudi.hadoop.HoodieParquetInputFormat.listStatus(HoodieParquetInputFormat.java:73)
   at 
org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:313)
   at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:204)
   at 
org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:273)
   

[jira] [Updated] (HUDI-724) Parallelize GetSmallFiles For Partitions

2020-04-05 Thread leesf (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-724?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

leesf updated HUDI-724:
---
Fix Version/s: 0.6.0

> Parallelize GetSmallFiles For Partitions
> 
>
> Key: HUDI-724
> URL: https://issues.apache.org/jira/browse/HUDI-724
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Performance, Writer Core
>Reporter: Feichi Feng
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.6.0
>
> Attachments: gap.png, nogapAfterImprovement.png
>
>   Original Estimate: 48h
>  Time Spent: 40m
>  Remaining Estimate: 47h 20m
>
> When writing data, a gap was observed between spark stages. By tracking down 
> where the time was spent on the spark driver, it's get-small-files operation 
> for partitions.
> When creating the UpsertPartitioner and trying to assign insert records, it 
> uses a normal for-loop for get the list of small files for all partitions 
> that the load is going to load data to, and the process is very slow when 
> there are a lot of partitions to go through. While the operation is running 
> on spark driver process, all other worker nodes are sitting idle waiting for 
> tasks.
> For all those partitions, they don't affect each other, so the 
> get-small-files operations can be parallelized. The change I made is to pass 
> the JavaSparkContext to the UpsertPartitioner, and create RDD for the 
> partitions and eventually send the get small files operations to multiple 
> tasks.
>  
> screenshot attached for 
> the gap without the improvement
> the spark stage with the improvement (no gap)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-742) Fix java.lang.NoSuchMethodError: java.lang.Math.floorMod(JI)I

2020-04-05 Thread leesf (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

leesf updated HUDI-742:
---
Fix Version/s: 0.6.0

> Fix java.lang.NoSuchMethodError: java.lang.Math.floorMod(JI)I
> -
>
> Key: HUDI-742
> URL: https://issues.apache.org/jira/browse/HUDI-742
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>  Components: Spark Integration
>Reporter: lamber-ken
>Assignee: edwinguo
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.6.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> *ISSUE* : https://github.com/apache/incubator-hudi/issues/1455
> {code:java}
> at org.apache.hudi.client.HoodieWriteClient.upsert(HoodieWriteClient.java:193)
> at org.apache.hudi.DataSourceUtils.doWriteOperation(DataSourceUtils.java:206)
> at org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:144)
> at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:108)
> at 
> org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45)
> at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
> at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
> at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:86)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:156)
> at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
> at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
> at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
> at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:83)
> at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:83)
> at 
> org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676)
> at 
> org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676)
> at 
> org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:84)
> at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:165)
> at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:74)
> at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:676)
> at 
> org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:285)
> at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:271)
> at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:229)
> ... 49 elided
> Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
> Task 44 in stage 11.0 failed 4 times, most recent failure: Lost task 44.3 in 
> stage 11.0 (TID 975, ip-10-81-135-85.ec2.internal, executor 6): 
> java.lang.NoSuchMethodError: java.lang.Math.floorMod(JI)I
> at 
> org.apache.hudi.index.bloom.BucketizedBloomCheckPartitioner.getPartition(BucketizedBloomCheckPartitioner.java:148)
> at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:151)
> at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
> at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)
> at org.apache.spark.scheduler.Task.run(Task.scala:123)
> at 
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
> at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> Driver stacktrace:
> at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:2041)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:2029)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:2028)
> at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
> at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2028)
> at 
> 

[jira] [Updated] (HUDI-762) modify the pom.xml to support maven 3.x

2020-04-05 Thread yaojingyi (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yaojingyi updated HUDI-762:
---
Summary: modify the pom.xml to support maven 3.x  (was: change the pom.xml 
to supportmaven version to 3.x)

> modify the pom.xml to support maven 3.x
> ---
>
> Key: HUDI-762
> URL: https://issues.apache.org/jira/browse/HUDI-762
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>Reporter: yaojingyi
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-713) Datasource Writer throws error on resolving array of struct fields

2020-04-05 Thread leesf (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-713?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

leesf updated HUDI-713:
---
Status: Open  (was: New)

> Datasource Writer throws error on resolving array of struct fields
> --
>
> Key: HUDI-713
> URL: https://issues.apache.org/jira/browse/HUDI-713
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>Reporter: Wenning Ding
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Similar to [https://issues.apache.org/jira/browse/HUDI-530]. With migration 
> of Hudi to spark 2.4.4 and using Spark's native spark-avro module, this issue 
> now exists in Hudi master.
> Reproduce steps:
> Run following script
> {code:java}
> import org.apache.hudi.DataSourceWriteOptions
> import org.apache.hudi.config.HoodieWriteConfig
> import org.apache.hudi.hive.MultiPartKeysValueExtractor
> import org.apache.spark.sql.SaveMode
> import org.apache.spark.sql._
> import org.apache.spark.sql.functions._
> import org.apache.spark.sql.types._
> import spark.implicits._
> val sample = """
> [{
>   "partition": 0,
>   "offset": 5,
>   "timestamp": "1581508884",
>   "value": {
> "prop1": "val1",
> "prop2": [{"withinProp1": "val2", "withinProp2": 1}]
>   }
> }, {
>   "partition": 1,
>   "offset": 10,
>   "timestamp": "1581108884",
>   "value": {
> "prop1": "val4",
> "prop2": [{"withinProp1": "val5", "withinProp2": 2}]
>   }
> }]
> """
> val df = spark.read.option("dropFieldIfAllNull", 
> "true").json(Seq(sample).toDS)
> val dfcol1 = df.withColumn("op_ts", from_unixtime(col("timestamp")))
> val dfcol2 = dfcol1.withColumn("year_partition", 
> year(col("op_ts"))).withColumn("id", concat($"partition", lit("-"), 
> $"offset"))
> val dfcol3 = dfcol2.drop("timestamp")
> val hudiOptions: Map[String, String] =
> Map[String, String](
> DataSourceWriteOptions.HIVE_DATABASE_OPT_KEY -> "test",
> DataSourceWriteOptions.STORAGE_TYPE_OPT_KEY -> 
> DataSourceWriteOptions.MOR_STORAGE_TYPE_OPT_VAL,
> DataSourceWriteOptions.OPERATION_OPT_KEY -> 
> DataSourceWriteOptions.INSERT_OPERATION_OPT_VAL,
> DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY -> "op_ts",
> DataSourceWriteOptions.HIVE_SYNC_ENABLED_OPT_KEY -> "true",
> DataSourceWriteOptions.HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY -> 
> classOf[MultiPartKeysValueExtractor].getName,
> "hoodie.parquet.max.file.size" -> String.valueOf(1024 * 1024 * 
> 1024),
> "hoodie.parquet.compression.ratio" -> String.valueOf(0.5),
> "hoodie.insert.shuffle.parallelism" -> String.valueOf(2)
>   )
> dfcol3.write.format("org.apache.hudi")
>   .options(hudiOptions)
>   .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY, "id")
>   .option(DataSourceWriteOptions.HIVE_PARTITION_FIELDS_OPT_KEY, 
> "year_partition")
>   .option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY, 
> "year_partition")
>   .option(HoodieWriteConfig.TABLE_NAME, "AWS_TEST")
>   .option(DataSourceWriteOptions.HIVE_TABLE_OPT_KEY, "AWS_TEST")
>   .mode(SaveMode.Append).save("s3://xxx/AWS_TEST/")
> {code}
> Will throw not in union exception:
> {code:java}
> Caused by: org.apache.avro.UnresolvedUnionException: Not in union 
> [{"type":"record","name":"prop2","namespace":"hoodie.AWS_TEST.AWS_TEST_record.value","fields":[{"name":"withinProp1","type":["string","null"]},{"name":"withinProp2","type":["long","null"]}]},"null"]:
>  {"withinProp1": "val2", "withinProp2": 1}
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (HUDI-713) Datasource Writer throws error on resolving array of struct fields

2020-04-05 Thread leesf (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-713?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

leesf resolved HUDI-713.

Fix Version/s: 0.6.0
   Resolution: Fixed

Fixed via master: ce0a4c64d07d6eea926d1bfb92b69ae387b88f50

> Datasource Writer throws error on resolving array of struct fields
> --
>
> Key: HUDI-713
> URL: https://issues.apache.org/jira/browse/HUDI-713
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>Reporter: Wenning Ding
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.6.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Similar to [https://issues.apache.org/jira/browse/HUDI-530]. With migration 
> of Hudi to spark 2.4.4 and using Spark's native spark-avro module, this issue 
> now exists in Hudi master.
> Reproduce steps:
> Run following script
> {code:java}
> import org.apache.hudi.DataSourceWriteOptions
> import org.apache.hudi.config.HoodieWriteConfig
> import org.apache.hudi.hive.MultiPartKeysValueExtractor
> import org.apache.spark.sql.SaveMode
> import org.apache.spark.sql._
> import org.apache.spark.sql.functions._
> import org.apache.spark.sql.types._
> import spark.implicits._
> val sample = """
> [{
>   "partition": 0,
>   "offset": 5,
>   "timestamp": "1581508884",
>   "value": {
> "prop1": "val1",
> "prop2": [{"withinProp1": "val2", "withinProp2": 1}]
>   }
> }, {
>   "partition": 1,
>   "offset": 10,
>   "timestamp": "1581108884",
>   "value": {
> "prop1": "val4",
> "prop2": [{"withinProp1": "val5", "withinProp2": 2}]
>   }
> }]
> """
> val df = spark.read.option("dropFieldIfAllNull", 
> "true").json(Seq(sample).toDS)
> val dfcol1 = df.withColumn("op_ts", from_unixtime(col("timestamp")))
> val dfcol2 = dfcol1.withColumn("year_partition", 
> year(col("op_ts"))).withColumn("id", concat($"partition", lit("-"), 
> $"offset"))
> val dfcol3 = dfcol2.drop("timestamp")
> val hudiOptions: Map[String, String] =
> Map[String, String](
> DataSourceWriteOptions.HIVE_DATABASE_OPT_KEY -> "test",
> DataSourceWriteOptions.STORAGE_TYPE_OPT_KEY -> 
> DataSourceWriteOptions.MOR_STORAGE_TYPE_OPT_VAL,
> DataSourceWriteOptions.OPERATION_OPT_KEY -> 
> DataSourceWriteOptions.INSERT_OPERATION_OPT_VAL,
> DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY -> "op_ts",
> DataSourceWriteOptions.HIVE_SYNC_ENABLED_OPT_KEY -> "true",
> DataSourceWriteOptions.HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY -> 
> classOf[MultiPartKeysValueExtractor].getName,
> "hoodie.parquet.max.file.size" -> String.valueOf(1024 * 1024 * 
> 1024),
> "hoodie.parquet.compression.ratio" -> String.valueOf(0.5),
> "hoodie.insert.shuffle.parallelism" -> String.valueOf(2)
>   )
> dfcol3.write.format("org.apache.hudi")
>   .options(hudiOptions)
>   .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY, "id")
>   .option(DataSourceWriteOptions.HIVE_PARTITION_FIELDS_OPT_KEY, 
> "year_partition")
>   .option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY, 
> "year_partition")
>   .option(HoodieWriteConfig.TABLE_NAME, "AWS_TEST")
>   .option(DataSourceWriteOptions.HIVE_TABLE_OPT_KEY, "AWS_TEST")
>   .mode(SaveMode.Append).save("s3://xxx/AWS_TEST/")
> {code}
> Will throw not in union exception:
> {code:java}
> Caused by: org.apache.avro.UnresolvedUnionException: Not in union 
> [{"type":"record","name":"prop2","namespace":"hoodie.AWS_TEST.AWS_TEST_record.value","fields":[{"name":"withinProp1","type":["string","null"]},{"name":"withinProp2","type":["long","null"]}]},"null"]:
>  {"withinProp1": "val2", "withinProp2": 1}
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-762) modify the pom.xml to support maven 3.x

2020-04-05 Thread yaojingyi (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yaojingyi updated HUDI-762:
---
Priority: Trivial  (was: Major)

> modify the pom.xml to support maven 3.x
> ---
>
> Key: HUDI-762
> URL: https://issues.apache.org/jira/browse/HUDI-762
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>Reporter: yaojingyi
>Priority: Trivial
>
> I met
> [ERROR] [ERROR] Some problems were encountered while processing the POMs:
> [WARNING] 'artifactId' contains an expression but should be a constant. @ 
> org.apache.hudi:hudi-spark_${scala.binary.version}:[unknown-version], 
> /Users/yaojingyi/Documents/workspace_root/workspace_idea_01/incubator-hudi/hudi-spark/pom.xml,
>  line 26, column 15
>  
> when execute 
> {{mvn clean package -DskipTests -DskipITs}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-762) modify the pom.xml to support maven 3.x

2020-04-05 Thread yaojingyi (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yaojingyi updated HUDI-762:
---
Description: 
I met

[ERROR] [ERROR] Some problems were encountered while processing the POMs:
[WARNING] 'artifactId' contains an expression but should be a constant. @ 
org.apache.hudi:hudi-spark_${scala.binary.version}:[unknown-version], 
/Users/yaojingyi/Documents/workspace_root/workspace_idea_01/incubator-hudi/hudi-spark/pom.xml,
 line 26, column 15

 

when execute 

{{mvn clean package -DskipTests -DskipITs}}

> modify the pom.xml to support maven 3.x
> ---
>
> Key: HUDI-762
> URL: https://issues.apache.org/jira/browse/HUDI-762
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>Reporter: yaojingyi
>Priority: Major
>
> I met
> [ERROR] [ERROR] Some problems were encountered while processing the POMs:
> [WARNING] 'artifactId' contains an expression but should be a constant. @ 
> org.apache.hudi:hudi-spark_${scala.binary.version}:[unknown-version], 
> /Users/yaojingyi/Documents/workspace_root/workspace_idea_01/incubator-hudi/hudi-spark/pom.xml,
>  line 26, column 15
>  
> when execute 
> {{mvn clean package -DskipTests -DskipITs}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (HUDI-756) Organize Cleaning Action execution into a single package in hudi-client

2020-04-05 Thread leesf (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-756?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

leesf resolved HUDI-756.

Resolution: Fixed

Fixed via master: eaf6cc2d90bf27c0d9414a4ea18dbd1b61f58e50

> Organize Cleaning Action execution into a single package in hudi-client
> ---
>
> Key: HUDI-756
> URL: https://issues.apache.org/jira/browse/HUDI-756
> Project: Apache Hudi (incubating)
>  Issue Type: Sub-task
>  Components: Writer Core
>Reporter: Vinoth Chandar
>Assignee: Vinoth Chandar
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.6.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-717) Fix HudiHiveClient for Hive 2.x

2020-04-05 Thread leesf (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-717?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

leesf updated HUDI-717:
---
Fix Version/s: 0.6.0

> Fix HudiHiveClient for Hive 2.x
> ---
>
> Key: HUDI-717
> URL: https://issues.apache.org/jira/browse/HUDI-717
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>Reporter: Prashant Wason
>Assignee: Prashant Wason
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.6.0
>
>   Original Estimate: 4h
>  Time Spent: 20m
>  Remaining Estimate: 3h 40m
>
> When using the HiveDriver mode in HudiHiveClient, Hive 2.x DDL operations 
> like ALTER may fail. This is because Hive 2.x doesn't like `db`.`table_name` 
> for operations.
> There are two ways to fix this:
> 1. Precede all DDL statements by "USE ;"
> 2. Set the name of the database in the SessionState create for the Driver.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [incubator-hudi] malanb5 edited a comment on issue #1487: [SUPPORT] Exception in thread "main" java.io.IOException: No FileSystem for scheme: hdfs

2020-04-05 Thread GitBox
malanb5 edited a comment on issue #1487: [SUPPORT] Exception in thread "main" 
java.io.IOException: No FileSystem for scheme: hdfs
URL: https://github.com/apache/incubator-hudi/issues/1487#issuecomment-609496962
 
 
   Posted this on Stack Overflow, hopefully this will help others:
   
   https://stackoverflow.com/a/61050495/8366477


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] nsivabalan commented on issue #1482: [SUPPORT] Deletion of records through deltaStreamer _hoodie_is_deleted flag does not work as expected

2020-04-05 Thread GitBox
nsivabalan commented on issue #1482: [SUPPORT] Deletion of records through 
deltaStreamer _hoodie_is_deleted flag does not work as expected
URL: https://github.com/apache/incubator-hudi/issues/1482#issuecomment-609498911
 
 
   @venkee14 : can you try setting a default value for the new field. 
   {
   "name" : "_hoodie_is_deleted",
   "type" : "boolean",
   "default" : false
 }
   Let me know if this works. 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Assigned] (HUDI-145) Limit the amount of partitions considered for GlobalBloomIndex

2020-04-05 Thread jerry (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jerry reassigned HUDI-145:
--

Assignee: jerry

> Limit the amount of partitions considered for GlobalBloomIndex
> --
>
> Key: HUDI-145
> URL: https://issues.apache.org/jira/browse/HUDI-145
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Index, newbie
>Reporter: Vinoth Chandar
>Assignee: jerry
>Priority: Major
>
> Currently, global bloom index will check inputs against files in all 
> partitions.. In lot of cases, the user may know a range of partitions 
> actually impacted from updates clearly (e.g upstream system drops updates 
> older than a year, ... )..  In such a scenario,it may make sense to support 
> an option for Global bloom to control how many partitions you want to match 
> against, to gain performance. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Build failed in Jenkins: hudi-snapshot-deployment-0.5 #239

2020-04-05 Thread Apache Jenkins Server
See 


Changes:


--
[...truncated 2.35 KB...]
/home/jenkins/tools/maven/apache-maven-3.5.4/conf:
logging
settings.xml
toolchains.xml

/home/jenkins/tools/maven/apache-maven-3.5.4/conf/logging:
simplelogger.properties

/home/jenkins/tools/maven/apache-maven-3.5.4/lib:
aopalliance-1.0.jar
cdi-api-1.0.jar
cdi-api.license
commons-cli-1.4.jar
commons-cli.license
commons-io-2.5.jar
commons-io.license
commons-lang3-3.5.jar
commons-lang3.license
ext
guava-20.0.jar
guice-4.2.0-no_aop.jar
jansi-1.17.1.jar
jansi-native
javax.inject-1.jar
jcl-over-slf4j-1.7.25.jar
jcl-over-slf4j.license
jsr250-api-1.0.jar
jsr250-api.license
maven-artifact-3.5.4.jar
maven-artifact.license
maven-builder-support-3.5.4.jar
maven-builder-support.license
maven-compat-3.5.4.jar
maven-compat.license
maven-core-3.5.4.jar
maven-core.license
maven-embedder-3.5.4.jar
maven-embedder.license
maven-model-3.5.4.jar
maven-model-builder-3.5.4.jar
maven-model-builder.license
maven-model.license
maven-plugin-api-3.5.4.jar
maven-plugin-api.license
maven-repository-metadata-3.5.4.jar
maven-repository-metadata.license
maven-resolver-api-1.1.1.jar
maven-resolver-api.license
maven-resolver-connector-basic-1.1.1.jar
maven-resolver-connector-basic.license
maven-resolver-impl-1.1.1.jar
maven-resolver-impl.license
maven-resolver-provider-3.5.4.jar
maven-resolver-provider.license
maven-resolver-spi-1.1.1.jar
maven-resolver-spi.license
maven-resolver-transport-wagon-1.1.1.jar
maven-resolver-transport-wagon.license
maven-resolver-util-1.1.1.jar
maven-resolver-util.license
maven-settings-3.5.4.jar
maven-settings-builder-3.5.4.jar
maven-settings-builder.license
maven-settings.license
maven-shared-utils-3.2.1.jar
maven-shared-utils.license
maven-slf4j-provider-3.5.4.jar
maven-slf4j-provider.license
org.eclipse.sisu.inject-0.3.3.jar
org.eclipse.sisu.inject.license
org.eclipse.sisu.plexus-0.3.3.jar
org.eclipse.sisu.plexus.license
plexus-cipher-1.7.jar
plexus-cipher.license
plexus-component-annotations-1.7.1.jar
plexus-component-annotations.license
plexus-interpolation-1.24.jar
plexus-interpolation.license
plexus-sec-dispatcher-1.4.jar
plexus-sec-dispatcher.license
plexus-utils-3.1.0.jar
plexus-utils.license
slf4j-api-1.7.25.jar
slf4j-api.license
wagon-file-3.1.0.jar
wagon-file.license
wagon-http-3.1.0-shaded.jar
wagon-http.license
wagon-provider-api-3.1.0.jar
wagon-provider-api.license

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/ext:
README.txt

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native:
freebsd32
freebsd64
linux32
linux64
osx
README.txt
windows32
windows64

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/freebsd32:
libjansi.so

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/freebsd64:
libjansi.so

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/linux32:
libjansi.so

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/linux64:
libjansi.so

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/osx:
libjansi.jnilib

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/windows32:
jansi.dll

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/windows64:
jansi.dll
Finished /home/jenkins/tools/maven/apache-maven-3.5.4 Directory Listing :
Detected current version as: 
'HUDI_home=
0.6.0-SNAPSHOT'
[INFO] Scanning for projects...
[WARNING] 
[WARNING] Some problems were encountered while building the effective model for 
org.apache.hudi:hudi-spark_2.11:jar:0.6.0-SNAPSHOT
[WARNING] 'artifactId' contains an expression but should be a constant. @ 
org.apache.hudi:hudi-spark_${scala.binary.version}:[unknown-version], 

 line 26, column 15
[WARNING] 
[WARNING] Some problems were encountered while building the effective model for 
org.apache.hudi:hudi-timeline-service:jar:0.6.0-SNAPSHOT
[WARNING] 'build.plugins.plugin.(groupId:artifactId)' must be unique but found 
duplicate declaration of plugin org.jacoco:jacoco-maven-plugin @ 
org.apache.hudi:hudi-timeline-service:[unknown-version], 

 line 58, column 15
[WARNING] 
[WARNING] Some problems were encountered while building the effective model for 
org.apache.hudi:hudi-utilities_2.11:jar:0.6.0-SNAPSHOT
[WARNING] 'artifactId' contains an expression but should be a constant. @ 
org.apache.hudi:hudi-utilities_${scala.binary.version}:[unknown-version], 

 line 26, column 15
[WARNING] 
[WARNING] Some problems were encountered while building the effective model for 
org.apache.hudi:hudi-spark-bundle_2.11:jar:0.6.0-SNAPSHOT
[WARNING] 'artifactId' contains an expression but should be a constant. @ 

[jira] [Commented] (HUDI-69) Support realtime view in Spark datasource #136

2020-04-05 Thread Yanjia Gary Li (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-69?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17076023#comment-17076023
 ] 

Yanjia Gary Li commented on HUDI-69:


Hello [~bhasudha], I found your commit 
[https://github.com/apache/incubator-hudi/commit/d09eacdc13b9f19f69a317c8d08bda69a43678bc]
 could be related to this ticket.

Does InputPathHandler able to provide MOR snapshot paths(avro + parquet)? If 
not, I could probably start from the path selector. 

To add Spark Datasource support RealtimeUnmergedRecordReader, we may simply use 
the Spark SQL API to read two separate formats then union them together. Is 
that make sense? 

To merge them, I might need to dig deeper. 

> Support realtime view in Spark datasource #136
> --
>
> Key: HUDI-69
> URL: https://issues.apache.org/jira/browse/HUDI-69
> Project: Apache Hudi (incubating)
>  Issue Type: New Feature
>  Components: Spark Integration
>Reporter: Vinoth Chandar
>Assignee: Yanjia Gary Li
>Priority: Major
> Fix For: 0.6.0
>
>
> https://github.com/uber/hudi/issues/136



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [incubator-hudi] codecov-io edited a comment on issue #1486: [HUDI-759] Integrate checkpoint privoder with delta streamer

2020-04-05 Thread GitBox
codecov-io edited a comment on issue #1486: [HUDI-759] Integrate checkpoint 
privoder with delta streamer
URL: https://github.com/apache/incubator-hudi/pull/1486#issuecomment-609364046
 
 
   # 
[Codecov](https://codecov.io/gh/apache/incubator-hudi/pull/1486?src=pr=h1) 
Report
   > Merging 
[#1486](https://codecov.io/gh/apache/incubator-hudi/pull/1486?src=pr=desc) 
into 
[master](https://codecov.io/gh/apache/incubator-hudi/commit/eaf6cc2d90bf27c0d9414a4ea18dbd1b61f58e50=desc)
 will **increase** coverage by `0.04%`.
   > The diff coverage is `80.00%`.
   
   [![Impacted file tree 
graph](https://codecov.io/gh/apache/incubator-hudi/pull/1486/graphs/tree.svg?width=650=150=pr=VTTXabwbs2)](https://codecov.io/gh/apache/incubator-hudi/pull/1486?src=pr=tree)
   
   ```diff
   @@ Coverage Diff  @@
   ## master#1486  +/-   ##
   
   + Coverage 71.54%   71.59%   +0.04% 
   - Complexity  261  265   +4 
   
 Files   336  337   +1 
 Lines 1574415756  +12 
 Branches   1610 1611   +1 
   
   + Hits  1126411280  +16 
   + Misses 3759 3754   -5 
   - Partials721  722   +1 
   ```
   
   
   | [Impacted 
Files](https://codecov.io/gh/apache/incubator-hudi/pull/1486?src=pr=tree) | 
Coverage Δ | Complexity Δ | |
   |---|---|---|---|
   | 
[...in/java/org/apache/hudi/utilities/UtilHelpers.java](https://codecov.io/gh/apache/incubator-hudi/pull/1486/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL1V0aWxIZWxwZXJzLmphdmE=)
 | `64.70% <33.33%> (-0.71%)` | `22.00 <1.00> (+1.00)` | :arrow_down: |
   | 
[...i/utilities/deltastreamer/HoodieDeltaStreamer.java](https://codecov.io/gh/apache/incubator-hudi/pull/1486/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL2RlbHRhc3RyZWFtZXIvSG9vZGllRGVsdGFTdHJlYW1lci5qYXZh)
 | `78.84% <85.71%> (+0.23%)` | `10.00 <1.00> (+2.00)` | |
   | 
[...ities/checkpointing/InitialCheckPointProvider.java](https://codecov.io/gh/apache/incubator-hudi/pull/1486/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL2NoZWNrcG9pbnRpbmcvSW5pdGlhbENoZWNrUG9pbnRQcm92aWRlci5qYXZh)
 | `100.00% <100.00%> (ø)` | `1.00 <1.00> (?)` | |
   | 
[...lities/checkpointing/KafkaConnectHdfsProvider.java](https://codecov.io/gh/apache/incubator-hudi/pull/1486/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL2NoZWNrcG9pbnRpbmcvS2Fma2FDb25uZWN0SGRmc1Byb3ZpZGVyLmphdmE=)
 | `92.00% <100.00%> (-0.31%)` | `12.00 <1.00> (ø)` | |
   | 
[...src/main/java/org/apache/hudi/metrics/Metrics.java](https://codecov.io/gh/apache/incubator-hudi/pull/1486/diff?src=pr=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvbWV0cmljcy9NZXRyaWNzLmphdmE=)
 | `72.22% <0.00%> (+13.88%)` | `0.00% <0.00%> (ø%)` | |
   | 
[...g/apache/hudi/metrics/InMemoryMetricsReporter.java](https://codecov.io/gh/apache/incubator-hudi/pull/1486/diff?src=pr=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvbWV0cmljcy9Jbk1lbW9yeU1ldHJpY3NSZXBvcnRlci5qYXZh)
 | `80.00% <0.00%> (+40.00%)` | `0.00% <0.00%> (ø%)` | |
   
   --
   
   [Continue to review full report at 
Codecov](https://codecov.io/gh/apache/incubator-hudi/pull/1486?src=pr=continue).
   > **Legend** - [Click here to learn 
more](https://docs.codecov.io/docs/codecov-delta)
   > `Δ = absolute  (impact)`, `ø = not affected`, `? = missing data`
   > Powered by 
[Codecov](https://codecov.io/gh/apache/incubator-hudi/pull/1486?src=pr=footer).
 Last update 
[eaf6cc2...c904218](https://codecov.io/gh/apache/incubator-hudi/pull/1486?src=pr=lastupdated).
 Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments).
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] malanb5 commented on issue #1487: [SUPPORT] Exception in thread "main" java.io.IOException: No FileSystem for scheme: hdfs

2020-04-05 Thread GitBox
malanb5 commented on issue #1487: [SUPPORT] Exception in thread "main" 
java.io.IOException: No FileSystem for scheme: hdfs
URL: https://github.com/apache/incubator-hudi/issues/1487#issuecomment-609496962
 
 
   https://stackoverflow.com/a/59823742/8366477


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] codecov-io edited a comment on issue #1486: [HUDI-759] Integrate checkpoint privoder with delta streamer

2020-04-05 Thread GitBox
codecov-io edited a comment on issue #1486: [HUDI-759] Integrate checkpoint 
privoder with delta streamer
URL: https://github.com/apache/incubator-hudi/pull/1486#issuecomment-609364046
 
 
   # 
[Codecov](https://codecov.io/gh/apache/incubator-hudi/pull/1486?src=pr=h1) 
Report
   > Merging 
[#1486](https://codecov.io/gh/apache/incubator-hudi/pull/1486?src=pr=desc) 
into 
[master](https://codecov.io/gh/apache/incubator-hudi/commit/eaf6cc2d90bf27c0d9414a4ea18dbd1b61f58e50=desc)
 will **increase** coverage by `0.04%`.
   > The diff coverage is `80.00%`.
   
   [![Impacted file tree 
graph](https://codecov.io/gh/apache/incubator-hudi/pull/1486/graphs/tree.svg?width=650=150=pr=VTTXabwbs2)](https://codecov.io/gh/apache/incubator-hudi/pull/1486?src=pr=tree)
   
   ```diff
   @@ Coverage Diff  @@
   ## master#1486  +/-   ##
   
   + Coverage 71.54%   71.59%   +0.04% 
   - Complexity  261  265   +4 
   
 Files   336  337   +1 
 Lines 1574415756  +12 
 Branches   1610 1611   +1 
   
   + Hits  1126411280  +16 
   + Misses 3759 3754   -5 
   - Partials721  722   +1 
   ```
   
   
   | [Impacted 
Files](https://codecov.io/gh/apache/incubator-hudi/pull/1486?src=pr=tree) | 
Coverage Δ | Complexity Δ | |
   |---|---|---|---|
   | 
[...in/java/org/apache/hudi/utilities/UtilHelpers.java](https://codecov.io/gh/apache/incubator-hudi/pull/1486/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL1V0aWxIZWxwZXJzLmphdmE=)
 | `64.70% <33.33%> (-0.71%)` | `22.00 <1.00> (+1.00)` | :arrow_down: |
   | 
[...i/utilities/deltastreamer/HoodieDeltaStreamer.java](https://codecov.io/gh/apache/incubator-hudi/pull/1486/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL2RlbHRhc3RyZWFtZXIvSG9vZGllRGVsdGFTdHJlYW1lci5qYXZh)
 | `78.84% <85.71%> (+0.23%)` | `10.00 <1.00> (+2.00)` | |
   | 
[...ities/checkpointing/InitialCheckPointProvider.java](https://codecov.io/gh/apache/incubator-hudi/pull/1486/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL2NoZWNrcG9pbnRpbmcvSW5pdGlhbENoZWNrUG9pbnRQcm92aWRlci5qYXZh)
 | `100.00% <100.00%> (ø)` | `1.00 <1.00> (?)` | |
   | 
[...lities/checkpointing/KafkaConnectHdfsProvider.java](https://codecov.io/gh/apache/incubator-hudi/pull/1486/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL2NoZWNrcG9pbnRpbmcvS2Fma2FDb25uZWN0SGRmc1Byb3ZpZGVyLmphdmE=)
 | `92.00% <100.00%> (-0.31%)` | `12.00 <1.00> (ø)` | |
   | 
[...src/main/java/org/apache/hudi/metrics/Metrics.java](https://codecov.io/gh/apache/incubator-hudi/pull/1486/diff?src=pr=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvbWV0cmljcy9NZXRyaWNzLmphdmE=)
 | `72.22% <0.00%> (+13.88%)` | `0.00% <0.00%> (ø%)` | |
   | 
[...g/apache/hudi/metrics/InMemoryMetricsReporter.java](https://codecov.io/gh/apache/incubator-hudi/pull/1486/diff?src=pr=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvbWV0cmljcy9Jbk1lbW9yeU1ldHJpY3NSZXBvcnRlci5qYXZh)
 | `80.00% <0.00%> (+40.00%)` | `0.00% <0.00%> (ø%)` | |
   
   --
   
   [Continue to review full report at 
Codecov](https://codecov.io/gh/apache/incubator-hudi/pull/1486?src=pr=continue).
   > **Legend** - [Click here to learn 
more](https://docs.codecov.io/docs/codecov-delta)
   > `Δ = absolute  (impact)`, `ø = not affected`, `? = missing data`
   > Powered by 
[Codecov](https://codecov.io/gh/apache/incubator-hudi/pull/1486?src=pr=footer).
 Last update 
[eaf6cc2...c904218](https://codecov.io/gh/apache/incubator-hudi/pull/1486?src=pr=lastupdated).
 Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments).
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] lamber-ken commented on issue #1487: [SUPPORT] Exception in thread "main" java.io.IOException: No FileSystem for scheme: hdfs

2020-04-05 Thread GitBox
lamber-ken commented on issue #1487: [SUPPORT] Exception in thread "main" 
java.io.IOException: No FileSystem for scheme: hdfs
URL: https://github.com/apache/incubator-hudi/issues/1487#issuecomment-609550970
 
 
   You're always welcome : )


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] satishkotha commented on a change in pull request #1396: [HUDI-687] Stop incremental reader on RO table before a pending compaction

2020-04-05 Thread GitBox
satishkotha commented on a change in pull request #1396: [HUDI-687] Stop 
incremental reader on RO table before a pending compaction
URL: https://github.com/apache/incubator-hudi/pull/1396#discussion_r403839376
 
 

 ##
 File path: 
hudi-client/src/test/java/org/apache/hudi/table/TestMergeOnReadTable.java
 ##
 @@ -186,6 +168,96 @@ public void testSimpleInsertAndUpdate() throws Exception {
 }
   }
 
+  // test incremental read does not go past compaction instant for RO views
+  // For RT views, incremental read can go past compaction
+  @Test
+  public void testIncrementalReadsWithCompaction() throws Exception {
 
 Review comment:
   This actually includes RT views too.
   
   Example line 195: getRTIncrementalFiles(partitionPath); 
validateIncrementalFiles()


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] satishkotha commented on a change in pull request #1396: [HUDI-687] Stop incremental reader on RO table before a pending compaction

2020-04-05 Thread GitBox
satishkotha commented on a change in pull request #1396: [HUDI-687] Stop 
incremental reader on RO table before a pending compaction
URL: https://github.com/apache/incubator-hudi/pull/1396#discussion_r403838755
 
 

 ##
 File path: 
hudi-client/src/test/java/org/apache/hudi/table/TestMergeOnReadTable.java
 ##
 @@ -1311,4 +1383,111 @@ private void assertNoWriteErrors(List 
statuses) {
   assertFalse("Errors found in write of " + status.getFileId(), 
status.hasErrors());
 }
   }
+  
+  private FileStatus[] insertAndGetFilePaths(List records, 
HoodieWriteClient client,
 
 Review comment:
   This helper code is refactored from existing test 
'TestMergeOnReadTable#testSimpleInsertAndUpdate' . I don't know this setup is 
needed in other places. If you have any suggestions for new location, let me 
know.
   
   There are also no tests so far with end-to-end testing of incremental reads 
on MOR tables (as far as i could tell). So I had to add some more utility test 
methods for that. Please suggest if ythere is a better location for that


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] satishkotha commented on a change in pull request #1396: [HUDI-687] Stop incremental reader on RO table before a pending compaction

2020-04-05 Thread GitBox
satishkotha commented on a change in pull request #1396: [HUDI-687] Stop 
incremental reader on RO table before a pending compaction
URL: https://github.com/apache/incubator-hudi/pull/1396#discussion_r403836768
 
 

 ##
 File path: 
hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/HoodieParquetInputFormat.java
 ##
 @@ -118,6 +119,34 @@
 return returns.toArray(new FileStatus[returns.size()]);
   }
 
+  /**
+   * Filter any specific instants that we do not want to process.
+   * example timeline:
+   *
+   * t0 -> create bucket1.parquet
+   * t1 -> create and append updates bucket1.log
+   * t2 -> request compaction
+   * t3 -> create bucket2.parquet
+   *
+   * if compaction at t2 takes a long time, incremental readers on RO tables 
can move to t3 and would skip updates in t1
+   *
+   * To workaround this problem, we want to stop returning data belonging to 
commits > t2.
+   * After compaction is complete, incremental reader would see updates in t2, 
t3, so on.
+   */
+  protected HoodieDefaultTimeline filterInstantsTimeline(HoodieDefaultTimeline 
timeline) {
+Option pendingCompactionInstant = 
timeline.filterPendingCompactionTimeline().firstInstant();
+if (pendingCompactionInstant.isPresent()) {
 
 Review comment:
   Yes, this is the crux of the change. My understanding is this bug caused 
data loss on derived ETL tables multiple times. These ETL tables are generated 
using incremental reads on "RO"views. As you suggested, that is core issue and 
switching to RT views is likely going to get rid of the problem. 
   
   Also given "getting started" and other demo examples  include incremental 
reads on RO views,  I think this new safeguard is useful to have, especially 
given that finding root cause  for this took a while.  I am fine with 
abandoning this change if we can remove incremental read examples on RO views 
in documentation.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] satishkotha commented on a change in pull request #1396: [HUDI-687] Stop incremental reader on RO table before a pending compaction

2020-04-05 Thread GitBox
satishkotha commented on a change in pull request #1396: [HUDI-687] Stop 
incremental reader on RO table before a pending compaction
URL: https://github.com/apache/incubator-hudi/pull/1396#discussion_r403838755
 
 

 ##
 File path: 
hudi-client/src/test/java/org/apache/hudi/table/TestMergeOnReadTable.java
 ##
 @@ -1311,4 +1383,111 @@ private void assertNoWriteErrors(List 
statuses) {
   assertFalse("Errors found in write of " + status.getFileId(), 
status.hasErrors());
 }
   }
+  
+  private FileStatus[] insertAndGetFilePaths(List records, 
HoodieWriteClient client,
 
 Review comment:
   This helper code is refactored from existing test 
'TestMergeOnReadTable#testSimpleInsertAndUpdate' . I don't know this setup is 
needed in other places. If you have any suggestions for new location, let me 
know.
   
   There are also no tests so far with end-to-end testing of incremental reads 
on MOR tables (as far as i could tell). So I had to add some more utility test 
methods for that. Please suggest if there is a better location for that


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] satishkotha commented on a change in pull request #1396: [HUDI-687] Stop incremental reader on RO table before a pending compaction

2020-04-05 Thread GitBox
satishkotha commented on a change in pull request #1396: [HUDI-687] Stop 
incremental reader on RO table before a pending compaction
URL: https://github.com/apache/incubator-hudi/pull/1396#discussion_r403836768
 
 

 ##
 File path: 
hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/HoodieParquetInputFormat.java
 ##
 @@ -118,6 +119,34 @@
 return returns.toArray(new FileStatus[returns.size()]);
   }
 
+  /**
+   * Filter any specific instants that we do not want to process.
+   * example timeline:
+   *
+   * t0 -> create bucket1.parquet
+   * t1 -> create and append updates bucket1.log
+   * t2 -> request compaction
+   * t3 -> create bucket2.parquet
+   *
+   * if compaction at t2 takes a long time, incremental readers on RO tables 
can move to t3 and would skip updates in t1
+   *
+   * To workaround this problem, we want to stop returning data belonging to 
commits > t2.
+   * After compaction is complete, incremental reader would see updates in t2, 
t3, so on.
+   */
+  protected HoodieDefaultTimeline filterInstantsTimeline(HoodieDefaultTimeline 
timeline) {
+Option pendingCompactionInstant = 
timeline.filterPendingCompactionTimeline().firstInstant();
+if (pendingCompactionInstant.isPresent()) {
 
 Review comment:
   Yes, this is the crux of the change. My understanding is this bug caused 
data loss on derived ETL tables multiple times. These ETL tables are generated 
using incremental reads on "RO"views. As you suggested, that is core issue and 
switching to RT views is likely going to get rid of the problem. 
   
   Also given "getting started" and other demo examples  include incremental 
reads on RO views,  I think this is useful to have especially given that 
finding root cause  for this took a while.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services