[jira] [Commented] (HUDI-146) Impala Support

2019-11-05 Thread Yanjia Gary Li (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16968043#comment-16968043
 ] 

Yanjia Gary Li commented on HUDI-146:
-

Hello [~vinoth], Yuanbin finished his internship a few months ago, please 
assign this ticket to me and I will give it a try. 

> Impala Support
> --
>
> Key: HUDI-146
> URL: https://issues.apache.org/jira/browse/HUDI-146
> Project: Apache Hudi (incubating)
>  Issue Type: New Feature
>Reporter: Vinoth Chandar
>Assignee: Yuanbin Cheng
>Priority: Major
>
> [https://github.com/apache/incubator-hudi/issues/179] 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-146) Impala Support

2019-11-06 Thread Yanjia Gary Li (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16968638#comment-16968638
 ] 

Yanjia Gary Li commented on HUDI-146:
-

[~vinoth] is there any hudi related code in the Hive code base? 

> Impala Support
> --
>
> Key: HUDI-146
> URL: https://issues.apache.org/jira/browse/HUDI-146
> Project: Apache Hudi (incubating)
>  Issue Type: New Feature
>Reporter: Vinoth Chandar
>Assignee: Yanjia Gary Li
>Priority: Major
>
> [https://github.com/apache/incubator-hudi/issues/179] 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-318) Update Migration Guide to Include Delta Streamer

2019-10-31 Thread Yanjia Gary Li (Jira)
Yanjia Gary Li created HUDI-318:
---

 Summary: Update Migration Guide to Include Delta Streamer
 Key: HUDI-318
 URL: https://issues.apache.org/jira/browse/HUDI-318
 Project: Apache Hudi (incubating)
  Issue Type: Improvement
Reporter: Yanjia Gary Li
Assignee: Yanjia Gary Li


[http://hudi.apache.org/migration_guide.html]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-415) HoodieSparkSqlWriter Commit time not representing the Spark job starting time

2019-12-16 Thread Yanjia Gary Li (Jira)
Yanjia Gary Li created HUDI-415:
---

 Summary: HoodieSparkSqlWriter Commit time not representing the 
Spark job starting time
 Key: HUDI-415
 URL: https://issues.apache.org/jira/browse/HUDI-415
 Project: Apache Hudi (incubating)
  Issue Type: Bug
Reporter: Yanjia Gary Li
Assignee: Yanjia Gary Li


Hudi records the commit time after the first action complete. If there is a 
heavy transformation before isEmpty(), then the commit time could be inaccurate.
{code:java}
if (hoodieRecords.isEmpty()) { 
log.info("new batch has no new records, skipping...") 
return (true, common.util.Option.empty()) 
} 
commitTime = client.startCommit() 
writeStatuses = DataSourceUtils.doWriteOperation(client, hoodieRecords, 
commitTime, operation)
{code}
For example, I start the spark job at 20190101, but *isEmpty()* ran for 2 
hours, then the commit time in the .hoodie folder will be 201901010*2*00. If I 
use the commit time to ingest data starting from 201901010200(from HDFS, not 
using deltastreamer), then I will miss 2 hours of data.

Is this set up intended? Can we move the commit time before isEmpty()?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-259) Hadoop 3 support for Hudi writing

2019-12-12 Thread Yanjia Gary Li (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16995153#comment-16995153
 ] 

Yanjia Gary Li commented on HUDI-259:
-

Hello, I recently started using Hadoop 3 and Spark 2.4. 
[https://github.com/apache/incubator-hudi/commit/7bc08cbfdce337ad980bb544ec9fc3dbdf9c#diff-832156391e3edd5b0ceb86007ce6ae41]
 enable me to compile Hudi with Hadoop 3, but some tests are failed. 

> Hadoop 3 support for Hudi writing
> -
>
> Key: HUDI-259
> URL: https://issues.apache.org/jira/browse/HUDI-259
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Usability
>Reporter: Vinoth Chandar
>Assignee: Pratyaksh Sharma
>Priority: Major
>
> Sample issues
>  
> [https://github.com/apache/incubator-hudi/issues/735]
> [https://github.com/apache/incubator-hudi/issues/877#issuecomment-528433568] 
> [https://github.com/apache/incubator-hudi/issues/898]
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-415) HoodieSparkSqlWriter Commit time not representing the Spark job starting time

2019-12-20 Thread Yanjia Gary Li (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanjia Gary Li updated HUDI-415:

Status: Closed  (was: Patch Available)

> HoodieSparkSqlWriter Commit time not representing the Spark job starting time
> -
>
> Key: HUDI-415
> URL: https://issues.apache.org/jira/browse/HUDI-415
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>Reporter: Yanjia Gary Li
>Assignee: Yanjia Gary Li
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Hudi records the commit time after the first action complete. If there is a 
> heavy transformation before isEmpty(), then the commit time could be 
> inaccurate.
> {code:java}
> if (hoodieRecords.isEmpty()) { 
> log.info("new batch has no new records, skipping...") 
> return (true, common.util.Option.empty()) 
> } 
> commitTime = client.startCommit() 
> writeStatuses = DataSourceUtils.doWriteOperation(client, hoodieRecords, 
> commitTime, operation)
> {code}
> For example, I start the spark job at 20190101, but *isEmpty()* ran for 2 
> hours, then the commit time in the .hoodie folder will be 201901010*2*00. If 
> I use the commit time to ingest data starting from 201901010200(from HDFS, 
> not using deltastreamer), then I will miss 2 hours of data.
> Is this set up intended? Can we move the commit time before isEmpty()?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-415) HoodieSparkSqlWriter Commit time not representing the Spark job starting time

2019-12-20 Thread Yanjia Gary Li (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanjia Gary Li updated HUDI-415:

Status: Patch Available  (was: In Progress)

> HoodieSparkSqlWriter Commit time not representing the Spark job starting time
> -
>
> Key: HUDI-415
> URL: https://issues.apache.org/jira/browse/HUDI-415
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>Reporter: Yanjia Gary Li
>Assignee: Yanjia Gary Li
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Hudi records the commit time after the first action complete. If there is a 
> heavy transformation before isEmpty(), then the commit time could be 
> inaccurate.
> {code:java}
> if (hoodieRecords.isEmpty()) { 
> log.info("new batch has no new records, skipping...") 
> return (true, common.util.Option.empty()) 
> } 
> commitTime = client.startCommit() 
> writeStatuses = DataSourceUtils.doWriteOperation(client, hoodieRecords, 
> commitTime, operation)
> {code}
> For example, I start the spark job at 20190101, but *isEmpty()* ran for 2 
> hours, then the commit time in the .hoodie folder will be 201901010*2*00. If 
> I use the commit time to ingest data starting from 201901010200(from HDFS, 
> not using deltastreamer), then I will miss 2 hours of data.
> Is this set up intended? Can we move the commit time before isEmpty()?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-415) HoodieSparkSqlWriter Commit time not representing the Spark job starting time

2019-12-20 Thread Yanjia Gary Li (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17001084#comment-17001084
 ] 

Yanjia Gary Li commented on HUDI-415:
-

PR merged. Issue resolved.

> HoodieSparkSqlWriter Commit time not representing the Spark job starting time
> -
>
> Key: HUDI-415
> URL: https://issues.apache.org/jira/browse/HUDI-415
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>Reporter: Yanjia Gary Li
>Assignee: Yanjia Gary Li
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Hudi records the commit time after the first action complete. If there is a 
> heavy transformation before isEmpty(), then the commit time could be 
> inaccurate.
> {code:java}
> if (hoodieRecords.isEmpty()) { 
> log.info("new batch has no new records, skipping...") 
> return (true, common.util.Option.empty()) 
> } 
> commitTime = client.startCommit() 
> writeStatuses = DataSourceUtils.doWriteOperation(client, hoodieRecords, 
> commitTime, operation)
> {code}
> For example, I start the spark job at 20190101, but *isEmpty()* ran for 2 
> hours, then the commit time in the .hoodie folder will be 201901010*2*00. If 
> I use the commit time to ingest data starting from 201901010200(from HDFS, 
> not using deltastreamer), then I will miss 2 hours of data.
> Is this set up intended? Can we move the commit time before isEmpty()?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-610) Impala nea real time table support

2020-02-13 Thread Yanjia Gary Li (Jira)
Yanjia Gary Li created HUDI-610:
---

 Summary: Impala nea real time table support
 Key: HUDI-610
 URL: https://issues.apache.org/jira/browse/HUDI-610
 Project: Apache Hudi (incubating)
  Issue Type: New Feature
Reporter: Yanjia Gary Li
Assignee: Yanjia Gary Li


Impala uses the JAVA based module call "frontend" to list all the files to scan 
and let the C++ based "backend" to do all the file scanning. 

Merge Avro and Parquet could be difficult because it might need to have a 
custom merging logic like RealtimeCompactedRecordReader to be implemented in 
backend using C++, but I think it will be doable to have something like 
RealtimeUnmergedRecordReader which only need some changes in the frontend. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-611) Impala sync tool

2020-02-13 Thread Yanjia Gary Li (Jira)
Yanjia Gary Li created HUDI-611:
---

 Summary: Impala sync tool
 Key: HUDI-611
 URL: https://issues.apache.org/jira/browse/HUDI-611
 Project: Apache Hudi (incubating)
  Issue Type: New Feature
Reporter: Yanjia Gary Li
Assignee: Yanjia Gary Li


Like sync to Hive. We need a tool to sync with Impala. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-644) Enable to retrieve checkpoint from previous commits in Delta Streamer

2020-02-26 Thread Yanjia Gary Li (Jira)
Yanjia Gary Li created HUDI-644:
---

 Summary: Enable to retrieve checkpoint from previous commits in 
Delta Streamer
 Key: HUDI-644
 URL: https://issues.apache.org/jira/browse/HUDI-644
 Project: Apache Hudi (incubating)
  Issue Type: Improvement
  Components: DeltaStreamer
Reporter: Yanjia Gary Li
Assignee: Yanjia Gary Li


This ticket is to resolve the following problem:

The user is using a homebrew Spark data source to read new data and write to 
Hudi table

The user would like to migrate to Delta Streamer

But the Delta Streamer only checks the last commit metadata, if there is no 
checkpoint info, then the Delta Streamer will use the default. For Kafka 
source, it is LATEST. 

The user would like to run the homebrew Spark data source reader and Delta 
Streamer in parallel to prevent data loss, but the Spark data source writer 
will make commit without checkpoint info, which will reset the delta streamer. 

So if we have an option to allow the user to retrieve the checkpoint from 
previous commits instead of the latest commit would be helpful for the 
migration. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-644) Enable to retrieve checkpoint from previous commits in Delta Streamer

2020-03-03 Thread Yanjia Gary Li (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanjia Gary Li updated HUDI-644:

Status: Open  (was: New)

> Enable to retrieve checkpoint from previous commits in Delta Streamer
> -
>
> Key: HUDI-644
> URL: https://issues.apache.org/jira/browse/HUDI-644
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: DeltaStreamer
>Reporter: Yanjia Gary Li
>Assignee: Yanjia Gary Li
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> This ticket is to resolve the following problem:
> The user is using a homebrew Spark data source to read new data and write to 
> Hudi table
> The user would like to migrate to Delta Streamer
> But the Delta Streamer only checks the last commit metadata, if there is no 
> checkpoint info, then the Delta Streamer will use the default. For Kafka 
> source, it is LATEST. 
> The user would like to run the homebrew Spark data source reader and Delta 
> Streamer in parallel to prevent data loss, but the Spark data source writer 
> will make commit without checkpoint info, which will reset the delta 
> streamer. 
> So if we have an option to allow the user to retrieve the checkpoint from 
> previous commits instead of the latest commit would be helpful for the 
> migration. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-644) Enable to retrieve checkpoint from previous commits in Delta Streamer

2020-03-03 Thread Yanjia Gary Li (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanjia Gary Li updated HUDI-644:

Status: In Progress  (was: Open)

> Enable to retrieve checkpoint from previous commits in Delta Streamer
> -
>
> Key: HUDI-644
> URL: https://issues.apache.org/jira/browse/HUDI-644
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: DeltaStreamer
>Reporter: Yanjia Gary Li
>Assignee: Yanjia Gary Li
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.6.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> This ticket is to resolve the following problem:
> The user is using a homebrew Spark data source to read new data and write to 
> Hudi table
> The user would like to migrate to Delta Streamer
> But the Delta Streamer only checks the last commit metadata, if there is no 
> checkpoint info, then the Delta Streamer will use the default. For Kafka 
> source, it is LATEST. 
> The user would like to run the homebrew Spark data source reader and Delta 
> Streamer in parallel to prevent data loss, but the Spark data source writer 
> will make commit without checkpoint info, which will reset the delta 
> streamer. 
> So if we have an option to allow the user to retrieve the checkpoint from 
> previous commits instead of the latest commit would be helpful for the 
> migration. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-644) Enable to retrieve checkpoint from previous commits in Delta Streamer

2020-03-03 Thread Yanjia Gary Li (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanjia Gary Li updated HUDI-644:

Fix Version/s: 0.6.0

> Enable to retrieve checkpoint from previous commits in Delta Streamer
> -
>
> Key: HUDI-644
> URL: https://issues.apache.org/jira/browse/HUDI-644
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: DeltaStreamer
>Reporter: Yanjia Gary Li
>Assignee: Yanjia Gary Li
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.6.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> This ticket is to resolve the following problem:
> The user is using a homebrew Spark data source to read new data and write to 
> Hudi table
> The user would like to migrate to Delta Streamer
> But the Delta Streamer only checks the last commit metadata, if there is no 
> checkpoint info, then the Delta Streamer will use the default. For Kafka 
> source, it is LATEST. 
> The user would like to run the homebrew Spark data source reader and Delta 
> Streamer in parallel to prevent data loss, but the Spark data source writer 
> will make commit without checkpoint info, which will reset the delta 
> streamer. 
> So if we have an option to allow the user to retrieve the checkpoint from 
> previous commits instead of the latest commit would be helpful for the 
> migration. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (HUDI-315) Reimplement statistics/workload profile collected during writes using Spark 2.x custom accumulators

2020-02-27 Thread Yanjia Gary Li (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-315?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanjia Gary Li closed HUDI-315.
---
Resolution: Won't Fix

> Reimplement statistics/workload profile collected during writes using Spark 
> 2.x custom accumulators
> ---
>
> Key: HUDI-315
> URL: https://issues.apache.org/jira/browse/HUDI-315
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Performance, Writer Core
>Reporter: Vinoth Chandar
>Assignee: Yanjia Gary Li
>Priority: Major
>
> https://medium.com/@shrechak/leveraging-custom-accumulators-in-apache-spark-2-0-f4fef23f19f1
>  
> In Hudi, there are two places where we need to obtain statistics on the input 
> data 
> - HoodieBloomIndex  : for knowing what partitions need to be loaded and 
> checked against (is this still needed with the timeline server enabled is a 
> separate question) 
> - Workload profile to get a sense of number of updates, inserts to each 
> partition/file group
> Both of them issue their own groupBy or shuffle computation today. This can 
> be avoided using an accumulator



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-315) Reimplement statistics/workload profile collected during writes using Spark 2.x custom accumulators

2020-02-27 Thread Yanjia Gary Li (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17047137#comment-17047137
 ] 

Yanjia Gary Li commented on HUDI-315:
-

Agree. Closing this ticket. 

> Reimplement statistics/workload profile collected during writes using Spark 
> 2.x custom accumulators
> ---
>
> Key: HUDI-315
> URL: https://issues.apache.org/jira/browse/HUDI-315
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Performance, Writer Core
>Reporter: Vinoth Chandar
>Assignee: Yanjia Gary Li
>Priority: Major
>
> https://medium.com/@shrechak/leveraging-custom-accumulators-in-apache-spark-2-0-f4fef23f19f1
>  
> In Hudi, there are two places where we need to obtain statistics on the input 
> data 
> - HoodieBloomIndex  : for knowing what partitions need to be loaded and 
> checked against (is this still needed with the timeline server enabled is a 
> separate question) 
> - Workload profile to get a sense of number of updates, inserts to each 
> partition/file group
> Both of them issue their own groupBy or shuffle computation today. This can 
> be avoided using an accumulator



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (HUDI-415) HoodieSparkSqlWriter Commit time not representing the Spark job starting time

2020-02-27 Thread Yanjia Gary Li (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanjia Gary Li resolved HUDI-415.
-
Resolution: Fixed

> HoodieSparkSqlWriter Commit time not representing the Spark job starting time
> -
>
> Key: HUDI-415
> URL: https://issues.apache.org/jira/browse/HUDI-415
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>Reporter: Yanjia Gary Li
>Assignee: Yanjia Gary Li
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.5.1
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Hudi records the commit time after the first action complete. If there is a 
> heavy transformation before isEmpty(), then the commit time could be 
> inaccurate.
> {code:java}
> if (hoodieRecords.isEmpty()) { 
> log.info("new batch has no new records, skipping...") 
> return (true, common.util.Option.empty()) 
> } 
> commitTime = client.startCommit() 
> writeStatuses = DataSourceUtils.doWriteOperation(client, hoodieRecords, 
> commitTime, operation)
> {code}
> For example, I start the spark job at 20190101, but *isEmpty()* ran for 2 
> hours, then the commit time in the .hoodie folder will be 201901010*2*00. If 
> I use the commit time to ingest data starting from 201901010200(from HDFS, 
> not using deltastreamer), then I will miss 2 hours of data.
> Is this set up intended? Can we move the commit time before isEmpty()?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Reopened] (HUDI-415) HoodieSparkSqlWriter Commit time not representing the Spark job starting time

2020-02-27 Thread Yanjia Gary Li (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanjia Gary Li reopened HUDI-415:
-

> HoodieSparkSqlWriter Commit time not representing the Spark job starting time
> -
>
> Key: HUDI-415
> URL: https://issues.apache.org/jira/browse/HUDI-415
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>Reporter: Yanjia Gary Li
>Assignee: Yanjia Gary Li
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.5.1
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Hudi records the commit time after the first action complete. If there is a 
> heavy transformation before isEmpty(), then the commit time could be 
> inaccurate.
> {code:java}
> if (hoodieRecords.isEmpty()) { 
> log.info("new batch has no new records, skipping...") 
> return (true, common.util.Option.empty()) 
> } 
> commitTime = client.startCommit() 
> writeStatuses = DataSourceUtils.doWriteOperation(client, hoodieRecords, 
> commitTime, operation)
> {code}
> For example, I start the spark job at 20190101, but *isEmpty()* ran for 2 
> hours, then the commit time in the .hoodie folder will be 201901010*2*00. If 
> I use the commit time to ingest data starting from 201901010200(from HDFS, 
> not using deltastreamer), then I will miss 2 hours of data.
> Is this set up intended? Can we move the commit time before isEmpty()?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-315) Reimplement statistics/workload profile collected during writes using Spark 2.x custom accumulators

2020-02-26 Thread Yanjia Gary Li (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17045986#comment-17045986
 ] 

Yanjia Gary Li commented on HUDI-315:
-

I will take a look at this ticket

> Reimplement statistics/workload profile collected during writes using Spark 
> 2.x custom accumulators
> ---
>
> Key: HUDI-315
> URL: https://issues.apache.org/jira/browse/HUDI-315
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Performance, Writer Core
>Reporter: Vinoth Chandar
>Assignee: Yanjia Gary Li
>Priority: Major
>
> https://medium.com/@shrechak/leveraging-custom-accumulators-in-apache-spark-2-0-f4fef23f19f1
>  
> In Hudi, there are two places where we need to obtain statistics on the input 
> data 
> - HoodieBloomIndex  : for knowing what partitions need to be loaded and 
> checked against (is this still needed with the timeline server enabled is a 
> separate question) 
> - Workload profile to get a sense of number of updates, inserts to each 
> partition/file group
> Both of them issue their own groupBy or shuffle computation today. This can 
> be avoided using an accumulator



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (HUDI-315) Reimplement statistics/workload profile collected during writes using Spark 2.x custom accumulators

2020-02-26 Thread Yanjia Gary Li (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-315?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanjia Gary Li reassigned HUDI-315:
---

Assignee: Yanjia Gary Li

> Reimplement statistics/workload profile collected during writes using Spark 
> 2.x custom accumulators
> ---
>
> Key: HUDI-315
> URL: https://issues.apache.org/jira/browse/HUDI-315
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Performance, Writer Core
>Reporter: Vinoth Chandar
>Assignee: Yanjia Gary Li
>Priority: Major
>
> https://medium.com/@shrechak/leveraging-custom-accumulators-in-apache-spark-2-0-f4fef23f19f1
>  
> In Hudi, there are two places where we need to obtain statistics on the input 
> data 
> - HoodieBloomIndex  : for knowing what partitions need to be loaded and 
> checked against (is this still needed with the timeline server enabled is a 
> separate question) 
> - Workload profile to get a sense of number of updates, inserts to each 
> partition/file group
> Both of them issue their own groupBy or shuffle computation today. This can 
> be avoided using an accumulator



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-597) Enable incremental pulling from defined partitions

2020-03-01 Thread Yanjia Gary Li (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanjia Gary Li updated HUDI-597:

Description: 
For the use case that I only need to pull the incremental part of certain 
partitions, I need to do the incremental pulling from the entire dataset first 
then filtering in Spark.

If we can use the folder partitions directly as part of the input path, it 
could run faster by only load relevant parquet files.

Example:

 
{code:java}
spark.read.format("org.apache.hudi")
.option(DataSourceReadOptions.VIEW_TYPE_OPT_KEY,DataSourceReadOptions.VIEW_TYPE_INCREMENTAL_OPT_VAL)
.option(DataSourceReadOptions.BEGIN_INSTANTTIME_OPT_KEY, "000")
.option(DataSourceReadOptions.INCR_PATH_GLOB_OPT_KEY, "/year=2016/*/*/*")
.load(path)
 
{code}
 

  was:
For the use case that I only need to pull the incremental part of certain 
partitions, I need to do the incremental pulling from the entire dataset first 
then filtering in Spark.

If we can use the folder partitions directly as part of the input path, it 
could run faster by only load relevant parquet files.

Example:

 
{code:java}
spark.read.format("org.apache.hudi")
.option(DataSourceReadOptions.VIEW_TYPE_OPT_KEY,DataSourceReadOptions.VIEW_TYPE_INCREMENTAL_OPT_VAL)
.option(DataSourceReadOptions.BEGIN_INSTANTTIME_OPT_KEY, "000")
.load(path, "year=2020/*/*/*")
 
{code}
 


> Enable incremental pulling from defined partitions
> --
>
> Key: HUDI-597
> URL: https://issues.apache.org/jira/browse/HUDI-597
> Project: Apache Hudi (incubating)
>  Issue Type: New Feature
>Reporter: Yanjia Gary Li
>Assignee: Yanjia Gary Li
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.5.2
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> For the use case that I only need to pull the incremental part of certain 
> partitions, I need to do the incremental pulling from the entire dataset 
> first then filtering in Spark.
> If we can use the folder partitions directly as part of the input path, it 
> could run faster by only load relevant parquet files.
> Example:
>  
> {code:java}
> spark.read.format("org.apache.hudi")
> .option(DataSourceReadOptions.VIEW_TYPE_OPT_KEY,DataSourceReadOptions.VIEW_TYPE_INCREMENTAL_OPT_VAL)
> .option(DataSourceReadOptions.BEGIN_INSTANTTIME_OPT_KEY, "000")
> .option(DataSourceReadOptions.INCR_PATH_GLOB_OPT_KEY, "/year=2016/*/*/*")
> .load(path)
>  
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (HUDI-597) Enable incremental pulling from defined partitions

2020-02-27 Thread Yanjia Gary Li (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanjia Gary Li resolved HUDI-597.
-
Resolution: Fixed

PR merged. Will update the DOC after 0.5.2 release

> Enable incremental pulling from defined partitions
> --
>
> Key: HUDI-597
> URL: https://issues.apache.org/jira/browse/HUDI-597
> Project: Apache Hudi (incubating)
>  Issue Type: New Feature
>Reporter: Yanjia Gary Li
>Assignee: Yanjia Gary Li
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> For the use case that I only need to pull the incremental part of certain 
> partitions, I need to do the incremental pulling from the entire dataset 
> first then filtering in Spark.
> If we can use the folder partitions directly as part of the input path, it 
> could run faster by only load relevant parquet files.
> Example:
>  
> {code:java}
> spark.read.format("org.apache.hudi")
> .option(DataSourceReadOptions.VIEW_TYPE_OPT_KEY,DataSourceReadOptions.VIEW_TYPE_INCREMENTAL_OPT_VAL)
> .option(DataSourceReadOptions.BEGIN_INSTANTTIME_OPT_KEY, "000")
> .load(path, "year=2020/*/*/*")
>  
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-597) Enable incremental pulling from defined partitions

2020-02-27 Thread Yanjia Gary Li (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanjia Gary Li updated HUDI-597:

Fix Version/s: 0.5.2

> Enable incremental pulling from defined partitions
> --
>
> Key: HUDI-597
> URL: https://issues.apache.org/jira/browse/HUDI-597
> Project: Apache Hudi (incubating)
>  Issue Type: New Feature
>Reporter: Yanjia Gary Li
>Assignee: Yanjia Gary Li
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.5.2
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> For the use case that I only need to pull the incremental part of certain 
> partitions, I need to do the incremental pulling from the entire dataset 
> first then filtering in Spark.
> If we can use the folder partitions directly as part of the input path, it 
> could run faster by only load relevant parquet files.
> Example:
>  
> {code:java}
> spark.read.format("org.apache.hudi")
> .option(DataSourceReadOptions.VIEW_TYPE_OPT_KEY,DataSourceReadOptions.VIEW_TYPE_INCREMENTAL_OPT_VAL)
> .option(DataSourceReadOptions.BEGIN_INSTANTTIME_OPT_KEY, "000")
> .load(path, "year=2020/*/*/*")
>  
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-597) Enable incremental pulling from defined partitions

2020-02-27 Thread Yanjia Gary Li (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanjia Gary Li updated HUDI-597:

Status: Open  (was: New)

> Enable incremental pulling from defined partitions
> --
>
> Key: HUDI-597
> URL: https://issues.apache.org/jira/browse/HUDI-597
> Project: Apache Hudi (incubating)
>  Issue Type: New Feature
>Reporter: Yanjia Gary Li
>Assignee: Yanjia Gary Li
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> For the use case that I only need to pull the incremental part of certain 
> partitions, I need to do the incremental pulling from the entire dataset 
> first then filtering in Spark.
> If we can use the folder partitions directly as part of the input path, it 
> could run faster by only load relevant parquet files.
> Example:
>  
> {code:java}
> spark.read.format("org.apache.hudi")
> .option(DataSourceReadOptions.VIEW_TYPE_OPT_KEY,DataSourceReadOptions.VIEW_TYPE_INCREMENTAL_OPT_VAL)
> .option(DataSourceReadOptions.BEGIN_INSTANTTIME_OPT_KEY, "000")
> .load(path, "year=2020/*/*/*")
>  
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-611) Add Impala Guide to Doc

2020-02-27 Thread Yanjia Gary Li (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanjia Gary Li updated HUDI-611:

Status: Open  (was: New)

> Add Impala Guide to Doc
> ---
>
> Key: HUDI-611
> URL: https://issues.apache.org/jira/browse/HUDI-611
> Project: Apache Hudi (incubating)
>  Issue Type: New Feature
>Reporter: Yanjia Gary Li
>Assignee: Yanjia Gary Li
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Like sync to Hive. We need a tool to sync with Impala. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (HUDI-611) Add Impala Guide to Doc

2020-02-27 Thread Yanjia Gary Li (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanjia Gary Li resolved HUDI-611.
-
Resolution: Fixed

> Add Impala Guide to Doc
> ---
>
> Key: HUDI-611
> URL: https://issues.apache.org/jira/browse/HUDI-611
> Project: Apache Hudi (incubating)
>  Issue Type: New Feature
>Reporter: Yanjia Gary Li
>Assignee: Yanjia Gary Li
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Like sync to Hive. We need a tool to sync with Impala. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-611) Add Impala Guide to Doc

2020-02-27 Thread Yanjia Gary Li (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanjia Gary Li updated HUDI-611:

Status: In Progress  (was: Open)

> Add Impala Guide to Doc
> ---
>
> Key: HUDI-611
> URL: https://issues.apache.org/jira/browse/HUDI-611
> Project: Apache Hudi (incubating)
>  Issue Type: New Feature
>Reporter: Yanjia Gary Li
>Assignee: Yanjia Gary Li
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Like sync to Hive. We need a tool to sync with Impala. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-597) Enable incremental pulling from defined partitions

2020-02-27 Thread Yanjia Gary Li (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanjia Gary Li updated HUDI-597:

Status: In Progress  (was: Open)

> Enable incremental pulling from defined partitions
> --
>
> Key: HUDI-597
> URL: https://issues.apache.org/jira/browse/HUDI-597
> Project: Apache Hudi (incubating)
>  Issue Type: New Feature
>Reporter: Yanjia Gary Li
>Assignee: Yanjia Gary Li
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> For the use case that I only need to pull the incremental part of certain 
> partitions, I need to do the incremental pulling from the entire dataset 
> first then filtering in Spark.
> If we can use the folder partitions directly as part of the input path, it 
> could run faster by only load relevant parquet files.
> Example:
>  
> {code:java}
> spark.read.format("org.apache.hudi")
> .option(DataSourceReadOptions.VIEW_TYPE_OPT_KEY,DataSourceReadOptions.VIEW_TYPE_INCREMENTAL_OPT_VAL)
> .option(DataSourceReadOptions.BEGIN_INSTANTTIME_OPT_KEY, "000")
> .load(path, "year=2020/*/*/*")
>  
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-597) Enable incremental pulling from defined partitions

2020-02-03 Thread Yanjia Gary Li (Jira)
Yanjia Gary Li created HUDI-597:
---

 Summary: Enable incremental pulling from defined partitions
 Key: HUDI-597
 URL: https://issues.apache.org/jira/browse/HUDI-597
 Project: Apache Hudi (incubating)
  Issue Type: New Feature
Reporter: Yanjia Gary Li
Assignee: Yanjia Gary Li


For the use case that I only need to pull the incremental part of certain 
partitions, I need to do the incremental pulling from the entire dataset first 
then filtering in Spark.

If we can use the folder partitions directly as part of the input path, it 
could run faster by only load relevant parquet files.

Example:

 
{code:java}
spark.read.format("org.apache.hudi")
.option(DataSourceReadOptions.VIEW_TYPE_OPT_KEY,DataSourceReadOptions.VIEW_TYPE_INCREMENTAL_OPT_VAL)
.option(DataSourceReadOptions.BEGIN_INSTANTTIME_OPT_KEY, "000")
.load(path, "year=2020/*/*/*")
 
{code}
 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (HUDI-146) Impala Support

2020-02-11 Thread Yanjia Gary Li (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-146?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanjia Gary Li resolved HUDI-146.
-
Resolution: Done

read optimized table now support by Impala. Fixed by: 
[https://github.com/apache/impala/commit/ea0e1def6160d596082b01365fcbbb6e24afb21d]

Sample query to create Hudi table: 
[https://github.com/apache/impala/blob/ea0e1def6160d596082b01365fcbbb6e24afb21d/testdata/datasets/functional/functional_schema_template.sql#L2758]

 

> Impala Support
> --
>
> Key: HUDI-146
> URL: https://issues.apache.org/jira/browse/HUDI-146
> Project: Apache Hudi (incubating)
>  Issue Type: New Feature
>  Components: Hive Integration
>Reporter: Vinoth Chandar
>Assignee: Yanjia Gary Li
>Priority: Major
>
> [https://github.com/apache/incubator-hudi/issues/179] 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-611) Add Impala Guide to Doc

2020-02-21 Thread Yanjia Gary Li (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanjia Gary Li updated HUDI-611:

Summary: Add Impala Guide to Doc  (was: Impala sync tool)

> Add Impala Guide to Doc
> ---
>
> Key: HUDI-611
> URL: https://issues.apache.org/jira/browse/HUDI-611
> Project: Apache Hudi (incubating)
>  Issue Type: New Feature
>Reporter: Yanjia Gary Li
>Assignee: Yanjia Gary Li
>Priority: Major
>
> Like sync to Hive. We need a tool to sync with Impala. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-611) Add Impala Guide to Doc

2020-02-21 Thread Yanjia Gary Li (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanjia Gary Li updated HUDI-611:

Priority: Minor  (was: Major)

> Add Impala Guide to Doc
> ---
>
> Key: HUDI-611
> URL: https://issues.apache.org/jira/browse/HUDI-611
> Project: Apache Hudi (incubating)
>  Issue Type: New Feature
>Reporter: Yanjia Gary Li
>Assignee: Yanjia Gary Li
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Like sync to Hive. We need a tool to sync with Impala. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-494) [DEBUGGING] Huge amount of tasks when writing files into HDFS

2020-01-02 Thread Yanjia Gary Li (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanjia Gary Li updated HUDI-494:

Description: 
I am using the manual build master after 
[https://github.com/apache/incubator-hudi/commit/36b3b6f5dd913d3f1c9aa116aff8daf6540fed65]
 commit. EDIT: tried with the latest master but got the same result

I am seeing 3 million tasks when the Hudi Spark job writing the files into 
HDFS. It seems like related to the input size. With 7.7 GB input it was 3.2 
million tasks, with 9 GB input it was 3.7 million. Both with 10 parallelisms. 

I am seeing a huge amount of 0 byte files being written into .hoodie/.temp/ 
folder in my HDFS. In the Spark UI, each task only writes less than 10 records 
in
{code:java}
count at HoodieSparkSqlWriter{code}
 All the stages before this seem normal. Any idea what happened here? My first 
guess would be something related to the bloom filter index. Maybe somewhere 
trigger the repartitioning with the bloom filter index? But I am not really 
familiar with that part of the code. 

Thanks

 

  was:
I am using the manual build master after 
[https://github.com/apache/incubator-hudi/commit/36b3b6f5dd913d3f1c9aa116aff8daf6540fed65]
 commit. 

I am seeing 3 million tasks when the Hudi Spark job writing the files into 
HDFS. 

I am seeing a huge amount of 0 byte files being written into .hoodie/.temp/ 
folder in my HDFS. In the Spark UI, each task only writes less than 10 records 
in
{code:java}
count at HoodieSparkSqlWriter{code}
 All the stages before this seems normal. Any idea what happened here?

 


> [DEBUGGING] Huge amount of tasks when writing files into HDFS
> -
>
> Key: HUDI-494
> URL: https://issues.apache.org/jira/browse/HUDI-494
> Project: Apache Hudi (incubating)
>  Issue Type: Test
>Reporter: Yanjia Gary Li
>Assignee: Vinoth Chandar
>Priority: Major
> Attachments: Screen Shot 2020-01-02 at 8.53.24 PM.png, Screen Shot 
> 2020-01-02 at 8.53.44 PM.png
>
>
> I am using the manual build master after 
> [https://github.com/apache/incubator-hudi/commit/36b3b6f5dd913d3f1c9aa116aff8daf6540fed65]
>  commit. EDIT: tried with the latest master but got the same result
> I am seeing 3 million tasks when the Hudi Spark job writing the files into 
> HDFS. It seems like related to the input size. With 7.7 GB input it was 3.2 
> million tasks, with 9 GB input it was 3.7 million. Both with 10 parallelisms. 
> I am seeing a huge amount of 0 byte files being written into .hoodie/.temp/ 
> folder in my HDFS. In the Spark UI, each task only writes less than 10 
> records in
> {code:java}
> count at HoodieSparkSqlWriter{code}
>  All the stages before this seem normal. Any idea what happened here? My 
> first guess would be something related to the bloom filter index. Maybe 
> somewhere trigger the repartitioning with the bloom filter index? But I am 
> not really familiar with that part of the code. 
> Thanks
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-494) [DEBUGGING] Huge amount of tasks when writing files into HDFS

2020-01-02 Thread Yanjia Gary Li (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanjia Gary Li updated HUDI-494:

Attachment: Screen Shot 2020-01-02 at 8.53.44 PM.png

> [DEBUGGING] Huge amount of tasks when writing files into HDFS
> -
>
> Key: HUDI-494
> URL: https://issues.apache.org/jira/browse/HUDI-494
> Project: Apache Hudi (incubating)
>  Issue Type: Test
>Reporter: Yanjia Gary Li
>Assignee: Vinoth Chandar
>Priority: Major
> Attachments: Screen Shot 2020-01-02 at 8.53.24 PM.png, Screen Shot 
> 2020-01-02 at 8.53.44 PM.png
>
>
> I am using the manual build master after 
> [https://github.com/apache/incubator-hudi/commit/36b3b6f5dd913d3f1c9aa116aff8daf6540fed65]
>  commit. 
> I am seeing 3 million tasks when the Hudi Spark job writing the files into 
> HDFS. 
> I am seeing a huge amount of 0 byte files being written into .hoodie/.temp/ 
> folder in my HDFS. In the Spark UI, each task only writes less than 10 
> records in
> {code:java}
> count at HoodieSparkSqlWriter{code}
>  All the stages before this seems normal. Any idea what happened here?
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-494) [DEBUGGING] Huge amount of tasks when writing files into HDFS

2020-01-02 Thread Yanjia Gary Li (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanjia Gary Li updated HUDI-494:

Attachment: Screen Shot 2020-01-02 at 8.53.24 PM.png

> [DEBUGGING] Huge amount of tasks when writing files into HDFS
> -
>
> Key: HUDI-494
> URL: https://issues.apache.org/jira/browse/HUDI-494
> Project: Apache Hudi (incubating)
>  Issue Type: Test
>Reporter: Yanjia Gary Li
>Assignee: Vinoth Chandar
>Priority: Major
> Attachments: Screen Shot 2020-01-02 at 8.53.24 PM.png, Screen Shot 
> 2020-01-02 at 8.53.44 PM.png
>
>
> I am using the manual build master after 
> [https://github.com/apache/incubator-hudi/commit/36b3b6f5dd913d3f1c9aa116aff8daf6540fed65]
>  commit. 
> I am seeing 3 million tasks when the Hudi Spark job writing the files into 
> HDFS. 
> I am seeing a huge amount of 0 byte files being written into .hoodie/.temp/ 
> folder in my HDFS. In the Spark UI, each task only writes less than 10 
> records in
> {code:java}
> count at HoodieSparkSqlWriter{code}
>  All the stages before this seems normal. Any idea what happened here?
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-494) [DEBUGGING] Huge amount of tasks when writing files into HDFS

2020-01-02 Thread Yanjia Gary Li (Jira)
Yanjia Gary Li created HUDI-494:
---

 Summary: [DEBUGGING] Huge amount of tasks when writing files into 
HDFS
 Key: HUDI-494
 URL: https://issues.apache.org/jira/browse/HUDI-494
 Project: Apache Hudi (incubating)
  Issue Type: Test
Reporter: Yanjia Gary Li
Assignee: Vinoth Chandar
 Attachments: Screen Shot 2020-01-02 at 8.53.24 PM.png, Screen Shot 
2020-01-02 at 8.53.44 PM.png

I am using the manual build master after 
[https://github.com/apache/incubator-hudi/commit/36b3b6f5dd913d3f1c9aa116aff8daf6540fed65]
 commit. 

I am seeing 3 million tasks when the Hudi Spark job writing the files into 
HDFS. 

I am seeing a huge amount of 0 byte files being written into .hoodie/.temp/ 
folder in my HDFS. In the Spark UI, each task only writes less than 10 records 
in
{code:java}
count at HoodieSparkSqlWriter{code}
 All the stages before this seems normal. Any idea what happened here?

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-494) [DEBUGGING] Huge amount of tasks when writing files into HDFS

2020-01-04 Thread Yanjia Gary Li (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-494?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17008199#comment-17008199
 ] 

Yanjia Gary Li commented on HUDI-494:
-

Hello [~lamber-ken],

Thanks for trying this out. This behavior is very strange and I haven't seen 
this happened before with the older version of Hudi(0.4.7). I recently upgraded 
my cluster to Hadoop 3 and Spark 2.4 with the latest Hudi snapshot.

More details about my scenario:
 * My dataset was partitioned by year/month/day and the total number of parquet 
files is about a few thousand. The total size of the data set was about a few 
TBs.
 * When the upsert job was running(halfway done with the 3 million tasks), 
there was only one partition under 
.hoodie/.temp/20200102/year=2020/month=01/day=01/, but in that partition, 
there are tons of parquet.marker files. 
 * I also checked the delta input, they should be under the same partition.
 * I used hoodie.index.bloom.num_entries = 2,000,000 based on the number of 
records in each parquet.
 * Max parquet size was set to 128MB and min was 100MB.

My guess of the cause:
 * In the initial bulkInsert, I set the bulkInsertParallelism too high that 
caused the average size of the parquet files is about 30MB, which is below the 
min value I set. But I guess this might be not related. I am rerunning the 
initial bulkInsert job with lower parallelism then run an upsert job to see 
what would happen.
 * 3 million tasks look like some sort of overflow. I need to dig into the code 
for this.

> [DEBUGGING] Huge amount of tasks when writing files into HDFS
> -
>
> Key: HUDI-494
> URL: https://issues.apache.org/jira/browse/HUDI-494
> Project: Apache Hudi (incubating)
>  Issue Type: Test
>Reporter: Yanjia Gary Li
>Assignee: Vinoth Chandar
>Priority: Major
> Attachments: Screen Shot 2020-01-02 at 8.53.24 PM.png, Screen Shot 
> 2020-01-02 at 8.53.44 PM.png, image-2020-01-05-07-30-53-567.png
>
>
> I am using the manual build master after 
> [https://github.com/apache/incubator-hudi/commit/36b3b6f5dd913d3f1c9aa116aff8daf6540fed65]
>  commit. EDIT: tried with the latest master but got the same result
> I am seeing 3 million tasks when the Hudi Spark job writing the files into 
> HDFS. It seems like related to the input size. With 7.7 GB input it was 3.2 
> million tasks, with 9 GB input it was 3.7 million. Both with 10 parallelisms. 
> I am seeing a huge amount of 0 byte files being written into .hoodie/.temp/ 
> folder in my HDFS. In the Spark UI, each task only writes less than 10 
> records in
> {code:java}
> count at HoodieSparkSqlWriter{code}
>  All the stages before this seem normal. Any idea what happened here? My 
> first guess would be something related to the bloom filter index. Maybe 
> somewhere trigger the repartitioning with the bloom filter index? But I am 
> not really familiar with that part of the code. 
> Thanks
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-494) [DEBUGGING] Huge amount of tasks when writing files into HDFS

2020-01-07 Thread Yanjia Gary Li (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-494?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17010031#comment-17010031
 ] 

Yanjia Gary Li commented on HUDI-494:
-

[~vinoth] Thanks for the feedback. The code snippets were prepared by 
[~lamber-ken] and my dataset has this issue was partitioned by 
year/month/day/hour. The behavior I observed was the path 
*/.hoodie/.temp/20200101/year=2020/month=1/day=1/hour=00* has a ton of 
files. 

For my dataset, I calculate the parallelism based on the input data size. I set 
*bulkInsertParallelism = inputSizeInMB / 100* which was 6 for my 6TB 
dataset. 

the *upsertParallelism = 10* based on the input size when I ran this upsert 
job. 

I will reproduce this once I got the chance and provide more details. 

> [DEBUGGING] Huge amount of tasks when writing files into HDFS
> -
>
> Key: HUDI-494
> URL: https://issues.apache.org/jira/browse/HUDI-494
> Project: Apache Hudi (incubating)
>  Issue Type: Test
>Reporter: Yanjia Gary Li
>Assignee: Vinoth Chandar
>Priority: Major
> Attachments: Screen Shot 2020-01-02 at 8.53.24 PM.png, Screen Shot 
> 2020-01-02 at 8.53.44 PM.png, image-2020-01-05-07-30-53-567.png
>
>
> I am using the manual build master after 
> [https://github.com/apache/incubator-hudi/commit/36b3b6f5dd913d3f1c9aa116aff8daf6540fed65]
>  commit. EDIT: tried with the latest master but got the same result
> I am seeing 3 million tasks when the Hudi Spark job writing the files into 
> HDFS. It seems like related to the input size. With 7.7 GB input it was 3.2 
> million tasks, with 9 GB input it was 3.7 million. Both with 10 parallelisms. 
> I am seeing a huge amount of 0 byte files being written into .hoodie/.temp/ 
> folder in my HDFS. In the Spark UI, each task only writes less than 10 
> records in
> {code:java}
> count at HoodieSparkSqlWriter{code}
>  All the stages before this seem normal. Any idea what happened here? My 
> first guess would be something related to the bloom filter index. Maybe 
> somewhere trigger the repartitioning with the bloom filter index? But I am 
> not really familiar with that part of the code. 
> Thanks
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-644) checkpoint generator tool for delta streamer

2020-03-11 Thread Yanjia Gary Li (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanjia Gary Li updated HUDI-644:

Summary: checkpoint generator tool for delta streamer  (was: Enable to 
retrieve checkpoint from previous commits in Delta Streamer)

> checkpoint generator tool for delta streamer
> 
>
> Key: HUDI-644
> URL: https://issues.apache.org/jira/browse/HUDI-644
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: DeltaStreamer
>Reporter: Yanjia Gary Li
>Assignee: Yanjia Gary Li
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.6.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> This ticket is to resolve the following problem:
> The user is using a homebrew Spark data source to read new data and write to 
> Hudi table
> The user would like to migrate to Delta Streamer
> But the Delta Streamer only checks the last commit metadata, if there is no 
> checkpoint info, then the Delta Streamer will use the default. For Kafka 
> source, it is LATEST. 
> The user would like to run the homebrew Spark data source reader and Delta 
> Streamer in parallel to prevent data loss, but the Spark data source writer 
> will make commit without checkpoint info, which will reset the delta 
> streamer. 
> So if we have an option to allow the user to retrieve the checkpoint from 
> previous commits instead of the latest commit would be helpful for the 
> migration. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-773) Hudi On Azure Data Lake Storage V2

2020-04-08 Thread Yanjia Gary Li (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-773?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanjia Gary Li updated HUDI-773:

Summary: Hudi On Azure Data Lake Storage V2  (was: Hudi On Azure Data Lake 
Storage)

> Hudi On Azure Data Lake Storage V2
> --
>
> Key: HUDI-773
> URL: https://issues.apache.org/jira/browse/HUDI-773
> Project: Apache Hudi (incubating)
>  Issue Type: New Feature
>  Components: Usability
>Reporter: Yanjia Gary Li
>Assignee: Yanjia Gary Li
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-773) Hudi On Azure Data Lake Storage

2020-04-08 Thread Yanjia Gary Li (Jira)
Yanjia Gary Li created HUDI-773:
---

 Summary: Hudi On Azure Data Lake Storage
 Key: HUDI-773
 URL: https://issues.apache.org/jira/browse/HUDI-773
 Project: Apache Hudi (incubating)
  Issue Type: New Feature
  Components: Usability
Reporter: Yanjia Gary Li
Assignee: Yanjia Gary Li






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (HUDI-759) Integrate checkpoint provider

2020-04-14 Thread Yanjia Gary Li (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanjia Gary Li resolved HUDI-759.
-
Resolution: Fixed

> Integrate checkpoint provider
> -
>
> Key: HUDI-759
> URL: https://issues.apache.org/jira/browse/HUDI-759
> Project: Apache Hudi (incubating)
>  Issue Type: New Feature
>Reporter: Yanjia Gary Li
>Assignee: Yanjia Gary Li
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.6.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (HUDI-69) Support realtime view in Spark datasource #136

2020-04-14 Thread Yanjia Gary Li (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-69?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17082773#comment-17082773
 ] 

Yanjia Gary Li edited comment on HUDI-69 at 4/14/20, 10:11 PM:
---

After a closer look, I think Spark datasource support for realtime table needs:
 * Support hadoop.mapreduce.xxx apis. We use hadoop.mapred.RecordReader, but 
Spark sql use hadoop.mapreduce.RecordReader. We need to figure how to support 
both apis, or upgrade to mapreduce.   
 * Implement the extension of ParquetInputFormat from Spark or a custom data 
source reader to handle merge. 
[https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala]
 * Use Datasource V2 to be the default data source. 

Please let me know what you guys think. 


was (Author: garyli1019):
After a closer look, I think Spark datasource support for realtime table needs:
 * Refactoring HoodieRealtimeFormat and (file split, record reader). Decouple 
Hudi logic from the MapredParquetInputFormat. I think we can maintain the Hudi 
file split and path filtering in a central place, and able to be adopted by 
different query engines. With bootstrap support, the file format maintenance 
could be more complicated. I think this is very essential. 
 * Implement the extension of ParquetInputFormat from Spark or a custom data 
source reader to handle merge. 
[https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala]
 * Use Datasource V2 to be the default data source. 

Please let me know what you guys think. 

> Support realtime view in Spark datasource #136
> --
>
> Key: HUDI-69
> URL: https://issues.apache.org/jira/browse/HUDI-69
> Project: Apache Hudi (incubating)
>  Issue Type: New Feature
>  Components: Spark Integration
>Reporter: Vinoth Chandar
>Assignee: Yanjia Gary Li
>Priority: Major
> Fix For: 0.6.0
>
>
> https://github.com/uber/hudi/issues/136



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (HUDI-765) Implement OrcReaderIterator

2020-04-15 Thread Yanjia Gary Li (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-765?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanjia Gary Li reassigned HUDI-765:
---

Assignee: Yanjia Gary Li

> Implement OrcReaderIterator
> ---
>
> Key: HUDI-765
> URL: https://issues.apache.org/jira/browse/HUDI-765
> Project: Apache Hudi (incubating)
>  Issue Type: Sub-task
>Reporter: lamber-ken
>Assignee: Yanjia Gary Li
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-791) Replace null by Option in Delta Streamer

2020-04-17 Thread Yanjia Gary Li (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17085942#comment-17085942
 ] 

Yanjia Gary Li commented on HUDI-791:
-

[~tison] Thanks for looking into this ticket!

The initiative here is to make the code look cleaner and more robust. If you 
are interested to improve the delta streamer, please feel free to claim this 
ticket :)

> Replace null by Option in Delta Streamer
> 
>
> Key: HUDI-791
> URL: https://issues.apache.org/jira/browse/HUDI-791
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: DeltaStreamer, newbie
>Reporter: Yanjia Gary Li
>Priority: Minor
>
> There is a lot of null in Delta Streamer. That will be great if we can 
> replace those null by Option. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-773) Hudi On Azure Data Lake Storage V2

2020-04-17 Thread Yanjia Gary Li (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-773?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17086087#comment-17086087
 ] 

Yanjia Gary Li commented on HUDI-773:
-

Hello [~sasikumar.venkat], I am very new to Azure.

How is your cluster set up? Are you using HDInsign or Databricks? Is your Spark 
cluster attached to the storage account or access it through an API?

> Hudi On Azure Data Lake Storage V2
> --
>
> Key: HUDI-773
> URL: https://issues.apache.org/jira/browse/HUDI-773
> Project: Apache Hudi (incubating)
>  Issue Type: New Feature
>  Components: Usability
>Reporter: Yanjia Gary Li
>Assignee: Yanjia Gary Li
>Priority: Minor
> Fix For: 0.6.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-805) Verify which types of Azure storage support Hudi

2020-04-17 Thread Yanjia Gary Li (Jira)
Yanjia Gary Li created HUDI-805:
---

 Summary: Verify which types of Azure storage support Hudi
 Key: HUDI-805
 URL: https://issues.apache.org/jira/browse/HUDI-805
 Project: Apache Hudi (incubating)
  Issue Type: Sub-task
Reporter: Yanjia Gary Li


Azure has the following storage options:

Azure Data Lake Storage Gen 1

Azure Data Lake Storage Gen 2

Azure Blob Storage(legacy name: windows azure storage blob)

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-804) Add Azure Support to Hudi Doc

2020-04-17 Thread Yanjia Gary Li (Jira)
Yanjia Gary Li created HUDI-804:
---

 Summary: Add Azure Support to Hudi Doc
 Key: HUDI-804
 URL: https://issues.apache.org/jira/browse/HUDI-804
 Project: Apache Hudi (incubating)
  Issue Type: Sub-task
Reporter: Yanjia Gary Li






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-773) Hudi On Azure Data Lake Storage V2

2020-04-16 Thread Yanjia Gary Li (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-773?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17085409#comment-17085409
 ] 

Yanjia Gary Li commented on HUDI-773:
-

Hello [~sasikumar.venkat], thanks for sharing!

I am able to write Hudi data without OAUTH. We are probably first few people in 
the community using Hudi on Azure, so I believe we need to figure this out :)

I will try to reproduce your issue. Will update here once I tried. 

> Hudi On Azure Data Lake Storage V2
> --
>
> Key: HUDI-773
> URL: https://issues.apache.org/jira/browse/HUDI-773
> Project: Apache Hudi (incubating)
>  Issue Type: New Feature
>  Components: Usability
>Reporter: Yanjia Gary Li
>Assignee: Yanjia Gary Li
>Priority: Minor
> Fix For: 0.6.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-69) Support realtime view in Spark datasource #136

2020-04-13 Thread Yanjia Gary Li (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-69?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17082773#comment-17082773
 ] 

Yanjia Gary Li commented on HUDI-69:


After a closer look, I think Spark datasource support for realtime table needs:
 * Refactoring HoodieRealtimeFormat and (file split, record reader). Decouple 
Hudi logic from the MapredParquetInputFormat. I think we can maintain the Hudi 
file split and path filtering in a central place, and able to be adopted by 
different query engines. With bootstrap support, the file format maintenance 
could be more complicated. I think this is very essential. 
 * Implement the extension of ParquetInputFormat from Spark or a custom data 
source reader to handle merge. 
[https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala]
 * Use Datasource V2 to be the default data source. 

Please let me know what you guys think. 

> Support realtime view in Spark datasource #136
> --
>
> Key: HUDI-69
> URL: https://issues.apache.org/jira/browse/HUDI-69
> Project: Apache Hudi (incubating)
>  Issue Type: New Feature
>  Components: Spark Integration
>Reporter: Vinoth Chandar
>Assignee: Yanjia Gary Li
>Priority: Major
> Fix For: 0.6.0
>
>
> https://github.com/uber/hudi/issues/136



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-791) Replace null by Option in Delta Streamer

2020-04-13 Thread Yanjia Gary Li (Jira)
Yanjia Gary Li created HUDI-791:
---

 Summary: Replace null by Option in Delta Streamer
 Key: HUDI-791
 URL: https://issues.apache.org/jira/browse/HUDI-791
 Project: Apache Hudi (incubating)
  Issue Type: New Feature
  Components: DeltaStreamer, newbie
Reporter: Yanjia Gary Li


There is a lot of null in Delta Streamer. That will be great if we can replace 
those null by Option. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-791) Replace null by Option in Delta Streamer

2020-04-13 Thread Yanjia Gary Li (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-791?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanjia Gary Li updated HUDI-791:

Issue Type: Improvement  (was: New Feature)

> Replace null by Option in Delta Streamer
> 
>
> Key: HUDI-791
> URL: https://issues.apache.org/jira/browse/HUDI-791
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: DeltaStreamer, newbie
>Reporter: Yanjia Gary Li
>Priority: Minor
>
> There is a lot of null in Delta Streamer. That will be great if we can 
> replace those null by Option. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (HUDI-30) Explore support for Spark Datasource V2

2020-04-12 Thread Yanjia Gary Li (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-30?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanjia Gary Li reassigned HUDI-30:
--

Assignee: Yanjia Gary Li

> Explore support for Spark Datasource V2
> ---
>
> Key: HUDI-30
> URL: https://issues.apache.org/jira/browse/HUDI-30
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Spark Integration
>Reporter: Vinoth Chandar
>Assignee: Yanjia Gary Li
>Priority: Major
>
> https://github.com/uber/hudi/issues/501



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-30) Explore support for Spark Datasource V2

2020-04-12 Thread Yanjia Gary Li (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-30?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanjia Gary Li updated HUDI-30:
---
Status: In Progress  (was: Open)

> Explore support for Spark Datasource V2
> ---
>
> Key: HUDI-30
> URL: https://issues.apache.org/jira/browse/HUDI-30
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Spark Integration
>Reporter: Vinoth Chandar
>Assignee: Yanjia Gary Li
>Priority: Major
>
> https://github.com/uber/hudi/issues/501



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-773) Hudi On Azure Data Lake Storage V2

2020-04-20 Thread Yanjia Gary Li (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-773?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17087994#comment-17087994
 ] 

Yanjia Gary Li commented on HUDI-773:
-

[~sasikumar.venkat] I haven't tried Databricks Spark myself, but one of my 
colleagues tried that before and have some issues with the Hudi write, probably 
related to yours. As Vinoth mentioned, any debugging info would be helpful. I 
will also try it myself later

> Hudi On Azure Data Lake Storage V2
> --
>
> Key: HUDI-773
> URL: https://issues.apache.org/jira/browse/HUDI-773
> Project: Apache Hudi (incubating)
>  Issue Type: New Feature
>  Components: Usability
>Reporter: Yanjia Gary Li
>Assignee: Yanjia Gary Li
>Priority: Minor
> Fix For: 0.6.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-822) Decouple hoodie related methods with Hoodie Input Formats

2020-04-20 Thread Yanjia Gary Li (Jira)
Yanjia Gary Li created HUDI-822:
---

 Summary: Decouple hoodie related methods with Hoodie Input Formats
 Key: HUDI-822
 URL: https://issues.apache.org/jira/browse/HUDI-822
 Project: Apache Hudi (incubating)
  Issue Type: Sub-task
Reporter: Yanjia Gary Li
Assignee: Yanjia Gary Li


In order to support multiple query engines, we need to generalize the Hudi 
input format and Hudi record merging logic. And decouple from 
MapredParquetInputFormat, which is depending on Hive. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-773) Hudi On Azure Data Lake Storage V2

2020-04-10 Thread Yanjia Gary Li (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-773?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17081030#comment-17081030
 ] 

Yanjia Gary Li commented on HUDI-773:
-

surprisingly easy...I tried the following test using Spark2.4 HDinsigh cluster 
with Azure Data Lake Storage V2. Hudi ran out of the box. No extra config 
needed.
{code:java}
// Initial Batch
val outputPath = "/Test/HudiWrite"
val df1 = Seq(
  ("0", "year=2019", "test1", "pass", "201901"),
  ("1", "year=2019", "test1", "pass", "201901"),
  ("2", "year=2020", "test1", "pass", "201901"),
  ("3", "year=2020", "test1", "pass", "201901")
).toDF("_uuid", "_partition", "PARAM_NAME", "RESULT_STRING", "TIMESTAMP")
val bulk_insert_ops = Map(
  DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY -> "_uuid",
  DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY -> "_partition",
  DataSourceWriteOptions.OPERATION_OPT_KEY -> 
DataSourceWriteOptions.BULK_INSERT_OPERATION_OPT_VAL,
  DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY -> "TIMESTAMP",
  "hoodie.bulkinsert.shuffle.parallelism" -> "10",
  "hoodie.upsert.shuffle.parallelism" -> "10",
  HoodieWriteConfig.TABLE_NAME -> "test"
)
df1.write.format("org.apache.hudi").options(bulk_insert_ops).mode(SaveMode.Overwrite).save(outputPath)

// Upsert
val upsert_ops = Map(
  DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY -> "_uuid",
  DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY -> "_partition",
  DataSourceWriteOptions.OPERATION_OPT_KEY -> 
DataSourceWriteOptions.UPSERT_OPERATION_OPT_VAL,
  DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY -> "TIMESTAMP",
  "hoodie.bulkinsert.shuffle.parallelism" -> "10",
  "hoodie.upsert.shuffle.parallelism" -> "10",
  HoodieWriteConfig.TABLE_NAME -> "test"
)
val df2 = Seq(
  ("0", "year=2019", "test1", "pass", "201910"),
  ("1", "year=2019", "test1", "pass", "201910"),
  ("2", "year=2020", "test1", "pass", "201910"),
  ("3", "year=2020", "test1", "pass", "201910")
).toDF("_uuid", "_partition", "PARAM_NAME", "RESULT_STRING", "TIMESTAMP")
df2.write.format("org.apache.hudi").options(upsert_ops).mode(SaveMode.Append).save(outputPath)

// Read as hudi format
val df_read = 
spark.read.format("org.apache.hudi").option(DataSourceReadOptions.QUERY_TYPE_OPT_KEY,
 DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL).load(outputPath)
assert(df_read.count() == 4){code}
 

> Hudi On Azure Data Lake Storage V2
> --
>
> Key: HUDI-773
> URL: https://issues.apache.org/jira/browse/HUDI-773
> Project: Apache Hudi (incubating)
>  Issue Type: New Feature
>  Components: Usability
>Reporter: Yanjia Gary Li
>Assignee: Yanjia Gary Li
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-773) Hudi On Azure Data Lake Storage V2

2020-04-10 Thread Yanjia Gary Li (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-773?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17081032#comment-17081032
 ] 

Yanjia Gary Li commented on HUDI-773:
-

Any extra tests needed? What tests have you guys done for AWS and GCP? 
[~vinoth] [~vbalaji]

> Hudi On Azure Data Lake Storage V2
> --
>
> Key: HUDI-773
> URL: https://issues.apache.org/jira/browse/HUDI-773
> Project: Apache Hudi (incubating)
>  Issue Type: New Feature
>  Components: Usability
>Reporter: Yanjia Gary Li
>Assignee: Yanjia Gary Li
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-773) Hudi On Azure Data Lake Storage V2

2020-04-10 Thread Yanjia Gary Li (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-773?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanjia Gary Li updated HUDI-773:

Status: In Progress  (was: Open)

> Hudi On Azure Data Lake Storage V2
> --
>
> Key: HUDI-773
> URL: https://issues.apache.org/jira/browse/HUDI-773
> Project: Apache Hudi (incubating)
>  Issue Type: New Feature
>  Components: Usability
>Reporter: Yanjia Gary Li
>Assignee: Yanjia Gary Li
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-773) Hudi On Azure Data Lake Storage V2

2020-04-10 Thread Yanjia Gary Li (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-773?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanjia Gary Li updated HUDI-773:

Fix Version/s: 0.6.0

> Hudi On Azure Data Lake Storage V2
> --
>
> Key: HUDI-773
> URL: https://issues.apache.org/jira/browse/HUDI-773
> Project: Apache Hudi (incubating)
>  Issue Type: New Feature
>  Components: Usability
>Reporter: Yanjia Gary Li
>Assignee: Yanjia Gary Li
>Priority: Minor
> Fix For: 0.6.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-69) Support realtime view in Spark datasource #136

2020-03-31 Thread Yanjia Gary Li (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-69?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17072382#comment-17072382
 ] 

Yanjia Gary Li commented on HUDI-69:


[~vinoth] I am happy to work on this ticket. Please assign to me

> Support realtime view in Spark datasource #136
> --
>
> Key: HUDI-69
> URL: https://issues.apache.org/jira/browse/HUDI-69
> Project: Apache Hudi (incubating)
>  Issue Type: New Feature
>  Components: Spark Integration
>Reporter: Vinoth Chandar
>Assignee: Vinoth Chandar
>Priority: Major
> Fix For: 0.6.0
>
>
> https://github.com/uber/hudi/issues/136



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-759) Integrate checkpoint provider

2020-04-03 Thread Yanjia Gary Li (Jira)
Yanjia Gary Li created HUDI-759:
---

 Summary: Integrate checkpoint provider
 Key: HUDI-759
 URL: https://issues.apache.org/jira/browse/HUDI-759
 Project: Apache Hudi (incubating)
  Issue Type: New Feature
Reporter: Yanjia Gary Li
Assignee: Yanjia Gary Li
 Fix For: 0.6.0






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-759) Integrate checkpoint provider

2020-04-03 Thread Yanjia Gary Li (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanjia Gary Li updated HUDI-759:

Status: Open  (was: New)

> Integrate checkpoint provider
> -
>
> Key: HUDI-759
> URL: https://issues.apache.org/jira/browse/HUDI-759
> Project: Apache Hudi (incubating)
>  Issue Type: New Feature
>Reporter: Yanjia Gary Li
>Assignee: Yanjia Gary Li
>Priority: Minor
> Fix For: 0.6.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-759) Integrate checkpoint provider

2020-04-03 Thread Yanjia Gary Li (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanjia Gary Li updated HUDI-759:

Status: In Progress  (was: Open)

> Integrate checkpoint provider
> -
>
> Key: HUDI-759
> URL: https://issues.apache.org/jira/browse/HUDI-759
> Project: Apache Hudi (incubating)
>  Issue Type: New Feature
>Reporter: Yanjia Gary Li
>Assignee: Yanjia Gary Li
>Priority: Minor
> Fix For: 0.6.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (HUDI-644) checkpoint generator tool for delta streamer

2020-04-03 Thread Yanjia Gary Li (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanjia Gary Li resolved HUDI-644.
-
Resolution: Fixed

> checkpoint generator tool for delta streamer
> 
>
> Key: HUDI-644
> URL: https://issues.apache.org/jira/browse/HUDI-644
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: DeltaStreamer
>Reporter: Yanjia Gary Li
>Assignee: Yanjia Gary Li
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.6.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> This ticket is to resolve the following problem:
> The user has finished the initial load and write to Hudi table
> The user would like to migrate to Delta Streamer
> The user needs a tool to provide the checkpoint for the Delta Streamer in the 
> first run.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-69) Support realtime view in Spark datasource #136

2020-04-04 Thread Yanjia Gary Li (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-69?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanjia Gary Li updated HUDI-69:
---
Status: In Progress  (was: Open)

> Support realtime view in Spark datasource #136
> --
>
> Key: HUDI-69
> URL: https://issues.apache.org/jira/browse/HUDI-69
> Project: Apache Hudi (incubating)
>  Issue Type: New Feature
>  Components: Spark Integration
>Reporter: Vinoth Chandar
>Assignee: Yanjia Gary Li
>Priority: Major
> Fix For: 0.6.0
>
>
> https://github.com/uber/hudi/issues/136



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-69) Support realtime view in Spark datasource #136

2020-04-05 Thread Yanjia Gary Li (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-69?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17076023#comment-17076023
 ] 

Yanjia Gary Li commented on HUDI-69:


Hello [~bhasudha], I found your commit 
[https://github.com/apache/incubator-hudi/commit/d09eacdc13b9f19f69a317c8d08bda69a43678bc]
 could be related to this ticket.

Does InputPathHandler able to provide MOR snapshot paths(avro + parquet)? If 
not, I could probably start from the path selector. 

To add Spark Datasource support RealtimeUnmergedRecordReader, we may simply use 
the Spark SQL API to read two separate formats then union them together. Is 
that make sense? 

To merge them, I might need to dig deeper. 

> Support realtime view in Spark datasource #136
> --
>
> Key: HUDI-69
> URL: https://issues.apache.org/jira/browse/HUDI-69
> Project: Apache Hudi (incubating)
>  Issue Type: New Feature
>  Components: Spark Integration
>Reporter: Vinoth Chandar
>Assignee: Yanjia Gary Li
>Priority: Major
> Fix For: 0.6.0
>
>
> https://github.com/uber/hudi/issues/136



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-644) checkpoint generator tool for delta streamer

2020-03-27 Thread Yanjia Gary Li (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanjia Gary Li updated HUDI-644:

Description: 
This ticket is to resolve the following problem:

The user has finished the initial load and write to Hudi table

The user would like to migrate to Delta Streamer

The user needs a tool to provide the checkpoint for the Delta Streamer in the 
first run.

  was:
This ticket is to resolve the following problem:

The user is using a homebrew Spark data source to read new data and write to 
Hudi table

The user would like to migrate to Delta Streamer

But the Delta Streamer only checks the last commit metadata, if there is no 
checkpoint info, then the Delta Streamer will use the default. For Kafka 
source, it is LATEST. 

The user would like to run the homebrew Spark data source reader and Delta 
Streamer in parallel to prevent data loss, but the Spark data source writer 
will make commit without checkpoint info, which will reset the delta streamer. 

So if we have an option to allow the user to retrieve the checkpoint from 
previous commits instead of the latest commit would be helpful for the 
migration. 


> checkpoint generator tool for delta streamer
> 
>
> Key: HUDI-644
> URL: https://issues.apache.org/jira/browse/HUDI-644
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: DeltaStreamer
>Reporter: Yanjia Gary Li
>Assignee: Yanjia Gary Li
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.6.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> This ticket is to resolve the following problem:
> The user has finished the initial load and write to Hudi table
> The user would like to migrate to Delta Streamer
> The user needs a tool to provide the checkpoint for the Delta Streamer in the 
> first run.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-69) Support realtime view in Spark datasource #136

2020-04-25 Thread Yanjia Gary Li (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-69?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanjia Gary Li updated HUDI-69:
---
Description: 
[https://github.com/uber/hudi/issues/136]

RFC: 
[https://cwiki.apache.org/confluence/display/HUDI/RFC+-+16+Abstraction+for+HoodieInputFormat+and+RecordReader]

WIP commit: [https://github.com/garyli1019/incubator-hudi/pull/1]

  was:
[https://github.com/uber/hudi/issues/136]

RFC: 
[https://cwiki.apache.org/confluence/display/HUDI/RFC+-+16+Abstraction+for+HoodieInputFormat+and+RecordReader]


> Support realtime view in Spark datasource #136
> --
>
> Key: HUDI-69
> URL: https://issues.apache.org/jira/browse/HUDI-69
> Project: Apache Hudi (incubating)
>  Issue Type: New Feature
>  Components: Spark Integration
>Reporter: Vinoth Chandar
>Assignee: Yanjia Gary Li
>Priority: Major
> Fix For: 0.6.0
>
>
> [https://github.com/uber/hudi/issues/136]
> RFC: 
> [https://cwiki.apache.org/confluence/display/HUDI/RFC+-+16+Abstraction+for+HoodieInputFormat+and+RecordReader]
> WIP commit: [https://github.com/garyli1019/incubator-hudi/pull/1]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-773) Hudi On Azure Data Lake Storage V2

2020-04-23 Thread Yanjia Gary Li (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-773?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17091042#comment-17091042
 ] 

Yanjia Gary Li commented on HUDI-773:
-

Hello [~sasikumar.venkat], could you try the following:

mount your storage account to Databricks
{code:java}
dbutils.fs.mount(
source = "abfss://x...@xxx.dfs.core.windows.net",
mountPoint = "/mountpoint",
extraConfigs = configs)
{code}
When writing to Hudi, use the abfss URL
{code:java}
save("abfss://<>.dfs.core.windows.net/hudi-tables/customer"){code}
When read Hudi data, use the mount point
{code:java}
load("/mountpoint/hudi-tables/customer")
{code}
I believe this error could be related to Databricks internal setup

> Hudi On Azure Data Lake Storage V2
> --
>
> Key: HUDI-773
> URL: https://issues.apache.org/jira/browse/HUDI-773
> Project: Apache Hudi (incubating)
>  Issue Type: New Feature
>  Components: Usability
>Reporter: Yanjia Gary Li
>Assignee: Yanjia Gary Li
>Priority: Minor
> Fix For: 0.6.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-69) Support realtime view in Spark datasource #136

2020-04-21 Thread Yanjia Gary Li (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-69?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanjia Gary Li updated HUDI-69:
---
Description: 
[https://github.com/uber/hudi/issues/136]

RFC: 
[https://cwiki.apache.org/confluence/display/HUDI/RFC+-+16+Abstraction+for+HoodieInputFormat+and+RecordReader]

  was:https://github.com/uber/hudi/issues/136


> Support realtime view in Spark datasource #136
> --
>
> Key: HUDI-69
> URL: https://issues.apache.org/jira/browse/HUDI-69
> Project: Apache Hudi (incubating)
>  Issue Type: New Feature
>  Components: Spark Integration
>Reporter: Vinoth Chandar
>Assignee: Yanjia Gary Li
>Priority: Major
> Fix For: 0.6.0
>
>
> [https://github.com/uber/hudi/issues/136]
> RFC: 
> [https://cwiki.apache.org/confluence/display/HUDI/RFC+-+16+Abstraction+for+HoodieInputFormat+and+RecordReader]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Reopened] (HUDI-69) Support realtime view in Spark datasource #136

2020-05-04 Thread Yanjia Gary Li (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-69?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanjia Gary Li reopened HUDI-69:


> Support realtime view in Spark datasource #136
> --
>
> Key: HUDI-69
> URL: https://issues.apache.org/jira/browse/HUDI-69
> Project: Apache Hudi (incubating)
>  Issue Type: New Feature
>  Components: Spark Integration
>Reporter: Vinoth Chandar
>Assignee: Yanjia Gary Li
>Priority: Major
> Fix For: 0.6.0
>
>
> [https://github.com/uber/hudi/issues/136]
> RFC: 
> [https://cwiki.apache.org/confluence/display/HUDI/RFC+-+16+Abstraction+for+HoodieInputFormat+and+RecordReader]
> PR: [https://github.com/apache/incubator-hudi/pull/1592]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Issue Comment Deleted] (HUDI-69) Support realtime view in Spark datasource #136

2020-05-04 Thread Yanjia Gary Li (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-69?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanjia Gary Li updated HUDI-69:
---
Comment: was deleted

(was: Can anyone reopen this ticket? I accidentally closed this :))

> Support realtime view in Spark datasource #136
> --
>
> Key: HUDI-69
> URL: https://issues.apache.org/jira/browse/HUDI-69
> Project: Apache Hudi (incubating)
>  Issue Type: New Feature
>  Components: Spark Integration
>Reporter: Vinoth Chandar
>Assignee: Yanjia Gary Li
>Priority: Major
> Fix For: 0.6.0
>
>
> [https://github.com/uber/hudi/issues/136]
> RFC: 
> [https://cwiki.apache.org/confluence/display/HUDI/RFC+-+16+Abstraction+for+HoodieInputFormat+and+RecordReader]
> PR: [https://github.com/apache/incubator-hudi/pull/1592]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-69) Support realtime view in Spark datasource #136

2020-05-04 Thread Yanjia Gary Li (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-69?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanjia Gary Li updated HUDI-69:
---
Status: Closed  (was: Patch Available)

> Support realtime view in Spark datasource #136
> --
>
> Key: HUDI-69
> URL: https://issues.apache.org/jira/browse/HUDI-69
> Project: Apache Hudi (incubating)
>  Issue Type: New Feature
>  Components: Spark Integration
>Reporter: Vinoth Chandar
>Assignee: Yanjia Gary Li
>Priority: Major
> Fix For: 0.6.0
>
>
> [https://github.com/uber/hudi/issues/136]
> RFC: 
> [https://cwiki.apache.org/confluence/display/HUDI/RFC+-+16+Abstraction+for+HoodieInputFormat+and+RecordReader]
> PR: [https://github.com/apache/incubator-hudi/pull/1592]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-69) Support realtime view in Spark datasource #136

2020-05-04 Thread Yanjia Gary Li (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-69?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17099507#comment-17099507
 ] 

Yanjia Gary Li commented on HUDI-69:


Can anyone reopen this ticket? I accidentally closed this :)

> Support realtime view in Spark datasource #136
> --
>
> Key: HUDI-69
> URL: https://issues.apache.org/jira/browse/HUDI-69
> Project: Apache Hudi (incubating)
>  Issue Type: New Feature
>  Components: Spark Integration
>Reporter: Vinoth Chandar
>Assignee: Yanjia Gary Li
>Priority: Major
> Fix For: 0.6.0
>
>
> [https://github.com/uber/hudi/issues/136]
> RFC: 
> [https://cwiki.apache.org/confluence/display/HUDI/RFC+-+16+Abstraction+for+HoodieInputFormat+and+RecordReader]
> PR: [https://github.com/apache/incubator-hudi/pull/1592]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-822) Decouple hoodie related methods with Hoodie Input Formats

2020-05-04 Thread Yanjia Gary Li (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-822?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanjia Gary Li updated HUDI-822:

Status: In Progress  (was: Open)

> Decouple hoodie related methods with Hoodie Input Formats
> -
>
> Key: HUDI-822
> URL: https://issues.apache.org/jira/browse/HUDI-822
> Project: Apache Hudi (incubating)
>  Issue Type: Sub-task
>Reporter: Yanjia Gary Li
>Assignee: Yanjia Gary Li
>Priority: Major
>
> In order to support multiple query engines, we need to generalize the Hudi 
> input format and Hudi record merging logic. And decouple from 
> MapredParquetInputFormat, which is depending on Hive. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-69) Support realtime view in Spark datasource #136

2020-05-04 Thread Yanjia Gary Li (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-69?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanjia Gary Li updated HUDI-69:
---
Description: 
[https://github.com/uber/hudi/issues/136]

RFC: 
[https://cwiki.apache.org/confluence/display/HUDI/RFC+-+16+Abstraction+for+HoodieInputFormat+and+RecordReader]

PR: [https://github.com/apache/incubator-hudi/pull/1592]

  was:
[https://github.com/uber/hudi/issues/136]

RFC: 
[https://cwiki.apache.org/confluence/display/HUDI/RFC+-+16+Abstraction+for+HoodieInputFormat+and+RecordReader]

WIP commit: [https://github.com/garyli1019/incubator-hudi/pull/1]


> Support realtime view in Spark datasource #136
> --
>
> Key: HUDI-69
> URL: https://issues.apache.org/jira/browse/HUDI-69
> Project: Apache Hudi (incubating)
>  Issue Type: New Feature
>  Components: Spark Integration
>Reporter: Vinoth Chandar
>Assignee: Yanjia Gary Li
>Priority: Major
> Fix For: 0.6.0
>
>
> [https://github.com/uber/hudi/issues/136]
> RFC: 
> [https://cwiki.apache.org/confluence/display/HUDI/RFC+-+16+Abstraction+for+HoodieInputFormat+and+RecordReader]
> PR: [https://github.com/apache/incubator-hudi/pull/1592]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-69) Support realtime view in Spark datasource #136

2020-05-04 Thread Yanjia Gary Li (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-69?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanjia Gary Li updated HUDI-69:
---
Status: Patch Available  (was: In Progress)

> Support realtime view in Spark datasource #136
> --
>
> Key: HUDI-69
> URL: https://issues.apache.org/jira/browse/HUDI-69
> Project: Apache Hudi (incubating)
>  Issue Type: New Feature
>  Components: Spark Integration
>Reporter: Vinoth Chandar
>Assignee: Yanjia Gary Li
>Priority: Major
> Fix For: 0.6.0
>
>
> [https://github.com/uber/hudi/issues/136]
> RFC: 
> [https://cwiki.apache.org/confluence/display/HUDI/RFC+-+16+Abstraction+for+HoodieInputFormat+and+RecordReader]
> WIP commit: [https://github.com/garyli1019/incubator-hudi/pull/1]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-494) [DEBUGGING] Huge amount of tasks when writing files into HDFS

2020-05-12 Thread Yanjia Gary Li (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanjia Gary Li updated HUDI-494:

Fix Version/s: 0.5.3

> [DEBUGGING] Huge amount of tasks when writing files into HDFS
> -
>
> Key: HUDI-494
> URL: https://issues.apache.org/jira/browse/HUDI-494
> Project: Apache Hudi (incubating)
>  Issue Type: Test
>Reporter: Yanjia Gary Li
>Assignee: Yanjia Gary Li
>Priority: Major
>  Labels: bug-bash-0.6.0, pull-request-available
> Fix For: 0.5.3
>
> Attachments: Screen Shot 2020-01-02 at 8.53.24 PM.png, Screen Shot 
> 2020-01-02 at 8.53.44 PM.png, example2_hdfs.png, example2_sparkui.png, 
> image-2020-01-05-07-30-53-567.png
>
>
> I am using the manual build master after 
> [https://github.com/apache/incubator-hudi/commit/36b3b6f5dd913d3f1c9aa116aff8daf6540fed65]
>  commit. EDIT: tried with the latest master but got the same result
> I am seeing 3 million tasks when the Hudi Spark job writing the files into 
> HDFS. It seems like related to the input size. With 7.7 GB input it was 3.2 
> million tasks, with 9 GB input it was 3.7 million. Both with 10 parallelisms. 
> I am seeing a huge amount of 0 byte files being written into .hoodie/.temp/ 
> folder in my HDFS. In the Spark UI, each task only writes less than 10 
> records in
> {code:java}
> count at HoodieSparkSqlWriter{code}
>  All the stages before this seem normal. Any idea what happened here? My 
> first guess would be something related to the bloom filter index. Maybe 
> somewhere trigger the repartitioning with the bloom filter index? But I am 
> not really familiar with that part of the code. 
> Thanks
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-528) Incremental Pull fails when latest commit is empty

2020-05-12 Thread Yanjia Gary Li (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanjia Gary Li updated HUDI-528:

Fix Version/s: 0.5.3

> Incremental Pull fails when latest commit is empty
> --
>
> Key: HUDI-528
> URL: https://issues.apache.org/jira/browse/HUDI-528
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>  Components: Incremental Pull
>Reporter: Javier Vega
>Assignee: Yanjia Gary Li
>Priority: Minor
>  Labels: bug-bash-0.6.0, help-requested, pull-request-available
> Fix For: 0.5.3
>
>
> When trying to create an incremental view of a dataset, an exception is 
> thrown when the latest commit in the time range is empty. In order to 
> determine the schema of the dataset, Hudi will grab the [latest commit file, 
> parse it, and grab the first metadata file 
> path|https://github.com/apache/incubator-hudi/blob/480fc7869d4d69e1219bf278fd9a37f27ac260f6/hudi-spark/src/main/scala/org/apache/hudi/IncrementalRelation.scala#L78-L80].
>  If the latest commit was empty though, the field which is used to determine 
> file paths (partitionToWriteStats) will be empty causing the following 
> exception:
>  
>  
> {code:java}
> java.util.NoSuchElementException
>   at java.util.HashMap$HashIterator.nextNode(HashMap.java:1447)
>   at java.util.HashMap$ValueIterator.next(HashMap.java:1474)
>   at org.apache.hudi.IncrementalRelation.(IncrementalRelation.scala:80)
>   at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:65)
>   at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:46)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:318)
>   at 
> org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:223)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:211)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (HUDI-318) Update Migration Guide to Include Delta Streamer

2020-05-12 Thread Yanjia Gary Li (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-318?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanjia Gary Li reassigned HUDI-318:
---

Assignee: (was: Yanjia Gary Li)

> Update Migration Guide to Include Delta Streamer
> 
>
> Key: HUDI-318
> URL: https://issues.apache.org/jira/browse/HUDI-318
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Docs
>Reporter: Yanjia Gary Li
>Priority: Minor
>  Labels: doc
>
> [http://hudi.apache.org/migration_guide.html]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-110) Better defaults for Partition extractor for Spark DataSOurce and DeltaStreamer

2020-05-17 Thread Yanjia Gary Li (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-110?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanjia Gary Li updated HUDI-110:

Status: In Progress  (was: Open)

> Better defaults for Partition extractor for Spark DataSOurce and DeltaStreamer
> --
>
> Key: HUDI-110
> URL: https://issues.apache.org/jira/browse/HUDI-110
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: DeltaStreamer, Spark Integration, Usability
>Reporter: Balaji Varadarajan
>Assignee: Yanjia Gary Li
>Priority: Minor
>  Labels: bug-bash-0.6.0
>
> Currently
> SlashEncodedDayPartitionValueExtractor is the default being used. This is not 
> a common format outside Uber.
>  
> Also, Spark DataSource provides partitionedBy clauses which has not been 
> integrated for Hudi Data Source.  We need to investigate how we can leverage 
> partitionBy clause for partitioning.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-890) Prepare for 0.5.3 patch release

2020-05-17 Thread Yanjia Gary Li (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17109805#comment-17109805
 ] 

Yanjia Gary Li commented on HUDI-890:
-

Hi [~bhavanisudha] , #1602 HUDI-494 fix incorrect record size estimation was 
pushed to 0.6.0. Thanks

> Prepare for 0.5.3 patch release
> ---
>
> Key: HUDI-890
> URL: https://issues.apache.org/jira/browse/HUDI-890
> Project: Apache Hudi (incubating)
>  Issue Type: Task
>Reporter: Bhavani Sudha
>Assignee: Bhavani Sudha
>Priority: Major
> Fix For: 0.5.3
>
>
> The following commits are included in this release.
>  * #1372 HUDI-652 Decouple HoodieReadClient and AbstractHoodieClient to break 
> the inheritance chain
>  * #1388 HUDI-681 Remove embeddedTimelineService from HoodieReadClient
>  * #1350 HUDI-629: Replace Guava's Hashing with an equivalent in 
> NumericUtils.java
>  * #1505 [HUDI - 738] Add validation to DeltaStreamer to fail fast when 
> filterDupes is enabled on UPSERT mode.
>  * #1517 HUDI-799 Use appropriate FS when loading configs
>  * #1406 HUDI-713 Fix conversion of Spark array of struct type to Avro schema
>  * #1394 HUDI-656[Performance] Return a dummy Spark relation after writing 
> the DataFrame
>  * #1576 HUDI-850 Avoid unnecessary listings in incremental cleaning mode
>  * #1421 HUDI-724 Parallelize getSmallFiles for partitions
>  * #1330 HUDI-607 Fix to allow creation/syncing of Hive tables partitioned by 
> Date type columns
>  * #1413 Add constructor to HoodieROTablePathFilter
>  * #1415 HUDI-539 Make ROPathFilter conf member serializable
>  * #1578 Add changes for presto mor queries
>  * #1506 HUDI-782 Add support of Aliyun object storage service.
>  * #1432 HUDI-716 Exception: Not an Avro data file when running 
> HoodieCleanClient.runClean
>  * #1422 HUDI-400 Check upgrade from old plan to new plan for compaction
>  * #1448 [MINOR] Update DOAP with 0.5.2 Release
>  * #1466 HUDI-742 Fix Java Math Exception
>  * #1416 HUDI-717 Fixed usage of HiveDriver for DDL statements.
>  * #1427 HUDI-727: Copy default values of fields if not present when 
> rewriting incoming record with new schema
>  * #1515 HUDI-795 Handle auto-deleted empty aux folder
>  * #1547 [MINOR]: Fix cli docs for DeltaStreamer
>  * #1580 HUDI-852 adding check for table name for Append Save mode
>  * #1537 [MINOR] fixed building IndexFileFilter with a wrong condition in 
> HoodieGlobalBloomIndex class
>  * #1434 HUDI-616 Fixed parquet files getting created on local FS
>  * #1633 HUDI-858 Allow multiple operations to be executed within a single 
> commit
>  * #1634 HUDI-846Enable Incremental cleaning and embedded timeline-server by 
> default
>  * #1596 HUDI-863 get decimal properties from derived spark DataType
>  * #1602 HUDI-494 fix incorrect record size estimation
>  * #1636 HUDI-895 Remove unnecessary listing .hoodie folder when using 
> timeline server
>  * #1584 HUDI-902 Avoid exception when getSchemaProvider
>  * #1612 HUDI-528 Handle empty commit in incremental pulling
>  * #1511 HUDI-789Adjust logic of upsert in HDFSParquetImporter
>  * #1627 HUDI-889 Writer supports useJdbc configuration when hive 
> synchronization is enabled



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-494) [DEBUGGING] Huge amount of tasks when writing files into HDFS

2020-05-17 Thread Yanjia Gary Li (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanjia Gary Li updated HUDI-494:

Fix Version/s: (was: 0.5.3)

> [DEBUGGING] Huge amount of tasks when writing files into HDFS
> -
>
> Key: HUDI-494
> URL: https://issues.apache.org/jira/browse/HUDI-494
> Project: Apache Hudi (incubating)
>  Issue Type: Test
>Reporter: Yanjia Gary Li
>Assignee: lamber-ken
>Priority: Major
>  Labels: bug-bash-0.6.0, pull-request-available
> Fix For: 0.6.0
>
> Attachments: Screen Shot 2020-01-02 at 8.53.24 PM.png, Screen Shot 
> 2020-01-02 at 8.53.44 PM.png, example2_hdfs.png, example2_sparkui.png, 
> image-2020-01-05-07-30-53-567.png
>
>
> I am using the manual build master after 
> [https://github.com/apache/incubator-hudi/commit/36b3b6f5dd913d3f1c9aa116aff8daf6540fed65]
>  commit. EDIT: tried with the latest master but got the same result
> I am seeing 3 million tasks when the Hudi Spark job writing the files into 
> HDFS. It seems like related to the input size. With 7.7 GB input it was 3.2 
> million tasks, with 9 GB input it was 3.7 million. Both with 10 parallelisms. 
> I am seeing a huge amount of 0 byte files being written into .hoodie/.temp/ 
> folder in my HDFS. In the Spark UI, each task only writes less than 10 
> records in
> {code:java}
> count at HoodieSparkSqlWriter{code}
>  All the stages before this seem normal. Any idea what happened here? My 
> first guess would be something related to the bloom filter index. Maybe 
> somewhere trigger the repartitioning with the bloom filter index? But I am 
> not really familiar with that part of the code. 
> Thanks
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-905) Support native filter pushdown for Spark Datasource

2020-05-17 Thread Yanjia Gary Li (Jira)
Yanjia Gary Li created HUDI-905:
---

 Summary: Support native filter pushdown for Spark Datasource
 Key: HUDI-905
 URL: https://issues.apache.org/jira/browse/HUDI-905
 Project: Apache Hudi (incubating)
  Issue Type: New Feature
Reporter: Yanjia Gary Li
Assignee: Yanjia Gary Li






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-494) [DEBUGGING] Huge amount of tasks when writing files into HDFS

2020-05-06 Thread Yanjia Gary Li (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-494?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17101207#comment-17101207
 ] 

Yanjia Gary Li commented on HUDI-494:
-

 

Commit 1:
{code:java}

"partitionToWriteStats" : {
"year=2020/month=5/day=0/hour=0" : [ {
  "fileId" : "4aee295a-4bbd-4c74-ba49-f6d50f489524-0",
  "path" : 
"year=2020/month=5/day=0/hour=0/4aee295a-4bbd-4c74-ba49-f6d50f489524-0_0-112-1773_20200504101048.parquet",
  "prevCommit" : "null",
  "numWrites" : 21,
  "numDeletes" : 0,
  "numUpdateWrites" : 0,
  "numInserts" : 21,
  "totalWriteBytes" : 14397559,
  "totalWriteErrors" : 0,
  "tempPath" : null,
  "partitionPath" : "year=2020/month=5/day=0/hour=0",
  "totalLogRecords" : 0,
  "totalLogFilesCompacted" : 0,
  "totalLogSizeCompacted" : 0,
  "totalUpdatedRecordsCompacted" : 0,
  "totalLogBlocks" : 0,
  "totalCorruptLogBlock" : 0,
  "totalRollbackBlocks" : 0,
  "fileSizeInBytes" : 14397559
}
{code}
Commit2:
{code:java}
  "partitionToWriteStats" : {
"year=2020/month=5/day=0/hour=0" : [ {
  "fileId" : "4aee295a-4bbd-4c74-ba49-f6d50f489524-0",
  "path" : 
"year=2020/month=5/day=0/hour=0/4aee295a-4bbd-4c74-ba49-f6d50f489524-0_0-248-163129_20200505023830.parquet",
  "prevCommit" : "20200504101048",
  "numWrites" : 12817,
  "numDeletes" : 0,
  "numUpdateWrites" : 0,
  "numInserts" : 12796,
  "totalWriteBytes" : 16297335,
  "totalWriteErrors" : 0,
  "tempPath" : null,
  "partitionPath" : "year=2020/month=5/day=0/hour=0",
  "totalLogRecords" : 0,
  "totalLogFilesCompacted" : 0,
  "totalLogSizeCompacted" : 0,
  "totalUpdatedRecordsCompacted" : 0,
  "totalLogBlocks" : 0,
  "totalCorruptLogBlock" : 0,
  "totalRollbackBlocks" : 0,
  "fileSizeInBytes" : 16297335
}, {
  "fileId" : "9d0c9e79-00dd-41d2-a217-0944f8428e1c-0",
  "path" : 
"year=2020/month=5/day=0/hour=0/9d0c9e79-00dd-41d2-a217-0944f8428e1c-0_1-248-163130_20200505023830.parquet",
  "prevCommit" : "null",
  "numWrites" : 200,
  "numDeletes" : 0,
  "numUpdateWrites" : 0,
  "numInserts" : 200,
  "totalWriteBytes" : 14428883,
  "totalWriteErrors" : 0,
  "tempPath" : null,
  "partitionPath" : "year=2020/month=5/day=0/hour=0",
  "totalLogRecords" : 0,
  "totalLogFilesCompacted" : 0,
  "totalLogSizeCompacted" : 0,
  "totalUpdatedRecordsCompacted" : 0,
  "totalLogBlocks" : 0,
  "totalCorruptLogBlock" : 0,
  "totalRollbackBlocks" : 0,
  "fileSizeInBytes" : 14428883
}, {
  "fileId" : "5990beb4-bd0c-40c9-84f1-a4107287971e-0",
  "path" : 
"year=2020/month=5/day=0/hour=0/5990beb4-bd0c-40c9-84f1-a4107287971e-0_2-248-163131_20200505023830.parquet",
  "prevCommit" : "null",
  "numWrites" : 198,
  "numDeletes" : 0,
  "numUpdateWrites" : 0,
  "numInserts" : 198,
  "totalWriteBytes" : 14428338,
  "totalWriteErrors" : 0,
  "tempPath" : null,
  "partitionPath" : "year=2020/month=5/day=0/hour=0",
  "totalLogRecords" : 0,
  "totalLogFilesCompacted" : 0,
  "totalLogSizeCompacted" : 0,
  "totalUpdatedRecordsCompacted" : 0,
  "totalLogBlocks" : 0,
  "totalCorruptLogBlock" : 0,
  "totalRollbackBlocks" : 0,
  "fileSizeInBytes" : 14428338
}, {
  "fileId" : "673c5550-39c3-4611-ac68-bc0c7da065e2-0",
  "path" : 
"year=2020/month=5/day=0/hour=0/673c5550-39c3-4611-ac68-bc0c7da065e2-0_3-248-163132_20200505023830.parquet",
  "prevCommit" : "null",
  "numWrites" : 179,
  "numDeletes" : 0,
  "numUpdateWrites" : 0,
  "numInserts" : 179,
  "totalWriteBytes" : 14425571,
  "totalWriteErrors" : 0,
  "tempPath" : null,
  "partitionPath" : "year=2020/month=5/day=0/hour=0",
  "totalLogRecords" : 0,
  "totalLogFilesCompacted" : 0,
  "totalLogSizeCompacted" : 0,
  "totalUpdatedRecordsCompacted" : 0,
  "totalLogBlocks" : 0,
  "totalCorruptLogBlock" : 0,
  "totalRollbackBlocks" : 0,
  "fileSizeInBytes" : 14425571
}
{code}
 

 

> [DEBUGGING] Huge amount of tasks when writing files into HDFS
> -
>
> Key: HUDI-494
> URL: https://issues.apache.org/jira/browse/HUDI-494
> Project: Apache Hudi (incubating)
>  Issue Type: Test
>Reporter: Yanjia Gary Li
>Assignee: Yanjia Gary Li
>Priority: Major
> Attachments: Screen Shot 2020-01-02 at 8.53.24 PM.png, Screen Shot 
> 2020-01-02 at 8.53.44 PM.png, example2_hdfs.png, example2_sparkui.png, 
> image-2020-01-05-07-30-53-567.png
>
>
> I am using the manual build master after 
> [https://github.com/apache/incubator-hudi/commit/36b3b6f5dd913d3f1c9aa116aff8daf6540fed65]
>  commit. EDIT: tried with the latest master but got the same result
> 

[jira] [Assigned] (HUDI-528) Incremental Pull fails when latest commit is empty

2020-05-10 Thread Yanjia Gary Li (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanjia Gary Li reassigned HUDI-528:
---

Assignee: Yanjia Gary Li

> Incremental Pull fails when latest commit is empty
> --
>
> Key: HUDI-528
> URL: https://issues.apache.org/jira/browse/HUDI-528
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>  Components: Incremental Pull
>Reporter: Javier Vega
>Assignee: Yanjia Gary Li
>Priority: Minor
>  Labels: bug-bash-0.6.0, help-requested
>
> When trying to create an incremental view of a dataset, an exception is 
> thrown when the latest commit in the time range is empty. In order to 
> determine the schema of the dataset, Hudi will grab the [latest commit file, 
> parse it, and grab the first metadata file 
> path|https://github.com/apache/incubator-hudi/blob/480fc7869d4d69e1219bf278fd9a37f27ac260f6/hudi-spark/src/main/scala/org/apache/hudi/IncrementalRelation.scala#L78-L80].
>  If the latest commit was empty though, the field which is used to determine 
> file paths (partitionToWriteStats) will be empty causing the following 
> exception:
>  
>  
> {code:java}
> java.util.NoSuchElementException
>   at java.util.HashMap$HashIterator.nextNode(HashMap.java:1447)
>   at java.util.HashMap$ValueIterator.next(HashMap.java:1474)
>   at org.apache.hudi.IncrementalRelation.(IncrementalRelation.scala:80)
>   at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:65)
>   at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:46)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:318)
>   at 
> org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:223)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:211)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-494) [DEBUGGING] Huge amount of tasks when writing files into HDFS

2020-05-10 Thread Yanjia Gary Li (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanjia Gary Li updated HUDI-494:

Status: In Progress  (was: Open)

> [DEBUGGING] Huge amount of tasks when writing files into HDFS
> -
>
> Key: HUDI-494
> URL: https://issues.apache.org/jira/browse/HUDI-494
> Project: Apache Hudi (incubating)
>  Issue Type: Test
>Reporter: Yanjia Gary Li
>Assignee: Yanjia Gary Li
>Priority: Major
>  Labels: bug-bash-0.6.0, pull-request-available
> Attachments: Screen Shot 2020-01-02 at 8.53.24 PM.png, Screen Shot 
> 2020-01-02 at 8.53.44 PM.png, example2_hdfs.png, example2_sparkui.png, 
> image-2020-01-05-07-30-53-567.png
>
>
> I am using the manual build master after 
> [https://github.com/apache/incubator-hudi/commit/36b3b6f5dd913d3f1c9aa116aff8daf6540fed65]
>  commit. EDIT: tried with the latest master but got the same result
> I am seeing 3 million tasks when the Hudi Spark job writing the files into 
> HDFS. It seems like related to the input size. With 7.7 GB input it was 3.2 
> million tasks, with 9 GB input it was 3.7 million. Both with 10 parallelisms. 
> I am seeing a huge amount of 0 byte files being written into .hoodie/.temp/ 
> folder in my HDFS. In the Spark UI, each task only writes less than 10 
> records in
> {code:java}
> count at HoodieSparkSqlWriter{code}
>  All the stages before this seem normal. Any idea what happened here? My 
> first guess would be something related to the bloom filter index. Maybe 
> somewhere trigger the repartitioning with the bloom filter index? But I am 
> not really familiar with that part of the code. 
> Thanks
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-528) Incremental Pull fails when latest commit is empty

2020-05-10 Thread Yanjia Gary Li (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanjia Gary Li updated HUDI-528:

Status: In Progress  (was: Open)

> Incremental Pull fails when latest commit is empty
> --
>
> Key: HUDI-528
> URL: https://issues.apache.org/jira/browse/HUDI-528
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>  Components: Incremental Pull
>Reporter: Javier Vega
>Assignee: Yanjia Gary Li
>Priority: Minor
>  Labels: bug-bash-0.6.0, help-requested, pull-request-available
>
> When trying to create an incremental view of a dataset, an exception is 
> thrown when the latest commit in the time range is empty. In order to 
> determine the schema of the dataset, Hudi will grab the [latest commit file, 
> parse it, and grab the first metadata file 
> path|https://github.com/apache/incubator-hudi/blob/480fc7869d4d69e1219bf278fd9a37f27ac260f6/hudi-spark/src/main/scala/org/apache/hudi/IncrementalRelation.scala#L78-L80].
>  If the latest commit was empty though, the field which is used to determine 
> file paths (partitionToWriteStats) will be empty causing the following 
> exception:
>  
>  
> {code:java}
> java.util.NoSuchElementException
>   at java.util.HashMap$HashIterator.nextNode(HashMap.java:1447)
>   at java.util.HashMap$ValueIterator.next(HashMap.java:1474)
>   at org.apache.hudi.IncrementalRelation.(IncrementalRelation.scala:80)
>   at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:65)
>   at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:46)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:318)
>   at 
> org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:223)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:211)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (HUDI-528) Incremental Pull fails when latest commit is empty

2020-05-15 Thread Yanjia Gary Li (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanjia Gary Li resolved HUDI-528.
-
Resolution: Fixed

> Incremental Pull fails when latest commit is empty
> --
>
> Key: HUDI-528
> URL: https://issues.apache.org/jira/browse/HUDI-528
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>  Components: Incremental Pull
>Reporter: Javier Vega
>Assignee: Yanjia Gary Li
>Priority: Minor
>  Labels: bug-bash-0.6.0, help-requested, pull-request-available
> Fix For: 0.5.3
>
>
> When trying to create an incremental view of a dataset, an exception is 
> thrown when the latest commit in the time range is empty. In order to 
> determine the schema of the dataset, Hudi will grab the [latest commit file, 
> parse it, and grab the first metadata file 
> path|https://github.com/apache/incubator-hudi/blob/480fc7869d4d69e1219bf278fd9a37f27ac260f6/hudi-spark/src/main/scala/org/apache/hudi/IncrementalRelation.scala#L78-L80].
>  If the latest commit was empty though, the field which is used to determine 
> file paths (partitionToWriteStats) will be empty causing the following 
> exception:
>  
>  
> {code:java}
> java.util.NoSuchElementException
>   at java.util.HashMap$HashIterator.nextNode(HashMap.java:1447)
>   at java.util.HashMap$ValueIterator.next(HashMap.java:1474)
>   at org.apache.hudi.IncrementalRelation.(IncrementalRelation.scala:80)
>   at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:65)
>   at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:46)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:318)
>   at 
> org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:223)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:211)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-494) [DEBUGGING] Huge amount of tasks when writing files into HDFS

2020-05-06 Thread Yanjia Gary Li (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-494?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17100967#comment-17100967
 ] 

Yanjia Gary Li commented on HUDI-494:
-

Hi folks, this issue seems coming back again...

!example2_hdfs.png!

!example2_sparkui.png!

A very small(2GB) upsert job creates 60,000+ files in a single partition and 
gets stuck for 10+ hours. I believe there might be a bug on the BloomIndexing 
stage.

> [DEBUGGING] Huge amount of tasks when writing files into HDFS
> -
>
> Key: HUDI-494
> URL: https://issues.apache.org/jira/browse/HUDI-494
> Project: Apache Hudi (incubating)
>  Issue Type: Test
>Reporter: Yanjia Gary Li
>Assignee: Vinoth Chandar
>Priority: Major
> Attachments: Screen Shot 2020-01-02 at 8.53.24 PM.png, Screen Shot 
> 2020-01-02 at 8.53.44 PM.png, example2_hdfs.png, example2_sparkui.png, 
> image-2020-01-05-07-30-53-567.png
>
>
> I am using the manual build master after 
> [https://github.com/apache/incubator-hudi/commit/36b3b6f5dd913d3f1c9aa116aff8daf6540fed65]
>  commit. EDIT: tried with the latest master but got the same result
> I am seeing 3 million tasks when the Hudi Spark job writing the files into 
> HDFS. It seems like related to the input size. With 7.7 GB input it was 3.2 
> million tasks, with 9 GB input it was 3.7 million. Both with 10 parallelisms. 
> I am seeing a huge amount of 0 byte files being written into .hoodie/.temp/ 
> folder in my HDFS. In the Spark UI, each task only writes less than 10 
> records in
> {code:java}
> count at HoodieSparkSqlWriter{code}
>  All the stages before this seem normal. Any idea what happened here? My 
> first guess would be something related to the bloom filter index. Maybe 
> somewhere trigger the repartitioning with the bloom filter index? But I am 
> not really familiar with that part of the code. 
> Thanks
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-494) [DEBUGGING] Huge amount of tasks when writing files into HDFS

2020-05-06 Thread Yanjia Gary Li (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanjia Gary Li updated HUDI-494:

Status: Open  (was: New)

> [DEBUGGING] Huge amount of tasks when writing files into HDFS
> -
>
> Key: HUDI-494
> URL: https://issues.apache.org/jira/browse/HUDI-494
> Project: Apache Hudi (incubating)
>  Issue Type: Test
>Reporter: Yanjia Gary Li
>Assignee: Yanjia Gary Li
>Priority: Major
> Attachments: Screen Shot 2020-01-02 at 8.53.24 PM.png, Screen Shot 
> 2020-01-02 at 8.53.44 PM.png, example2_hdfs.png, example2_sparkui.png, 
> image-2020-01-05-07-30-53-567.png
>
>
> I am using the manual build master after 
> [https://github.com/apache/incubator-hudi/commit/36b3b6f5dd913d3f1c9aa116aff8daf6540fed65]
>  commit. EDIT: tried with the latest master but got the same result
> I am seeing 3 million tasks when the Hudi Spark job writing the files into 
> HDFS. It seems like related to the input size. With 7.7 GB input it was 3.2 
> million tasks, with 9 GB input it was 3.7 million. Both with 10 parallelisms. 
> I am seeing a huge amount of 0 byte files being written into .hoodie/.temp/ 
> folder in my HDFS. In the Spark UI, each task only writes less than 10 
> records in
> {code:java}
> count at HoodieSparkSqlWriter{code}
>  All the stages before this seem normal. Any idea what happened here? My 
> first guess would be something related to the bloom filter index. Maybe 
> somewhere trigger the repartitioning with the bloom filter index? But I am 
> not really familiar with that part of the code. 
> Thanks
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (HUDI-494) [DEBUGGING] Huge amount of tasks when writing files into HDFS

2020-05-06 Thread Yanjia Gary Li (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanjia Gary Li reassigned HUDI-494:
---

Assignee: Yanjia Gary Li  (was: Vinoth Chandar)

> [DEBUGGING] Huge amount of tasks when writing files into HDFS
> -
>
> Key: HUDI-494
> URL: https://issues.apache.org/jira/browse/HUDI-494
> Project: Apache Hudi (incubating)
>  Issue Type: Test
>Reporter: Yanjia Gary Li
>Assignee: Yanjia Gary Li
>Priority: Major
> Attachments: Screen Shot 2020-01-02 at 8.53.24 PM.png, Screen Shot 
> 2020-01-02 at 8.53.44 PM.png, example2_hdfs.png, example2_sparkui.png, 
> image-2020-01-05-07-30-53-567.png
>
>
> I am using the manual build master after 
> [https://github.com/apache/incubator-hudi/commit/36b3b6f5dd913d3f1c9aa116aff8daf6540fed65]
>  commit. EDIT: tried with the latest master but got the same result
> I am seeing 3 million tasks when the Hudi Spark job writing the files into 
> HDFS. It seems like related to the input size. With 7.7 GB input it was 3.2 
> million tasks, with 9 GB input it was 3.7 million. Both with 10 parallelisms. 
> I am seeing a huge amount of 0 byte files being written into .hoodie/.temp/ 
> folder in my HDFS. In the Spark UI, each task only writes less than 10 
> records in
> {code:java}
> count at HoodieSparkSqlWriter{code}
>  All the stages before this seem normal. Any idea what happened here? My 
> first guess would be something related to the bloom filter index. Maybe 
> somewhere trigger the repartitioning with the bloom filter index? But I am 
> not really familiar with that part of the code. 
> Thanks
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-494) [DEBUGGING] Huge amount of tasks when writing files into HDFS

2020-05-06 Thread Yanjia Gary Li (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-494?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17101055#comment-17101055
 ] 

Yanjia Gary Li commented on HUDI-494:
-

Ok, I see what happened here. Root cause is 
[https://github.com/apache/incubator-hudi/blob/master/hudi-client/src/main/java/org/apache/hudi/index/bloom/HoodieBloomIndex.java#L214]

So basically commit 1 wrote a very small file(let's say 200 records) to a new 
partition day=05. And then when commit 2 trying to write to day=05, it will 
look up the affected partition and use the Bloom index range from the existing 
files, so it will use 200 here. Commit 2 has much more records than 200, so it 
will create tons of files since the Bloom index range is too small.

I am not really familiar with the indexing part of the code. Please let me know 
if I understand this correctly and we can figure out a fix. [~lamber-ken] 
[~vinoth]

> [DEBUGGING] Huge amount of tasks when writing files into HDFS
> -
>
> Key: HUDI-494
> URL: https://issues.apache.org/jira/browse/HUDI-494
> Project: Apache Hudi (incubating)
>  Issue Type: Test
>Reporter: Yanjia Gary Li
>Assignee: Yanjia Gary Li
>Priority: Major
> Attachments: Screen Shot 2020-01-02 at 8.53.24 PM.png, Screen Shot 
> 2020-01-02 at 8.53.44 PM.png, example2_hdfs.png, example2_sparkui.png, 
> image-2020-01-05-07-30-53-567.png
>
>
> I am using the manual build master after 
> [https://github.com/apache/incubator-hudi/commit/36b3b6f5dd913d3f1c9aa116aff8daf6540fed65]
>  commit. EDIT: tried with the latest master but got the same result
> I am seeing 3 million tasks when the Hudi Spark job writing the files into 
> HDFS. It seems like related to the input size. With 7.7 GB input it was 3.2 
> million tasks, with 9 GB input it was 3.7 million. Both with 10 parallelisms. 
> I am seeing a huge amount of 0 byte files being written into .hoodie/.temp/ 
> folder in my HDFS. In the Spark UI, each task only writes less than 10 
> records in
> {code:java}
> count at HoodieSparkSqlWriter{code}
>  All the stages before this seem normal. Any idea what happened here? My 
> first guess would be something related to the bloom filter index. Maybe 
> somewhere trigger the repartitioning with the bloom filter index? But I am 
> not really familiar with that part of the code. 
> Thanks
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-494) [DEBUGGING] Huge amount of tasks when writing files into HDFS

2020-05-06 Thread Yanjia Gary Li (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanjia Gary Li updated HUDI-494:

Attachment: example2_hdfs.png

> [DEBUGGING] Huge amount of tasks when writing files into HDFS
> -
>
> Key: HUDI-494
> URL: https://issues.apache.org/jira/browse/HUDI-494
> Project: Apache Hudi (incubating)
>  Issue Type: Test
>Reporter: Yanjia Gary Li
>Assignee: Vinoth Chandar
>Priority: Major
> Attachments: Screen Shot 2020-01-02 at 8.53.24 PM.png, Screen Shot 
> 2020-01-02 at 8.53.44 PM.png, example2_hdfs.png, example2_sparkui.png, 
> image-2020-01-05-07-30-53-567.png
>
>
> I am using the manual build master after 
> [https://github.com/apache/incubator-hudi/commit/36b3b6f5dd913d3f1c9aa116aff8daf6540fed65]
>  commit. EDIT: tried with the latest master but got the same result
> I am seeing 3 million tasks when the Hudi Spark job writing the files into 
> HDFS. It seems like related to the input size. With 7.7 GB input it was 3.2 
> million tasks, with 9 GB input it was 3.7 million. Both with 10 parallelisms. 
> I am seeing a huge amount of 0 byte files being written into .hoodie/.temp/ 
> folder in my HDFS. In the Spark UI, each task only writes less than 10 
> records in
> {code:java}
> count at HoodieSparkSqlWriter{code}
>  All the stages before this seem normal. Any idea what happened here? My 
> first guess would be something related to the bloom filter index. Maybe 
> somewhere trigger the repartitioning with the bloom filter index? But I am 
> not really familiar with that part of the code. 
> Thanks
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-494) [DEBUGGING] Huge amount of tasks when writing files into HDFS

2020-05-06 Thread Yanjia Gary Li (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanjia Gary Li updated HUDI-494:

Attachment: example2_sparkui.png

> [DEBUGGING] Huge amount of tasks when writing files into HDFS
> -
>
> Key: HUDI-494
> URL: https://issues.apache.org/jira/browse/HUDI-494
> Project: Apache Hudi (incubating)
>  Issue Type: Test
>Reporter: Yanjia Gary Li
>Assignee: Vinoth Chandar
>Priority: Major
> Attachments: Screen Shot 2020-01-02 at 8.53.24 PM.png, Screen Shot 
> 2020-01-02 at 8.53.44 PM.png, example2_hdfs.png, example2_sparkui.png, 
> image-2020-01-05-07-30-53-567.png
>
>
> I am using the manual build master after 
> [https://github.com/apache/incubator-hudi/commit/36b3b6f5dd913d3f1c9aa116aff8daf6540fed65]
>  commit. EDIT: tried with the latest master but got the same result
> I am seeing 3 million tasks when the Hudi Spark job writing the files into 
> HDFS. It seems like related to the input size. With 7.7 GB input it was 3.2 
> million tasks, with 9 GB input it was 3.7 million. Both with 10 parallelisms. 
> I am seeing a huge amount of 0 byte files being written into .hoodie/.temp/ 
> folder in my HDFS. In the Spark UI, each task only writes less than 10 
> records in
> {code:java}
> count at HoodieSparkSqlWriter{code}
>  All the stages before this seem normal. Any idea what happened here? My 
> first guess would be something related to the bloom filter index. Maybe 
> somewhere trigger the repartitioning with the bloom filter index? But I am 
> not really familiar with that part of the code. 
> Thanks
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (HUDI-494) [DEBUGGING] Huge amount of tasks when writing files into HDFS

2020-05-07 Thread Yanjia Gary Li (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-494?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17101055#comment-17101055
 ] 

Yanjia Gary Li edited comment on HUDI-494 at 5/8/20, 1:38 AM:
--

-Ok, I see what happened here. Root cause is 
[https://github.com/apache/incubator-hudi/blob/master/hudi-client/src/main/java/org/apache/hudi/index/bloom/HoodieBloomIndex.java#L214]-

So basically commit 1 wrote a very small file(let's say 200 records) to a new 
partition day=05. And then when commit 2 was trying to write, it looks back to 
commit 1 to get an estimated size of every record, but because commit 1 has too 
little records so it's inaccurate and way too big. So Hudi will calculate 
record/file using the big record size number and get a very small record/file. 
This lead to many small files. 


was (Author: garyli1019):
Ok, I see what happened here. Root cause is 
[https://github.com/apache/incubator-hudi/blob/master/hudi-client/src/main/java/org/apache/hudi/index/bloom/HoodieBloomIndex.java#L214]

So basically commit 1 wrote a very small file(let's say 200 records) to a new 
partition day=05. And then when commit 2 trying to write to day=05, it will 
look up the affected partition and use the Bloom index range from the existing 
files, so it will use 200 here. Commit 2 has much more records than 200, so it 
will create tons of files since the Bloom index range is too small.

I am not really familiar with the indexing part of the code. Please let me know 
if I understand this correctly and we can figure out a fix. [~lamber-ken] 
[~vinoth]

> [DEBUGGING] Huge amount of tasks when writing files into HDFS
> -
>
> Key: HUDI-494
> URL: https://issues.apache.org/jira/browse/HUDI-494
> Project: Apache Hudi (incubating)
>  Issue Type: Test
>Reporter: Yanjia Gary Li
>Assignee: Yanjia Gary Li
>Priority: Major
> Attachments: Screen Shot 2020-01-02 at 8.53.24 PM.png, Screen Shot 
> 2020-01-02 at 8.53.44 PM.png, example2_hdfs.png, example2_sparkui.png, 
> image-2020-01-05-07-30-53-567.png
>
>
> I am using the manual build master after 
> [https://github.com/apache/incubator-hudi/commit/36b3b6f5dd913d3f1c9aa116aff8daf6540fed65]
>  commit. EDIT: tried with the latest master but got the same result
> I am seeing 3 million tasks when the Hudi Spark job writing the files into 
> HDFS. It seems like related to the input size. With 7.7 GB input it was 3.2 
> million tasks, with 9 GB input it was 3.7 million. Both with 10 parallelisms. 
> I am seeing a huge amount of 0 byte files being written into .hoodie/.temp/ 
> folder in my HDFS. In the Spark UI, each task only writes less than 10 
> records in
> {code:java}
> count at HoodieSparkSqlWriter{code}
>  All the stages before this seem normal. Any idea what happened here? My 
> first guess would be something related to the bloom filter index. Maybe 
> somewhere trigger the repartitioning with the bloom filter index? But I am 
> not really familiar with that part of the code. 
> Thanks
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-905) Support PrunedFilteredScan for Spark Datasource

2020-05-20 Thread Yanjia Gary Li (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanjia Gary Li updated HUDI-905:

Priority: Minor  (was: Major)

> Support PrunedFilteredScan for Spark Datasource
> ---
>
> Key: HUDI-905
> URL: https://issues.apache.org/jira/browse/HUDI-905
> Project: Apache Hudi (incubating)
>  Issue Type: New Feature
>Reporter: Yanjia Gary Li
>Priority: Minor
>
> Hudi Spark Datasource incremental view currently is using 
> DataSourceReadOptions.PUSH_DOWN_INCR_FILTERS_OPT_KEY to push down the filter.
> If we wanna use Spark predicate pushdown in a native way, we need to 
> implement PrunedFilteredScan for Hudi Datasource.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-905) Support PrunedFilteredScan for Spark Datasource

2020-05-20 Thread Yanjia Gary Li (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanjia Gary Li updated HUDI-905:

Status: Open  (was: New)

> Support PrunedFilteredScan for Spark Datasource
> ---
>
> Key: HUDI-905
> URL: https://issues.apache.org/jira/browse/HUDI-905
> Project: Apache Hudi (incubating)
>  Issue Type: New Feature
>Reporter: Yanjia Gary Li
>Priority: Minor
>
> Hudi Spark Datasource incremental view currently is using 
> DataSourceReadOptions.PUSH_DOWN_INCR_FILTERS_OPT_KEY to push down the filter.
> If we wanna use Spark predicate pushdown in a native way, we need to 
> implement PrunedFilteredScan for Hudi Datasource.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-905) Support PrunedFilteredScan for Spark Datasource

2020-05-20 Thread Yanjia Gary Li (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanjia Gary Li updated HUDI-905:

Component/s: Spark Integration

> Support PrunedFilteredScan for Spark Datasource
> ---
>
> Key: HUDI-905
> URL: https://issues.apache.org/jira/browse/HUDI-905
> Project: Apache Hudi (incubating)
>  Issue Type: New Feature
>  Components: Spark Integration
>Reporter: Yanjia Gary Li
>Priority: Minor
>
> Hudi Spark Datasource incremental view currently is using 
> DataSourceReadOptions.PUSH_DOWN_INCR_FILTERS_OPT_KEY to push down the filter.
> If we wanna use Spark predicate pushdown in a native way, we need to 
> implement PrunedFilteredScan for Hudi Datasource.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


  1   2   >