[jira] [Updated] (HUDI-597) Enable incremental pulling from defined partitions

2020-03-01 Thread Yanjia Gary Li (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanjia Gary Li updated HUDI-597:

Description: 
For the use case that I only need to pull the incremental part of certain 
partitions, I need to do the incremental pulling from the entire dataset first 
then filtering in Spark.

If we can use the folder partitions directly as part of the input path, it 
could run faster by only load relevant parquet files.

Example:

 
{code:java}
spark.read.format("org.apache.hudi")
.option(DataSourceReadOptions.VIEW_TYPE_OPT_KEY,DataSourceReadOptions.VIEW_TYPE_INCREMENTAL_OPT_VAL)
.option(DataSourceReadOptions.BEGIN_INSTANTTIME_OPT_KEY, "000")
.option(DataSourceReadOptions.INCR_PATH_GLOB_OPT_KEY, "/year=2016/*/*/*")
.load(path)
 
{code}
 

  was:
For the use case that I only need to pull the incremental part of certain 
partitions, I need to do the incremental pulling from the entire dataset first 
then filtering in Spark.

If we can use the folder partitions directly as part of the input path, it 
could run faster by only load relevant parquet files.

Example:

 
{code:java}
spark.read.format("org.apache.hudi")
.option(DataSourceReadOptions.VIEW_TYPE_OPT_KEY,DataSourceReadOptions.VIEW_TYPE_INCREMENTAL_OPT_VAL)
.option(DataSourceReadOptions.BEGIN_INSTANTTIME_OPT_KEY, "000")
.load(path, "year=2020/*/*/*")
 
{code}
 


> Enable incremental pulling from defined partitions
> --
>
> Key: HUDI-597
> URL: https://issues.apache.org/jira/browse/HUDI-597
> Project: Apache Hudi (incubating)
>  Issue Type: New Feature
>Reporter: Yanjia Gary Li
>Assignee: Yanjia Gary Li
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.5.2
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> For the use case that I only need to pull the incremental part of certain 
> partitions, I need to do the incremental pulling from the entire dataset 
> first then filtering in Spark.
> If we can use the folder partitions directly as part of the input path, it 
> could run faster by only load relevant parquet files.
> Example:
>  
> {code:java}
> spark.read.format("org.apache.hudi")
> .option(DataSourceReadOptions.VIEW_TYPE_OPT_KEY,DataSourceReadOptions.VIEW_TYPE_INCREMENTAL_OPT_VAL)
> .option(DataSourceReadOptions.BEGIN_INSTANTTIME_OPT_KEY, "000")
> .option(DataSourceReadOptions.INCR_PATH_GLOB_OPT_KEY, "/year=2016/*/*/*")
> .load(path)
>  
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-597) Enable incremental pulling from defined partitions

2020-02-27 Thread Yanjia Gary Li (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanjia Gary Li updated HUDI-597:

Fix Version/s: 0.5.2

> Enable incremental pulling from defined partitions
> --
>
> Key: HUDI-597
> URL: https://issues.apache.org/jira/browse/HUDI-597
> Project: Apache Hudi (incubating)
>  Issue Type: New Feature
>Reporter: Yanjia Gary Li
>Assignee: Yanjia Gary Li
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.5.2
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> For the use case that I only need to pull the incremental part of certain 
> partitions, I need to do the incremental pulling from the entire dataset 
> first then filtering in Spark.
> If we can use the folder partitions directly as part of the input path, it 
> could run faster by only load relevant parquet files.
> Example:
>  
> {code:java}
> spark.read.format("org.apache.hudi")
> .option(DataSourceReadOptions.VIEW_TYPE_OPT_KEY,DataSourceReadOptions.VIEW_TYPE_INCREMENTAL_OPT_VAL)
> .option(DataSourceReadOptions.BEGIN_INSTANTTIME_OPT_KEY, "000")
> .load(path, "year=2020/*/*/*")
>  
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-597) Enable incremental pulling from defined partitions

2020-02-27 Thread Yanjia Gary Li (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanjia Gary Li updated HUDI-597:

Status: Open  (was: New)

> Enable incremental pulling from defined partitions
> --
>
> Key: HUDI-597
> URL: https://issues.apache.org/jira/browse/HUDI-597
> Project: Apache Hudi (incubating)
>  Issue Type: New Feature
>Reporter: Yanjia Gary Li
>Assignee: Yanjia Gary Li
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> For the use case that I only need to pull the incremental part of certain 
> partitions, I need to do the incremental pulling from the entire dataset 
> first then filtering in Spark.
> If we can use the folder partitions directly as part of the input path, it 
> could run faster by only load relevant parquet files.
> Example:
>  
> {code:java}
> spark.read.format("org.apache.hudi")
> .option(DataSourceReadOptions.VIEW_TYPE_OPT_KEY,DataSourceReadOptions.VIEW_TYPE_INCREMENTAL_OPT_VAL)
> .option(DataSourceReadOptions.BEGIN_INSTANTTIME_OPT_KEY, "000")
> .load(path, "year=2020/*/*/*")
>  
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-597) Enable incremental pulling from defined partitions

2020-02-27 Thread Yanjia Gary Li (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanjia Gary Li updated HUDI-597:

Status: In Progress  (was: Open)

> Enable incremental pulling from defined partitions
> --
>
> Key: HUDI-597
> URL: https://issues.apache.org/jira/browse/HUDI-597
> Project: Apache Hudi (incubating)
>  Issue Type: New Feature
>Reporter: Yanjia Gary Li
>Assignee: Yanjia Gary Li
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> For the use case that I only need to pull the incremental part of certain 
> partitions, I need to do the incremental pulling from the entire dataset 
> first then filtering in Spark.
> If we can use the folder partitions directly as part of the input path, it 
> could run faster by only load relevant parquet files.
> Example:
>  
> {code:java}
> spark.read.format("org.apache.hudi")
> .option(DataSourceReadOptions.VIEW_TYPE_OPT_KEY,DataSourceReadOptions.VIEW_TYPE_INCREMENTAL_OPT_VAL)
> .option(DataSourceReadOptions.BEGIN_INSTANTTIME_OPT_KEY, "000")
> .load(path, "year=2020/*/*/*")
>  
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-597) Enable incremental pulling from defined partitions

2020-02-21 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-597:

Labels: pull-request-available  (was: )

> Enable incremental pulling from defined partitions
> --
>
> Key: HUDI-597
> URL: https://issues.apache.org/jira/browse/HUDI-597
> Project: Apache Hudi (incubating)
>  Issue Type: New Feature
>Reporter: Yanjia Gary Li
>Assignee: Yanjia Gary Li
>Priority: Minor
>  Labels: pull-request-available
>
> For the use case that I only need to pull the incremental part of certain 
> partitions, I need to do the incremental pulling from the entire dataset 
> first then filtering in Spark.
> If we can use the folder partitions directly as part of the input path, it 
> could run faster by only load relevant parquet files.
> Example:
>  
> {code:java}
> spark.read.format("org.apache.hudi")
> .option(DataSourceReadOptions.VIEW_TYPE_OPT_KEY,DataSourceReadOptions.VIEW_TYPE_INCREMENTAL_OPT_VAL)
> .option(DataSourceReadOptions.BEGIN_INSTANTTIME_OPT_KEY, "000")
> .load(path, "year=2020/*/*/*")
>  
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)