[DISCUSS] Default partition path in TimestampBasedKeyGenerator

2019-12-11 Thread Pratyaksh Sharma
Hi,

If value for configured partitionPathField is not present, we are
defaulting to default partition path in all the key generator classes
except TimestampBasedKeyGenerator. In TimestampBasedKeyGenerator, we
directly throw exception if the value is null.

I wanted to know if this behaviour is intentional. Ideally we should handle
such cases gracefully everywhere.


Re: [DISCUSS] Next Apache Release

2019-12-11 Thread Balaji Varadarajan
 + 1 from me as well for having @leesf be the release manager for 0.5.1. @leesf 
- Appreciate your spirit in helping Hudi community. 
Balaji.VOn Wednesday, December 11, 2019, 06:52:21 PM PST, Vinoth Chandar 
 wrote:  
 
 +1 for leesf, driving the release..

>From http://www.apache.org/dev/release-publishing.html#release_manager, it
does explicitly confirm that any committer can be RM.
I am happy to volunteer my services to assist leesf in the process.

@all : Please speak up if you have concerns with the release/timelines.

Side note: There are OPEN jiras here
https://issues.apache.org/jira/issues/?jql=project%20%3D%20HUDI%20AND%20fixVersion%20%3D%200.5.1
,
targeted against the release. So if you are interested, please grab them so
we can expedite our progress.

On Wed, Dec 11, 2019 at 5:27 PM leesf  wrote:

> Hi Balaji,
>
> Thanks for kicking the discussion off.
>
> +1 to release next version as we made many improvements since last released
> version and Jan is reasonable considering the upcoming holidays.
>
> Besides I am wondering if I can be the release manager of 0.5.1 to work
> with you. It is always meaningful to help the community as much as
> possible.
>
> Best,
> Leesf
>
> Balaji Varadarajan  于2019年12月12日周四 上午1:08写道:
>
> > Hello all,
> >
> > In the spirit of making Apache Hudi (incubating) releases at regular
> > cadence, we are starting this thread to kickstart the planning and
> > preparatory work for next release (0.5.1).
> >
> > As discussed in yesterdays meeting, the current plan is to have a release
> > by end of Jan 2020.
> >
> > As described in the release guide (see References), the first step would
> be
> > identify the release manager for 0.5.1. This is a consensus-based
> decision
> > of the entire community. The only requirements is that the release
> manager
> > be Apache Hudi Committer as they have permissions to perform some of the
> > release manager's work. The committer would still need to work with PPMC
> to
> > write to Apache release repositories.
> >
> > There’s no formal process, no vote requirements, and no timing
> requirements
> > when identifying release manager. Any objections should be resolved by
> > consensus before starting the release.
> >
> > In general, the community prefers to have a rotating set of 3-5 Release
> > Managers. Keeping a small core set of managers allows enough people to
> > build expertise in this area and improve processes over time, without
> > Release Managers needing to re-learn the processes for each release. That
> > said, if you are a committer interested in serving the community in this
> > way, please reach out to the community on the dev@ mailing list.
> >
> > If any Hudi committer is interested in being the next release manager,
> > please reply to this email.
> >
> > References:
> > Planned Tickets:  Jira Tickets
> > <
> >
> https://jira.apache.org/jira/issues/?jql=project+%3D+HUDI+AND+fixVersion+%3D+0.5.1
> > >
> > Release Guide:  Release Guide
> > <
> >
> https://cwiki.apache.org/confluence/display/HUDI/Apache+Hudi+%28incubating%29+-+Release+Guide
> > >
> >
> > Thanks,
> > Balaji.V
> > (On behalf of Apache Hudi PPMC)
> >
>  

Re: [DISCUSS] Next Apache Release

2019-12-11 Thread Pratyaksh Sharma
Hi Vinoth,

We are targeting HUDI-288  also
as part of 0.5.1 release. I will change the fix version of that jira as
well. Right now, it is not included in the list you shared above.

On Thu, Dec 12, 2019 at 8:22 AM Vinoth Chandar  wrote:

> +1 for leesf, driving the release..
>
> From http://www.apache.org/dev/release-publishing.html#release_manager, it
> does explicitly confirm that any committer can be RM.
> I am happy to volunteer my services to assist leesf in the process.
>
> @all : Please speak up if you have concerns with the release/timelines.
>
> Side note: There are OPEN jiras here
>
> https://issues.apache.org/jira/issues/?jql=project%20%3D%20HUDI%20AND%20fixVersion%20%3D%200.5.1
> ,
> targeted against the release. So if you are interested, please grab them so
> we can expedite our progress.
>
> On Wed, Dec 11, 2019 at 5:27 PM leesf  wrote:
>
> > Hi Balaji,
> >
> > Thanks for kicking the discussion off.
> >
> > +1 to release next version as we made many improvements since last
> released
> > version and Jan is reasonable considering the upcoming holidays.
> >
> > Besides I am wondering if I can be the release manager of 0.5.1 to work
> > with you. It is always meaningful to help the community as much as
> > possible.
> >
> > Best,
> > Leesf
> >
> > Balaji Varadarajan  于2019年12月12日周四 上午1:08写道:
> >
> > > Hello all,
> > >
> > > In the spirit of making Apache Hudi (incubating) releases at regular
> > > cadence, we are starting this thread to kickstart the planning and
> > > preparatory work for next release (0.5.1).
> > >
> > > As discussed in yesterdays meeting, the current plan is to have a
> release
> > > by end of Jan 2020.
> > >
> > > As described in the release guide (see References), the first step
> would
> > be
> > > identify the release manager for 0.5.1. This is a consensus-based
> > decision
> > > of the entire community. The only requirements is that the release
> > manager
> > > be Apache Hudi Committer as they have permissions to perform some of
> the
> > > release manager's work. The committer would still need to work with
> PPMC
> > to
> > > write to Apache release repositories.
> > >
> > > There’s no formal process, no vote requirements, and no timing
> > requirements
> > > when identifying release manager. Any objections should be resolved by
> > > consensus before starting the release.
> > >
> > > In general, the community prefers to have a rotating set of 3-5 Release
> > > Managers. Keeping a small core set of managers allows enough people to
> > > build expertise in this area and improve processes over time, without
> > > Release Managers needing to re-learn the processes for each release.
> That
> > > said, if you are a committer interested in serving the community in
> this
> > > way, please reach out to the community on the dev@ mailing list.
> > >
> > > If any Hudi committer is interested in being the next release manager,
> > > please reply to this email.
> > >
> > > References:
> > > Planned Tickets:   Jira Tickets
> > > <
> > >
> >
> https://jira.apache.org/jira/issues/?jql=project+%3D+HUDI+AND+fixVersion+%3D+0.5.1
> > > >
> > > Release Guide:  Release Guide
> > > <
> > >
> >
> https://cwiki.apache.org/confluence/display/HUDI/Apache+Hudi+%28incubating%29+-+Release+Guide
> > > >
> > >
> > > Thanks,
> > > Balaji.V
> > > (On behalf of Apache Hudi PPMC)
> > >
> >
>


Re:Re: Re: Re: Re: [DISCUSS] Refactor of the configuration framework of hudi project

2019-12-11 Thread lamberken


Hi, @vinoth


1, Hoodie*Config classes are only used to set default value when call their 
build method currently.
They will be replaced by HoodieMemoryOptions, HoodieIndexOptions, 
HoodieHBaseIndexOptions, etc...
2, I don't understand the question "It is not clear to me whether there is any 
external facing changes which changes this model.".


Best,
lamber-ken


At 2019-12-12 11:01:36, "Vinoth Chandar"  wrote:
>I actually prefer the builder pattern for making the configs, because I can
>do `builder.` in the IDE and actually see all the options... That said,
>most developers program against the Spark datasource and so this may not be
>useful, unless we expose a builder for that.. I will concede that since its
>also subjective anyway.
>
>But, to clarify Siva's question, you do intend to keep the different
>component level config classes right - HoodieIndexConfig,
>HoodieCompactionConfig?
>
>Once again, can you please explicitly address the following question, so we
>can get on the same page?
>>> It is not clear to me whether there is any external facing changes which
>changes this model.
>This is still the most critical question from both me and balaji.
>
>On Wed, Dec 11, 2019 at 11:35 AM lamberken  wrote:
>
>>  hi, @Sivabalan
>>
>> Yes, thanks very much for help me explain my initial proposal.
>>
>>
>> Answer your question, we can call HoodieWriteConfig as a SystemConfig, we
>> need to pass it everywhere.
>> Actually, it may just contains a few custom configurations( does not
>> include default configurations)
>> Because each component has its own ConfigOptions.
>>
>>
>> The old version HoodieWriteConfig includes all keys(custom configurations,
>> default configurations), it is a fat.
>>
>>
>> Best,
>> lamber-ken
>>
>>
>>
>>
>>
>>
>>
>>
>> At 2019-12-12 03:14:11, "Sivabalan"  wrote:
>> >Let me summarize your initial proposal and then will get into details.
>> >- Introduce ConfigOptions for ease of handling of default values.
>> >- Remove all Hoodie*Config classes and just have HoodieWriteConfig. What
>> >this means is that, every other config file will be replaced by
>> >ConfigOptions. eg, HoodieIndexConfigOption, HoodieCompactionConfigOption,
>> >etc.
>> >- Config option will take care of returning defaults for any property,
>> even
>> >if an entire Config(eg IndexConfig) is not explicitly set.
>> >
>> >Here are the positives I see.
>> >- By way of having component level ConfigOptions, we bucketize the configs
>> >and have defaults set(same as before)
>> >- User doesn't need to set each component's config(eg IndexConfig)
>> >explicitly with HoodieWriteConfig.
>> >
>> >But have one question:
>> >- I see Bucketizing only in write path. How does one get hold of
>> >IndexConfigOptions as a consumer?  For eg, If some class is using just
>> >IndexConfig alone, how will it consume? From your eg, I see only
>> >HoodieWriteConfig. Do we pass in HoodieWriteConfig everywhere then?
>> >Wouldn't that contradicts your initial proposal to not have a fat config
>> >class? May be can you expand your example below to show how a consumer of
>> >IndexConfig look like.
>> >
>> >Your eg:
>> >/**
>> > * New version
>> > */
>> >// set value overrite the default value
>> >HoodieWriteConfig config = new HoodieWriteConfig();
>> >config.set(HoodieIndexConfigOptions.INDEX_TYPE,
>> >HoodieIndex.IndexType.HBASE.name <
>> http://hoodieindex.indextype.hbase.name/>
>> >())
>> >
>> >
>> >
>> >
>> >On Wed, Dec 11, 2019 at 8:33 AM lamberken  wrote:
>> >
>> >>
>> >>
>> >> Hi,
>> >>
>> >>
>> >>
>> >>
>> >> On 1,2. Yes, you are right, moving the getter to the component level
>> >> Config class itself.
>> >>
>> >>
>> >> On 3, HoodieWriteConfig can also set value through ConfigOption, small
>> >> code snippets.
>> >> From the bellow snippets, we can see that clients need to know each
>> >> component's builders
>> >> and also call their "with" methods to override the default value in old
>> >> version.
>> >>
>> >>
>> >> But, in new version, clients just need to know each component's public
>> >> config options, just like constants.
>> >> So, these builders are redundant.
>> >>
>> >>
>> >>
>> /---/
>> >>
>> >>
>> >> public class HoodieIndexConfigOptions {
>> >>   public static final ConfigOption INDEX_TYPE = ConfigOption
>> >>   .key("hoodie.index.type")
>> >>   .defaultValue(HoodieIndex.IndexType.BLOOM.name());
>> >> }
>> >>
>> >>
>> >> public class HoodieWriteConfig {
>> >>   public void setString(ConfigOption option, String value) {
>> >> this.props.put(option.key(), value);
>> >>   }
>> >> }
>> >>
>> >>
>> >>
>> >>
>> >> /**
>> >>  * New version
>> >>  */
>> >> // set value overrite the default value
>> >> HoodieWriteConfig config = new HoodieWriteConfig();
>> >> config.set(HoodieIndexConfigOptions.INDEX_TYPE,
>> >> HoodieIndex.IndexType.HBASE.name())
>> >>
>> >>
>> >>
>> >>
>> >> /**
>> >>  * Old version
>> >>  */
>> >> HoodieWriteConfig.Builder buil

Re: Re: Re: Re: [DISCUSS] Refactor of the configuration framework of hudi project

2019-12-11 Thread Vinoth Chandar
I actually prefer the builder pattern for making the configs, because I can
do `builder.` in the IDE and actually see all the options... That said,
most developers program against the Spark datasource and so this may not be
useful, unless we expose a builder for that.. I will concede that since its
also subjective anyway.

But, to clarify Siva's question, you do intend to keep the different
component level config classes right - HoodieIndexConfig,
HoodieCompactionConfig?

Once again, can you please explicitly address the following question, so we
can get on the same page?
>> It is not clear to me whether there is any external facing changes which
changes this model.
This is still the most critical question from both me and balaji.

On Wed, Dec 11, 2019 at 11:35 AM lamberken  wrote:

>  hi, @Sivabalan
>
> Yes, thanks very much for help me explain my initial proposal.
>
>
> Answer your question, we can call HoodieWriteConfig as a SystemConfig, we
> need to pass it everywhere.
> Actually, it may just contains a few custom configurations( does not
> include default configurations)
> Because each component has its own ConfigOptions.
>
>
> The old version HoodieWriteConfig includes all keys(custom configurations,
> default configurations), it is a fat.
>
>
> Best,
> lamber-ken
>
>
>
>
>
>
>
>
> At 2019-12-12 03:14:11, "Sivabalan"  wrote:
> >Let me summarize your initial proposal and then will get into details.
> >- Introduce ConfigOptions for ease of handling of default values.
> >- Remove all Hoodie*Config classes and just have HoodieWriteConfig. What
> >this means is that, every other config file will be replaced by
> >ConfigOptions. eg, HoodieIndexConfigOption, HoodieCompactionConfigOption,
> >etc.
> >- Config option will take care of returning defaults for any property,
> even
> >if an entire Config(eg IndexConfig) is not explicitly set.
> >
> >Here are the positives I see.
> >- By way of having component level ConfigOptions, we bucketize the configs
> >and have defaults set(same as before)
> >- User doesn't need to set each component's config(eg IndexConfig)
> >explicitly with HoodieWriteConfig.
> >
> >But have one question:
> >- I see Bucketizing only in write path. How does one get hold of
> >IndexConfigOptions as a consumer?  For eg, If some class is using just
> >IndexConfig alone, how will it consume? From your eg, I see only
> >HoodieWriteConfig. Do we pass in HoodieWriteConfig everywhere then?
> >Wouldn't that contradicts your initial proposal to not have a fat config
> >class? May be can you expand your example below to show how a consumer of
> >IndexConfig look like.
> >
> >Your eg:
> >/**
> > * New version
> > */
> >// set value overrite the default value
> >HoodieWriteConfig config = new HoodieWriteConfig();
> >config.set(HoodieIndexConfigOptions.INDEX_TYPE,
> >HoodieIndex.IndexType.HBASE.name <
> http://hoodieindex.indextype.hbase.name/>
> >())
> >
> >
> >
> >
> >On Wed, Dec 11, 2019 at 8:33 AM lamberken  wrote:
> >
> >>
> >>
> >> Hi,
> >>
> >>
> >>
> >>
> >> On 1,2. Yes, you are right, moving the getter to the component level
> >> Config class itself.
> >>
> >>
> >> On 3, HoodieWriteConfig can also set value through ConfigOption, small
> >> code snippets.
> >> From the bellow snippets, we can see that clients need to know each
> >> component's builders
> >> and also call their "with" methods to override the default value in old
> >> version.
> >>
> >>
> >> But, in new version, clients just need to know each component's public
> >> config options, just like constants.
> >> So, these builders are redundant.
> >>
> >>
> >>
> /---/
> >>
> >>
> >> public class HoodieIndexConfigOptions {
> >>   public static final ConfigOption INDEX_TYPE = ConfigOption
> >>   .key("hoodie.index.type")
> >>   .defaultValue(HoodieIndex.IndexType.BLOOM.name());
> >> }
> >>
> >>
> >> public class HoodieWriteConfig {
> >>   public void setString(ConfigOption option, String value) {
> >> this.props.put(option.key(), value);
> >>   }
> >> }
> >>
> >>
> >>
> >>
> >> /**
> >>  * New version
> >>  */
> >> // set value overrite the default value
> >> HoodieWriteConfig config = new HoodieWriteConfig();
> >> config.set(HoodieIndexConfigOptions.INDEX_TYPE,
> >> HoodieIndex.IndexType.HBASE.name())
> >>
> >>
> >>
> >>
> >> /**
> >>  * Old version
> >>  */
> >> HoodieWriteConfig.Builder builder = HoodieWriteConfig.newBuilder()
> >>
> >>
> builder.withIndexConfig(HoodieIndexConfig.newBuilder().withIndexType(HoodieIndex.IndexType.BLOOM).build())
> >>
> >>
> >>
> >>
> /---/
> >>
> >>
> >> Another, users use hudi like bellow, here're all keys.
> >>
> >>
> /---/
> >>
> >>
> >> df.write.format("hudi").
> >> option("hoodie.insert.shuffle.parallelism", "10").
> >> option("hoodie.upsert.shuffle.parallel

Re: [DISCUSS] Next Apache Release

2019-12-11 Thread Vinoth Chandar
+1 for leesf, driving the release..

>From http://www.apache.org/dev/release-publishing.html#release_manager, it
does explicitly confirm that any committer can be RM.
I am happy to volunteer my services to assist leesf in the process.

@all : Please speak up if you have concerns with the release/timelines.

Side note: There are OPEN jiras here
https://issues.apache.org/jira/issues/?jql=project%20%3D%20HUDI%20AND%20fixVersion%20%3D%200.5.1
,
targeted against the release. So if you are interested, please grab them so
we can expedite our progress.

On Wed, Dec 11, 2019 at 5:27 PM leesf  wrote:

> Hi Balaji,
>
> Thanks for kicking the discussion off.
>
> +1 to release next version as we made many improvements since last released
> version and Jan is reasonable considering the upcoming holidays.
>
> Besides I am wondering if I can be the release manager of 0.5.1 to work
> with you. It is always meaningful to help the community as much as
> possible.
>
> Best,
> Leesf
>
> Balaji Varadarajan  于2019年12月12日周四 上午1:08写道:
>
> > Hello all,
> >
> > In the spirit of making Apache Hudi (incubating) releases at regular
> > cadence, we are starting this thread to kickstart the planning and
> > preparatory work for next release (0.5.1).
> >
> > As discussed in yesterdays meeting, the current plan is to have a release
> > by end of Jan 2020.
> >
> > As described in the release guide (see References), the first step would
> be
> > identify the release manager for 0.5.1. This is a consensus-based
> decision
> > of the entire community. The only requirements is that the release
> manager
> > be Apache Hudi Committer as they have permissions to perform some of the
> > release manager's work. The committer would still need to work with PPMC
> to
> > write to Apache release repositories.
> >
> > There’s no formal process, no vote requirements, and no timing
> requirements
> > when identifying release manager. Any objections should be resolved by
> > consensus before starting the release.
> >
> > In general, the community prefers to have a rotating set of 3-5 Release
> > Managers. Keeping a small core set of managers allows enough people to
> > build expertise in this area and improve processes over time, without
> > Release Managers needing to re-learn the processes for each release. That
> > said, if you are a committer interested in serving the community in this
> > way, please reach out to the community on the dev@ mailing list.
> >
> > If any Hudi committer is interested in being the next release manager,
> > please reply to this email.
> >
> > References:
> > Planned Tickets:   Jira Tickets
> > <
> >
> https://jira.apache.org/jira/issues/?jql=project+%3D+HUDI+AND+fixVersion+%3D+0.5.1
> > >
> > Release Guide:  Release Guide
> > <
> >
> https://cwiki.apache.org/confluence/display/HUDI/Apache+Hudi+%28incubating%29+-+Release+Guide
> > >
> >
> > Thanks,
> > Balaji.V
> > (On behalf of Apache Hudi PPMC)
> >
>


Re: [DISCUSS] Next Apache Release

2019-12-11 Thread leesf
Hi Balaji,

Thanks for kicking the discussion off.

+1 to release next version as we made many improvements since last released
version and Jan is reasonable considering the upcoming holidays.

Besides I am wondering if I can be the release manager of 0.5.1 to work
with you. It is always meaningful to help the community as much as possible.

Best,
Leesf

Balaji Varadarajan  于2019年12月12日周四 上午1:08写道:

> Hello all,
>
> In the spirit of making Apache Hudi (incubating) releases at regular
> cadence, we are starting this thread to kickstart the planning and
> preparatory work for next release (0.5.1).
>
> As discussed in yesterdays meeting, the current plan is to have a release
> by end of Jan 2020.
>
> As described in the release guide (see References), the first step would be
> identify the release manager for 0.5.1. This is a consensus-based decision
> of the entire community. The only requirements is that the release manager
> be Apache Hudi Committer as they have permissions to perform some of the
> release manager's work. The committer would still need to work with PPMC to
> write to Apache release repositories.
>
> There’s no formal process, no vote requirements, and no timing requirements
> when identifying release manager. Any objections should be resolved by
> consensus before starting the release.
>
> In general, the community prefers to have a rotating set of 3-5 Release
> Managers. Keeping a small core set of managers allows enough people to
> build expertise in this area and improve processes over time, without
> Release Managers needing to re-learn the processes for each release. That
> said, if you are a committer interested in serving the community in this
> way, please reach out to the community on the dev@ mailing list.
>
> If any Hudi committer is interested in being the next release manager,
> please reply to this email.
>
> References:
> Planned Tickets:   Jira Tickets
> <
> https://jira.apache.org/jira/issues/?jql=project+%3D+HUDI+AND+fixVersion+%3D+0.5.1
> >
> Release Guide:  Release Guide
> <
> https://cwiki.apache.org/confluence/display/HUDI/Apache+Hudi+%28incubating%29+-+Release+Guide
> >
>
> Thanks,
> Balaji.V
> (On behalf of Apache Hudi PPMC)
>


Re:Re: Re: Re: [DISCUSS] Refactor of the configuration framework of hudi project

2019-12-11 Thread lamberken
 hi, @Sivabalan
 
Yes, thanks very much for help me explain my initial proposal.


Answer your question, we can call HoodieWriteConfig as a SystemConfig, we need 
to pass it everywhere. 
Actually, it may just contains a few custom configurations( does not include 
default configurations)
Because each component has its own ConfigOptions.


The old version HoodieWriteConfig includes all keys(custom configurations, 
default configurations), it is a fat.


Best,
lamber-ken








At 2019-12-12 03:14:11, "Sivabalan"  wrote:
>Let me summarize your initial proposal and then will get into details.
>- Introduce ConfigOptions for ease of handling of default values.
>- Remove all Hoodie*Config classes and just have HoodieWriteConfig. What
>this means is that, every other config file will be replaced by
>ConfigOptions. eg, HoodieIndexConfigOption, HoodieCompactionConfigOption,
>etc.
>- Config option will take care of returning defaults for any property, even
>if an entire Config(eg IndexConfig) is not explicitly set.
>
>Here are the positives I see.
>- By way of having component level ConfigOptions, we bucketize the configs
>and have defaults set(same as before)
>- User doesn't need to set each component's config(eg IndexConfig)
>explicitly with HoodieWriteConfig.
>
>But have one question:
>- I see Bucketizing only in write path. How does one get hold of
>IndexConfigOptions as a consumer?  For eg, If some class is using just
>IndexConfig alone, how will it consume? From your eg, I see only
>HoodieWriteConfig. Do we pass in HoodieWriteConfig everywhere then?
>Wouldn't that contradicts your initial proposal to not have a fat config
>class? May be can you expand your example below to show how a consumer of
>IndexConfig look like.
>
>Your eg:
>/**
> * New version
> */
>// set value overrite the default value
>HoodieWriteConfig config = new HoodieWriteConfig();
>config.set(HoodieIndexConfigOptions.INDEX_TYPE,
>HoodieIndex.IndexType.HBASE.name 
>())
>
>
>
>
>On Wed, Dec 11, 2019 at 8:33 AM lamberken  wrote:
>
>>
>>
>> Hi,
>>
>>
>>
>>
>> On 1,2. Yes, you are right, moving the getter to the component level
>> Config class itself.
>>
>>
>> On 3, HoodieWriteConfig can also set value through ConfigOption, small
>> code snippets.
>> From the bellow snippets, we can see that clients need to know each
>> component's builders
>> and also call their "with" methods to override the default value in old
>> version.
>>
>>
>> But, in new version, clients just need to know each component's public
>> config options, just like constants.
>> So, these builders are redundant.
>>
>>
>> /---/
>>
>>
>> public class HoodieIndexConfigOptions {
>>   public static final ConfigOption INDEX_TYPE = ConfigOption
>>   .key("hoodie.index.type")
>>   .defaultValue(HoodieIndex.IndexType.BLOOM.name());
>> }
>>
>>
>> public class HoodieWriteConfig {
>>   public void setString(ConfigOption option, String value) {
>> this.props.put(option.key(), value);
>>   }
>> }
>>
>>
>>
>>
>> /**
>>  * New version
>>  */
>> // set value overrite the default value
>> HoodieWriteConfig config = new HoodieWriteConfig();
>> config.set(HoodieIndexConfigOptions.INDEX_TYPE,
>> HoodieIndex.IndexType.HBASE.name())
>>
>>
>>
>>
>> /**
>>  * Old version
>>  */
>> HoodieWriteConfig.Builder builder = HoodieWriteConfig.newBuilder()
>>
>> builder.withIndexConfig(HoodieIndexConfig.newBuilder().withIndexType(HoodieIndex.IndexType.BLOOM).build())
>>
>>
>>
>> /---/
>>
>>
>> Another, users use hudi like bellow, here're all keys.
>>
>> /---/
>>
>>
>> df.write.format("hudi").
>> option("hoodie.insert.shuffle.parallelism", "10").
>> option("hoodie.upsert.shuffle.parallelism", "10").
>> option("hoodie.delete.shuffle.parallelism", "10").
>> option("hoodie.bulkinsert.shuffle.parallelism", "10").
>> option("hoodie.datasource.write.recordkey.field", "name").
>> option("hoodie.datasource.write.partitionpath.field", "location").
>> option("hoodie.datasource.write.precombine.field", "ts").
>> option("hoodie.table.name", tableName).
>> mode(Overwrite).
>> save(basePath);
>>
>>
>>
>> /---/
>>
>>
>>
>>
>> Last, as I responsed to @vino, it's reasonable to handle fallbackkeys. I
>> think we need to do this step by step,
>> it's easy to integrate FallbackKey in the future, it is not what we need
>> right now in my opinion.
>>
>>
>> If some places are still not very clear, feel free to feedback.
>>
>>
>>
>>
>> Best,
>> lamber-ken
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> At 2019-12-11 23:41:31, "Vinoth Chandar"  wrote:
>> >Hi Lamber-ken,
>> >
>> >I looked at the sample PR you put up as well.
>> >
>> >On 1,2 => Seems your intent is t

Re: [QUESTION] Handle record partition change

2019-12-11 Thread Sivabalan
Depends on whether you are using regular BLOOM or GLOBAL_BLOOM. May I know
which one are you talking about?


On Wed, Dec 11, 2019 at 9:12 AM Shiyan Xu 
wrote:

> Hi Hudi devs,
>
> Upon upsert operations, does Hudi detect record's partition path change? As
> for the same record, the partition path field may get updated while the
> record key (the primary id) stays the same, then the insert would result in
> duplicate record (based on record key) in the dataset. Is there any
> relevant logic of this kind of detection and/or clean-up in the codebase?
>
> Best,
> Raymond
>


-- 
Regards,
-Sivabalan


Re: Re: Re: [DISCUSS] Refactor of the configuration framework of hudi project

2019-12-11 Thread Sivabalan
Let me summarize your initial proposal and then will get into details.
- Introduce ConfigOptions for ease of handling of default values.
- Remove all Hoodie*Config classes and just have HoodieWriteConfig. What
this means is that, every other config file will be replaced by
ConfigOptions. eg, HoodieIndexConfigOption, HoodieCompactionConfigOption,
etc.
- Config option will take care of returning defaults for any property, even
if an entire Config(eg IndexConfig) is not explicitly set.

Here are the positives I see.
- By way of having component level ConfigOptions, we bucketize the configs
and have defaults set(same as before)
- User doesn't need to set each component's config(eg IndexConfig)
explicitly with HoodieWriteConfig.

But have one question:
- I see Bucketizing only in write path. How does one get hold of
IndexConfigOptions as a consumer?  For eg, If some class is using just
IndexConfig alone, how will it consume? From your eg, I see only
HoodieWriteConfig. Do we pass in HoodieWriteConfig everywhere then?
Wouldn't that contradicts your initial proposal to not have a fat config
class? May be can you expand your example below to show how a consumer of
IndexConfig look like.

Your eg:
/**
 * New version
 */
// set value overrite the default value
HoodieWriteConfig config = new HoodieWriteConfig();
config.set(HoodieIndexConfigOptions.INDEX_TYPE,
HoodieIndex.IndexType.HBASE.name 
())




On Wed, Dec 11, 2019 at 8:33 AM lamberken  wrote:

>
>
> Hi,
>
>
>
>
> On 1,2. Yes, you are right, moving the getter to the component level
> Config class itself.
>
>
> On 3, HoodieWriteConfig can also set value through ConfigOption, small
> code snippets.
> From the bellow snippets, we can see that clients need to know each
> component's builders
> and also call their "with" methods to override the default value in old
> version.
>
>
> But, in new version, clients just need to know each component's public
> config options, just like constants.
> So, these builders are redundant.
>
>
> /---/
>
>
> public class HoodieIndexConfigOptions {
>   public static final ConfigOption INDEX_TYPE = ConfigOption
>   .key("hoodie.index.type")
>   .defaultValue(HoodieIndex.IndexType.BLOOM.name());
> }
>
>
> public class HoodieWriteConfig {
>   public void setString(ConfigOption option, String value) {
> this.props.put(option.key(), value);
>   }
> }
>
>
>
>
> /**
>  * New version
>  */
> // set value overrite the default value
> HoodieWriteConfig config = new HoodieWriteConfig();
> config.set(HoodieIndexConfigOptions.INDEX_TYPE,
> HoodieIndex.IndexType.HBASE.name())
>
>
>
>
> /**
>  * Old version
>  */
> HoodieWriteConfig.Builder builder = HoodieWriteConfig.newBuilder()
>
> builder.withIndexConfig(HoodieIndexConfig.newBuilder().withIndexType(HoodieIndex.IndexType.BLOOM).build())
>
>
>
> /---/
>
>
> Another, users use hudi like bellow, here're all keys.
>
> /---/
>
>
> df.write.format("hudi").
> option("hoodie.insert.shuffle.parallelism", "10").
> option("hoodie.upsert.shuffle.parallelism", "10").
> option("hoodie.delete.shuffle.parallelism", "10").
> option("hoodie.bulkinsert.shuffle.parallelism", "10").
> option("hoodie.datasource.write.recordkey.field", "name").
> option("hoodie.datasource.write.partitionpath.field", "location").
> option("hoodie.datasource.write.precombine.field", "ts").
> option("hoodie.table.name", tableName).
> mode(Overwrite).
> save(basePath);
>
>
>
> /---/
>
>
>
>
> Last, as I responsed to @vino, it's reasonable to handle fallbackkeys. I
> think we need to do this step by step,
> it's easy to integrate FallbackKey in the future, it is not what we need
> right now in my opinion.
>
>
> If some places are still not very clear, feel free to feedback.
>
>
>
>
> Best,
> lamber-ken
>
>
>
>
>
>
>
>
>
>
>
>
> At 2019-12-11 23:41:31, "Vinoth Chandar"  wrote:
> >Hi Lamber-ken,
> >
> >I looked at the sample PR you put up as well.
> >
> >On 1,2 => Seems your intent is to replace these with moving the getter to
> >the component level Config class itself? I am fine with that (although I
> >think its not that big of a hurdle really to use atm). But, once we do
> that
> >we could pass just the specific component config into parts of code versus
> >passing in the entire HoodieWriteConfig object. I am fine with moving the
> >classes to a ConfigOption class as you suggested as well.
> >
> >On 3, I still we feel we will need the builder pattern going forward. to
> >build the HoodieWriteConfig object. Like below? Cannot understand why we
> >would want to change this. Could you please clarify?
> >
> >HoodieWriteConfig.Builder builder =
> >
> HoodieWrit

[QUESTION] Handle record partition change

2019-12-11 Thread Shiyan Xu
Hi Hudi devs,

Upon upsert operations, does Hudi detect record's partition path change? As
for the same record, the partition path field may get updated while the
record key (the primary id) stays the same, then the insert would result in
duplicate record (based on record key) in the dataset. Is there any
relevant logic of this kind of detection and/or clean-up in the codebase?

Best,
Raymond


[DISCUSS] Next Apache Release

2019-12-11 Thread Balaji Varadarajan
Hello all,

In the spirit of making Apache Hudi (incubating) releases at regular
cadence, we are starting this thread to kickstart the planning and
preparatory work for next release (0.5.1).

As discussed in yesterdays meeting, the current plan is to have a release
by end of Jan 2020.

As described in the release guide (see References), the first step would be
identify the release manager for 0.5.1. This is a consensus-based decision
of the entire community. The only requirements is that the release manager
be Apache Hudi Committer as they have permissions to perform some of the
release manager's work. The committer would still need to work with PPMC to
write to Apache release repositories.

There’s no formal process, no vote requirements, and no timing requirements
when identifying release manager. Any objections should be resolved by
consensus before starting the release.

In general, the community prefers to have a rotating set of 3-5 Release
Managers. Keeping a small core set of managers allows enough people to
build expertise in this area and improve processes over time, without
Release Managers needing to re-learn the processes for each release. That
said, if you are a committer interested in serving the community in this
way, please reach out to the community on the dev@ mailing list.

If any Hudi committer is interested in being the next release manager,
please reply to this email.

References:
Planned Tickets:   Jira Tickets

Release Guide:  Release Guide


Thanks,
Balaji.V
(On behalf of Apache Hudi PPMC)


Re:Re: Re: [DISCUSS] Refactor of the configuration framework of hudi project

2019-12-11 Thread lamberken


Hi, 




On 1,2. Yes, you are right, moving the getter to the component level Config 
class itself. 


On 3, HoodieWriteConfig can also set value through ConfigOption, small code 
snippets.
From the bellow snippets, we can see that clients need to know each component's 
builders 
and also call their "with" methods to override the default value in old version.


But, in new version, clients just need to know each component's public config 
options, just like constants.
So, these builders are redundant.
 
/---/


public class HoodieIndexConfigOptions {
  public static final ConfigOption INDEX_TYPE = ConfigOption
  .key("hoodie.index.type")
  .defaultValue(HoodieIndex.IndexType.BLOOM.name());
}


public class HoodieWriteConfig {
  public void setString(ConfigOption option, String value) {
this.props.put(option.key(), value);
  }
}




/**
 * New version
 */
// set value overrite the default value
HoodieWriteConfig config = new HoodieWriteConfig();
config.set(HoodieIndexConfigOptions.INDEX_TYPE, 
HoodieIndex.IndexType.HBASE.name())




/**
 * Old version
 */
HoodieWriteConfig.Builder builder = HoodieWriteConfig.newBuilder()
builder.withIndexConfig(HoodieIndexConfig.newBuilder().withIndexType(HoodieIndex.IndexType.BLOOM).build())


/---/


Another, users use hudi like bellow, here're all keys.
/---/


df.write.format("hudi").
option("hoodie.insert.shuffle.parallelism", "10").
option("hoodie.upsert.shuffle.parallelism", "10").
option("hoodie.delete.shuffle.parallelism", "10").
option("hoodie.bulkinsert.shuffle.parallelism", "10").
option("hoodie.datasource.write.recordkey.field", "name").
option("hoodie.datasource.write.partitionpath.field", "location").
option("hoodie.datasource.write.precombine.field", "ts").
option("hoodie.table.name", tableName).
mode(Overwrite).
save(basePath);


/---/




Last, as I responsed to @vino, it's reasonable to handle fallbackkeys. I think 
we need to do this step by step,
it's easy to integrate FallbackKey in the future, it is not what we need right 
now in my opinion.


If some places are still not very clear, feel free to feedback.




Best,
lamber-ken












At 2019-12-11 23:41:31, "Vinoth Chandar"  wrote:
>Hi Lamber-ken,
>
>I looked at the sample PR you put up as well.
>
>On 1,2 => Seems your intent is to replace these with moving the getter to
>the component level Config class itself? I am fine with that (although I
>think its not that big of a hurdle really to use atm). But, once we do that
>we could pass just the specific component config into parts of code versus
>passing in the entire HoodieWriteConfig object. I am fine with moving the
>classes to a ConfigOption class as you suggested as well.
>
>On 3, I still we feel we will need the builder pattern going forward. to
>build the HoodieWriteConfig object. Like below? Cannot understand why we
>would want to change this. Could you please clarify?
>
>HoodieWriteConfig.Builder builder =
>
> HoodieWriteConfig.newBuilder().withPath(cfg.targetBasePath).combineInput(cfg.filterDupes,
>true)
>
> .withCompactionConfig(HoodieCompactionConfig.newBuilder().withPayloadClass(cfg.payloadClassName)
>// Inline compaction is disabled for continuous mode.
>otherwise enabled for MOR
>.withInlineCompaction(cfg.isInlineCompactionEnabled()).build())
>.forTable(cfg.targetTableName)
>
> .withIndexConfig(HoodieIndexConfig.newBuilder().withIndexType(HoodieIndex.IndexType.BLOOM).build())
>.withAutoCommit(false).withProps(props);
>
>
>Typically, we write RFCs for large changes that breaks existing behavior or
>introduces significantly complex new features.. If you are just planning to
>do the refactoring into ConfigOption class, per se you don't need a RFC.
>But , if you plan to address the fallback keys (or) your changes are going
>to break/change existing jobs, we would need a RFC.
>
>>> It is not clear to me whether there is any external facing changes which
>changes this model.
>I am still unclear on this as well. can you please explicitly clarify?
>
>thanks
>vinoth
>
>
>On Tue, Dec 10, 2019 at 12:35 PM lamberken  wrote:
>
>>
>> Hi, @Balaji @Vinoth
>>
>>
>> I'm sorry, some places are not very clear,
>>
>>
>> 1, We can see that HoodieMetricsConfig, HoodieStorageConfig, etc.. already
>> defined in project.
>>But we get property value by methods which defined in
>> HoodieWriteConfig, like HoodieWriteConfig#getParquetMaxFileSize,
>>HoodieWriteConfig#getParquetBlockSize, etc. It's means that
>> Hoodie*Config are redundant.
>>
>>
>> 2, These Hoodie*Config classes are used to set default value when call
>> their build method, nothing e

Re: Re: [DISCUSS] Refactor of the configuration framework of hudi project

2019-12-11 Thread Vinoth Chandar
Hi Lamber-ken,

I looked at the sample PR you put up as well.

On 1,2 => Seems your intent is to replace these with moving the getter to
the component level Config class itself? I am fine with that (although I
think its not that big of a hurdle really to use atm). But, once we do that
we could pass just the specific component config into parts of code versus
passing in the entire HoodieWriteConfig object. I am fine with moving the
classes to a ConfigOption class as you suggested as well.

On 3, I still we feel we will need the builder pattern going forward. to
build the HoodieWriteConfig object. Like below? Cannot understand why we
would want to change this. Could you please clarify?

HoodieWriteConfig.Builder builder =

HoodieWriteConfig.newBuilder().withPath(cfg.targetBasePath).combineInput(cfg.filterDupes,
true)

.withCompactionConfig(HoodieCompactionConfig.newBuilder().withPayloadClass(cfg.payloadClassName)
// Inline compaction is disabled for continuous mode.
otherwise enabled for MOR
.withInlineCompaction(cfg.isInlineCompactionEnabled()).build())
.forTable(cfg.targetTableName)

.withIndexConfig(HoodieIndexConfig.newBuilder().withIndexType(HoodieIndex.IndexType.BLOOM).build())
.withAutoCommit(false).withProps(props);


Typically, we write RFCs for large changes that breaks existing behavior or
introduces significantly complex new features.. If you are just planning to
do the refactoring into ConfigOption class, per se you don't need a RFC.
But , if you plan to address the fallback keys (or) your changes are going
to break/change existing jobs, we would need a RFC.

>> It is not clear to me whether there is any external facing changes which
changes this model.
I am still unclear on this as well. can you please explicitly clarify?

thanks
vinoth


On Tue, Dec 10, 2019 at 12:35 PM lamberken  wrote:

>
> Hi, @Balaji @Vinoth
>
>
> I'm sorry, some places are not very clear,
>
>
> 1, We can see that HoodieMetricsConfig, HoodieStorageConfig, etc.. already
> defined in project.
>But we get property value by methods which defined in
> HoodieWriteConfig, like HoodieWriteConfig#getParquetMaxFileSize,
>HoodieWriteConfig#getParquetBlockSize, etc. It's means that
> Hoodie*Config are redundant.
>
>
> 2, These Hoodie*Config classes are used to set default value when call
> their build method, nothing else.
>
>
> 3, For current plan is keep the Builder pattern when configuring, when we
> are familiar with the config framework,
>We will find that Hoodie*Config class are redundant and methods
> prefixed with "get" in HoodieWriteConfig are also redundant.
>
>
> In addition, I create a pr[1] for initializing with a demo. At this demo,
> I create
> MetricsGraphiteReporterOptions which contains HOST, PORT, PREFIX, and
> remove getGraphiteServerHost,
> getGraphiteServerPort, getGraphiteMetricPrefix in HoodieMetricsConfig.
>
>
> https://github.com/apache/incubator-hudi/pull/1094
>
>
> Best,
> lamber-ken
>
>
>
>
>
>
>
> At 2019-12-11 02:35:30, "Balaji Varadarajan" 
> wrote:
> > Hi Lamber-Ken,
> >Thanks for the time writing the proposal and thinking about improving
> Hudi usability.
> >My preference would be to keep the Builder pattern when configuring. It
> is something I find it natural when configuring. It is not clear to me
> whether there is any external facing changes which changes this model.
> Would you mind adding some more details on the RFC. It would save time to
> read it in one place as opposed to checking out github repo :)
> >Thanks,Balaji.V
> >On Tuesday, December 10, 2019, 07:55:01 AM PST, Vinoth Chandar <
> vin...@apache.org> wrote:
> >
> > Hi ,
> >
> >Thanks for the proposal. Some parts I agree, some parts I don't and some
> >parts are unclear
> >
> >Agree :
> >- On introducing a class that binds key, default value, provided value,
> and
> >also may be a doc along with it (?).
> >- Designing the framework to have fallback keys is good IMO. It helps us
> do
> >things like https://issues.apache.org/jira/browse/HUDI-89
> >
> >Disagree :
> >- Not all configuration values are in HoodieWriteConfig, its not accurate.
> >Configs are already split by components into HoodieIndexConfig,
> >HoodieCompactionConfig etc..
> >- There are helpers for all these conveniently located in
> >HoodieWriteConfig. I think some of the claims of usability seem subjective
> >to me, speaking from hands-on experience writing jobs. So, if you
> proposing
> >a large shake up (e.g not have a single properties file load all
> >components), I would love to understand this at more depth. From my
> >experience, well namespaced configs in a single properties file keeps it
> >simple and understandable.
> >
> >Unclear :
> >- What is impact on existing jobs - using  RDD/WriteClient API,
> DataSource,
> >DeltaStreamer level? Do you intend to change namespacing of configs?
> >
> >
> >Thanks
> >Vinoth
> >
> >On Tue, Dec 10, 2019 at 6:44 AM lamberken  wrote:
> >
> >>
> >>
> >> Hi,