Re: [DISCUSS] Readiness for graduation to TLP

2020-04-28 Thread lamberken
+1

On 2020/04/28 05:05:44, Vinoth Chandar  wrote: 
> Hello all,
> 
> I would like to start a discussion on our readiness to pursue graduation to
> TLP and potentially follow up with a VOTE with a formal resolution. To seed
> the discussion, our  community's achievements since entering the Incubator
> in early 2018 include the following:
> 
> - Accepted > 500 patches from 90 contributors, including 15+ new  design
> proposals
> - Performed 3 releases with 3 different release managers
> - Invited 5 new committers (all of them accepted)
> - invited 3 of those new committers to join the PMC (all of them accepted)
> - Migrated our web site to ASF infrastructure [1]
> - Migrated developer conversations to the list at dev@hudi.apache.org
> - Migrated all issue tracking to JIRA [2]
> - Apache Hudi name search has been approved [3]
> - We have built a meritocratic, open collaborative process, the Apache way
> - Our PMC is diverse and consists of members from ~10 organizations
> 
> Please chime in with your thoughts.
> 
> Thanks
> Vinoth
> 


Re: Checking out the asf svn repo

2020-04-23 Thread lamberken


Hi Vinoth,

Minor: Line:55, better to use https://hudi.incubator.apache.org than 
http://hudi.incubator.apache.org

https://svn.apache.org/repos/asf/incubator/public/trunk/content/projects/hudi.xml

Best,
Lamber-Ken

On 2020/04/23 19:04:19, Vinoth Chandar  wrote: 
> Good catch.. Fixed!
> 
> On Thu, Apr 23, 2020 at 11:57 AM lamberken  wrote:
> 
> > Hi Vinoth,
> >
> > The browser shown hudi.xml contains syntax error.
> >
> > https://svn.apache.org/repos/asf/incubator/public/trunk/content/projects/hudi.xml
> >
> > Best,
> > Lamber-Ken
> >
> > On 2020/04/23 16:51:49, Vinoth Chandar  wrote:
> > > Finally figured out.. :/ Updated the status file now, to reflect latest
> > > information
> > >
> > > all, please take a look and spot any errors (if any)
> > >
> > https://svn.apache.org/viewvc/incubator/public/trunk/content/projects/hudi.xml?revision=1876904&view=markup
> > >
> > >
> > > On Mon, Apr 20, 2020 at 5:03 PM Vinoth Chandar 
> > wrote:
> > >
> > > > Thanks, that's what I am following. Don't see the checkout done
> > properly,
> > > > only see the top level folder without contents
> > > >
> > > > On Thu, Apr 16, 2020 at 6:16 PM lamber-ken  wrote:
> > > >
> > > >>
> > > >>
> > > >> Hi Vinoth,
> > > >>
> > > >>
> > > >> You can get help from the following documentation[1]
> > > >>
> > > >>
> > > >> [1] https://infra.apache.org/version-control.html
> > > >>
> > > >>
> > > >>
> > > >>
> > > >> Best,
> > > >> Lamber-Ken
> > > >>
> > > >>
> > > >>
> > > >>
> > > >>
> > > >>
> > > >>
> > > >>
> > > >>
> > > >>
> > > >>
> > > >> At 2020-04-17 08:24:01, "Vinoth Chandar"  wrote:
> > > >> >Hello all,
> > > >> >
> > > >> >Can anyone here (potentially from prior experience with other apache
> > > >> >projects) point me, to how I can checkout the apache svn repo here?
> > > >> >
> > > >>
> > https://svn.apache.org/viewvc/incubator/public/trunk/content/projects/hudi.xml?view=log
> > > >> >
> > > >> >
> > > >> >Would like to make some edits to our status file.. Specifically, I am
> > > >> >trying to understand how I authenticate to commit the changes?  (I
> > have
> > > >> not
> > > >> >used svn in years. So apologize if I am asking some basic qs)
> > > >> >
> > > >> >Thanks
> > > >> >Vinoth
> > > >>
> > > >
> > >
> >
> 


Re: Checking out the asf svn repo

2020-04-23 Thread lamberken
Hi Vinoth,

The browser shown hudi.xml contains syntax error.
https://svn.apache.org/repos/asf/incubator/public/trunk/content/projects/hudi.xml

Best,
Lamber-Ken

On 2020/04/23 16:51:49, Vinoth Chandar  wrote: 
> Finally figured out.. :/ Updated the status file now, to reflect latest
> information
> 
> all, please take a look and spot any errors (if any)
> https://svn.apache.org/viewvc/incubator/public/trunk/content/projects/hudi.xml?revision=1876904&view=markup
> 
> 
> On Mon, Apr 20, 2020 at 5:03 PM Vinoth Chandar  wrote:
> 
> > Thanks, that's what I am following. Don't see the checkout done properly,
> > only see the top level folder without contents
> >
> > On Thu, Apr 16, 2020 at 6:16 PM lamber-ken  wrote:
> >
> >>
> >>
> >> Hi Vinoth,
> >>
> >>
> >> You can get help from the following documentation[1]
> >>
> >>
> >> [1] https://infra.apache.org/version-control.html
> >>
> >>
> >>
> >>
> >> Best,
> >> Lamber-Ken
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >> At 2020-04-17 08:24:01, "Vinoth Chandar"  wrote:
> >> >Hello all,
> >> >
> >> >Can anyone here (potentially from prior experience with other apache
> >> >projects) point me, to how I can checkout the apache svn repo here?
> >> >
> >> https://svn.apache.org/viewvc/incubator/public/trunk/content/projects/hudi.xml?view=log
> >> >
> >> >
> >> >Would like to make some edits to our status file.. Specifically, I am
> >> >trying to understand how I authenticate to commit the changes?  (I have
> >> not
> >> >used svn in years. So apologize if I am asking some basic qs)
> >> >
> >> >Thanks
> >> >Vinoth
> >>
> >
> 


[DISSCUSS] Troubleshooting flow

2020-03-31 Thread lamberken
Hi team,




Many users use slack ask for support when they met bugs / problems currently.

but there are some disadvantages we need to consider:

1. code snippet display is not friendly.

2. we may miss some questions when questions come up at the same time.

3. threads cann't be indexed by search engines

...




So, I suggest we should guide users to use GitHub issues as much as we can.

step1: guide users use GitHub issues to report their questions

step2: developers can pick up some issues which they are interested in.

step3: raise a related JIRA if needed

step4: add some useful notes to troubleshooting guide



Any thoughts are welcome, thanks : )


Best,
Lamber-Ken

Re: [NOTIFICATION] Auto generation asf-site feedback

2020-03-25 Thread lamberken



Thanks  : )




At 2020-03-25 09:50:59, "vino yang"  wrote:
>Great job!
>
>Thanks to lamber-ken for driving and getting this done!
>
>Best,
>Vino
>
>Vinoth Chandar  于2020年3月25日周三 上午8:34写道:
>
>> Currently, the new site is published to a "test-content" folder.  Our plan
>> is to try this for 1 week and then actually cut over to "content" which is
>> what powers the site.
>>
>> Kudos to lamber-ken for the perseverance in getting this done!
>>
>> On Tue, Mar 24, 2020 at 5:19 PM lamberken  wrote:
>>
>> > Hi team,
>> >
>> >
>> >
>> >
>> > After HUDI-504[1] landed, travis will build asf-site branch and update
>> > site automatically,
>> >
>> > developers can focus on add/edit/remove *.md files, will don't need to
>> > learn about how to build site.
>> >
>> >
>> >
>> >
>> > Fell free to report any issues if you see, thanks very much.
>> >
>> >
>> >
>> >
>> > [1] https://github.com/apache/incubator-hudi/pull/1412
>>


[NOTIFICATION] Auto generation asf-site feedback

2020-03-24 Thread lamberken
Hi team,




After HUDI-504[1] landed, travis will build asf-site branch and update site 
automatically,

developers can focus on add/edit/remove *.md files, will don't need to learn 
about how to build site.




Fell free to report any issues if you see, thanks very much.




[1] https://github.com/apache/incubator-hudi/pull/1412

Re:Re: Re: upsert on COW Takes 6 min for 150K Record

2020-03-11 Thread lamberken


Hi, 


The unit is byte, it is an example, you need to modify it according to your own 
env.


Best,
Lamber-Ken



At 2020-03-12 01:51:20, "selvaraj periyasamy" 
 wrote:
>Thanks . What is this number 200485760? is it in bits or bytes?
>
>Thanks,
>Selva
>
>On Tue, Mar 10, 2020 at 2:57 AM lamberken  wrote:
>
>>
>>
>> hi,
>>
>>
>> IMO, when upsert 150K record with 100columns, these records need
>> serializate to disk and deserialize from disk.
>> You can try add < option("hoodie.memory.merge.max.size", "200485760") >
>>
>>
>> best,
>> lamber-ken
>>
>>
>>
>>
>>
>> At 2020-03-10 17:07:58, "selvaraj periyasamy" <
>> selvaraj.periyasamy1...@gmail.com> wrote:
>>
>> Sorry for the partial emails. My company portal don’t allow me to add test
>> code .  Am using 0.5.0 version of Hudi Jars built from my local.  While
>> running upsert , it takes more than 6 or 7 mins for processing 150k records.
>>
>>
>>
>> Is there any tuning that could reduce the processing time from 6 or 7 mins
>> ? Overwrite just takes less than a min ? Each row has 100 columns .
>>
>>
>>
>> Thanks,
>> Selva
>>
>>
>> On Tue, Mar 10, 2020 at 1:51 AM selvaraj periyasamy <
>> selvaraj.periyasamy1...@gmail.com> wrote:
>>
>> Team,
>>
>>
>> Am using 0.5.0 version of Hudi Jars built from my local.  While running
>> upsert , it takes more than 6 or 7 mins for processing 150k records. Below
>> are the code and logs.
>>
>>
>> 20/03/10 07:26:09 INFO IteratorBasedQueueProducer: starting to buffer
>> records
>> 20/03/10 07:26:09 INFO BoundedInMemoryExecutor: starting consumer thread
>> 20/03/10 07:33:59 INFO IteratorBasedQueueProducer: finished buffering
>> records
>> 20/03/10 07:34:00 INFO BoundedInMemoryExecutor: Queue Consumption is done;
>> notifying producer threads
>>
>>
>> 20/03/10 07:26:08 INFO IteratorBasedQueueProducer: starting to buffer
>> records
>> 20/03/10 07:26:08 INFO BoundedInMemoryExecutor: starting consumer thread
>> 20/03/10 07:33:31 INFO IteratorBasedQueueProducer: finished buffering
>> records
>> 20/03/10 07:33:31 INFO BoundedInMemoryExecutor: Queue Consumption is done;
>> notifying producer threads
>>
>>
>> While running insert
>>
>>
>> On Tue, Mar 10, 2020 at 1:45 AM selvaraj periyasamy <
>> selvaraj.periyasamy1...@gmail.com> wrote:
>>
>> Team,
>>
>>
>> Am using 0.5.0 version of Hudi Jars built from my local.  While running
>> upsert
>>
>>
>> 20/03/10 07:26:09 INFO IteratorBasedQueueProducer: starting to buffer
>> records
>> 20/03/10 07:26:09 INFO BoundedInMemoryExecutor: starting consumer thread
>> 20/03/10 07:33:59 INFO IteratorBasedQueueProducer: finished buffering
>> records
>> 20/03/10 07:34:00 INFO BoundedInMemoryExecutor: Queue Consumption is done;
>> notifying producer threads
>>
>>
>> 20/03/10 07:26:08 INFO IteratorBasedQueueProducer: starting to buffer
>> records
>> 20/03/10 07:26:08 INFO BoundedInMemoryExecutor: starting consumer thread
>> 20/03/10 07:33:31 INFO IteratorBasedQueueProducer: finished buffering
>> records
>> 20/03/10 07:33:31 INFO BoundedInMemoryExecutor: Queue Consumption is done;
>> notifying producer threads
>>
>>
>>
>>


Re:Re: upsert on COW Takes 6 min for 150K Record

2020-03-10 Thread lamberken


More, we improve the performance issuse around DiskBasedMap & kryo at master 
branch.
You also can try build hudi jar use master branch.


best,
lamber-ken





At 2020-03-10 17:07:58, "selvaraj periyasamy" 
 wrote:

Sorry for the partial emails. My company portal don’t allow me to add test code 
.  Am using 0.5.0 version of Hudi Jars built from my local.  While running 
upsert , it takes more than 6 or 7 mins for processing 150k records.



Is there any tuning that could reduce the processing time from 6 or 7 mins ? 
Overwrite just takes less than a min ? Each row has 100 columns .



Thanks,
Selva


On Tue, Mar 10, 2020 at 1:51 AM selvaraj periyasamy 
 wrote:

Team,


Am using 0.5.0 version of Hudi Jars built from my local.  While running upsert 
, it takes more than 6 or 7 mins for processing 150k records. Below are the 
code and logs.  


20/03/10 07:26:09 INFO IteratorBasedQueueProducer: starting to buffer records
20/03/10 07:26:09 INFO BoundedInMemoryExecutor: starting consumer thread
20/03/10 07:33:59 INFO IteratorBasedQueueProducer: finished buffering records
20/03/10 07:34:00 INFO BoundedInMemoryExecutor: Queue Consumption is done; 
notifying producer threads


20/03/10 07:26:08 INFO IteratorBasedQueueProducer: starting to buffer records
20/03/10 07:26:08 INFO BoundedInMemoryExecutor: starting consumer thread
20/03/10 07:33:31 INFO IteratorBasedQueueProducer: finished buffering records
20/03/10 07:33:31 INFO BoundedInMemoryExecutor: Queue Consumption is done; 
notifying producer threads


While running insert 


On Tue, Mar 10, 2020 at 1:45 AM selvaraj periyasamy 
 wrote:

Team,


Am using 0.5.0 version of Hudi Jars built from my local.  While running upsert 


20/03/10 07:26:09 INFO IteratorBasedQueueProducer: starting to buffer records
20/03/10 07:26:09 INFO BoundedInMemoryExecutor: starting consumer thread
20/03/10 07:33:59 INFO IteratorBasedQueueProducer: finished buffering records
20/03/10 07:34:00 INFO BoundedInMemoryExecutor: Queue Consumption is done; 
notifying producer threads


20/03/10 07:26:08 INFO IteratorBasedQueueProducer: starting to buffer records
20/03/10 07:26:08 INFO BoundedInMemoryExecutor: starting consumer thread
20/03/10 07:33:31 INFO IteratorBasedQueueProducer: finished buffering records
20/03/10 07:33:31 INFO BoundedInMemoryExecutor: Queue Consumption is done; 
notifying producer threads





Re:Re: upsert on COW Takes 6 min for 150K Record

2020-03-10 Thread lamberken


hi, 


IMO, when upsert 150K record with 100columns, these records need serializate to 
disk and deserialize from disk.
You can try add < option("hoodie.memory.merge.max.size", "200485760") >


best,
lamber-ken





At 2020-03-10 17:07:58, "selvaraj periyasamy" 
 wrote:

Sorry for the partial emails. My company portal don’t allow me to add test code 
.  Am using 0.5.0 version of Hudi Jars built from my local.  While running 
upsert , it takes more than 6 or 7 mins for processing 150k records.



Is there any tuning that could reduce the processing time from 6 or 7 mins ? 
Overwrite just takes less than a min ? Each row has 100 columns .



Thanks,
Selva


On Tue, Mar 10, 2020 at 1:51 AM selvaraj periyasamy 
 wrote:

Team,


Am using 0.5.0 version of Hudi Jars built from my local.  While running upsert 
, it takes more than 6 or 7 mins for processing 150k records. Below are the 
code and logs.  


20/03/10 07:26:09 INFO IteratorBasedQueueProducer: starting to buffer records
20/03/10 07:26:09 INFO BoundedInMemoryExecutor: starting consumer thread
20/03/10 07:33:59 INFO IteratorBasedQueueProducer: finished buffering records
20/03/10 07:34:00 INFO BoundedInMemoryExecutor: Queue Consumption is done; 
notifying producer threads


20/03/10 07:26:08 INFO IteratorBasedQueueProducer: starting to buffer records
20/03/10 07:26:08 INFO BoundedInMemoryExecutor: starting consumer thread
20/03/10 07:33:31 INFO IteratorBasedQueueProducer: finished buffering records
20/03/10 07:33:31 INFO BoundedInMemoryExecutor: Queue Consumption is done; 
notifying producer threads


While running insert 


On Tue, Mar 10, 2020 at 1:45 AM selvaraj periyasamy 
 wrote:

Team,


Am using 0.5.0 version of Hudi Jars built from my local.  While running upsert 


20/03/10 07:26:09 INFO IteratorBasedQueueProducer: starting to buffer records
20/03/10 07:26:09 INFO BoundedInMemoryExecutor: starting consumer thread
20/03/10 07:33:59 INFO IteratorBasedQueueProducer: finished buffering records
20/03/10 07:34:00 INFO BoundedInMemoryExecutor: Queue Consumption is done; 
notifying producer threads


20/03/10 07:26:08 INFO IteratorBasedQueueProducer: starting to buffer records
20/03/10 07:26:08 INFO BoundedInMemoryExecutor: starting consumer thread
20/03/10 07:33:31 INFO IteratorBasedQueueProducer: finished buffering records
20/03/10 07:33:31 INFO BoundedInMemoryExecutor: Queue Consumption is done; 
notifying producer threads





Re:Re: Re: Re: [DISCUSS] Improve the merge performance for cow

2020-03-03 Thread lamberken


Hi Vinoth,


Yes, it's incorrect to draw the conclusion from only one test.


It's just an new idea to improve the merge performance, it's not the best.
e.g when read old record, series of conversion operations (Row to GenericRecord 
to HoodieRecord) etc..


> Also let's separate the RDD vs DataFrame discussion out of this
Okay, mentioned it here, because if use Dataset/DataFrame, may not need so many 
conversions in hudi project.
IMO, the new merge program will be more clearer, it's a great project to do 
that.
As you suggested, let's separate it out of this.


Best,
Lamber-Ken





At 2020-03-03 02:16:04, "Vinoth Chandar"  wrote:
>Hi Lamber-ken,
>
>If you agree reduceByKey() will shuffle data, then it would serialize and
>deserialize anyway correct?
>
>I am not denying that this may be a valid approach.. But we need much more
>rigorous testing and potentially implement both approaches side-by-side to
>compare.. IMO We cannot conclude based on this on the one test we had -
>where the metadata overhead was so high . First step would be to introduce
>abstractions so that these two ways can be implemented side-by-side and
>controlled by a flag..
>
>Also let's separate the RDD vs DataFrame discussion out of this? Since that
>orthogonal anyway..
>
>Thanks
>Vinoth
>
>
>On Fri, Feb 28, 2020 at 11:02 AM lamberken  wrote:
>
>>
>>
>> Hi vinoth,
>>
>>
>> Thanks for reviewing the initial design :)
>> I know there are many problems at present(e.g shuffling, parallelism
>> issue). We can discussed the practicability of the idea first.
>>
>>
>> > ExternalSpillableMap itself was not the issue right, the serialization
>> was
>> Right, the new design will not have this issue, because will not use it at
>> all.
>>
>>
>> > This map is also used on the query side
>> Right, the proposal aims to improve the merge performance of cow table.
>>
>>
>> > HoodieWriteClient.java#L546 We cannot collect() the recordRDD at all ...
>> OOM driver
>> Here, in order to get the Map, had executed distinct()
>> before collect(), the result is very small.
>> Also, it can be implemented in FileSystemViewManager, and lazy loading
>> also ok.
>>
>>
>> > Doesn't this move the problem to tuning spark simply?
>> there are two serious performance problems in the old merge logic.
>> 1, when upsert many records, it will serialize record to disk, then
>> deserialize it when merge old record
>> 2, only single thread comsume the old record one by one, then handle the
>> merge process, it is much less efficient.
>>
>>
>> > doing a sort based merge repartitionAndSortWithinPartitions
>> Trying to understand your point :)
>>
>>
>> Compare to old version, may there are serveral improvements
>> 1. use spark built-in operators, it's easier to understand.
>> 2. during my testing, the upsert performance doubled.
>> 3. if possible, we can write data in batch by using Dataframe in the
>> futher.
>>
>>
>> [1]
>> https://github.com/BigDataArtisans/incubator-hudi/blob/new-cow-merge/hudi-client/src/main/java/org/apache/hudi/HoodieWriteClient.java
>>
>>
>> Best,
>> Lamber-Ken
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> At 2020-02-29 01:40:36, "Vinoth Chandar"  wrote:
>> >Does n't this move the problem to tuning spark simply? the
>> >ExternalSpillableMap itself was not the issue right, the serialization
>> >was.  This map is also used on the query side btw, where we need something
>> >like that.
>> >
>> >I took a pass at the code. I think we are shuffling data again for the
>> >reduceByKey step in this approach? For MOR, note that this is unnecessary
>> >since we simply log the. records and there is no merge. This approach
>> might
>> >have a better parallelism of merging when that's costly.. But ultimately,
>> >our write parallelism is limited by number of affected files right?  So
>> its
>> >not clear to me, that this would be a win always..
>> >
>> >On the code itself,
>> >
>> https://github.com/BigDataArtisans/incubator-hudi/blob/new-cow-merge/hudi-client/src/main/java/org/apache/hudi/HoodieWriteClient.java#L546
>> > We cannot collect() the recordRDD at all.. It will OOM the driver .. :)
>> >
>> >Orthogonally, one thing we think of is : doing a sort based merge.. i.e
>> >repartitionAndSortWithinPartitions()  the input records to mergehandle,
>> and
>> >if the 

Re:Re: Re: [DISCUSS] Improve the merge performance for cow

2020-02-28 Thread lamberken


Hi vinoth,


Thanks for reviewing the initial design :)
I know there are many problems at present(e.g shuffling, parallelism issue). We 
can discussed the practicability of the idea first.


> ExternalSpillableMap itself was not the issue right, the serialization was
Right, the new design will not have this issue, because will not use it at all.


> This map is also used on the query side
Right, the proposal aims to improve the merge performance of cow table.


> HoodieWriteClient.java#L546 We cannot collect() the recordRDD at all ... OOM 
> driver
Here, in order to get the Map, had executed distinct() 
before collect(), the result is very small.
Also, it can be implemented in FileSystemViewManager, and lazy loading also ok.


> Doesn't this move the problem to tuning spark simply?
there are two serious performance problems in the old merge logic.
1, when upsert many records, it will serialize record to disk, then deserialize 
it when merge old record
2, only single thread comsume the old record one by one, then handle the merge 
process, it is much less efficient.   


> doing a sort based merge repartitionAndSortWithinPartitions
Trying to understand your point :) 


Compare to old version, may there are serveral improvements
1. use spark built-in operators, it's easier to understand.
2. during my testing, the upsert performance doubled.
3. if possible, we can write data in batch by using Dataframe in the futher.


[1] 
https://github.com/BigDataArtisans/incubator-hudi/blob/new-cow-merge/hudi-client/src/main/java/org/apache/hudi/HoodieWriteClient.java


Best,
Lamber-Ken









At 2020-02-29 01:40:36, "Vinoth Chandar"  wrote:
>Does n't this move the problem to tuning spark simply? the
>ExternalSpillableMap itself was not the issue right, the serialization
>was.  This map is also used on the query side btw, where we need something
>like that.
>
>I took a pass at the code. I think we are shuffling data again for the
>reduceByKey step in this approach? For MOR, note that this is unnecessary
>since we simply log the. records and there is no merge. This approach might
>have a better parallelism of merging when that's costly.. But ultimately,
>our write parallelism is limited by number of affected files right?  So its
>not clear to me, that this would be a win always..
>
>On the code itself,
>https://github.com/BigDataArtisans/incubator-hudi/blob/new-cow-merge/hudi-client/src/main/java/org/apache/hudi/HoodieWriteClient.java#L546
> We cannot collect() the recordRDD at all.. It will OOM the driver .. :)
>
>Orthogonally, one thing we think of is : doing a sort based merge.. i.e
>repartitionAndSortWithinPartitions()  the input records to mergehandle, and
>if the file is also sorted on disk (its not today), then we can do a
>merge_sort like algorithm to perform the merge.. We can probably write code
>to bear one time sorting costs... This will eliminate the need for memory
>for merging altogether..
>
>On Wed, Feb 26, 2020 at 10:11 PM lamberken  wrote:
>
>>
>>
>> hi, vinoth
>>
>>
>> > What do you mean by spark built in operators
>> We may can not depency on ExternalSpillableMap again when upsert to cow
>> table.
>>
>>
>> > Are you suggesting that we perform the merging in sql
>> No, just only use spark built-in operators like mapToPair, reduceByKey etc
>>
>>
>> Details has been described in this article[1], also finished draft
>> implementation and test.
>> mainly modified HoodieWriteClient#upsertRecordsInternal method.
>>
>>
>> [1]
>> https://docs.google.com/document/d/1-EHHfemtwtX2rSySaPMjeOAUkg5xfqJCKLAETZHa7Qw/edit?usp=sharing
>> [2]
>> https://github.com/BigDataArtisans/incubator-hudi/blob/new-cow-merge/hudi-client/src/main/java/org/apache/hudi/HoodieWriteClient.java
>>
>>
>>
>> At 2020-02-27 13:45:57, "Vinoth Chandar"  wrote:
>> >Hi lamber-ken,
>> >
>> >Thanks for this. I am not quite following the proposal. What do you mean
>> by
>> >spark built in operators? Dont we use the RDD based spark operations.
>> >
>> >Are you suggesting that we perform the merging in sql? Not following.
>> >Please clarify.
>> >
>> >On Wed, Feb 26, 2020 at 10:08 AM lamberken  wrote:
>> >
>> >>
>> >>
>> >> Hi guys,
>> >>
>> >>
>> >> Motivation
>> >> Impove the merge performance for cow table when upsert, handle merge
>> >> operation by using spark built-in operators.
>> >>
>> >>
>> >> Background
>> >> When do a upsert operation, for each bucket, hudi needs to put new input
>> >> elements

Re:Re: [DISCUSS] Improve the merge performance for cow

2020-02-26 Thread lamberken


hi, vinoth


> What do you mean by spark built in operators
We may can not depency on ExternalSpillableMap again when upsert to cow table.


> Are you suggesting that we perform the merging in sql
No, just only use spark built-in operators like mapToPair, reduceByKey etc


Details has been described in this article[1], also finished draft 
implementation and test.
mainly modified HoodieWriteClient#upsertRecordsInternal method.


[1] 
https://docs.google.com/document/d/1-EHHfemtwtX2rSySaPMjeOAUkg5xfqJCKLAETZHa7Qw/edit?usp=sharing
[2] 
https://github.com/BigDataArtisans/incubator-hudi/blob/new-cow-merge/hudi-client/src/main/java/org/apache/hudi/HoodieWriteClient.java



At 2020-02-27 13:45:57, "Vinoth Chandar"  wrote:
>Hi lamber-ken,
>
>Thanks for this. I am not quite following the proposal. What do you mean by
>spark built in operators? Dont we use the RDD based spark operations.
>
>Are you suggesting that we perform the merging in sql? Not following.
>Please clarify.
>
>On Wed, Feb 26, 2020 at 10:08 AM lamberken  wrote:
>
>>
>>
>> Hi guys,
>>
>>
>> Motivation
>> Impove the merge performance for cow table when upsert, handle merge
>> operation by using spark built-in operators.
>>
>>
>> Background
>> When do a upsert operation, for each bucket, hudi needs to put new input
>> elements to memory cache map, and will
>> need an external map that spills content to disk when there is
>> insufficient space for it to grow.
>>
>>
>> There are several performance issuses:
>> 1. We may need an external disk map, serialize / deserialize records
>> 2. Only single thread do the I/O operation when check
>> 3. Can't take advantage of built-in spark operators
>>
>>
>> Based on above, reworked the merge logic and done draft test.
>> If you are also interested in this, please go ahead with this doc[1], any
>> suggestion are welcome. :)
>>
>>
>>
>>
>> Thanks,
>> Lamber-Ken
>>
>>
>> [1]
>> https://docs.google.com/document/d/1-EHHfemtwtX2rSySaPMjeOAUkg5xfqJCKLAETZHa7Qw/edit?usp=sharing
>>
>>


[DISCUSS] Improve the merge performance for cow

2020-02-26 Thread lamberken


Hi guys,


Motivation
Impove the merge performance for cow table when upsert, handle merge operation 
by using spark built-in operators.


Background
When do a upsert operation, for each bucket, hudi needs to put new input 
elements to memory cache map, and will 
need an external map that spills content to disk when there is insufficient 
space for it to grow. 


There are several performance issuses:
1. We may need an external disk map, serialize / deserialize records 
2. Only single thread do the I/O operation when check 
3. Can't take advantage of built-in spark operators 


Based on above, reworked the merge logic and done draft test.
If you are also interested in this, please go ahead with this doc[1], any 
suggestion are welcome. :)




Thanks,
Lamber-Ken


[1] 
https://docs.google.com/document/d/1-EHHfemtwtX2rSySaPMjeOAUkg5xfqJCKLAETZHa7Qw/edit?usp=sharing



Re:Re: [DISCUSS] How to correct the license header of entrypoint.sh script

2020-02-22 Thread lamberken


Right, will do.


Thanks,
Lamber-Ken

At 2020-02-22 22:35:13, "vbal...@apache.org"  wrote:
> 
>+1 on ensuring all scripts in Hudi codebase follow same convention for 
>licensing.
>Balaji.VOn Saturday, February 22, 2020, 06:16:29 AM PST, Suneel Marthi 
> wrote:  
> 
> Please go ahead and make the change @lamberken
>
>I was just looking at scripts from Hive and Kafka projects, see below.
>
>https://github.com/apache/hive/blob/master/bin/init-hive-dfs.sh
>https://github.com/apache/hive/blob/master/bin/hive-config.sh
>
>https://github.com/apache/kafka/blob/trunk/bin/connect-distributed.sh
>https://github.com/apache/kafka/blob/trunk/bin/kafka-leader-election.sh
>
>I suggest to fix all the script files to be consistent with apache license
>guide.
>
>
>
>On Sat, Feb 22, 2020 at 8:53 AM lamberken  wrote:
>
>>
>>
>> Hi all,
>>
>>
>> During the voting process on rc1 0.5.1-incubating release, Justin pointed
>> out
>> docker/hoodie/hadoop/base/entrypoint.sh has an incorrect license header,
>> But, many script files used the same license header like "entrypoint.sh"
>> has.
>>
>>
>> From apache license guide[2], it says "The text should be enclosed in the
>> appropriate comment syntax for the file format."
>> So, need to remove the repeated "#", like following changes?
>>
>>
>>
>> 
>> #  Licensed to the Apache Software Foundation (ASF) under one
>> #  or more contributor license agreements.  See the NOTICE file
>> #  distributed with this work for additional information
>> #  regarding copyright ownership.  The ASF licenses this file
>> #  to you under the Apache License, Version 2.0 (the
>> #  "License"); you may not use this file except in compliance
>> #  with the License.  You may obtain a copy of the License at
>> #
>> #  http://www.apache.org/licenses/LICENSE-2.0
>> #
>> #  Unless required by applicable law or agreed to in writing, software
>> #  distributed under the License is distributed on an "AS IS" BASIS,
>> #  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
>> #  See the License for the specific language governing permissions and
>> # limitations under the License.
>>
>> 
>>
>>
>> #
>> #  Licensed to the Apache Software Foundation (ASF) under one
>> #  or more contributor license agreements.  See the NOTICE file
>> #  distributed with this work for additional information
>> #  regarding copyright ownership.  The ASF licenses this file
>> #  to you under the Apache License, Version 2.0 (the
>> #  "License"); you may not use this file except in compliance
>> #  with the License.  You may obtain a copy of the License at
>> #
>> #  http://www.apache.org/licenses/LICENSE-2.0
>> #
>> #  Unless required by applicable law or agreed to in writing, software
>> #  distributed under the License is distributed on an "AS IS" BASIS,
>> #  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
>> #  See the License for the specific language governing permissions and
>> # limitations under the License.
>> #
>>
>>
>> Any thought are welcome, thanks.
>>
>>
>> Thanks,
>> Lamber-Ken
>>
>>
>> [1]
>> https://lists.apache.org/thread.html/rd3f4a72d82a4a5a81b2c6bd71e1417054daa38637ce8e07901f26f04%40%3Cgeneral.incubator.apache.org%3E
>> [2] https://www.apache.org/licenses/LICENSE-2.0
>>
>>
>  


Re:Re: [DISCUSS] How to correct the license header of entrypoint.sh script

2020-02-22 Thread lamberken


Thanks Suneel Marthi,


Right, will fix all the script files which is consistent with apache license 
guide.


Thanks,
Lamber-Ken

At 2020-02-22 22:16:16, "Suneel Marthi"  wrote:
>Please go ahead and make the change @lamberken
>
>I was just looking at scripts from Hive and Kafka projects, see below.
>
>https://github.com/apache/hive/blob/master/bin/init-hive-dfs.sh
>https://github.com/apache/hive/blob/master/bin/hive-config.sh
>
>https://github.com/apache/kafka/blob/trunk/bin/connect-distributed.sh
>https://github.com/apache/kafka/blob/trunk/bin/kafka-leader-election.sh
>
>I suggest to fix all the script files to be consistent with apache license
>guide.
>
>
>
>On Sat, Feb 22, 2020 at 8:53 AM lamberken  wrote:
>
>>
>>
>> Hi all,
>>
>>
>> During the voting process on rc1 0.5.1-incubating release, Justin pointed
>> out
>> docker/hoodie/hadoop/base/entrypoint.sh has an incorrect license header,
>> But, many script files used the same license header like "entrypoint.sh"
>> has.
>>
>>
>> From apache license guide[2], it says "The text should be enclosed in the
>> appropriate comment syntax for the file format."
>> So, need to remove the repeated "#", like following changes?
>>
>>
>>
>> 
>> #  Licensed to the Apache Software Foundation (ASF) under one
>> #  or more contributor license agreements.  See the NOTICE file
>> #  distributed with this work for additional information
>> #  regarding copyright ownership.  The ASF licenses this file
>> #  to you under the Apache License, Version 2.0 (the
>> #  "License"); you may not use this file except in compliance
>> #  with the License.  You may obtain a copy of the License at
>> #
>> #  http://www.apache.org/licenses/LICENSE-2.0
>> #
>> #  Unless required by applicable law or agreed to in writing, software
>> #  distributed under the License is distributed on an "AS IS" BASIS,
>> #  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
>> #  See the License for the specific language governing permissions and
>> # limitations under the License.
>>
>> 
>>
>>
>> #
>> #  Licensed to the Apache Software Foundation (ASF) under one
>> #  or more contributor license agreements.  See the NOTICE file
>> #  distributed with this work for additional information
>> #  regarding copyright ownership.  The ASF licenses this file
>> #  to you under the Apache License, Version 2.0 (the
>> #  "License"); you may not use this file except in compliance
>> #  with the License.  You may obtain a copy of the License at
>> #
>> #  http://www.apache.org/licenses/LICENSE-2.0
>> #
>> #  Unless required by applicable law or agreed to in writing, software
>> #  distributed under the License is distributed on an "AS IS" BASIS,
>> #  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
>> #  See the License for the specific language governing permissions and
>> # limitations under the License.
>> #
>>
>>
>> Any thought are welcome, thanks.
>>
>>
>> Thanks,
>> Lamber-Ken
>>
>>
>> [1]
>> https://lists.apache.org/thread.html/rd3f4a72d82a4a5a81b2c6bd71e1417054daa38637ce8e07901f26f04%40%3Cgeneral.incubator.apache.org%3E
>> [2] https://www.apache.org/licenses/LICENSE-2.0
>>
>>


[DISCUSS] How to correct the license header of entrypoint.sh script

2020-02-22 Thread lamberken


Hi all,


During the voting process on rc1 0.5.1-incubating release, Justin pointed out 
docker/hoodie/hadoop/base/entrypoint.sh has an incorrect license header,
But, many script files used the same license header like "entrypoint.sh" has.


From apache license guide[2], it says "The text should be enclosed in the 
appropriate comment syntax for the file format."
So, need to remove the repeated "#", like following changes?



#  Licensed to the Apache Software Foundation (ASF) under one
#  or more contributor license agreements.  See the NOTICE file
#  distributed with this work for additional information
#  regarding copyright ownership.  The ASF licenses this file
#  to you under the Apache License, Version 2.0 (the
#  "License"); you may not use this file except in compliance
#  with the License.  You may obtain a copy of the License at
#
#  http://www.apache.org/licenses/LICENSE-2.0
#
#  Unless required by applicable law or agreed to in writing, software
#  distributed under the License is distributed on an "AS IS" BASIS,
#  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
#  See the License for the specific language governing permissions and
# limitations under the License.



#
#  Licensed to the Apache Software Foundation (ASF) under one
#  or more contributor license agreements.  See the NOTICE file
#  distributed with this work for additional information
#  regarding copyright ownership.  The ASF licenses this file
#  to you under the Apache License, Version 2.0 (the
#  "License"); you may not use this file except in compliance
#  with the License.  You may obtain a copy of the License at
#
#  http://www.apache.org/licenses/LICENSE-2.0
#
#  Unless required by applicable law or agreed to in writing, software
#  distributed under the License is distributed on an "AS IS" BASIS,
#  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
#  See the License for the specific language governing permissions and
# limitations under the License.
#


Any thought are welcome, thanks.


Thanks,
Lamber-Ken


[1] 
https://lists.apache.org/thread.html/rd3f4a72d82a4a5a81b2c6bd71e1417054daa38637ce8e07901f26f04%40%3Cgeneral.incubator.apache.org%3E
[2] https://www.apache.org/licenses/LICENSE-2.0



Re:Re: Re: Re: [DISCUSS] Relocate spark-avro dependency by maven-shade-plugin

2020-02-20 Thread lamberken






Thanks you all, had updated the pr[1]


Thanks
Lamber-Ken


[1] https://github.com/apache/incubator-hudi/pull/1290




At 2020-02-21 02:33:50, "Vinoth Chandar"  wrote:
>If there are no more comments/objections, we could re work the PR based on
>the discussion here..
>
>Points made by Udit are also pretty valid..
>
>Thanks for the constructive conversation. :)
>
>On Wed, Feb 19, 2020 at 3:12 PM lamberken  wrote:
>
>>
>>
>> @Vinoth, glad to see your reply.
>>
>>
>> >> SchemaConverters does import things like types
>> I checked the git history of package "org.apache.spark.sql.types", it
>> hasn't changed in a year,
>> means that spark does not change types often.
>>
>>
>> >> let's have a flag in maven to skip
>> Good suggestion. bundling it like we bundling
>> com.databricks:spark-avro_2.11 by default.
>> But how to use maven-shade-plugin with the flag, need to study.
>>
>>
>> Also, looking forward to others thoughts.
>>
>>
>> Thanks,
>> Lamber-Ken
>>
>>
>>
>>
>>
>> At 2020-02-20 03:50:12, "Vinoth Chandar"  wrote:
>> >Apologies for the delayed response..
>> >
>> >I think SchemaConverters does import things like types and those will be
>> >tied to the spark version. if there are new types for e.g, our bundled
>> >spark-avro may not recognize them for e.g..
>> >
>> >import org.apache.spark.sql.catalyst.util.RandomUUIDGenerator
>> >import org.apache.spark.sql.types._
>> >import org.apache.spark.sql.types.Decimal.{maxPrecisionForBytes,
>> >minBytesForPrecision}
>> >
>> >
>> >I also verified that we are bundling avro in the spark-bundle.. So, that
>> >part we are in the clear.
>> >
>> >Here is what I suggest.. let's try bundling in the hope that it works i.e
>> >spark does not change types etc often and spark-avro interplays.
>> >But let's have a flag in maven to skip this bundling if need be.. We
>> should
>> >doc his clearly on the build instructions in the README?
>> >
>> >What do others think?
>> >
>> >
>> >
>> >On Sat, Feb 15, 2020 at 10:54 PM lamberken  wrote:
>> >
>> >>
>> >>
>> >> Hi @Vinoth, sorry delay for ensure the following analysis is correct
>> >>
>> >>
>> >> In hudi project, spark-avro module is only used for converting between
>> >> spark's struct type and avro schema, only used two methods
>> >> `SchemaConverters.toAvroType` and `SchemaConverters.toSqlType`, these
>> two
>> >> methods are in `org.apache.spark.sql.avro.SchemaConverters` class.
>> >>
>> >>
>> >> Analyse:
>> >> 1, the `SchemaConverters` class are same in spark-master[1] and
>> >> branch-3.0[2].
>> >> 2, from the import statements in `SchemaConverters`, we can learn that
>> >> `SchemaConverters` doesn't depend on
>> >>other class in spark-avro module.
>> >>Also, I tried to move it hudi project and use a different package,
>> >> compile go though.
>> >>
>> >>
>> >> Use the hudi jar with shaded spark-avro module:
>> >> 1, spark-2.4.4-bin-hadoop2.7, everything is ok(create, upsert)
>> >> 2, spark-3.0.0-preview2-bin-hadoop2.7, everything is ok(create, upsert)
>> >>
>> >>
>> >> So, if we shade the spark-avro is safe and will has better user
>> >> experience, and we needn't shade it when spark-avro module is not
>> external
>> >> in spark project.
>> >>
>> >>
>> >> Thanks,
>> >> Lamber-Ken
>> >>
>> >>
>> >> [1]
>> >>
>> https://github.com/apache/spark/blob/master/external/avro/src/main/scala/org/apache/spark/sql/avro/SchemaConverters.scala
>> >> [2]
>> >>
>> https://github.com/apache/spark/blob/branch-3.0/external/avro/src/main/scala/org/apache/spark/sql/avro/SchemaConverters.scala
>> >>
>> >>
>> >>
>> >>
>> >>
>> >>
>> >>
>> >> At 2020-02-14 10:30:35, "Vinoth Chandar"  wrote:
>> >> >Just kicking this thread again, to make forward progress :)
>> >> >
>> >> >On Thu, Feb 6, 2020 at 10:46 AM Vinoth Chandar 
>> wrote:
>> >> >
>> >> >> First of all.. No apo

Re:Re: Re: [DISCUSS] Relocate spark-avro dependency by maven-shade-plugin

2020-02-19 Thread lamberken


@Vinoth, glad to see your reply.


>> SchemaConverters does import things like types
I checked the git history of package "org.apache.spark.sql.types", it hasn't 
changed in a year, 
means that spark does not change types often.


>> let's have a flag in maven to skip
Good suggestion. bundling it like we bundling com.databricks:spark-avro_2.11 by 
default. 
But how to use maven-shade-plugin with the flag, need to study.


Also, looking forward to others thoughts.


Thanks,
Lamber-Ken





At 2020-02-20 03:50:12, "Vinoth Chandar"  wrote:
>Apologies for the delayed response..
>
>I think SchemaConverters does import things like types and those will be
>tied to the spark version. if there are new types for e.g, our bundled
>spark-avro may not recognize them for e.g..
>
>import org.apache.spark.sql.catalyst.util.RandomUUIDGenerator
>import org.apache.spark.sql.types._
>import org.apache.spark.sql.types.Decimal.{maxPrecisionForBytes,
>minBytesForPrecision}
>
>
>I also verified that we are bundling avro in the spark-bundle.. So, that
>part we are in the clear.
>
>Here is what I suggest.. let's try bundling in the hope that it works i.e
>spark does not change types etc often and spark-avro interplays.
>But let's have a flag in maven to skip this bundling if need be.. We should
>doc his clearly on the build instructions in the README?
>
>What do others think?
>
>
>
>On Sat, Feb 15, 2020 at 10:54 PM lamberken  wrote:
>
>>
>>
>> Hi @Vinoth, sorry delay for ensure the following analysis is correct
>>
>>
>> In hudi project, spark-avro module is only used for converting between
>> spark's struct type and avro schema, only used two methods
>> `SchemaConverters.toAvroType` and `SchemaConverters.toSqlType`, these two
>> methods are in `org.apache.spark.sql.avro.SchemaConverters` class.
>>
>>
>> Analyse:
>> 1, the `SchemaConverters` class are same in spark-master[1] and
>> branch-3.0[2].
>> 2, from the import statements in `SchemaConverters`, we can learn that
>> `SchemaConverters` doesn't depend on
>>other class in spark-avro module.
>>Also, I tried to move it hudi project and use a different package,
>> compile go though.
>>
>>
>> Use the hudi jar with shaded spark-avro module:
>> 1, spark-2.4.4-bin-hadoop2.7, everything is ok(create, upsert)
>> 2, spark-3.0.0-preview2-bin-hadoop2.7, everything is ok(create, upsert)
>>
>>
>> So, if we shade the spark-avro is safe and will has better user
>> experience, and we needn't shade it when spark-avro module is not external
>> in spark project.
>>
>>
>> Thanks,
>> Lamber-Ken
>>
>>
>> [1]
>> https://github.com/apache/spark/blob/master/external/avro/src/main/scala/org/apache/spark/sql/avro/SchemaConverters.scala
>> [2]
>> https://github.com/apache/spark/blob/branch-3.0/external/avro/src/main/scala/org/apache/spark/sql/avro/SchemaConverters.scala
>>
>>
>>
>>
>>
>>
>>
>> At 2020-02-14 10:30:35, "Vinoth Chandar"  wrote:
>> >Just kicking this thread again, to make forward progress :)
>> >
>> >On Thu, Feb 6, 2020 at 10:46 AM Vinoth Chandar  wrote:
>> >
>> >> First of all.. No apologies, no feeling bad.  We are all having fun
>> here..
>> >> :)
>> >>
>> >> I think we are all on the same page on the tradeoffs here.. let's see if
>> >> we can decide one way or other.
>> >>
>> >> Bundling spark-avro has better user experience, one less package to
>> >> remember adding. But even with the valid points raised by udit and
>> hmatu, I
>> >> was just worried about specific things in spark-avro that may not be
>> >> compatible with the spark version.. Can someone analyze how coupled
>> >> spark-avro is with rest of spark.. For e.g, what if the spark 3.x uses a
>> >> different avro version than spark 2.4.4 and when hudi-spark-bundle is
>> used
>> >> in a spark 3.x cluster, the spark-avro:2.4.4 won't work with that avro
>> >> version?
>> >>
>> >> If someone can provide data points on the above and if we can convince
>> >> ourselves that we can bundle a different spark-avro version (even
>> >> spark-avro:3.x on spark 2.x cluster), then I am happy to reverse my
>> >> position. Otherwise, if we might face a barrage of support issues with
>> >> NoClassDefFound /NoSuchMethodError etc, its not worth it IMO ..
>> >>
>> >> TBH longer term, I a

Re:Re: [DISCUSS] Relocate spark-avro dependency by maven-shade-plugin

2020-02-15 Thread lamberken


Hi @Vinoth, sorry delay for ensure the following analysis is correct


In hudi project, spark-avro module is only used for converting between spark's 
struct type and avro schema, only used two methods 
`SchemaConverters.toAvroType` and `SchemaConverters.toSqlType`, these two 
methods are in `org.apache.spark.sql.avro.SchemaConverters` class.


Analyse:
1, the `SchemaConverters` class are same in spark-master[1] and branch-3.0[2].
2, from the import statements in `SchemaConverters`, we can learn that 
`SchemaConverters` doesn't depend on
   other class in spark-avro module. 
   Also, I tried to move it hudi project and use a different package, compile 
go though.


Use the hudi jar with shaded spark-avro module:
1, spark-2.4.4-bin-hadoop2.7, everything is ok(create, upsert)
2, spark-3.0.0-preview2-bin-hadoop2.7, everything is ok(create, upsert)


So, if we shade the spark-avro is safe and will has better user experience, and 
we needn't shade it when spark-avro module is not external in spark project. 


Thanks,
Lamber-Ken


[1] 
https://github.com/apache/spark/blob/master/external/avro/src/main/scala/org/apache/spark/sql/avro/SchemaConverters.scala
[2] 
https://github.com/apache/spark/blob/branch-3.0/external/avro/src/main/scala/org/apache/spark/sql/avro/SchemaConverters.scala







At 2020-02-14 10:30:35, "Vinoth Chandar"  wrote:
>Just kicking this thread again, to make forward progress :)
>
>On Thu, Feb 6, 2020 at 10:46 AM Vinoth Chandar  wrote:
>
>> First of all.. No apologies, no feeling bad.  We are all having fun here..
>> :)
>>
>> I think we are all on the same page on the tradeoffs here.. let's see if
>> we can decide one way or other.
>>
>> Bundling spark-avro has better user experience, one less package to
>> remember adding. But even with the valid points raised by udit and hmatu, I
>> was just worried about specific things in spark-avro that may not be
>> compatible with the spark version.. Can someone analyze how coupled
>> spark-avro is with rest of spark.. For e.g, what if the spark 3.x uses a
>> different avro version than spark 2.4.4 and when hudi-spark-bundle is used
>> in a spark 3.x cluster, the spark-avro:2.4.4 won't work with that avro
>> version?
>>
>> If someone can provide data points on the above and if we can convince
>> ourselves that we can bundle a different spark-avro version (even
>> spark-avro:3.x on spark 2.x cluster), then I am happy to reverse my
>> position. Otherwise, if we might face a barrage of support issues with
>> NoClassDefFound /NoSuchMethodError etc, its not worth it IMO ..
>>
>> TBH longer term, I am looking into if we can eliminate need for Row ->
>> Avro conversion that we need spark-avro for. But lets ignore that for
>> purposes of this discussion.
>>
>> Thanks
>> Vinoth
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> On Wed, Feb 5, 2020 at 10:54 PM hmatu  wrote:
>>
>>> Thanks for raising this! +1 to @Udit Mehrotra's point.
>>>
>>>
>>>  It's right that recommend users to actually build their  own hudi jars,
>>> with the spark version they use. It avoid the compatibility issues
>>>
>>> between user's local jars and pre-built hudi spark version(2.4.4).
>>>
>>> Or can remove "org.apache.spark:spark-avro_2.11:2.4.4"? Because user
>>> local env will contains that external dependency if they use avro.
>>>
>>> If not, to run hudi(release-0.5.1) is more complex for me, when using
>>> Delta Lake, it's more simpler:
>>> just "bin/spark-shell --packages io.delta:delta-core_2.11:0.5.0"
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> -- Original --
>>> From: "lamberken">> Date: Thu, Feb 6, 2020 07:42 AM
>>> To: "dev">>
>>> Subject: Re:[DISCUSS] Relocate spark-avro dependency by
>>> maven-shade-plugin
>>>
>>>
>>>
>>>
>>>
>>> Dear team,
>>>
>>>
>>> About this topic, there are some previous discussions in PR[1]. It's
>>> better to visit it carefully before chiming in, thanks.
>>>
>>>
>>> Current State:
>>> Lamber-Ken: +1
>>> Udit Mehrotra: +1
>>> Bhavani Sudha: -1
>>> Vinoth Chandar: -1
>>>
>>>
>>> Thanks,
>>> Lamber-Ken
>>>
>>>
>>>
>>> At 2020-02-06 06:10:52, "lamberken" >> >
>>> >
>>> >Dear team,
&

Re:Re: Please welcome our new PPMCs and Committer

2020-02-14 Thread lamberken


Congratulations to Leesf, Vino Yang and Siva, +1 very well deserved :) 
Best,
Lamber-Ken





在 2020-02-15 12:58:27,"vino yang"  写道:
>Thanks, folks. It's a great honor.
>
>Hudi community is great! Let us continue to make Hudi better.
>
>Best,
>Vino
>
>Noway <957029...@qq.com> 于2020年2月15日周六 上午11:42写道:
>
>> Congratulations to Vino Yang.
>> -- 原始邮件 --
>> 发件人: "vbal...@apache.org"> 发送时间: 2020年2月15日(星期六) 凌晨5:11
>> 收件人: "dev">
>> 主题: Re: Please welcome our new PPMCs and Committer
>>
>>
>>
>>  Congratulations to Leesf, Vino Yang and Siva.
>> +1 Very well deserved :) Looking forward to your continued contributions.
>> Balaji.V
>>     On Friday, February 14, 2020, 12:11:18 PM PST, Bhavani
>> Sudha >  
>>  Hearty congratulations to all of you - @leesf 
>> > @vinoyang
>> and @Sivabalan . Very well deserved.
>>
>> Thanks,
>> Sudha
>>
>> On Fri, Feb 14, 2020 at 11:58 AM Vinoth Chandar > wrote:
>>
>> > Hello all,
>> >
>> > I am incredibly excited to share that we have two new PPMC members :
>> > *leesf*
>> > and *vinoyang*, who have been doing such sustained, great work on the
>> > project over a good part of the last year! I and rest of the PPMC, do
>> hope
>> > there a bigger and better things to come!
>> >
>> > We also have a new committer : *Sivabalan*, who has stepped up to own
>> the
>> > indexing component in the past few months, and has already delivered
>> > several key contributions and currently driving some foundational
>> work on
>> > record level indexing.
>> >
>> > Please join me in congratulating them!
>> >
>> > Thanks
>> > Vinoth
>> >
>>  


Re:[DISCUSS] Relocate spark-avro dependency by maven-shade-plugin

2020-02-05 Thread lamberken


Dear team,


About this topic, there are some previous discussions in PR[1]. It's better to 
visit it carefully before chiming in, thanks.


Current State:
Lamber-Ken: +1
Udit Mehrotra: +1
Bhavani Sudha: -1
Vinoth Chandar: -1


Thanks,
Lamber-Ken



At 2020-02-06 06:10:52, "lamberken"  wrote:
>
>
>Dear team,
>
>
>With the 0.5.1 version released, user need to add 
>`org.apache.spark:spark-avro_2.11:2.4.4` when starting hudi command, like 
>bellow
>/-/
>spark-2.4.4-bin-hadoop2.7/bin/spark-shell \
>  --packages 
> org.apache.hudi:hudi-spark-bundle_2.11:0.5.1-incubating,org.apache.spark:spark-avro_2.11:2.4.4
>  \
>  --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer'
>/-/
>
>
>From spark-avro-guide[1], we know that the spark-avro module is external, it 
>is not exists in spark-2.4.4-bin-hadoop2.7.tgz.
>So may it's better to relocate spark-avro dependency by using 
>maven-shade-plugin. If so, user will starting hudi like 0.5.0 version does.
>/-/
>spark-2.4.4-bin-hadoop2.7/bin/spark-shell \
>  --packages org.apache.hudi:hudi-spark-bundle_2.11:0.5.1-incubating \
>  --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer'
>/-/
>
>
>I created a pr to fix this[3], we may need have more discussion about this, 
>any suggestion is welcome, thanks very much :)
>Current state:
>@bhasudha : +1
>@vinoth   : -1
>
>
>[1] http://spark.apache.org/docs/latest/sql-data-sources-avro.html
>[2] 
>http://mirror.bit.edu.cn/apache/spark/spark-2.4.4/spark-2.4.4-bin-hadoop2.7.tgz
> 
>[3] https://github.com/apache/incubator-hudi/pull/1290
>


Re:Re: [DISCUSS] Relocate spark-avro dependency by maven-shade-plugin

2020-02-05 Thread lamberken


Hi @bhasudha,


No need to say sorry, I think this discussion is meaningful hudi project.


Thanks,
Lamber-Ken











At 2020-02-06 07:07:49, "Bhavani Sudha"  wrote:
>Hi @lamberken Sorry I missed to see this earlier. I also left this comment
>in the PR. I think Vinoth brings up a valid point. Although your PR intends
>to make it easier for users to not care about scala 2.11 or scala 2.12, we
>also need to avoid coupling Hudi with specific spark_avro versions be it
>2.4.4 or 3.0-preview2.
>
>Please consider my vote as -1.
>
>Thanks,
>Sudha
>
>On Wed, Feb 5, 2020 at 2:11 PM lamberken  wrote:
>
>>
>>
>> Dear team,
>>
>>
>> With the 0.5.1 version released, user need to add
>> `org.apache.spark:spark-avro_2.11:2.4.4` when starting hudi command, like
>> bellow
>>
>> /-/
>> spark-2.4.4-bin-hadoop2.7/bin/spark-shell \
>>   --packages
>> org.apache.hudi:hudi-spark-bundle_2.11:0.5.1-incubating,org.apache.spark:spark-avro_2.11:2.4.4
>> \
>>   --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer'
>>
>> /-/
>>
>>
>> From spark-avro-guide[1], we know that the spark-avro module is external,
>> it is not exists in spark-2.4.4-bin-hadoop2.7.tgz.
>> So may it's better to relocate spark-avro dependency by using
>> maven-shade-plugin. If so, user will starting hudi like 0.5.0 version does.
>>
>> /-/
>> spark-2.4.4-bin-hadoop2.7/bin/spark-shell \
>>   --packages org.apache.hudi:hudi-spark-bundle_2.11:0.5.1-incubating \
>>   --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer'
>>
>> /-/
>>
>>
>> I created a pr to fix this[3], we may need have more discussion about
>> this, any suggestion is welcome, thanks very much :)
>> Current state:
>> @bhasudha : +1
>> @vinoth   : -1
>>
>>
>> [1] http://spark.apache.org/docs/latest/sql-data-sources-avro.html
>> [2]
>> http://mirror.bit.edu.cn/apache/spark/spark-2.4.4/spark-2.4.4-bin-hadoop2.7.tgz
>> [3] https://github.com/apache/incubator-hudi/pull/1290
>>
>>


[DISCUSS] Relocate spark-avro dependency by maven-shade-plugin

2020-02-05 Thread lamberken


Dear team,


With the 0.5.1 version released, user need to add 
`org.apache.spark:spark-avro_2.11:2.4.4` when starting hudi command, like bellow
/-/
spark-2.4.4-bin-hadoop2.7/bin/spark-shell \
  --packages 
org.apache.hudi:hudi-spark-bundle_2.11:0.5.1-incubating,org.apache.spark:spark-avro_2.11:2.4.4
 \
  --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer'
/-/


From spark-avro-guide[1], we know that the spark-avro module is external, it is 
not exists in spark-2.4.4-bin-hadoop2.7.tgz.
So may it's better to relocate spark-avro dependency by using 
maven-shade-plugin. If so, user will starting hudi like 0.5.0 version does.
/-/
spark-2.4.4-bin-hadoop2.7/bin/spark-shell \
  --packages org.apache.hudi:hudi-spark-bundle_2.11:0.5.1-incubating \
  --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer'
/-/


I created a pr to fix this[3], we may need have more discussion about this, any 
suggestion is welcome, thanks very much :)
Current state:
@bhasudha : +1
@vinoth   : -1


[1] http://spark.apache.org/docs/latest/sql-data-sources-avro.html
[2] 
http://mirror.bit.edu.cn/apache/spark/spark-2.4.4/spark-2.4.4-bin-hadoop2.7.tgz 
[3] https://github.com/apache/incubator-hudi/pull/1290



Re:Re: [DISCUSS] Unify Hudi code cleanup and improvement

2020-01-30 Thread lamberken


Hi @Vinoth @Vino


IMO, we can use SonarQube[1] and Sonarlint[2] tools to help us detect and fix 
quality issues.


Local env, follow below steps:
--
1, docker run -d --name sonarqube -p 9000:9000 sonarqube
2, mvn sonar:sonar
3, http://localhost:9000
--


[1] https://www.sonarqube.org
[2] https://www.sonarlint.org




Thanks
Lamber-Ken




At 2020-01-27 03:25:02, "Vinoth Chandar"  wrote:
>Hi Vino,
>
>You raise a valid point on what "MINOR" PR should be. All JIRAs start out
>in "NEW" state and committers have to "Accept" the issue already (to force
>early conversations like this).
>
>May be we should draw some bounds on it like, "cannot be more than 50
>lines", "No functionality changes" .. etc? WDYT?  This seems to be the core
>of the issue.
>
>On Thu, Jan 23, 2020 at 4:17 PM vino yang  wrote:
>
>> Hi Vinoth,
>>
>> Thank you for your thoughts, I agree that focusing on some higher priority
>> work is more valuable.
>>
>> This discussion is to sort out and manage the work that the community is
>> already doing. There are currently some PRs working on this type of work,
>> such as PR[1][2][3][4]. The community has not given guidance on these
>> tasks. I think it's not very appropriate to open a "MINOR" PR directly. So,
>> I want to hear from the community and how to manage them more effectively.
>> The discussion does not encourage to give a higher priority to such work.
>>
>> We haven't stopped this kind of work, so we should provide effective
>> guidance and organization so that it doesn't look disorganized. WYDT?
>>
>> Best,
>> Vino
>>
>> [1]: https://github.com/apache/incubator-hudi/pull/1237
>> [2]: https://github.com/apache/incubator-hudi/pull/1139
>> [3]: https://github.com/apache/incubator-hudi/pull/1137
>> [4]: https://github.com/apache/incubator-hudi/pull/1136
>>
>> Vinoth Chandar  于2020年1月23日周四 下午1:20写道:
>>
>> > Hi,
>> >
>> > Thanks everyone for sharing your views!
>> >
>> > Some of this conversation is starting to feel like boiling the ocean. I
>> > believe in refactoring with purpose and discussing class-by-class or
>> > module-by-module does not make sense to me. Can we first list down what
>> we
>> > want to achieve? So far, I have only heard fixing IDE/IntelliJ warnings.
>> > Also instead of focussing on new work, how about looking at the pending
>> > JIRAs under "Testing" "Code Cleanup" components first and see if those
>> are
>> > worth tackling.
>> >
>> > We went down this path for code formatting and today we still have
>> > inconsistencies. Looking back, I feel we should have clearly defined end
>> > goals for the cleanups and we can then rank them based on ROI.
>> >
>> > Thanks
>> > Vinoth
>> >
>> > On Wed, Jan 22, 2020 at 7:05 PM vino yang  wrote:
>> >
>> > > Hi Shiyan and Bhavani:
>> > >
>> > > Thanks for sharing your thoughts.
>> > >
>> > > As I originally stated. The advantage of using modules as a unit to
>> split
>> > > work is that the decomposition is clear, but the disadvantage is that
>> the
>> > > volume of changes may be huge, which brings huge risks (considering
>> that
>> > > Hudi's test coverage is still not very high) and the workload of
>> review.
>> > > The advantage of splitting by class is that the volume of changes is
>> > small
>> > > and the review is more convenient, but the disadvantages are too many
>> > tasks
>> > > and high maintenance costs.
>> > >
>> > >
>> > > *In addition, we need to define the boundaries of the "code cleanup" I
>> > > expressed in this topic: it is limited to the smart tips shown by
>> > Intellij
>> > > IDEA. If the boundaries are too wide, then this discussion will lose
>> > > control.*
>> > > I agree with Bhavani that we don't take it as the actual goal. But we
>> are
>> > > not opposed to the community to help improve the quality of the code
>> > > (basically, these tips given by the IDE are more reasonable).
>> > >
>> > >
>> > > So, I still give my thoughts: We manage this work with Jira. Before we
>> > > start working, we need to find a committer as a mentor. The mentor must
>> > > decide whether the scale of the subtasks is reasonable and whether
>> > > additional unit tests need to be added to verify the changes. And the
>> > > mentor should be responsible for merged changes.
>> > >
>> > > What do you think?
>> > >
>> > > Best,
>> > > Vino
>> > >
>> > > Bhavani Sudha  于2020年1月22日周三 下午2:22写道:
>> > >
>> > > > Hi @vinoyang thanks for bringing this to discussion. I feel it would
>> be
>> > > > less disruptive to clean up code as part of individual classes being
>> > > > touched for a specific goal rather than code cleanup being the actual
>> > > goal.
>> > > > This would narrow the touch point and ensure test coverage (both unit
>> > and
>> > > > integration tests)  catches any accidental/unintentional changes.
>> Also
>> > it
>> > > > would give chance to chan

Re:Re: Re: Re: [DISCUSS] Redraw of hudi data lake architecture diagram on langing page

2020-01-23 Thread lamberken


Thanks @Vino :)


Thanks
Lamber-Ken





At 2020-01-24 10:18:28, "vino yang"  wrote:
>Hi Lamber,
>
>+1 from my side.
>
>Best,
>Vino
>
>lamberken  于2020年1月24日周五 上午7:11写道:
>
>>
>>
>> Thanks you all. :)
>>
>>
>> Hi @nishith, good catch. I fixed it.
>>
>> https://github.com/apache/incubator-hudi/pull/1276/files?short_path=55fa8a8#diff-55fa8a81e6bf8c8d9d11d293b41511b5
>>
>>
>> Thanks
>> Lamber-Ken
>>
>>
>>
>>
>>
>>
>>
>> At 2020-01-24 04:43:02, "nishith agarwal"  wrote:
>> >+1 looks great
>> >
>> >Nit : I see that the old diagram has "Raw Ingest Tables" vs the new one
>> >"Row Ingest Tables". IMO, "Raw Ingest Tables" sounds more logical.
>> >
>> >-Nishith
>> >
>> >On Thu, Jan 23, 2020 at 10:57 AM Vinoth Chandar 
>> wrote:
>> >
>> >> +1. on that :)
>> >>
>> >> On Thu, Jan 23, 2020 at 10:22 AM hmatu <3480388...@qq.com> wrote:
>> >>
>> >> > The whole site looks better than old currently, big thanks for your
>> work!
>> >> >
>> >> >
>> >> > Thanks,
>> >> > Hmatu
>> >> >
>> >> >
>> >> >
>> >> > -- Original --
>> >> > From: "Balaji Varadarajan"> >> > Date: Fri, Jan 24, 2020 01:21 AM
>> >> > To: "dev"> >> >
>> >> > Subject: Re: [DISCUSS] Redraw of hudi data lake architecture
>> diagram
>> >> > on langing page
>> >> >
>> >> >
>> >> >
>> >> >  +1 as well. Looks great.
>> >> > Balaji.V
>> >> >     On Thursday, January 23, 2020, 08:17:47 AM PST, Vinoth
>> >> > Chandar > >> >  
>> >> >  Looks good . +1 !
>> >> >
>> >> > On Wed, Jan 22, 2020 at 11:44 PM lamberken > wrote:
>> >> >
>> >> > >
>> >> > >
>> >> > > Hello everyone,
>> >> > >
>> >> > >
>> >> > > I redrawed the hudi data lake architecture diagram on landing
>> page.
>> >> > If you
>> >> > > have time, go ahead with hudi website[1] and test site[2].
>> >> > > Any thoughts are welcome, thanks very much. :)
>> >> > >
>> >> > >
>> >> > > [1] https://hudi.apache.org
>> >> > > [2] https://lamber-ken.github.io
>> >> > >
>> >> > >
>> >> > > Thanks
>> >> > > Lamber-Ken
>> >> >  
>> >>
>>


Re:Re: Re: [DISCUSS] Redraw of hudi data lake architecture diagram on langing page

2020-01-23 Thread lamberken


Thanks you all. :)


Hi @nishith, good catch. I fixed it. 
https://github.com/apache/incubator-hudi/pull/1276/files?short_path=55fa8a8#diff-55fa8a81e6bf8c8d9d11d293b41511b5


Thanks
Lamber-Ken







At 2020-01-24 04:43:02, "nishith agarwal"  wrote:
>+1 looks great
>
>Nit : I see that the old diagram has "Raw Ingest Tables" vs the new one
>"Row Ingest Tables". IMO, "Raw Ingest Tables" sounds more logical.
>
>-Nishith
>
>On Thu, Jan 23, 2020 at 10:57 AM Vinoth Chandar  wrote:
>
>> +1. on that :)
>>
>> On Thu, Jan 23, 2020 at 10:22 AM hmatu <3480388...@qq.com> wrote:
>>
>> > The whole site looks better than old currently, big thanks for your work!
>> >
>> >
>> > Thanks,
>> > Hmatu
>> >
>> >
>> >
>> > -- Original --
>> > From: "Balaji Varadarajan"> > Date: Fri, Jan 24, 2020 01:21 AM
>> > To: "dev"> >
>> > Subject: Re: [DISCUSS] Redraw of hudi data lake architecture diagram
>> > on langing page
>> >
>> >
>> >
>> >  +1 as well. Looks great.
>> > Balaji.V
>> >     On Thursday, January 23, 2020, 08:17:47 AM PST, Vinoth
>> > Chandar > >  
>> >  Looks good . +1 !
>> >
>> > On Wed, Jan 22, 2020 at 11:44 PM lamberken > >
>> > >
>> > >
>> > > Hello everyone,
>> > >
>> > >
>> > > I redrawed the hudi data lake architecture diagram on landing page.
>> > If you
>> > > have time, go ahead with hudi website[1] and test site[2].
>> > > Any thoughts are welcome, thanks very much. :)
>> > >
>> > >
>> > > [1] https://hudi.apache.org
>> > > [2] https://lamber-ken.github.io
>> > >
>> > >
>> > > Thanks
>> > > Lamber-Ken
>> >  
>>


Re:Re: [DISCUSS] Redraw of hudi data lake architecture diagram on langing page

2020-01-23 Thread lamberken


Thanks @Balaji.V and @Vinoth.











At 2020-01-24 01:21:51, "Balaji Varadarajan"  wrote:
> +1 as well. Looks great.
>Balaji.V
>On Thursday, January 23, 2020, 08:17:47 AM PST, Vinoth Chandar 
>  wrote:  
> 
> Looks good . +1 !
>
>On Wed, Jan 22, 2020 at 11:44 PM lamberken  wrote:
>
>>
>>
>> Hello everyone,
>>
>>
>> I redrawed the hudi data lake architecture diagram on landing page. If you
>> have time, go ahead with hudi website[1] and test site[2].
>> Any thoughts are welcome, thanks very much. :)
>>
>>
>> [1] https://hudi.apache.org
>> [2] https://lamber-ken.github.io
>>
>>
>> Thanks
>> Lamber-Ken
>  


[DISCUSS] Redraw of hudi data lake architecture diagram on langing page

2020-01-22 Thread lamberken


Hello everyone, 


I redrawed the hudi data lake architecture diagram on landing page. If you have 
time, go ahead with hudi website[1] and test site[2].
Any thoughts are welcome, thanks very much. :)


[1] https://hudi.apache.org
[2] https://lamber-ken.github.io


Thanks
Lamber-Ken

Re:Re: HUDI-555 & supporting docs for multiple versions

2020-01-18 Thread lamberken


Hello @Vinoth,


It's very smart tricks, great job!
In `_config.yml` file, need add the version to `previous_docs`,
//
previous_docs:
  - version: latest
en: /docs/quick-start-guide.html
cn: /cn/docs/quick-start-guide.html
  - version: 0.5.0
en: /docs/0.5.0-quick-start-guide.html
cn: cn/docs/0.5.0-quick-start-guide.html
//


In `navigation.yml` file, need add `0.5.0_cn_docs` like `0.5.0_docs`
//
0.5.0_cn_docs:
  - title: Getting Started
children:
  - title: "Quick Start"
url: /cn/docs/0.5.0-quick-start-guide.html
  - title: "Use Cases"
url: /cn/docs/0.5.0-use_cases.html
  - title: "Talks & Powered By"
url: /cn/docs/0.5.0-powered_by.html
  - title: "Comparison"
url: /cn/docs/0.5.0-comparison.html
  - title: "Docker Demo"
url: /cn/docs/0.5.0-docker_demo.html

//


BTW,  we can refer to flink project, I found that flink has faced a same 
situation before.
here is flink issue[1] which talk about the building website automactically. 
Flink project has
resolved this problem in a simple way, so I think we can learn from it.


The solution uses apache buildbot[2] which can build and deploy snapshots 
automatically. It seems
to need PMC to complete the next steps.


[1] https://issues.apache.org/jira/browse/FLINK-1370 
[2] https://ci.apache.org/buildbot.html
[3] https://ci.apache.org/projects/flink/flink-docs-master 


thanks,
lamber-ken









At 2020-01-19 10:11:54, "Vinoth Chandar"  wrote:
>I figured out some tricks and gone ahead with a basic support
>https://hudi.apache.org/docs/0.5.0-quick-start-guide.html
>
>Feel free to fix this in a more elegant way, going forward, for 0.6.0
>release
>
>On Sat, Jan 18, 2020 at 2:10 PM Vinoth Chandar  wrote:
>
>> Hello all,
>>
>> I am looking at doing this, so we can preserve the 0.5.0 release docs for
>> users who can't move to 0.5.1. Any suggestions? esp lamberken?
>>
>> I tried adding a subfolder under _docs in the hope that it will get picked
>> up and new html generated.. but does not seem to work.
>>
>> Thanks
>> Vinoth
>>


Re:[ANNOUNCE] Hudi Weekly Community Update (2020-01-05 ~ 2020-01-12)

2020-01-12 Thread lamberken


Good Job !!



At 2020-01-13 09:05:07, "leesf"  wrote:
>Dear community,
>
>Nice to share Hudi community weekly update for 2020-01-05 ~ 2020-01-12 with
>updates on develpment, features, bug fixes.
>
>
>Development
>
>[Terminologies simplification] A full version to introduce the design and
>architecture of HUDI has been written[1], and you are welcome to
>contribute.
>[JDBC Incremental Puller] A disscussion about introducing JDBC Delta
>Streamer to make HUDI more powerful[2] has been started. and a RFC[3] has
>been draft for comments.
>[New Website] The PR provided by lamberKen to introduce new hudi web site
>has been merged, you would check it out[4] and kindly feedback are
>welcome[5].
>[Weekly update] A disscussion thread about giving a weekly update of hudi
>commnuity to expand the visibility of hudi.
>[Configuration refactor] A disscussion thread about refactoring the
>configuration framework of hudi is going to start [6].
>[Release] A disscussion about the code freeze date(Jan 15) for next release
>(0.5.1) reached a consensus.[7]
>
>[1] https://cwiki.apache.org/confluence/display/HUDI/Design+And+Architecture
>[2]
>https://lists.apache.org/thread.html/r31b03a964c234e0903847ba60d9d7b340d0b59daa5232ae922a5b38d%40%3Cdev.hudi.apache.org%3E
>[3]
>https://cwiki.apache.org/confluence/display/HUDI/RFC+-+14+%3A+JDBC+incremental+puller
>[4] https://hudi.apache.org/newsite-content/
>[5] https://github.com/apache/incubator-hudi/issues/1196
>[6]
>https://lists.apache.org/thread.html/1fd96c9ff258aa35c030d07b929fdc15c2ebe93b155e1067ff45259c%40%3Cdev.hudi.apache.org%3E
>[7]
>https://lists.apache.org/thread.html/r14291a41be93ff178f22faa292d5e2a09fc7c294b7d89216c132083a%40%3Cdev.hudi.apache.org%3E
>
>
>Features
>
>[DeltaStreamer] Adding Delete() support to DeltaStreamer[8]
>[Client] Refactor HoodieWriteClient so that commit logic can be shareable
>by both bootstrap and normal write operations[9]
>[Docs] Add a new maven profile to generate unified Javadoc for all Java and
>Scala classes[10]
>[Hive Integration] Optimize HoodieInputformat.listStatus() for faster Hive
>incremental queries on Hoodie[11]
>[Writer] added option to overwrite payload implementation in
>hoodie.properties file[12]
>[DeltaStreamer] Introduce Default partition path in
>TimestampBasedKeyGenerator[13]
>[Spark Integration] Replace Databricks spark-avro with native
>spark-avro[14]
>[Writer] Upgrade Hudi to Spark 2.4[15]
>[Utilities] Provide a custom time zone definition for
>TimestampBasedKeyGenerator[16]
>
>[8] https://issues.apache.org/jira/projects/HUDI/issues/HUDI-377
>[9] https://issues.apache.org/jira/projects/HUDI/issues/HUDI-417
>[10] https://issues.apache.org/jira/projects/HUDI/issues/HUDI-319
>[11] https://issues.apache.org/jira/projects/HUDI/issues/HUDI-25
>[12] https://issues.apache.org/jira/projects/HUDI/issues/HUDI-114
>[13] https://issues.apache.org/jira/projects/HUDI/issues/HUDI-406
>[14] https://issues.apache.org/jira/projects/HUDI/issues/HUDI-91
>[15] https://issues.apache.org/jira/projects/HUDI/issues/HUDI-12
>[16] https://issues.apache.org/jira/projects/HUDI/issues/HUDI-502
>
>
>Bugs
>
>[Incremental Pull] Fix NPE when reading IncrementalPull.sqltemplate in
>HiveIncrementalPuller[17]
>[CLI] HoodieCommitMetadata only show first commit insert rows[18]
>[CLI] CLI doesn't allow rolling back a Delta commit[19]
>[DeltaStreamer] DeltaSteamer should pick checkpoints off only deltacommits
>for MOR tables[20]
>
>[17] https://issues.apache.org/jira/projects/HUDI/issues/HUDI-484
>[18] https://issues.apache.org/jira/projects/HUDI/issues/HUDI-469
>[19] https://issues.apache.org/jira/projects/HUDI/issues/HUDI-248
>[20] https://issues.apache.org/jira/projects/HUDI/issues/HUDI-322


Re:Re: Re: Re: Re: Re: Re: Re: Re: Re: [DISCUSS] Refactor of the configuration framework of hudi project

2020-01-09 Thread lamberken


hi @Vinoth Chandar,


Got it, thanks.


best,
lamber-ken







At 2020-01-09 23:52:52, "Vinoth Chandar"  wrote:
>Hi lamber-ken,
>
>A ConfigOption class would be good indeed. +1 on starting incrementally
>with DataSource first and then iterating..
>
>Thanks
>Vinoth
>
>On Tue, Jan 7, 2020 at 6:58 PM lamberken  wrote:
>
>>
>>
>> Hi @Vinoth,
>>
>>
>> It's time to pick up this topic. Based on the content we talked about,
>> here are my thoughts
>>
>>
>> 1, Initial proposal aims to rework configuration framework
>> includes(DataSource and WriteClient level),
>> for compatibility, we can introduce a ConfigOption class and rework it on
>> DataSource level.
>>
>>
>> 2, It's very right that the scoped down version does not need a RFC[1], so
>> change state from 'Under Discussion' to 'Close' ?
>>
>>
>> [1]
>> https://cwiki.apache.org/confluence/display/HUDI/RFC+-+11+%3A+Refactor+of+the+configuration+framework+of+hudi+project
>>
>>
>> Best,
>> Lamber-Ken
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> At 2019-12-19 11:05:16, "Vinoth Chandar"  wrote:
>> >Sounds good.. This scoped down version per se, does not need a RFC.
>> >
>> >On Wed, Dec 18, 2019 at 3:09 PM lamberken  wrote:
>> >
>> >>
>> >>
>> >> Hi @Vinoth
>> >>
>> >>
>> >> I understand what you mean, I will continue to work on this when I
>> finish
>> >> reworking the new UI. :)
>> >>
>> >>
>> >> best,
>> >> lamber-ken
>> >>
>> >>
>> >>
>> >>
>> >> At 2019-12-18 11:39:30, "Vinoth Chandar"  wrote:
>> >> >Expect most users to use inputDF.write() approach...  Uber uses the
>> lower
>> >> >level RDD apis, like the DeltaStreamer tool does..
>> >> >If we don't rename configs and still support a builder, it should be
>> fine.
>> >> >
>> >> >I think we can scope this down to introducing a ConfigOption class that
>> >> >ties, the key,value, default together.. That definitely seems like a
>> >> better
>> >> >abstraction.
>> >> >
>> >> >On Fri, Dec 13, 2019 at 5:18 PM lamberken  wrote:
>> >> >
>> >> >>
>> >> >>
>> >> >> Hi, @vinoth
>> >> >>
>> >> >>
>> >> >> Okay, I see. If we don't want existing users to do any upgrading or
>> >> >> reconfigurations, then this refactor work will not make much sense.
>> >> >> This issue can be closed, because ConfigOptions and these builders do
>> >> the
>> >> >> same things.
>> >> >> From another side, if we finish this work before a stable release, we
>> >> will
>> >> >> benefit a lot from it. We need to make a choice.
>> >> >>
>> >> >>
>> >> >> btw, I have a question that users will use HoodieWriteConfig /
>> >> >> HoodieWriteClient in their program?
>> >> >>
>> >> >>
>> >>
>> /
>> >> >> HoodieWriteConfig cfg = HoodieWriteConfig.newBuilder()
>> >> >> .withPath(basePath)
>> >> >> .forTable(tableName)
>> >> >> .withSchema(schemaStr)
>> >> >> .withProps(props) // pass raw k,v pairs from a property file.
>> >> >>
>> >> >>
>> >>
>> .withCompactionConfig(HoodieCompactionConfig.newBuilder().withXXX(...).build())
>> >> >>
>> >> >> .withIndexConfig(HoodieIndexConfig.newBuilder().withXXX(...).build())
>> >> >> ...
>> >> >> .build();
>> >> >>
>> >> >>
>> >>
>> /----
>> >> >> OR
>> >> >>
>> >> >>
>> >>
>> /
>> >> >> inputDF.write()
>> >> >> .format("org.apache.hudi")
>> >> >> .options(clientOpts) // any of th

Re:Re: Re: Re: Re: Re: Re: Re: Re: Re:Re: Re: Re:Re: Re: Re: [DISCUSS] Rework of new web site

2020-01-09 Thread lamberken


Hi @Y Ethan Guo,


Thanks for your enthusiasm, most of the translation work has been done.
The next work is that sync chinese translation docs in old site to new site 
after the new site fully released.


Best,
Lamber-Ken









At 2020-01-09 14:12:27, "Y Ethan Guo"  wrote:
>@lamber-ken  Great to hear that you've led the comms with ApacheCN!  Let me
>know if any help is needed.  I'm also willing to help the translation work.
>
>On Wed, Jan 8, 2020 at 3:58 PM lamberken  wrote:
>
>> Hello @Sudha,
>>
>>
>> You are welcome, no need to say sorry :) . I just did something within my
>> ability to promote hudi project, thanks.
>>
>>
>> Best,
>> Lamber-Ken
>>
>>
>>
>>
>>
>> At 2020-01-09 05:41:11, "Bhavani Sudha Saktheeswaran"
>>  wrote:
>> >Sorry for the late response. Just catching up on mailing list thread after
>> >vacation.
>> >@lamber-ken The new site looks cool. Thanks for the time and effort you
>> >have put into this.
>> >
>> >Thanks,
>> >Sudha
>> >
>> >
>> >
>> >On Tue, Jan 7, 2020 at 11:45 PM lamberken  wrote:
>> >
>> >>
>> >>
>> >> Hi @Y Ethan Guo,
>> >>
>> >>
>> >> Thanks, I've already been in touch with ApacheCN.
>> >>
>> https://urldefense.proofpoint.com/v2/url?u=http-3A__hudi.apachecn.org&d=DwIGbg&c=r2dcLCtU9q6n0vrtnDw9vg&r=oyPDRKU5b-LuEWWyf8gacx4mFFydIGdyS50OKdxizX0&m=aCjSoCFupGfS7MZcvAg7nG5Dwm57SggFa42uPFaBdP4&s=gJyseoGhOPV9H5GouGaXHxsbKLvxsku_7Z9SqOlAmK0&e=
>> >> is coming.
>> >>
>> >>
>> >> Best,
>> >> Lamber-Ken
>> >>
>> >> At 2020-01-08 15:21:51, "Y Ethan Guo"  wrote:
>> >> >@lamber-ken
>> >> >
>> >> >Got it.  It would be great if the ApacheCN organization can also help
>> >> >translation and promotion.
>> >> >
>> >> >The reason I'm asking about the Chinese docs is that the pages under
>> >> >"Documentation" (e.g.,
>> >> >
>> >>
>> https://urldefense.proofpoint.com/v2/url?u=http-3A__hudi.apache.org_newsite-2Dcontent_docs_writing-5Fdata.html&d=DwIGbg&c=r2dcLCtU9q6n0vrtnDw9vg&r=oyPDRKU5b-LuEWWyf8gacx4mFFydIGdyS50OKdxizX0&m=aCjSoCFupGfS7MZcvAg7nG5Dwm57SggFa42uPFaBdP4&s=ZR16ZVJRPPS4lUbdX70bp15_nOsGiIfizlDOTVVpDHU&e=
>> >> ) already have
>> >> >the companion Chinese version on the old website (e.g.,
>> >> >
>> >>
>> https://urldefense.proofpoint.com/v2/url?u=http-3A__hudi.apache.org_cn_writing-5Fdata.html&d=DwIGbg&c=r2dcLCtU9q6n0vrtnDw9vg&r=oyPDRKU5b-LuEWWyf8gacx4mFFydIGdyS50OKdxizX0&m=aCjSoCFupGfS7MZcvAg7nG5Dwm57SggFa42uPFaBdP4&s=fFTrXIduD5fGSt2DF2RFdF8bytizFfaWQVRuWjsMdQ0&e=
>> >> ).  So if it's not hard to port
>> >> >them to the new website, they are still useful for the users.
>> >> >
>> >> >Best,
>> >> >- Ethan
>> >> >
>> >> >On Tue, Jan 7, 2020 at 11:05 PM lamberken  wrote:
>> >> >
>> >> >>
>> >> >>
>> >> >> Hi @Y Ethan Guo,
>> >> >>
>> >> >>
>> >> >> Thank you very much for your advice, I'll consider adjusting the font
>> >> >> size.
>> >> >>
>> >> >>
>> >> >> For Chinese docs, I talked with @leesf about the chinese docs before,
>> >> our
>> >> >> initial aim is to help user to learn hudi quickly, we should not
>> >> translate
>> >> >> the whole site, it doesn't work very well.
>> >> >>
>> >> >>
>> >> >> We can discuss about chinese docs in a new thread, btw we can work
>> with
>> >> >> ApacheCN organization to translate and promote the hudi project.
>> >> Apachecn
>> >> >> organization has already translate manay popular projects, like
>> kafka,
>> >> >> flink, spark and etc.
>> >> >>
>> >> >>
>> >> >> ApacheCN & Projects
>> >> >>
>> >>
>> https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apachecn&d=DwIGbg&c=r2dcLCtU9q6n0vrtnDw9vg&r=oyPDRKU5b-LuEWWyf8gacx4mFFydIGdyS50OKdxizX0&m=aCjSoCFupGfS7MZcvAg7nG5Dwm57SggFa42uPFaBdP4&s=ViJF5LL7QpBRHXivRf5OBLWh

Re:Re: Re: Re: Re: Re: Re: Re: Re:Re: Re: Re:Re: Re: Re: [DISCUSS] Rework of new web site

2020-01-08 Thread lamberken
Hello @Sudha,


You are welcome, no need to say sorry :) . I just did something within my 
ability to promote hudi project, thanks.


Best,
Lamber-Ken





At 2020-01-09 05:41:11, "Bhavani Sudha Saktheeswaran" 
 wrote:
>Sorry for the late response. Just catching up on mailing list thread after
>vacation.
>@lamber-ken The new site looks cool. Thanks for the time and effort you
>have put into this.
>
>Thanks,
>Sudha
>
>
>
>On Tue, Jan 7, 2020 at 11:45 PM lamberken  wrote:
>
>>
>>
>> Hi @Y Ethan Guo,
>>
>>
>> Thanks, I've already been in touch with ApacheCN.
>> https://urldefense.proofpoint.com/v2/url?u=http-3A__hudi.apachecn.org&d=DwIGbg&c=r2dcLCtU9q6n0vrtnDw9vg&r=oyPDRKU5b-LuEWWyf8gacx4mFFydIGdyS50OKdxizX0&m=aCjSoCFupGfS7MZcvAg7nG5Dwm57SggFa42uPFaBdP4&s=gJyseoGhOPV9H5GouGaXHxsbKLvxsku_7Z9SqOlAmK0&e=
>> is coming.
>>
>>
>> Best,
>> Lamber-Ken
>>
>> At 2020-01-08 15:21:51, "Y Ethan Guo"  wrote:
>> >@lamber-ken
>> >
>> >Got it.  It would be great if the ApacheCN organization can also help
>> >translation and promotion.
>> >
>> >The reason I'm asking about the Chinese docs is that the pages under
>> >"Documentation" (e.g.,
>> >
>> https://urldefense.proofpoint.com/v2/url?u=http-3A__hudi.apache.org_newsite-2Dcontent_docs_writing-5Fdata.html&d=DwIGbg&c=r2dcLCtU9q6n0vrtnDw9vg&r=oyPDRKU5b-LuEWWyf8gacx4mFFydIGdyS50OKdxizX0&m=aCjSoCFupGfS7MZcvAg7nG5Dwm57SggFa42uPFaBdP4&s=ZR16ZVJRPPS4lUbdX70bp15_nOsGiIfizlDOTVVpDHU&e=
>> ) already have
>> >the companion Chinese version on the old website (e.g.,
>> >
>> https://urldefense.proofpoint.com/v2/url?u=http-3A__hudi.apache.org_cn_writing-5Fdata.html&d=DwIGbg&c=r2dcLCtU9q6n0vrtnDw9vg&r=oyPDRKU5b-LuEWWyf8gacx4mFFydIGdyS50OKdxizX0&m=aCjSoCFupGfS7MZcvAg7nG5Dwm57SggFa42uPFaBdP4&s=fFTrXIduD5fGSt2DF2RFdF8bytizFfaWQVRuWjsMdQ0&e=
>> ).  So if it's not hard to port
>> >them to the new website, they are still useful for the users.
>> >
>> >Best,
>> >- Ethan
>> >
>> >On Tue, Jan 7, 2020 at 11:05 PM lamberken  wrote:
>> >
>> >>
>> >>
>> >> Hi @Y Ethan Guo,
>> >>
>> >>
>> >> Thank you very much for your advice, I'll consider adjusting the font
>> >> size.
>> >>
>> >>
>> >> For Chinese docs, I talked with @leesf about the chinese docs before,
>> our
>> >> initial aim is to help user to learn hudi quickly, we should not
>> translate
>> >> the whole site, it doesn't work very well.
>> >>
>> >>
>> >> We can discuss about chinese docs in a new thread, btw we can work with
>> >> ApacheCN organization to translate and promote the hudi project.
>> Apachecn
>> >> organization has already translate manay popular projects, like kafka,
>> >> flink, spark and etc.
>> >>
>> >>
>> >> ApacheCN & Projects
>> >>
>> https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apachecn&d=DwIGbg&c=r2dcLCtU9q6n0vrtnDw9vg&r=oyPDRKU5b-LuEWWyf8gacx4mFFydIGdyS50OKdxizX0&m=aCjSoCFupGfS7MZcvAg7nG5Dwm57SggFa42uPFaBdP4&s=ViJF5LL7QpBRHXivRf5OBLWhhVr4JMMpkrCM7uU0Ua8&e=
>> >>
>> https://urldefense.proofpoint.com/v2/url?u=https-3A__docs.apachecn.org&d=DwIGbg&c=r2dcLCtU9q6n0vrtnDw9vg&r=oyPDRKU5b-LuEWWyf8gacx4mFFydIGdyS50OKdxizX0&m=aCjSoCFupGfS7MZcvAg7nG5Dwm57SggFa42uPFaBdP4&s=medAR5LGJxTR8BDiMlszOpQVuKXcIcithelbvc1SK_Y&e=
>> >>
>> https://urldefense.proofpoint.com/v2/url?u=http-3A__kafka.apachecn.org&d=DwIGbg&c=r2dcLCtU9q6n0vrtnDw9vg&r=oyPDRKU5b-LuEWWyf8gacx4mFFydIGdyS50OKdxizX0&m=aCjSoCFupGfS7MZcvAg7nG5Dwm57SggFa42uPFaBdP4&s=qa1stm_1K7Oib3ZO3aNZGDPKhjCBy6LwZfWqINrTae0&e=
>> >>
>> https://urldefense.proofpoint.com/v2/url?u=http-3A__flink.apachecn.org&d=DwIGbg&c=r2dcLCtU9q6n0vrtnDw9vg&r=oyPDRKU5b-LuEWWyf8gacx4mFFydIGdyS50OKdxizX0&m=aCjSoCFupGfS7MZcvAg7nG5Dwm57SggFa42uPFaBdP4&s=i7OmLy8BjLNcxxt2okNp1VSlufHvf7M_r9D2yyfLEKc&e=
>> >>
>> https://urldefense.proofpoint.com/v2/url?u=http-3A__storm.apachecn.org&d=DwIGbg&c=r2dcLCtU9q6n0vrtnDw9vg&r=oyPDRKU5b-LuEWWyf8gacx4mFFydIGdyS50OKdxizX0&m=aCjSoCFupGfS7MZcvAg7nG5Dwm57SggFa42uPFaBdP4&s=aq4lFFhpvsR2L07HjtOSni3-osz7mjv5eelj-W0aFEY&e=
>> >>
>> https://urldefense.proofpoint.com/v2/url?u=http-3A__spark.apachecn.org&d=DwIGbg&c=r2dcLCtU9q6n0vrtnDw9vg&

Re: Re:Re: Re: Re:Re: Re: Re: [DISCUSS] Rework of new web site

2020-01-08 Thread lamberken


Re:Re: Re: Re: Re: Re: Re: Re:Re: Re: Re:Re: Re: Re: [DISCUSS] Rework of new web site

2020-01-07 Thread lamberken


Hi @Y Ethan Guo,


Thanks, I've already been in touch with ApacheCN.  http://hudi.apachecn.org is 
coming.


Best,
Lamber-Ken

At 2020-01-08 15:21:51, "Y Ethan Guo"  wrote:
>@lamber-ken
>
>Got it.  It would be great if the ApacheCN organization can also help
>translation and promotion.
>
>The reason I'm asking about the Chinese docs is that the pages under
>"Documentation" (e.g.,
>http://hudi.apache.org/newsite-content/docs/writing_data.html) already have
>the companion Chinese version on the old website (e.g.,
>http://hudi.apache.org/cn/writing_data.html).  So if it's not hard to port
>them to the new website, they are still useful for the users.
>
>Best,
>- Ethan
>
>On Tue, Jan 7, 2020 at 11:05 PM lamberken  wrote:
>
>>
>>
>> Hi @Y Ethan Guo,
>>
>>
>> Thank you very much for your advice, I'll consider adjusting the font
>> size.
>>
>>
>> For Chinese docs, I talked with @leesf about the chinese docs before, our
>> initial aim is to help user to learn hudi quickly, we should not translate
>> the whole site, it doesn't work very well.
>>
>>
>> We can discuss about chinese docs in a new thread, btw we can work with
>> ApacheCN organization to translate and promote the hudi project. Apachecn
>> organization has already translate manay popular projects, like kafka,
>> flink, spark and etc.
>>
>>
>> ApacheCN & Projects
>> https://github.com/apachecn
>> https://docs.apachecn.org
>> http://kafka.apachecn.org
>> http://flink.apachecn.org
>> http://storm.apachecn.org
>> http://spark.apachecn.org
>>
>>
>> Best,
>> Lamber-Ken
>>
>>
>>
>> At 2020-01-08 14:05:41, "Y Ethan Guo"  wrote:
>> >@lamber-ken,  Thanks for the great effort!  The new website looks slick,
>> >with a much better browsing experience.
>> >
>> >One thing I noticed is that there seems to be no link to the Chinese
>> >version of the docs on the new website.  Wondering where I can find them.
>> >
>> >Another minor thing is that the font size of the docs is bigger than the
>> >old one, so it takes more scrolls to the end of the page.  IMHO, one point
>> >smaller might be better.
>> >
>> >- Ethan
>> >
>> >On Tue, Jan 7, 2020 at 3:11 PM lamberken  wrote:
>> >
>> >>
>> >>
>> >> Hi Pratyaksh Sharma,
>> >>
>> >>
>> >> Good catch!
>> >>
>> >> Best,
>> >> Lamber-ken
>> >>
>> >>
>> >>
>> >>
>> >>
>> >> At 2020-01-07 21:50:54, "Pratyaksh Sharma" 
>> wrote:
>> >>
>> >> Hi lamberken,
>> >>
>> >>
>> >> Thank you for your efforts. The new website definitely looks a lot
>> better.
>> >>
>> >>
>> >> I found a minor issue. At the top where we are giving link to go back to
>> >> old site, the language seems incorrect. It says "Click here back to old
>> >> site". Rather it should be "Click here to go back to old site".
>> >>
>> >>
>> >> On Tue, Jan 7, 2020 at 12:22 PM Vinoth Chandar 
>> wrote:
>> >>
>> >> The new site is here http://hudi.apache.org/newsite-content/ if you are
>> >> wondering.. Based on the feedback, we plan to deprecate the old in the
>> next
>> >> week or so. So please chime in.
>> >>
>> >> Thanks Lamber-ken for the champion effort! As someone who shouldered the
>> >> site design since the beginning of the project, I am very happy to see
>> >> something better finally replace it :D
>> >>
>> >> On Mon, Jan 6, 2020 at 10:38 PM lamberken  wrote:
>> >>
>> >> >
>> >> >
>> >> > hello everyone,
>> >> >
>> >> >
>> >> > The new site has been merged into asf-site branch, the official
>> website
>> >> > has been updated, hope you all enjoy the new site style.
>> >> > Please visit the new web site, if you notice any issues, feel free
>> >> provide
>> >> > feedback.
>> >> >
>> >> >
>> >> > btw, thanks @Vinoth for reviewing carefully.
>> >> >
>> >> >
>> >> > best,
>> >> > lamber-ken
>> >> >
>> >> > At 2019

Re:Re: Re: Re: Re: Re: Re:Re: Re: Re:Re: Re: Re: [DISCUSS] Rework of new web site

2020-01-07 Thread lamberken


Hi @Y Ethan Guo,


Thank you very much for your advice, I'll consider adjusting the font size. 


For Chinese docs, I talked with @leesf about the chinese docs before, our 
initial aim is to help user to learn hudi quickly, we should not translate the 
whole site, it doesn't work very well. 


We can discuss about chinese docs in a new thread, btw we can work with 
ApacheCN organization to translate and promote the hudi project. Apachecn 
organization has already translate manay popular projects, like kafka, flink, 
spark and etc.


ApacheCN & Projects
https://github.com/apachecn
https://docs.apachecn.org
http://kafka.apachecn.org
http://flink.apachecn.org
http://storm.apachecn.org
http://spark.apachecn.org


Best,
Lamber-Ken



At 2020-01-08 14:05:41, "Y Ethan Guo"  wrote:
>@lamber-ken,  Thanks for the great effort!  The new website looks slick,
>with a much better browsing experience.
>
>One thing I noticed is that there seems to be no link to the Chinese
>version of the docs on the new website.  Wondering where I can find them.
>
>Another minor thing is that the font size of the docs is bigger than the
>old one, so it takes more scrolls to the end of the page.  IMHO, one point
>smaller might be better.
>
>- Ethan
>
>On Tue, Jan 7, 2020 at 3:11 PM lamberken  wrote:
>
>>
>>
>> Hi Pratyaksh Sharma,
>>
>>
>> Good catch!
>>
>> Best,
>> Lamber-ken
>>
>>
>>
>>
>>
>> At 2020-01-07 21:50:54, "Pratyaksh Sharma"  wrote:
>>
>> Hi lamberken,
>>
>>
>> Thank you for your efforts. The new website definitely looks a lot better.
>>
>>
>> I found a minor issue. At the top where we are giving link to go back to
>> old site, the language seems incorrect. It says "Click here back to old
>> site". Rather it should be "Click here to go back to old site".
>>
>>
>> On Tue, Jan 7, 2020 at 12:22 PM Vinoth Chandar  wrote:
>>
>> The new site is here http://hudi.apache.org/newsite-content/ if you are
>> wondering.. Based on the feedback, we plan to deprecate the old in the next
>> week or so. So please chime in.
>>
>> Thanks Lamber-ken for the champion effort! As someone who shouldered the
>> site design since the beginning of the project, I am very happy to see
>> something better finally replace it :D
>>
>> On Mon, Jan 6, 2020 at 10:38 PM lamberken  wrote:
>>
>> >
>> >
>> > hello everyone,
>> >
>> >
>> > The new site has been merged into asf-site branch, the official website
>> > has been updated, hope you all enjoy the new site style.
>> > Please visit the new web site, if you notice any issues, feel free
>> provide
>> > feedback.
>> >
>> >
>> > btw, thanks @Vinoth for reviewing carefully.
>> >
>> >
>> > best,
>> > lamber-ken
>> >
>> > At 2019-12-21 07:59:25, "Vinoth Chandar"  wrote:
>> > >Hi lamber,
>> > >
>> > >Given we have enough +1s on the look and feel aspects, I propose we
>> open a
>> > >PR and iron out the content/remaining issues there one by one.
>> > >
>> > >I think a full line by line review is the best way to go, as with any
>> code
>> > >change
>> > >
>> > >Please share the PR here once you have it
>> > >
>> > >Thanks
>> > >Vinoth
>> > >
>> > >On Fri, Dec 20, 2019 at 3:55 PM lamberken  wrote:
>> > >
>> > >>
>> > >>
>> > >> Hi leesf,
>> > >>
>> > >>
>> > >> Thank you for your affirmation.
>> > >>
>> > >>
>> > >> best,
>> > >> lamber-ken
>> > >>
>> > >>
>> > >>
>> > >>
>> > >>
>> > >> At 2019-12-21 07:28:50, "leesf"  wrote:
>> > >>
>> > >> Hi lamber,
>> > >>
>> > >>
>> > >> Thanks for your great work, the new website looks much better.
>> > >>
>> > >>
>> > >> Also if you guys have other companies(logos) needed to add to powered
>> > >> by(Hudi Users)[1], please let lamberken/me know before using new
>> > website.
>> > >>
>> > >>
>> > >> Best,
>> > >> Leesf
>> > >>
>> > >>
>> > >> [1] https://lamber-ken.github.io/
>> > >>

Re:Re: Re: Re: Re: Re: Re: Re: Re: [DISCUSS] Refactor of the configuration framework of hudi project

2020-01-07 Thread lamberken


Hi @Vinoth,


It's time to pick up this topic. Based on the content we talked about, here are 
my thoughts


1, Initial proposal aims to rework configuration framework includes(DataSource 
and WriteClient level), 
for compatibility, we can introduce a ConfigOption class and rework it on 
DataSource level.


2, It's very right that the scoped down version does not need a RFC[1], so 
change state from 'Under Discussion' to 'Close' ?


[1] 
https://cwiki.apache.org/confluence/display/HUDI/RFC+-+11+%3A+Refactor+of+the+configuration+framework+of+hudi+project


Best,
Lamber-Ken









At 2019-12-19 11:05:16, "Vinoth Chandar"  wrote:
>Sounds good.. This scoped down version per se, does not need a RFC.
>
>On Wed, Dec 18, 2019 at 3:09 PM lamberken  wrote:
>
>>
>>
>> Hi @Vinoth
>>
>>
>> I understand what you mean, I will continue to work on this when I finish
>> reworking the new UI. :)
>>
>>
>> best,
>> lamber-ken
>>
>>
>>
>>
>> At 2019-12-18 11:39:30, "Vinoth Chandar"  wrote:
>> >Expect most users to use inputDF.write() approach...  Uber uses the lower
>> >level RDD apis, like the DeltaStreamer tool does..
>> >If we don't rename configs and still support a builder, it should be fine.
>> >
>> >I think we can scope this down to introducing a ConfigOption class that
>> >ties, the key,value, default together.. That definitely seems like a
>> better
>> >abstraction.
>> >
>> >On Fri, Dec 13, 2019 at 5:18 PM lamberken  wrote:
>> >
>> >>
>> >>
>> >> Hi, @vinoth
>> >>
>> >>
>> >> Okay, I see. If we don't want existing users to do any upgrading or
>> >> reconfigurations, then this refactor work will not make much sense.
>> >> This issue can be closed, because ConfigOptions and these builders do
>> the
>> >> same things.
>> >> From another side, if we finish this work before a stable release, we
>> will
>> >> benefit a lot from it. We need to make a choice.
>> >>
>> >>
>> >> btw, I have a question that users will use HoodieWriteConfig /
>> >> HoodieWriteClient in their program?
>> >>
>> >>
>> /
>> >> HoodieWriteConfig cfg = HoodieWriteConfig.newBuilder()
>> >> .withPath(basePath)
>> >> .forTable(tableName)
>> >> .withSchema(schemaStr)
>> >> .withProps(props) // pass raw k,v pairs from a property file.
>> >>
>> >>
>> .withCompactionConfig(HoodieCompactionConfig.newBuilder().withXXX(...).build())
>> >>
>> >> .withIndexConfig(HoodieIndexConfig.newBuilder().withXXX(...).build())
>> >> ...
>> >> .build();
>> >>
>> >>
>> /
>> >> OR
>> >>
>> >>
>> /
>> >> inputDF.write()
>> >> .format("org.apache.hudi")
>> >> .options(clientOpts) // any of the Hudi client opts can be passed in
>> >> as well
>> >> .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY(),
>> "_row_key")
>> >>     .option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY(),
>> >> "partition")
>> >> .option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY(),
>> "timestamp")
>> >> .option(HoodieWriteConfig.TABLE_NAME, tableName)
>> >> .mode(SaveMode.Append)
>> >> .save(basePath);
>> >>
>> >>
>> /
>> >>
>> >>
>> >>
>> >>
>> >> best,
>> >> lamber-ken
>> >>
>> >> At 2019-12-14 08:43:06, "Vinoth Chandar"  wrote:
>> >> >Hi,
>> >> >
>> >> >Are you saying these classes needs to change? If so, understood. But
>> are
>> >> >you planning on renaming configs or relocating them? We dont want
>> existing
>> >> >users to do any upgrading or reconfigurations
>> >> >
>> >> >On Fri, Dec 13, 2019 at 10:28 AM lamberken  wrote:
>> >> >
>

Re:Re: Re: Re: Re: Re:Re: Re: Re:Re: Re: Re: [DISCUSS] Rework of new web site

2020-01-07 Thread lamberken


Hi Pratyaksh Sharma,


Good catch!

Best,
Lamber-ken





At 2020-01-07 21:50:54, "Pratyaksh Sharma"  wrote:

Hi lamberken, 


Thank you for your efforts. The new website definitely looks a lot better. 


I found a minor issue. At the top where we are giving link to go back to old 
site, the language seems incorrect. It says "Click here back to old site". 
Rather it should be "Click here to go back to old site".


On Tue, Jan 7, 2020 at 12:22 PM Vinoth Chandar  wrote:

The new site is here http://hudi.apache.org/newsite-content/ if you are
wondering.. Based on the feedback, we plan to deprecate the old in the next
week or so. So please chime in.

Thanks Lamber-ken for the champion effort! As someone who shouldered the
site design since the beginning of the project, I am very happy to see
something better finally replace it :D

On Mon, Jan 6, 2020 at 10:38 PM lamberken  wrote:

>
>
> hello everyone,
>
>
> The new site has been merged into asf-site branch, the official website
> has been updated, hope you all enjoy the new site style.
> Please visit the new web site, if you notice any issues, feel free provide
> feedback.
>
>
> btw, thanks @Vinoth for reviewing carefully.
>
>
> best,
> lamber-ken
>
> At 2019-12-21 07:59:25, "Vinoth Chandar"  wrote:
> >Hi lamber,
> >
> >Given we have enough +1s on the look and feel aspects, I propose we open a
> >PR and iron out the content/remaining issues there one by one.
> >
> >I think a full line by line review is the best way to go, as with any code
> >change
> >
> >Please share the PR here once you have it
> >
> >Thanks
> >Vinoth
> >
> >On Fri, Dec 20, 2019 at 3:55 PM lamberken  wrote:
> >
> >>
> >>
> >> Hi leesf,
> >>
> >>
> >> Thank you for your affirmation.
> >>
> >>
> >> best,
> >> lamber-ken
> >>
> >>
> >>
> >>
> >>
> >> At 2019-12-21 07:28:50, "leesf"  wrote:
> >>
> >> Hi lamber,
> >>
> >>
> >> Thanks for your great work, the new website looks much better.
> >>
> >>
> >> Also if you guys have other companies(logos) needed to add to powered
> >> by(Hudi Users)[1], please let lamberken/me know before using new
> website.
> >>
> >>
> >> Best,
> >> Leesf
> >>
> >>
> >> [1] https://lamber-ken.github.io/
> >>
> >>
> >> lamberken  于2019年12月20日周五 上午9:29写道:
> >>
> >>
> >>
> >> Hi nishith,
> >>
> >>
> >> Thank you for your affirmation. The content in the blue box is to help
> us
> >> understand the highlighted content.
> >> It is different from the body content, so we need it. There are several
> >> ways to present it, for examples.
> >>
> >>
> >> best,
> >> lamber-ken
> >>
> >>
> >>
> >> At 2019-12-20 05:57:16, "nishith agarwal"  wrote:
> >> >Great job Lamber!
> >> >
> >> >The website looks really slick and has a much better experience of
> moving
> >> >from one page to another (mostly I think because it's faster), also
> find
> >> it
> >> >the text much more conducive to absorb.
> >> >
> >> >While going through the quick start, I noticed that under the
> highlighted
> >> >box in dark (showing the code pieces), there's another highlighted box
> (in
> >> >light blue) which talks about more details. Do we need that ? May be
> the
> >> >details in that box can just follow the plain text style of other
> parts on
> >> >that page.
> >> >
> >> >-Nishith
> >> >
> >> >On Wed, Dec 18, 2019 at 10:59 PM vino yang 
> wrote:
> >> >
> >> >> Hi Lamber,
> >> >>
> >> >> Awesome! Thanks for your hard work.
> >> >>
> >> >> Best,
> >> >> Vino
> >> >>
> >> >> lamberken  于2019年12月19日周四 下午2:11写道:
> >> >>
> >> >> >
> >> >> >
> >> >> > Hi everyone,
> >> >> >
> >> >> >
> >> >> > I finished the rework of the new UI, if you have time, please visit
> >> the
> >> >> > website[1].
> >> >> > Any questions are welcome.
> >> >> >
> >> >> >
> >> >> > [1]h

Re:Re: Re: Re: Re:Re: Re: Re:Re: Re: Re: [DISCUSS] Rework of new web site

2020-01-06 Thread lamberken


hello everyone,


The new site has been merged into asf-site branch, the official website has 
been updated, hope you all enjoy the new site style.
Please visit the new web site, if you notice any issues, feel free provide 
feedback. 


btw, thanks @Vinoth for reviewing carefully.


best,
lamber-ken

At 2019-12-21 07:59:25, "Vinoth Chandar"  wrote:
>Hi lamber,
>
>Given we have enough +1s on the look and feel aspects, I propose we open a
>PR and iron out the content/remaining issues there one by one.
>
>I think a full line by line review is the best way to go, as with any code
>change
>
>Please share the PR here once you have it
>
>Thanks
>Vinoth
>
>On Fri, Dec 20, 2019 at 3:55 PM lamberken  wrote:
>
>>
>>
>> Hi leesf,
>>
>>
>> Thank you for your affirmation.
>>
>>
>> best,
>> lamber-ken
>>
>>
>>
>>
>>
>> At 2019-12-21 07:28:50, "leesf"  wrote:
>>
>> Hi lamber,
>>
>>
>> Thanks for your great work, the new website looks much better.
>>
>>
>> Also if you guys have other companies(logos) needed to add to powered
>> by(Hudi Users)[1], please let lamberken/me know before using new website.
>>
>>
>> Best,
>> Leesf
>>
>>
>> [1] https://lamber-ken.github.io/
>>
>>
>> lamberken  于2019年12月20日周五 上午9:29写道:
>>
>>
>>
>> Hi nishith,
>>
>>
>> Thank you for your affirmation. The content in the blue box is to help us
>> understand the highlighted content.
>> It is different from the body content, so we need it. There are several
>> ways to present it, for examples.
>>
>>
>> best,
>> lamber-ken
>>
>>
>>
>> At 2019-12-20 05:57:16, "nishith agarwal"  wrote:
>> >Great job Lamber!
>> >
>> >The website looks really slick and has a much better experience of moving
>> >from one page to another (mostly I think because it's faster), also find
>> it
>> >the text much more conducive to absorb.
>> >
>> >While going through the quick start, I noticed that under the highlighted
>> >box in dark (showing the code pieces), there's another highlighted box (in
>> >light blue) which talks about more details. Do we need that ? May be the
>> >details in that box can just follow the plain text style of other parts on
>> >that page.
>> >
>> >-Nishith
>> >
>> >On Wed, Dec 18, 2019 at 10:59 PM vino yang  wrote:
>> >
>> >> Hi Lamber,
>> >>
>> >> Awesome! Thanks for your hard work.
>> >>
>> >> Best,
>> >> Vino
>> >>
>> >> lamberken  于2019年12月19日周四 下午2:11写道:
>> >>
>> >> >
>> >> >
>> >> > Hi everyone,
>> >> >
>> >> >
>> >> > I finished the rework of the new UI, if you have time, please visit
>> the
>> >> > website[1].
>> >> > Any questions are welcome.
>> >> >
>> >> >
>> >> > [1]https://lamber-ken.github.io/docs/quick-start-guide/
>> >> >
>> >> >
>> >> > best,
>> >> > lamber-ken
>> >> >
>> >> >
>> >> >
>> >> > At 2019-12-19 07:38:47, "lamberken"  wrote:
>> >> > >
>> >> > >
>> >> > >Hi @Shiyan Xu
>> >> > >
>> >> > >
>> >> > >Thanks. :)
>> >> > >best,
>> >> > >lamber-ken
>> >> > >
>> >> > >
>> >> > >At 2019-12-19 00:53:51, "Shiyan Xu" 
>> >> wrote:
>> >> > >>Thank you @lamber-ken for the work! It is definitely a greater
>> browsing
>> >> > >>experience.
>> >> > >>
>> >> > >>On Tue, Dec 17, 2019 at 8:28 PM lamberken 
>> wrote:
>> >> > >>
>> >> > >>>
>> >> > >>> Hi, @Vinoth
>> >> > >>>
>> >> > >>>
>> >> > >>>
>> >> > >>> I'm glad to hear your thoughts on the new UI, thanks. So we keep
>> its
>> >> > style
>> >> > >>> as it is now.
>> >> > >>> The development of new UI can be completed these days, any
>> questions
>> >> > are
>> >> >

Re:Re: Re: Re: Re: Facing issues when using HiveIncrementalPuller

2020-01-03 Thread lamberken


hi Vinoth Chandar / Pratyaksh Sharma,


I reseted many commits from git and check whether HiveIncrementalPuller works 
normally. It seems that HiveIncrementalPuller has been working abnormallyis for 
a long time.


For detail reproduce steps, please visit HUDI-486 
<https://issues.apache.org/jira/browse/HUDI-486>


best,
lamber-ken





At 2020-01-01 09:15:01, "Vinoth Chandar"  wrote:
>This does sound like a fair bit of pain.
>I am wondering if it makes sense to change the integ-test setup/docker demo
>to use incremental  puller. Bunch of the packaging issues around jars, seem
>like regressions that the hudi-utilities is not a fat jar anymore?
>
>if there are nt any takers, I can also try my hand at fixing this, once I
>get done with few things on my end. left a comment on HUDI-485
>
>
>
>On Tue, Dec 31, 2019 at 4:19 PM lamberken  wrote:
>
>>
>>
>> Hi @Pratyaksh Sharma,
>>
>>
>> Thanks for your detail stackstrace and reproduce steps. And your
>> suggestion is reasonable.
>>
>>
>> 1, For NPE issue, please tracking pr #1167 <
>> https://github.com/apache/incubator-hudi/pull/1167>
>> 2, For TTransportException issue, I have a question that can other
>> statements be executed except create statement?
>>
>>
>> best,
>> lamber-ken
>>
>> At 2019-12-30 23:11:17, "Pratyaksh Sharma"  wrote:
>> >Thank you Lamberken, the above issue gets resolved with what you
>> suggested.
>> >However, still HiveIncrementalPuller is not working.
>> >Subsequently I found and fixed a bug raised here -
>> >https://issues.apache.org/jira/browse/HUDI-485.
>> >
>> >Currently I am facing the below exception when trying to run the create
>> >table statement on docker cluster. Any leads for solving this are welcome
>> -
>> >
>> >6811 [main] ERROR org.apache.hudi.utilities.HiveIncrementalPuller  -
>> >Exception when executing SQL
>> >
>> >java.sql.SQLException: org.apache.thrift.transport.TTransportException
>> >
>> >at
>>
>> >org.apache.hive.jdbc.HiveStatement.waitForOperationToComplete(HiveStatement.java:399)
>> >
>> >at org.apache.hive.jdbc.HiveStatement.execute(HiveStatement.java:254)
>> >
>> >at
>>
>> >org.apache.hudi.utilities.HiveIncrementalPuller.executeStatement(HiveIncrementalPuller.java:233)
>> >
>> >at
>>
>> >org.apache.hudi.utilities.HiveIncrementalPuller.executeIncrementalSQL(HiveIncrementalPuller.java:200)
>> >
>> >at
>>
>> >org.apache.hudi.utilities.HiveIncrementalPuller.saveDelta(HiveIncrementalPuller.java:157)
>> >
>> >at
>>
>> >org.apache.hudi.utilities.HiveIncrementalPuller.main(HiveIncrementalPuller.java:345)
>> >
>> >Caused by: org.apache.thrift.transport.TTransportException
>> >
>> >at
>>
>> >org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:132)
>> >
>> >at org.apache.thrift.transport.TTransport.readAll(TTransport.java:86)
>> >
>> >at
>>
>> >org.apache.thrift.transport.TSaslTransport.readLength(TSaslTransport.java:374)
>> >
>> >at
>>
>> >org.apache.thrift.transport.TSaslTransport.readFrame(TSaslTransport.java:451)
>> >
>> >at
>> org.apache.thrift.transport.TSaslTransport.read(TSaslTransport.java:433)
>> >
>> >at
>>
>> >org.apache.thrift.transport.TSaslClientTransport.read(TSaslClientTransport.java:38)
>> >
>> >at org.apache.thrift.transport.TTransport.readAll(TTransport.java:86)
>> >
>> >at
>>
>> >org.apache.thrift.protocol.TBinaryProtocol.readAll(TBinaryProtocol.java:425)
>> >
>> >at
>>
>> >org.apache.thrift.protocol.TBinaryProtocol.readI32(TBinaryProtocol.java:321)
>> >
>> >at
>>
>> >org.apache.thrift.protocol.TBinaryProtocol.readMessageBegin(TBinaryProtocol.java:225)
>> >
>> >at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:77)
>> >
>> >at
>>
>> >org.apache.hive.service.rpc.thrift.TCLIService$Client.recv_GetOperationStatus(TCLIService.java:467)
>> >
>> >at
>>
>> >org.apache.hive.service.rpc.thrift.TCLIService$Client.GetOperationStatus(TCLIService.java:454)
>> >
>> >at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>> >
>> >at
>>
>> >sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>> >
>> >at
>

Re: Facing issues when using HiveIncrementalPuller

2020-01-01 Thread lamberken


Hi @Pratyaksh Sharma,


Okay, all right. BTW, thanks for raising this issue.


best,
lamber-ken


On 01/2/2020 13:47,Pratyaksh Sharma wrote:
Hi Lamberken,

I am also trying to fix this issue. Please let us know if you come up with
anything.

On Thu, Jan 2, 2020 at 11:12 AM lamberken  wrote:



Hi @Vinoth,


Got it, thank you for reminding me. I just made a mistake just now.


best,
lamber-ken


On 01/2/2020 13:08,Vinoth Chandar wrote:
Hi Lamber,

utilities-bundle has always been a fat jar.. I was talking about
hudi-utilities.
Sure. take a swing at it. Happy to help as needed

On Wed, Jan 1, 2020 at 8:57 PM lamberken  wrote:



Hi @Vinoth,


I'm willing to solve this problem. I'm trying to find out from the history
when hudi-utilities-bundle becoming not a fatjar.



Git History
2019-08-29 FAT-JAR ---> 5f9fa82f47e1cc14a22b869250fe23c8f9c033cd
2019-09-14 NOT-FATJAR ---> d2525c31b7dad7bae2d4899d8df2a353ca39af50
best,
lamber-ken


At 2020-01-01 09:15:01, "Vinoth Chandar"  wrote:
This does sound like a fair bit of pain.
I am wondering if it makes sense to change the integ-test setup/docker
demo
to use incremental  puller. Bunch of the packaging issues around jars,
seem
like regressions that the hudi-utilities is not a fat jar anymore?

if there are nt any takers, I can also try my hand at fixing this, once I
get done with few things on my end. left a comment on HUDI-485



On Tue, Dec 31, 2019 at 4:19 PM lamberken  wrote:



Hi @Pratyaksh Sharma,


Thanks for your detail stackstrace and reproduce steps. And your
suggestion is reasonable.


1, For NPE issue, please tracking pr #1167 <
https://github.com/apache/incubator-hudi/pull/1167>
2, For TTransportException issue, I have a question that can other
statements be executed except create statement?


best,
lamber-ken

At 2019-12-30 23:11:17, "Pratyaksh Sharma" 
wrote:
Thank you Lamberken, the above issue gets resolved with what you
suggested.
However, still HiveIncrementalPuller is not working.
Subsequently I found and fixed a bug raised here -
https://issues.apache.org/jira/browse/HUDI-485.

Currently I am facing the below exception when trying to run the create
table statement on docker cluster. Any leads for solving this are
welcome
-

6811 [main] ERROR org.apache.hudi.utilities.HiveIncrementalPuller  -
Exception when executing SQL

java.sql.SQLException: org.apache.thrift.transport.TTransportException

at



org.apache.hive.jdbc.HiveStatement.waitForOperationToComplete(HiveStatement.java:399)

at org.apache.hive.jdbc.HiveStatement.execute(HiveStatement.java:254)

at



org.apache.hudi.utilities.HiveIncrementalPuller.executeStatement(HiveIncrementalPuller.java:233)

at



org.apache.hudi.utilities.HiveIncrementalPuller.executeIncrementalSQL(HiveIncrementalPuller.java:200)

at



org.apache.hudi.utilities.HiveIncrementalPuller.saveDelta(HiveIncrementalPuller.java:157)

at



org.apache.hudi.utilities.HiveIncrementalPuller.main(HiveIncrementalPuller.java:345)

Caused by: org.apache.thrift.transport.TTransportException

at



org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:132)

at org.apache.thrift.transport.TTransport.readAll(TTransport.java:86)

at



org.apache.thrift.transport.TSaslTransport.readLength(TSaslTransport.java:374)

at



org.apache.thrift.transport.TSaslTransport.readFrame(TSaslTransport.java:451)

at
org.apache.thrift.transport.TSaslTransport.read(TSaslTransport.java:433)

at



org.apache.thrift.transport.TSaslClientTransport.read(TSaslClientTransport.java:38)

at org.apache.thrift.transport.TTransport.readAll(TTransport.java:86)

at



org.apache.thrift.protocol.TBinaryProtocol.readAll(TBinaryProtocol.java:425)

at



org.apache.thrift.protocol.TBinaryProtocol.readI32(TBinaryProtocol.java:321)

at



org.apache.thrift.protocol.TBinaryProtocol.readMessageBegin(TBinaryProtocol.java:225)

at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:77)

at



org.apache.hive.service.rpc.thrift.TCLIService$Client.recv_GetOperationStatus(TCLIService.java:467)

at



org.apache.hive.service.rpc.thrift.TCLIService$Client.GetOperationStatus(TCLIService.java:454)

at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

at



sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)

at



sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

at java.lang.reflect.Method.invoke(Method.java:498)

at



org.apache.hive.jdbc.HiveConnection$SynchronizedHandler.invoke(HiveConnection.java:1524)

at com.sun.proxy.$Proxy5.GetOperationStatus(Unknown Source)

at



org.apache.hive.jdbc.HiveStatement.waitForOperationToComplete(HiveStatement.java:367)

... 5 more

6812 [main] ERROR org.apache.hudi.utilities.HiveIncrementalPuller  -
Could
not close the resultset opened

java.sql.SQLException: org.apache.thrift.transport.TTransportException

at



org.apache.hive.jdbc.HiveStatement.closeC

Re: Facing issues when using HiveIncrementalPuller

2020-01-01 Thread lamberken


Hi @Vinoth,


Got it, thank you for reminding me. I just made a mistake just now.


best,
lamber-ken


On 01/2/2020 13:08,Vinoth Chandar wrote:
Hi Lamber,

utilities-bundle has always been a fat jar.. I was talking about
hudi-utilities.
Sure. take a swing at it. Happy to help as needed

On Wed, Jan 1, 2020 at 8:57 PM lamberken  wrote:



Hi @Vinoth,


I'm willing to solve this problem. I'm trying to find out from the history
when hudi-utilities-bundle becoming not a fatjar.



Git History
2019-08-29 FAT-JAR ---> 5f9fa82f47e1cc14a22b869250fe23c8f9c033cd
2019-09-14 NOT-FATJAR ---> d2525c31b7dad7bae2d4899d8df2a353ca39af50
best,
lamber-ken


At 2020-01-01 09:15:01, "Vinoth Chandar"  wrote:
This does sound like a fair bit of pain.
I am wondering if it makes sense to change the integ-test setup/docker
demo
to use incremental  puller. Bunch of the packaging issues around jars,
seem
like regressions that the hudi-utilities is not a fat jar anymore?

if there are nt any takers, I can also try my hand at fixing this, once I
get done with few things on my end. left a comment on HUDI-485



On Tue, Dec 31, 2019 at 4:19 PM lamberken  wrote:



Hi @Pratyaksh Sharma,


Thanks for your detail stackstrace and reproduce steps. And your
suggestion is reasonable.


1, For NPE issue, please tracking pr #1167 <
https://github.com/apache/incubator-hudi/pull/1167>
2, For TTransportException issue, I have a question that can other
statements be executed except create statement?


best,
lamber-ken

At 2019-12-30 23:11:17, "Pratyaksh Sharma" 
wrote:
Thank you Lamberken, the above issue gets resolved with what you
suggested.
However, still HiveIncrementalPuller is not working.
Subsequently I found and fixed a bug raised here -
https://issues.apache.org/jira/browse/HUDI-485.

Currently I am facing the below exception when trying to run the create
table statement on docker cluster. Any leads for solving this are
welcome
-

6811 [main] ERROR org.apache.hudi.utilities.HiveIncrementalPuller  -
Exception when executing SQL

java.sql.SQLException: org.apache.thrift.transport.TTransportException

at


org.apache.hive.jdbc.HiveStatement.waitForOperationToComplete(HiveStatement.java:399)

at org.apache.hive.jdbc.HiveStatement.execute(HiveStatement.java:254)

at


org.apache.hudi.utilities.HiveIncrementalPuller.executeStatement(HiveIncrementalPuller.java:233)

at


org.apache.hudi.utilities.HiveIncrementalPuller.executeIncrementalSQL(HiveIncrementalPuller.java:200)

at


org.apache.hudi.utilities.HiveIncrementalPuller.saveDelta(HiveIncrementalPuller.java:157)

at


org.apache.hudi.utilities.HiveIncrementalPuller.main(HiveIncrementalPuller.java:345)

Caused by: org.apache.thrift.transport.TTransportException

at


org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:132)

at org.apache.thrift.transport.TTransport.readAll(TTransport.java:86)

at


org.apache.thrift.transport.TSaslTransport.readLength(TSaslTransport.java:374)

at


org.apache.thrift.transport.TSaslTransport.readFrame(TSaslTransport.java:451)

at
org.apache.thrift.transport.TSaslTransport.read(TSaslTransport.java:433)

at


org.apache.thrift.transport.TSaslClientTransport.read(TSaslClientTransport.java:38)

at org.apache.thrift.transport.TTransport.readAll(TTransport.java:86)

at


org.apache.thrift.protocol.TBinaryProtocol.readAll(TBinaryProtocol.java:425)

at


org.apache.thrift.protocol.TBinaryProtocol.readI32(TBinaryProtocol.java:321)

at


org.apache.thrift.protocol.TBinaryProtocol.readMessageBegin(TBinaryProtocol.java:225)

at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:77)

at


org.apache.hive.service.rpc.thrift.TCLIService$Client.recv_GetOperationStatus(TCLIService.java:467)

at


org.apache.hive.service.rpc.thrift.TCLIService$Client.GetOperationStatus(TCLIService.java:454)

at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

at


sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)

at


sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

at java.lang.reflect.Method.invoke(Method.java:498)

at


org.apache.hive.jdbc.HiveConnection$SynchronizedHandler.invoke(HiveConnection.java:1524)

at com.sun.proxy.$Proxy5.GetOperationStatus(Unknown Source)

at


org.apache.hive.jdbc.HiveStatement.waitForOperationToComplete(HiveStatement.java:367)

... 5 more

6812 [main] ERROR org.apache.hudi.utilities.HiveIncrementalPuller  -
Could
not close the resultset opened

java.sql.SQLException: org.apache.thrift.transport.TTransportException

at


org.apache.hive.jdbc.HiveStatement.closeClientOperation(HiveStatement.java:214)

at org.apache.hive.jdbc.HiveStatement.close(HiveStatement.java:231)

at


org.apache.hudi.utilities.HiveIncrementalPuller.saveDelta(HiveIncrementalPuller.java:165)

at


org.apache.hudi.utilities.HiveIncrementalPuller.main(HiveIncrementalPuller.java:345)

Caused by: org.apach

Re:Re: Re: Re: Re: Facing issues when using HiveIncrementalPuller

2020-01-01 Thread lamberken


Hi @Vinoth,


I'm willing to solve this problem. I'm trying to find out from the history when 
hudi-utilities-bundle becoming not a fatjar.



Git History
2019-08-29 FAT-JAR ---> 5f9fa82f47e1cc14a22b869250fe23c8f9c033cd
2019-09-14 NOT-FATJAR ---> d2525c31b7dad7bae2d4899d8df2a353ca39af50
best,
lamber-ken


At 2020-01-01 09:15:01, "Vinoth Chandar"  wrote:
>This does sound like a fair bit of pain.
>I am wondering if it makes sense to change the integ-test setup/docker demo
>to use incremental  puller. Bunch of the packaging issues around jars, seem
>like regressions that the hudi-utilities is not a fat jar anymore?
>
>if there are nt any takers, I can also try my hand at fixing this, once I
>get done with few things on my end. left a comment on HUDI-485
>
>
>
>On Tue, Dec 31, 2019 at 4:19 PM lamberken  wrote:
>
>>
>>
>> Hi @Pratyaksh Sharma,
>>
>>
>> Thanks for your detail stackstrace and reproduce steps. And your
>> suggestion is reasonable.
>>
>>
>> 1, For NPE issue, please tracking pr #1167 <
>> https://github.com/apache/incubator-hudi/pull/1167>
>> 2, For TTransportException issue, I have a question that can other
>> statements be executed except create statement?
>>
>>
>> best,
>> lamber-ken
>>
>> At 2019-12-30 23:11:17, "Pratyaksh Sharma"  wrote:
>> >Thank you Lamberken, the above issue gets resolved with what you
>> suggested.
>> >However, still HiveIncrementalPuller is not working.
>> >Subsequently I found and fixed a bug raised here -
>> >https://issues.apache.org/jira/browse/HUDI-485.
>> >
>> >Currently I am facing the below exception when trying to run the create
>> >table statement on docker cluster. Any leads for solving this are welcome
>> -
>> >
>> >6811 [main] ERROR org.apache.hudi.utilities.HiveIncrementalPuller  -
>> >Exception when executing SQL
>> >
>> >java.sql.SQLException: org.apache.thrift.transport.TTransportException
>> >
>> >at
>>
>> >org.apache.hive.jdbc.HiveStatement.waitForOperationToComplete(HiveStatement.java:399)
>> >
>> >at org.apache.hive.jdbc.HiveStatement.execute(HiveStatement.java:254)
>> >
>> >at
>>
>> >org.apache.hudi.utilities.HiveIncrementalPuller.executeStatement(HiveIncrementalPuller.java:233)
>> >
>> >at
>>
>> >org.apache.hudi.utilities.HiveIncrementalPuller.executeIncrementalSQL(HiveIncrementalPuller.java:200)
>> >
>> >at
>>
>> >org.apache.hudi.utilities.HiveIncrementalPuller.saveDelta(HiveIncrementalPuller.java:157)
>> >
>> >at
>>
>> >org.apache.hudi.utilities.HiveIncrementalPuller.main(HiveIncrementalPuller.java:345)
>> >
>> >Caused by: org.apache.thrift.transport.TTransportException
>> >
>> >at
>>
>> >org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:132)
>> >
>> >at org.apache.thrift.transport.TTransport.readAll(TTransport.java:86)
>> >
>> >at
>>
>> >org.apache.thrift.transport.TSaslTransport.readLength(TSaslTransport.java:374)
>> >
>> >at
>>
>> >org.apache.thrift.transport.TSaslTransport.readFrame(TSaslTransport.java:451)
>> >
>> >at
>> org.apache.thrift.transport.TSaslTransport.read(TSaslTransport.java:433)
>> >
>> >at
>>
>> >org.apache.thrift.transport.TSaslClientTransport.read(TSaslClientTransport.java:38)
>> >
>> >at org.apache.thrift.transport.TTransport.readAll(TTransport.java:86)
>> >
>> >at
>>
>> >org.apache.thrift.protocol.TBinaryProtocol.readAll(TBinaryProtocol.java:425)
>> >
>> >at
>>
>> >org.apache.thrift.protocol.TBinaryProtocol.readI32(TBinaryProtocol.java:321)
>> >
>> >at
>>
>> >org.apache.thrift.protocol.TBinaryProtocol.readMessageBegin(TBinaryProtocol.java:225)
>> >
>> >at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:77)
>> >
>> >at
>>
>> >org.apache.hive.service.rpc.thrift.TCLIService$Client.recv_GetOperationStatus(TCLIService.java:467)
>> >
>> >at
>>
>> >org.apache.hive.service.rpc.thrift.TCLIService$Client.GetOperationStatus(TCLIService.java:454)
>> >
>> >at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>> >
>> >at
>>
>> >sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>> >
>> &

Re:Re: Re: Re: Facing issues when using HiveIncrementalPuller

2019-12-31 Thread lamberken


Hi @Pratyaksh Sharma,


Thanks for your detail stackstrace and reproduce steps. And your suggestion is 
reasonable.


1, For NPE issue, please tracking pr #1167 
<https://github.com/apache/incubator-hudi/pull/1167>
2, For TTransportException issue, I have a question that can other statements 
be executed except create statement?


best,
lamber-ken

At 2019-12-30 23:11:17, "Pratyaksh Sharma"  wrote:
>Thank you Lamberken, the above issue gets resolved with what you suggested.
>However, still HiveIncrementalPuller is not working.
>Subsequently I found and fixed a bug raised here -
>https://issues.apache.org/jira/browse/HUDI-485.
>
>Currently I am facing the below exception when trying to run the create
>table statement on docker cluster. Any leads for solving this are welcome -
>
>6811 [main] ERROR org.apache.hudi.utilities.HiveIncrementalPuller  -
>Exception when executing SQL
>
>java.sql.SQLException: org.apache.thrift.transport.TTransportException
>
>at
>org.apache.hive.jdbc.HiveStatement.waitForOperationToComplete(HiveStatement.java:399)
>
>at org.apache.hive.jdbc.HiveStatement.execute(HiveStatement.java:254)
>
>at
>org.apache.hudi.utilities.HiveIncrementalPuller.executeStatement(HiveIncrementalPuller.java:233)
>
>at
>org.apache.hudi.utilities.HiveIncrementalPuller.executeIncrementalSQL(HiveIncrementalPuller.java:200)
>
>at
>org.apache.hudi.utilities.HiveIncrementalPuller.saveDelta(HiveIncrementalPuller.java:157)
>
>at
>org.apache.hudi.utilities.HiveIncrementalPuller.main(HiveIncrementalPuller.java:345)
>
>Caused by: org.apache.thrift.transport.TTransportException
>
>at
>org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:132)
>
>at org.apache.thrift.transport.TTransport.readAll(TTransport.java:86)
>
>at
>org.apache.thrift.transport.TSaslTransport.readLength(TSaslTransport.java:374)
>
>at
>org.apache.thrift.transport.TSaslTransport.readFrame(TSaslTransport.java:451)
>
>at org.apache.thrift.transport.TSaslTransport.read(TSaslTransport.java:433)
>
>at
>org.apache.thrift.transport.TSaslClientTransport.read(TSaslClientTransport.java:38)
>
>at org.apache.thrift.transport.TTransport.readAll(TTransport.java:86)
>
>at
>org.apache.thrift.protocol.TBinaryProtocol.readAll(TBinaryProtocol.java:425)
>
>at
>org.apache.thrift.protocol.TBinaryProtocol.readI32(TBinaryProtocol.java:321)
>
>at
>org.apache.thrift.protocol.TBinaryProtocol.readMessageBegin(TBinaryProtocol.java:225)
>
>at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:77)
>
>at
>org.apache.hive.service.rpc.thrift.TCLIService$Client.recv_GetOperationStatus(TCLIService.java:467)
>
>at
>org.apache.hive.service.rpc.thrift.TCLIService$Client.GetOperationStatus(TCLIService.java:454)
>
>at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>
>at
>sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>
>at
>sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>
>at java.lang.reflect.Method.invoke(Method.java:498)
>
>at
>org.apache.hive.jdbc.HiveConnection$SynchronizedHandler.invoke(HiveConnection.java:1524)
>
>at com.sun.proxy.$Proxy5.GetOperationStatus(Unknown Source)
>
>at
>org.apache.hive.jdbc.HiveStatement.waitForOperationToComplete(HiveStatement.java:367)
>
>... 5 more
>
>6812 [main] ERROR org.apache.hudi.utilities.HiveIncrementalPuller  - Could
>not close the resultset opened
>
>java.sql.SQLException: org.apache.thrift.transport.TTransportException
>
>at
>org.apache.hive.jdbc.HiveStatement.closeClientOperation(HiveStatement.java:214)
>
>at org.apache.hive.jdbc.HiveStatement.close(HiveStatement.java:231)
>
>at
>org.apache.hudi.utilities.HiveIncrementalPuller.saveDelta(HiveIncrementalPuller.java:165)
>
>at
>org.apache.hudi.utilities.HiveIncrementalPuller.main(HiveIncrementalPuller.java:345)
>
>Caused by: org.apache.thrift.transport.TTransportException
>
>at
>org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:132)
>
>at org.apache.thrift.transport.TTransport.readAll(TTransport.java:86)
>
>at
>org.apache.thrift.transport.TSaslTransport.readLength(TSaslTransport.java:374)
>
>at
>org.apache.thrift.transport.TSaslTransport.readFrame(TSaslTransport.java:451)
>
>at org.apache.thrift.transport.TSaslTransport.read(TSaslTransport.java:433)
>
>at
>org.apache.thrift.transport.TSaslClientTransport.read(TSaslClientTransport.java:38)
>
>at org.apache.thrift.transport.TTransport.readAll(TTransport.java:86)
>
>at
>org.apache.thrift.protocol.TBinaryProtocol.readAll(TBinaryProtocol.java:425)
>
>at
>org.apache.thrift.protocol.TBinaryProtocol.readI32(

Re:Re: Re:How to write a performance test program

2019-12-30 Thread lamberken


Hi @mayu1,



This issue has been fix in master branch 
<https://github.com/apache/incubator-hudi>, you can checkout it, build source 
and continue your test program.


Looking forward to your feedback on whether the problem has been solved.



best,
lamber-ken



At 2019-12-26 06:43:55, "Vinoth Chandar"  wrote:
>Filed HUDI-468 - Not an avro data file - error while archiving post
>rename() change <https://issues.apache.org/jira/browse/HUDI-468> to track
>this
>
>On Mon, Dec 23, 2019 at 11:40 PM ma...@bonc.com.cn 
>wrote:
>
>> Thank you, I have replaced it with hubi-spark-bundle-0.5.0-incubating.jar,
>> and the program seems to be stable.
>>
>> --
>> ma...@bonc.com.cn
>>
>>
>> *发件人:* lamberken 
>> *发送时间:* 2019-12-24 11:24
>> *收件人:* dev 
>> *主题:* Re:How to write a performance test program
>>
>> Hi @mayu1,
>>
>> I guess you used the latest master branch, this bug seems happened after
>> HUDI-398 merged.
>> I met the same exception, and I am trying to fix it [1].
>>
>> You can try to build source before that commit, then continue your test.
>>
>> [1] https://issues.apache.org/jira/browse/HUDI-453
>>
>> best,
>> lamber-ken
>>
>>
>>
>> At 2019-12-24 11:11:41, "ma...@bonc.com.cn"  wrote:
>> >hello!
>> >I want to modify the quickstart program for performance testing and 
>> >generate a dataset of ten million rows. However, the program will report an 
>> >error after running it multiple times.
>> >
>> >error:
>> >Exception in thread "main" org.apache.hudi.exception.HoodieCommitException: 
>> >Failed to archive commits
>> >at 
>> >org.apache.hudi.io.HoodieCommitArchiveLog.archive(HoodieCommitArchiveLog.java:266)
>> >at 
>> >org.apache.hudi.io.HoodieCommitArchiveLog.archiveIfRequired(HoodieCommitArchiveLog.java:122)
>> >at org.apache.hudi.HoodieWriteClient.commit(HoodieWriteClient.java:562)
>> >at org.apache.hudi.HoodieWriteClient.commit(HoodieWriteClient.java:523)
>> >at org.apache.hudi.HoodieWriteClient.commit(HoodieWriteClient.java:514)
>> >at 
>> >org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:152)
>> >at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:91)
>> >at 
>> >org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45)
>> >at 
>> >org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
>> >at 
>> >org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
>> >at 
>> >org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:86)
>> >at 
>> >org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
>> >at 
>> >org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
>> >at 
>> >org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
>> >at 
>> >org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>> >at 
>> >org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
>> >at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
>> >at 
>> >org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
>> >at 
>> >org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80)
>> >at 
>> >org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676)
>> >at 
>> >org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676)
>> >at 
>> >org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
>> >at 
>> >org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
>> >at 
>> >org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
>> >at 
>> >org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:676)
>> >at 
>> >org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:285)
>> >at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:271)
>> >at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:229)
>> >at

Re:Re: Re: Facing issues when using HiveIncrementalPuller

2019-12-30 Thread lamberken


Hi @Pratyaksh Sharma


Thanks for your steps to reproduce this issue. Try to modify bellow codes, and 
test again.


org.apache.hudi.utilities.HiveIncrementalPuller#HiveIncrementalPuller / 
- / String templateContent = 
FileIOUtils.readAsUTFString(this.getClass().getResourceAsStream("IncrementalPull.sqltemplate"));
Changed to
/ - / String templateContent = 
FileIOUtils.readAsUTFString(this.getClass().getResourceAsStream("/IncrementalPull.sqltemplate"));
 best,
lamber-ken





At 2019-12-30 19:25:08, "Pratyaksh Sharma"  wrote:
>Hi Vinoth,
>
>I am able to reproduce this error on docker setup and have filed a jira -
>https://issues.apache.org/jira/browse/HUDI-484.
>
>Steps to reproduce are mentioned in the jira description itself.
>
>On Thu, Dec 26, 2019 at 12:42 PM Pratyaksh Sharma 
>wrote:
>
>> Hi Vinoth,
>>
>> I will try to reproduce the error on docker cluster and keep you updated.
>>
>> On Tue, Dec 24, 2019 at 11:23 PM Vinoth Chandar  wrote:
>>
>>> Pratyaksh,
>>>
>>> If you are still having this issue, could you try reproducing this on the
>>> docker setup
>>>
>>> https://hudi.apache.org/docker_demo.html#step-7--incremental-query-for-copy-on-write-table
>>> similar to this and raise a JIRA.
>>> Happy to look into it and get it fixed if needed
>>>
>>> Thanks
>>> Vinoth
>>>
>>> On Tue, Dec 24, 2019 at 8:43 AM lamberken  wrote:
>>>
>>> >
>>> >
>>> > Hi, @Pratyaksh Sharma
>>> >
>>> >
>>> > The log4j-1.2.17.jar lib also needs to added to the classpath, for
>>> example:
>>> > java -cp
>>> >
>>> /path/to/hive-jdbc-2.3.1.jar:/path/to/log4j-1.2.17.jar:packaging/hudi-utilities-bundle/target/hudi-utilities-bundle-0.5.1-SNAPSHOT.jar
>>> > org.apache.hudi.utilities.HiveIncrementalPuller --help
>>> >
>>> >
>>> > best,
>>> > lamber-ken
>>> >
>>> > At 2019-12-24 17:23:20, "Pratyaksh Sharma" 
>>> wrote:
>>> > >Hi Vinoth,
>>> > >
>>> > >Sorry my bad, I did not realise earlier that spark is not needed for
>>> this
>>> > >class. I tried running it with the below command to get the mentioned
>>> > >exception -
>>> > >
>>> > >Command -
>>> > >
>>> > >java -cp
>>> >
>>> >
>>> >/path/to/hive-jdbc-2.3.1.jar:packaging/hudi-utilities-bundle/target/hudi-utilities-bundle-0.5.1-SNAPSHOT.jar
>>> > >org.apache.hudi.utilities.HiveIncrementalPuller --help
>>> > >
>>> > >Exception -
>>> > >Exception in thread "main" java.lang.NoClassDefFoundError:
>>> > >org/apache/log4j/LogManager
>>> > >at
>>> >
>>> >
>>> >org.apache.hudi.utilities.HiveIncrementalPuller.(HiveIncrementalPuller.java:64)
>>> > >Caused by: java.lang.ClassNotFoundException:
>>> org.apache.log4j.LogManager
>>> > >at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
>>> > >at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
>>> > >at
>>> sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349)
>>> > >at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
>>> > >... 1 more
>>> > >
>>> > >I was able to fix it by including the corresponding jar in the bundle.
>>> > >
>>> > >After fixing the above, still I am getting the NPE even though the
>>> > template
>>> > >is bundled in the jar.
>>> > >
>>> > >On Mon, Dec 23, 2019 at 10:45 PM Vinoth Chandar 
>>> > wrote:
>>> > >
>>> > >> Hi Pratyaksh,
>>> > >>
>>> > >> HveIncrementalPuller is just a java program. Does not need Spark,
>>> since
>>> > it
>>> > >> just runs a HiveQL remotely..
>>> > >>
>>> > >> On the error you specified, seems like it can't find the template?
>>> Can
>>> > you
>>> > >> see if the bundle does not have the template file.. May be this got
>>> > broken
>>> > >> during the bundling changes.. (since its no longer part of the
>>> resources
>>> > >> folder of the b

Re: Commit time issue in DeltaStreamer (Real-Time)

2019-12-27 Thread lamberken


Hi @Shahida Khan,


In the past few days, I faced similar issue. This bug seems happened after 
HUDI-398 merged. 
You can try to build source before that commit, then continue your work.


Here are the details:
https://lists.apache.org/thread.html/f7834b3389e67b2b66b65386f59eb6646942206865133300c0416a6a%40%3Cdev.hudi.apache.org%3E


best,
lamber-ken
On 12/27/2019 21:02,Shahida Khan wrote:
@lamberken, when I have checked, folder .aux was empty ...
:(

On Fri, 27 Dec 2019 at 6:28 PM, lamberken  wrote:



Hi @Shahida Khan,


I have a question that the size of *.clean.requested files is 0 ?


best,
lamber-ken




On 12/27/2019 19:54,Shahida Khan wrote:
Hi,

Greetings!!
I have currently using Delta Streamer and upserting data via hudi in
real-time.
Have used the latest master branch.
Job was running fine from last 10days, suddenly, most of the streaming job
started failing and below is the error which I am facing :

















*java.util.concurrent.ExecutionException:
org.apache.hudi.exception.HoodieException: Could not read commit
details from
hdfs:/user/hive/warehouse/hudi.db/tbltest/.hoodie/.aux/20191226153400.clean.requested
at
java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:357)
at
java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1895)  at
org.apache.hudi.utilities.deltastreamer.AbstractDeltaStreamerService.waitForShutdown(AbstractDeltaStreamerService.java:72)
at
org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.sync(HoodieDeltaStreamer.java:117)
at
org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.main(HoodieDeltaStreamer.java:297)
at
sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)  at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at
java.lang.reflect.Method.invoke(Method.java:498)  at

org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:688)Caused
by: org.apache.hudi.exception.HoodieException: Could not read commit
details from
hdfs:/user/hive/warehouse/hudi.db/tbltest/.hoodie/.aux/20191226153400.clean.requested
at
org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer$DeltaSyncService.lambda$startService$0(HoodieDeltaStreamer.java:411)
at
java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1590)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at
java.lang.Thread.run(Thread.java:748)*


It seems issue has already been raised : hudi-1128
<https://github.com/apache/incubator-hudi/pull/1128/files>
Is this issue related to same which i am facing..??


*Regards,*
*Shahida R. Khan*

--
Regards,
Shahida Rashid Khan
9167538366




kindly ignore typo error  Sent from handheld device ...*


Re:Commit time issue in DeltaStreamer (Real-Time)

2019-12-27 Thread lamberken


Hi @Shahida Khan,


I have a question that the size of *.clean.requested files is 0 ?


best,
lamber-ken




On 12/27/2019 19:54,Shahida Khan wrote:
Hi,

Greetings!!
I have currently using Delta Streamer and upserting data via hudi in
real-time.
Have used the latest master branch.
Job was running fine from last 10days, suddenly, most of the streaming job
started failing and below is the error which I am facing :

















*java.util.concurrent.ExecutionException:
org.apache.hudi.exception.HoodieException: Could not read commit
details from 
hdfs:/user/hive/warehouse/hudi.db/tbltest/.hoodie/.aux/20191226153400.clean.requested
  at
java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:357)  at
java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1895)  at
org.apache.hudi.utilities.deltastreamer.AbstractDeltaStreamerService.waitForShutdown(AbstractDeltaStreamerService.java:72)
  at
org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.sync(HoodieDeltaStreamer.java:117)
  at
org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.main(HoodieDeltaStreamer.java:297)
  at
sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)  at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)  
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
  at
java.lang.reflect.Method.invoke(Method.java:498)  at
org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:688)Caused
by: org.apache.hudi.exception.HoodieException: Could not read commit
details from 
hdfs:/user/hive/warehouse/hudi.db/tbltest/.hoodie/.aux/20191226153400.clean.requested
  at
org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer$DeltaSyncService.lambda$startService$0(HoodieDeltaStreamer.java:411)
  at
java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1590)
  at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) 
 at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) 
 at
java.lang.Thread.run(Thread.java:748)*


It seems issue has already been raised : hudi-1128

Is this issue related to same which i am facing..??


*Regards,*
*Shahida R. Khan*


Re: insert too slow

2019-12-26 Thread lamberken
Hi @mayu1,


Here is Tuning Guide[1], it may help you to improve performance.


[1]https://cwiki.apache.org/confluence/display/HUDI/Tuning+Guide




best,
lamber-kem




On 12/26/2019 15:46,ma...@bonc.com.cn wrote:
Thank you for your reply, it really works. And how to insert 100 million 
records without OOM?



ma...@bonc.com.cn

From: lamberken
Date: 2019-12-26 15:33
To: dev@hudi.apache.org
Subject: Re:insert too slow


Hi @mayu1,


Can you run the below program in cosole? looking forward to your feedback.



${SPARK_HOME}/bin/spark-shell \
--packages org.apache.hudi:hudi-spark-bundle:0.5.0-incubating \
--conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer'


import org.apache.hudi.DataSourceWriteOptions._
import org.apache.hudi.QuickstartUtils._
import org.apache.hudi.config.HoodieWriteConfig._
import org.apache.spark.sql.SaveMode._
import scala.collection.JavaConversions._


val tableName = "tableName"
val basePath = "file:///tmp/data"


for (i <- 1 to 1) {
println("start:" + System.currentTimeMillis())
val dataGen = new DataGenerator
val inserts = convertToStringList(dataGen.generateInserts(1))
println("start insert:" + System.currentTimeMillis())
val df = spark.read.json(spark.sparkContext.parallelize(inserts, 32))
df.write.format("org.apache.hudi").
options(getQuickstartWriteConfigs).
option(PRECOMBINE_FIELD_OPT_KEY, "ts").
option(RECORDKEY_FIELD_OPT_KEY, "uuid").
option(PARTITIONPATH_FIELD_OPT_KEY, "partitionpath").
option(TABLE_NAME, tableName).   
mode(Append).
save(basePath);
println("finish" + i + " " + System.currentTimeMillis())
}



best,
lamber-ken
On 12/26/2019 15:08,ma...@bonc.com.cn wrote:
Hello!
What is the throughput of Hudi? I currently use spark to insert 10,000 records 
(300 bytes each), which takes one minute. Is it too slow?
my program:
import org.apache.spark.sql.SQLContext
import org.apache.spark.{SparkConf, SparkContext}

object HudiDataGen {
def main(args: Array[String]): Unit = {
import org.apache.hudi.DataSourceWriteOptions._
import org.apache.hudi.QuickstartUtils._
import org.apache.hudi.config.HoodieWriteConfig._
import org.apache.spark.sql.SaveMode._

import scala.collection.JavaConversions._

//初始化
val conf = new SparkConf().setAppName("HudiTest")
.setMaster("local[*]")
conf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer") 
//使用Kryo序列化库
val sc = new SparkContext(conf)
val spark = new SQLContext(sc)

//设置表名、基本路径和数据生成器来为本指南生成记录。
val tableName = Constant.tableName
//val basePath = Constant.hdfsPath
val basePath = args(0)
//val basePath = "file:///e:/hudi_cow_table"
val count = args(1)
for (i <- 1 to count.toInt) {
println("start:" + System.currentTimeMillis())
val dataGen = new DataGenerator
//生成一些新的行程样本,将其加载到DataFrame中,然后将DataFrame写入Hudi数据集中,如下所示。
val inserts = convertToStringList(dataGen.generateInserts(1))
//println("insert:"+System.currentTimeMillis())
println("start insert:" + System.currentTimeMillis())
val df = spark.read.json(spark.sparkContext.parallelize(inserts, 32))
df.write.format("org.apache.hudi").
options(getQuickstartWriteConfigs).
option(PRECOMBINE_FIELD_OPT_KEY, "ts").
option(RECORDKEY_FIELD_OPT_KEY, "uuid").
option(PARTITIONPATH_FIELD_OPT_KEY, "partitionpath").
option(TABLE_NAME, tableName).
//option(STORAGE_TYPE_OPT_KEY, "MERGE_ON_READ").
mode(Append).
save(basePath);
println("finish" + i + " " + System.currentTimeMillis())
}

}
}


ma...@bonc.com.cn


Re:insert too slow

2019-12-25 Thread lamberken


Hi @mayu1,


Can you run the below program in cosole? looking forward to your feedback.



${SPARK_HOME}/bin/spark-shell \
--packages org.apache.hudi:hudi-spark-bundle:0.5.0-incubating \
--conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer'


import org.apache.hudi.DataSourceWriteOptions._
import org.apache.hudi.QuickstartUtils._
import org.apache.hudi.config.HoodieWriteConfig._
import org.apache.spark.sql.SaveMode._
import scala.collection.JavaConversions._


val tableName = "tableName"
val basePath = "file:///tmp/data"


for (i <- 1 to 1) {
println("start:" + System.currentTimeMillis())
val dataGen = new DataGenerator
val inserts = convertToStringList(dataGen.generateInserts(1))
println("start insert:" + System.currentTimeMillis())
val df = spark.read.json(spark.sparkContext.parallelize(inserts, 32))
df.write.format("org.apache.hudi").
options(getQuickstartWriteConfigs).
option(PRECOMBINE_FIELD_OPT_KEY, "ts").
option(RECORDKEY_FIELD_OPT_KEY, "uuid").
option(PARTITIONPATH_FIELD_OPT_KEY, "partitionpath").
option(TABLE_NAME, tableName).   
mode(Append).
save(basePath);
println("finish" + i + " " + System.currentTimeMillis())
}



best,
lamber-ken
On 12/26/2019 15:08,ma...@bonc.com.cn wrote:
Hello!
What is the throughput of Hudi? I currently use spark to insert 10,000 records 
(300 bytes each), which takes one minute. Is it too slow?
my program:
import org.apache.spark.sql.SQLContext
import org.apache.spark.{SparkConf, SparkContext}

object HudiDataGen {
def main(args: Array[String]): Unit = {
import org.apache.hudi.DataSourceWriteOptions._
import org.apache.hudi.QuickstartUtils._
import org.apache.hudi.config.HoodieWriteConfig._
import org.apache.spark.sql.SaveMode._

import scala.collection.JavaConversions._

//初始化
val conf = new SparkConf().setAppName("HudiTest")
.setMaster("local[*]")
conf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer") 
//使用Kryo序列化库
val sc = new SparkContext(conf)
val spark = new SQLContext(sc)

//设置表名、基本路径和数据生成器来为本指南生成记录。
val tableName = Constant.tableName
//val basePath = Constant.hdfsPath
val basePath = args(0)
//val basePath = "file:///e:/hudi_cow_table"
val count = args(1)
for (i <- 1 to count.toInt) {
println("start:" + System.currentTimeMillis())
val dataGen = new DataGenerator
//生成一些新的行程样本,将其加载到DataFrame中,然后将DataFrame写入Hudi数据集中,如下所示。
val inserts = convertToStringList(dataGen.generateInserts(1))
//println("insert:"+System.currentTimeMillis())
println("start insert:" + System.currentTimeMillis())
val df = spark.read.json(spark.sparkContext.parallelize(inserts, 32))
df.write.format("org.apache.hudi").
options(getQuickstartWriteConfigs).
option(PRECOMBINE_FIELD_OPT_KEY, "ts").
option(RECORDKEY_FIELD_OPT_KEY, "uuid").
option(PARTITIONPATH_FIELD_OPT_KEY, "partitionpath").
option(TABLE_NAME, tableName).
//option(STORAGE_TYPE_OPT_KEY, "MERGE_ON_READ").
mode(Append).
save(basePath);
println("finish" + i + " " + System.currentTimeMillis())
}

}
}


ma...@bonc.com.cn


Re:CHD6.3 hadoop3 quckstart

2019-12-25 Thread lamberken


Hi,


It's because there are spaces in the basePath, can you try to use this?
val basePath = "file:///tmp/hudi_cow_table”


best, 
lamber-ken


On 12/25/2019 20:50,965147...@qq.com<965147...@qq.com> wrote:

hi,all


The environment I use is CDH6.3,
Use hadoop3 maven dependency to compile hudi,  
3.0.0
execute quickstart,
bin / spark-shell --jars 
/home/t3cx/apps/hudi/hudi-spark-bundle-0.5.1-SNAPSHOT.jar --conf 
'spark.serializer = org.apache.spark.serializer.KryoSerializer'

import org.apache.hudi.QuickstartUtils._
import scala.collection.JavaConversions._
import org.apache.spark.sql.SaveMode._
import org.apache.hudi.DataSourceReadOptions._
import org.apache.hudi.DataSourceWriteOptions._
import org.apache.hudi.config.HoodieWriteConfig._

val tableName = "hudi_cow_table"
val basePath = "file: /// tmp / hudi_cow_table"
val dataGen = new DataGenerator

val inserts = convertToStringList (dataGen.generateInserts (10))
val df = spark.read.json (spark.sparkContext.parallelize (inserts, 2))
df.write.format ("org.apache.hudi").
options (getQuickstartWriteConfigs).
option (PRECOMBINE_FIELD_OPT_KEY, "ts").
option (RECORDKEY_FIELD_OPT_KEY, "uuid").
option (PARTITIONPATH_FIELD_OPT_KEY, "partitionpath").
option (TABLE_NAME, tableName).
mode (Overwrite).
save (basePath);

/ tmp / hudi_cow_table / No data in this directory

please help


965147...@qq.com


Re:Re: Facing issues when using HiveIncrementalPuller

2019-12-24 Thread lamberken


Hi, @Pratyaksh Sharma


The log4j-1.2.17.jar lib also needs to added to the classpath, for example:
java -cp 
/path/to/hive-jdbc-2.3.1.jar:/path/to/log4j-1.2.17.jar:packaging/hudi-utilities-bundle/target/hudi-utilities-bundle-0.5.1-SNAPSHOT.jar
 org.apache.hudi.utilities.HiveIncrementalPuller --help


best,
lamber-ken

At 2019-12-24 17:23:20, "Pratyaksh Sharma"  wrote:
>Hi Vinoth,
>
>Sorry my bad, I did not realise earlier that spark is not needed for this
>class. I tried running it with the below command to get the mentioned
>exception -
>
>Command -
>
>java -cp
>/path/to/hive-jdbc-2.3.1.jar:packaging/hudi-utilities-bundle/target/hudi-utilities-bundle-0.5.1-SNAPSHOT.jar
>org.apache.hudi.utilities.HiveIncrementalPuller --help
>
>Exception -
>Exception in thread "main" java.lang.NoClassDefFoundError:
>org/apache/log4j/LogManager
>at
>org.apache.hudi.utilities.HiveIncrementalPuller.(HiveIncrementalPuller.java:64)
>Caused by: java.lang.ClassNotFoundException: org.apache.log4j.LogManager
>at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
>at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
>at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349)
>at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
>... 1 more
>
>I was able to fix it by including the corresponding jar in the bundle.
>
>After fixing the above, still I am getting the NPE even though the template
>is bundled in the jar.
>
>On Mon, Dec 23, 2019 at 10:45 PM Vinoth Chandar  wrote:
>
>> Hi Pratyaksh,
>>
>> HveIncrementalPuller is just a java program. Does not need Spark, since it
>> just runs a HiveQL remotely..
>>
>> On the error you specified, seems like it can't find the template? Can you
>> see if the bundle does not have the template file.. May be this got broken
>> during the bundling changes.. (since its no longer part of the resources
>> folder of the bundle module).. We should also probably be throwing a better
>> error than NPE..
>>
>> We can raise a JIRA, once you confirm.
>>
>> String templateContent =
>>
>> FileIOUtils.readAsUTFString(this.getClass().getResourceAsStream("IncrementalPull.sqltemplate"));
>>
>>
>> On Mon, Dec 23, 2019 at 6:02 AM Pratyaksh Sharma 
>> wrote:
>>
>> > Hi,
>> >
>> > Can someone guide me or share some documentation regarding how to use
>> > HiveIncrementalPuller. I already went through the documentation on
>> > https://hudi.apache.org/querying_data.html. I tried using this puller
>> > using
>> > the below command and facing the given exception.
>> >
>> > Any leads are appreciated.
>> >
>> > Command -
>> > spark-submit --name incremental-puller --queue etl --files
>> > incremental_sql.txt --master yarn --deploy-mode cluster --driver-memory
>> 4g
>> > --executor-memory 4g --num-executors 2 --class
>> > org.apache.hudi.utilities.HiveIncrementalPuller
>> > hudi-utilities-bundle-0.5.1-SNAPSHOT.jar --hiveUrl
>> > jdbc:hive2://HOST:PORT/ --hiveUser  --hivePass 
>> > --extractSQLFile incremental_sql.txt --sourceDb  --sourceTable
>> >  --targetDb tmp --targetTable tempTable --fromCommitTime 0
>> > --maxCommits 1
>> >
>> > Error -
>> >
>> > java.lang.NullPointerException
>> > at org.apache.hudi.common.util.FileIOUtils.copy(FileIOUtils.java:73)
>> > at
>> >
>> >
>> org.apache.hudi.common.util.FileIOUtils.readAsUTFString(FileIOUtils.java:66)
>> > at
>> >
>> >
>> org.apache.hudi.common.util.FileIOUtils.readAsUTFString(FileIOUtils.java:61)
>> > at
>> >
>> >
>> org.apache.hudi.utilities.HiveIncrementalPuller.(HiveIncrementalPuller.java:113)
>> > at
>> >
>> >
>> org.apache.hudi.utilities.HiveIncrementalPuller.main(HiveIncrementalPuller.java:343)
>> >
>>


Re:Re: How to write a performance test program

2019-12-23 Thread lamberken


Hi @Vinoth,
 
Here are reproduce steps.


1, Build from latest source
mvn clean package -DskipTests -DskipITs -Dcheckstyle.skip=true -Drat.skip=true


2, Write Data
export SPARK_HOME=/work/BigData/install/spark/spark-2.3.3-bin-hadoop2.6
${SPARK_HOME}/bin/spark-shell --jars `ls 
packaging/hudi-spark-bundle/target/hudi-spark-bundle-*.*.*-SNAPSHOT.jar` --conf 
'spark.serializer=org.apache.spark.serializer.KryoSerializer'


import org.apache.spark.sql.SaveMode._


var datas = List("{ \"name\": \"kenken\", \"ts\": 1574297893836, \"age\": 12, 
\"location\": \"latitude\"}")
val df = spark.read.json(spark.sparkContext.parallelize(datas, 2))
df.write.format("org.apache.hudi").
option("hoodie.insert.shuffle.parallelism", "10").
option("hoodie.upsert.shuffle.parallelism", "10").
option("hoodie.delete.shuffle.parallelism", "10").
option("hoodie.bulkinsert.shuffle.parallelism", "10").
option("hoodie.datasource.write.recordkey.field", "name").
option("hoodie.datasource.write.partitionpath.field", "location").
option("hoodie.datasource.write.precombine.field", "ts").
option("hoodie.table.name", "hudi_mor_table").
mode(Overwrite).
save("file:///tmp/hudi_mor_table")


3, Append Data

df.write.format("org.apache.hudi").
option("hoodie.insert.shuffle.parallelism", "10").
option("hoodie.upsert.shuffle.parallelism", "10").
option("hoodie.delete.shuffle.parallelism", "10").
option("hoodie.bulkinsert.shuffle.parallelism", "10").
option("hoodie.datasource.write.recordkey.field", "name").
option("hoodie.datasource.write.partitionpath.field", "location").
option("hoodie.datasource.write.precombine.field", "ts").
option("hoodie.keep.max.commits", "5").
option("hoodie.keep.min.commits", "4").
option("hoodie.cleaner.commits.retained", "3").
option("hoodie.table.name", "hudi_mor_table").
mode(Append).
save("file:///tmp/hudi_mor_table")


4, Repeat about six times Append Data operation(above), will get the stackstrace
19/12/24 13:34:09 ERROR HoodieCommitArchiveLog: Failed to archive commits, 
.commit file: 20191224132942.clean.requested
java.io.IOException: Not an Avro data file
at org.apache.avro.file.DataFileReader.openReader(DataFileReader.java:50)
at 
org.apache.hudi.common.util.AvroUtils.deserializeAvroMetadata(AvroUtils.java:147)
at org.apache.hudi.common.util.CleanerUtils.getCleanerPlan(CleanerUtils.java:88)
at 
org.apache.hudi.io.HoodieCommitArchiveLog.convertToAvroRecord(HoodieCommitArchiveLog.java:294)
at 
org.apache.hudi.io.HoodieCommitArchiveLog.archive(HoodieCommitArchiveLog.java:253)
at 
org.apache.hudi.io.HoodieCommitArchiveLog.archiveIfRequired(HoodieCommitArchiveLog.java:122)
at org.apache.hudi.HoodieWriteClient.commit(HoodieWriteClient.java:562)
at org.apache.hudi.HoodieWriteClient.commit(HoodieWriteClient.java:523)
at org.apache.hudi.HoodieWriteClient.commit(HoodieWriteClient.java:514)
at org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:159)
at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:91)
at 
org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45)
at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:86)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)


BWT, I'm not familiar with this logic for now. If you have any ideas, feel free 
to take over it.


best,
lamber-ken



At 2019-12-24 13:01:50, "Vinoth Chandar"  wrote:

Could you give 0.5.0-incubating (last release) a shot in the meantime? 


Lamberken, do you have steps to reproduce this issue. Love to get a JIRA filed 
so we could fix before the next release. 


On Mon, Dec 23, 2019 at 7:25 PM lamberken  wrote:



Hi @mayu1,


I guess you used the latest master branch

Re:Re: IDE setup for code formatting

2019-12-23 Thread lamberken


Hi @Vinoth


Okay, I will talk with @leesf about the checkstyle. At the end we will give a 
relatively better solution.

best,
lamber-ken


At 2019-12-24 11:00:12, "Vinoth Chandar"  wrote:
>Ironically, google style + checkstyle is what we had few months ago :)
>
>Can we have an owner to drive this to a point where, the code formatting is
>well-documented for contributors?
>leesf. and lamber, seems like you have the most context?
>
>On Mon, Dec 23, 2019 at 6:24 PM lamber...@163.com  wrote:
>
>>


Re: IDE setup for code formatting

2019-12-23 Thread lamberken


Re: IDE setup for code formatting

2019-12-23 Thread lamberken


Hi @Minh Pham


I agree what @Y Ethan Guo says, we can disable those checkstyle rules which can 
not be automated for reformatting for now.
If we replace all new rules with google code style, it may takes more time to 
fix.


best,
lamber-ken
On 12/24/2019 03:34,Minh Pham wrote:
What do you guys think about Google Java Formater?
My opinion is that code style that can’t be automated is not worth enforcing.

Re:Re: Re: IDE setup for code formatting

2019-12-23 Thread lamberken

Hi @Y Ethan Guo,


I am very willing to follow the community's decision. And your idea is very 
good, we can disable those checkstyle rules that cannot be automated for 
reformatting in IDE for now.



best,
lamber-ken


At 2019-12-24 03:38:55, "Y Ethan Guo"  wrote:
>+1 on auto-formatting the code in IDE based on the checkstyle rules.
>
>Based on my experience with Java and Scala in IntelliJ, there's is indeed
>discrepancy on auto formatting code on some custom checkstyle rules.  For
>such cases, I tried to avoid using them if they do not sacrifice too much
>on the code style.  Reformatting code in IDE based on a different rule
>causing checkstyle errors has decreased my productivity before.
>
>So I'm wondering if @lamber-ken is willing to disable those checkstyle
>rules that cannot be automated for reformatting in IDE for now.  Once
>spotless or another plugin can solve this issue, we can re-enable those
>rules.
>
>On Mon, Dec 23, 2019 at 11:19 AM nishith agarwal 
>wrote:
>
>> Vinoth,
>>
>> +1 on automating the manual work required at the moment to fix the
>> checkstyle errors. I think if we are able to use spotless and
>> at the same time know upfront all the things that would require manual
>> work, there are few options IMO :
>>
>> a) Have a template of steps that can easily fix it -> for eg. selecting a
>> specific file, forcing checkstyle corrections through some intellij
>> sequence of steps. (I have noticed sometimes selecting certain parts of the
>> code and reformat manually helped, may be my intellij wasn't setup
>> correctly at the time). This documented, repeatable process can slightly
>> reduce the time taken to fix them.
>> b) Find a different tool that can address these shortcomings
>> c) Relax some of the checkstyles that are ok to not have (not sure if we
>> have scope for such a trade-off)
>>
>> -Nishith
>>
>> On Mon, Dec 23, 2019 at 10:16 AM Sivabalan  wrote:
>>
>> > My 2 cents:
>> >  I am also a big fan of code formatting in general, given that its
>> > fully automated. But if that comes at the cost of taking some time off
>> > everyone's time(manual fixing), we need to think through if its really
>> > worth it. Especially, wrt import order, I am not sure if that really
>> adds a
>> > lot of value. Anyone who work with an IDE, all imports are collapsed and
>> no
>> > one gets to see that only. So, what ever order or grouping we follow, it
>> > doesn't matter much. So, having said that, I am not up for spending 10
>> mins
>> > everytime we create or update a PR for this import ordering rule. If I
>> plan
>> > to work on two PRs one followed by other, and if I spend 15 odd mins in
>> > fixing first PR just for code formatting when creating one, I might lose
>> > interest to continue working with 2nd. So, would prefer to avoid spending
>> > time manually for code formatting in general.
>> >
>> >
>> >
>> >
>> >
>> > On Mon, Dec 23, 2019 at 7:43 AM Vinoth Chandar 
>> wrote:
>> >
>> > > Can we exhaustively list all that will be manually even after spotless
>> > > plugin is brought back?
>> > >
>> > > On Mon, Dec 23, 2019 at 3:01 AM leesf  wrote:
>> > >
>> > > > After bringing spotless plugin back to project, it would
>> automatically
>> > > fix
>> > > > comment check error except for import order error, we need to fix
>> this
>> > > > error manually. In Apache Flink/Calcite, we also fix it manually, and
>> > > will
>> > > > also look for other plugins to fix import order error if exist.
>> > > >
>> > > > Best,
>> > > > Leesf
>> > > >
>> > > > Vinoth Chandar  于2019年12月23日周一 下午4:55写道:
>> > > >
>> > > > > I understand. I am saying - we should automate all of this
>> > formatting..
>> > > > :)
>> > > > >
>> > > > > How do other projects do it? Other folks, who worked on the code
>> > > > > refactoring/formatting, may be you can also chime in?
>> > > > >
>> > > > > On Mon, Dec 23, 2019 at 12:24 AM lamberken 
>> > wrote:
>> > > > >
>> > > > > > Hi @Vinoth,
>> > > > > >
>> > > > > >
>> > > > > > The ImportOrder is a custom rule, IDE may can not reformat codes
>> > > > r

Re:Re: IDE setup for code formatting

2019-12-23 Thread lamberken
Hi @Vinoth,


The ImportOrder is a custom rule, IDE may can not reformat codes rightly. We 
can highlight this rule on contributing guide.


The new ImportOrder rule split import statements into groups and groups are 
separated by one blank line. 
These groups are 1) org.apache.hudi   2) third party imports   3) javax   4) 
java   5) static


For example
/---
package org.apache.hudi.metrics;

import org.apache.hudi.config.HoodieWriteConfig;
import org.apache.hudi.exception.HoodieException;

import com.google.common.base.Preconditions;
import org.apache.log4j.LogManager;
import org.apache.log4j.Logger;

import javax.management.remote.JMXConnectorServer;
import javax.management.remote.JMXConnectorServerFactory;
import javax.management.remote.JMXServiceURL;

import java.io.Closeable;
import java.lang.management.ManagementFactory;
import java.rmi.registry.LocateRegistry;

public class JmxMetricsReporter extends MetricsReporter {

/---


best,
lamber-ken





At 2019-12-23 15:45:27, "Vinoth Chandar"  wrote:
>+1 on 1/3 and improving the contributing guide. But on 2,  IMO it would be
>overloading PULL_REQUEST_TEMPLATE.
>
>Bigger point here is: We need a fully automated way of formatting code
>either using IDE or using something like spotless.
>I pulled the checkstyle rules into intellij and even then I noticed that it
>does not apply all rules while formatting (e.g line breaks between import
>groups).
>
>On Sun, Dec 22, 2019 at 11:36 PM lamberken  wrote:
>
>> Hi Vinoth,
>>
>>
>> Here are some of my points:
>>
>>
>> 1, When developers are not familiar with checkstyle rules, they feel
>> uncomfortable. I think its a good idea to
>> make the instructions on contributing guide work with the checkstyle rules
>> we already have.
>>
>>
>> 2, We can also prompt users in the PULL_REQUEST_TEMPLATE about how to
>> check code style by themself manually.
>>
>>
>> 3, The code of the current project has met the checkstyle rules, we just
>> need handle the incremental codes.
>>
>>
>> 4, here are some useful tools which can be placed on contributing guide.
>> 1) mvn scalastyle:check
>> 2) mvn checkstyle:check
>> 3) https://checkstyle.sourceforge.io/index.html
>> 4) http://www.scalastyle.org/rules-1.0.0.html
>>
>>
>> best,
>> lamber-ken
>> On 12/23/2019 13:03,Vinoth Chandar wrote:
>> Hello all,
>>
>> I know a bunch of work has happened to format the code base, closer to what
>> other project are doing..
>>
>> While working through some checkstyle violations today, I noticed that the
>> IDE formatting is now out of date with the checkstyle enforced? Manually
>> fixing these checkstyle issues are not a very productive use of time IMHO.
>>
>> http://hudi.apache.org/contributing.html#ide-setup
>>
>> Can we make the instructions here work with the checkstyle rules we already
>> have? Goal should be that formatting code in IntelliJ (or other IDEs)
>> should autofix it so that checkstyle passes..
>>
>> thoughts?
>>
>> Thanks
>> Vinoth
>>


Re:IDE setup for code formatting

2019-12-22 Thread lamberken
Hi Vinoth,


Here are some of my points:


1, When developers are not familiar with checkstyle rules, they feel 
uncomfortable. I think its a good idea to 
make the instructions on contributing guide work with the checkstyle rules we 
already have.


2, We can also prompt users in the PULL_REQUEST_TEMPLATE about how to check 
code style by themself manually.


3, The code of the current project has met the checkstyle rules, we just need 
handle the incremental codes.


4, here are some useful tools which can be placed on contributing guide.
1) mvn scalastyle:check 
2) mvn checkstyle:check
3) https://checkstyle.sourceforge.io/index.html
4) http://www.scalastyle.org/rules-1.0.0.html


best,
lamber-ken
On 12/23/2019 13:03,Vinoth Chandar wrote:
Hello all,

I know a bunch of work has happened to format the code base, closer to what
other project are doing..

While working through some checkstyle violations today, I noticed that the
IDE formatting is now out of date with the checkstyle enforced? Manually
fixing these checkstyle issues are not a very productive use of time IMHO.

http://hudi.apache.org/contributing.html#ide-setup

Can we make the instructions here work with the checkstyle rules we already
have? Goal should be that formatting code in IntelliJ (or other IDEs)
should autofix it so that checkstyle passes..

thoughts?

Thanks
Vinoth


[DISCUSS] RFC-10 Restructuring and auto-generation of docs

2019-12-20 Thread lamberken


Hi @Y Ethan Guo @Vinoth


I have some ideas for RFC-10 which aims to improve the Hudi web documentation 
for 
users and the process of updating docs for developers.


At the begining, I tried to learn how to realize it from other projcts, like 
pulsar, druid etc.
After a period of research, I learned that the implementation is complex.


But, when researching the flink project, I found that flink has faced a same 
situation before.
here is flink issue[1] which talk about the building website automactically. 
Flink project has
resolved this problem in a simple way, so I think we can learn from it.


The solution uses apache buildbot[2] which can build and deploy snapshots 
automatically. It seems
to need PMC to complete the next steps.


Hope the above to work well, thanks.


[1] https://issues.apache.org/jira/browse/FLINK-1370
[2] https://ci.apache.org/buildbot.html
[3] https://ci.apache.org/projects/flink/flink-docs-master


thanks,
lamber-ken





Re: [DISCUSS] Rework of new web site

2019-12-20 Thread lamberken




Hi Vinoth,


Thanks for your affirmation, here is pr 
https://github.com/apache/incubator-hudi/pull/1120


best,
lamber-ken
On 12/21/2019 07:59,Vinoth Chandar wrote:
Hi lamber,

Given we have enough +1s on the look and feel aspects, I propose we open a
PR and iron out the content/remaining issues there one by one.

I think a full line by line review is the best way to go, as with any code
change

Please share the PR here once you have it

Thanks
Vinoth

On Fri, Dec 20, 2019 at 3:55 PM lamberken  wrote:



Hi leesf,


Thank you for your affirmation.


best,
lamber-ken





At 2019-12-21 07:28:50, "leesf"  wrote:

Hi lamber,


Thanks for your great work, the new website looks much better.


Also if you guys have other companies(logos) needed to add to powered
by(Hudi Users)[1], please let lamberken/me know before using new website.


Best,
Leesf


[1] https://lamber-ken.github.io/


lamberken  于2019年12月20日周五 上午9:29写道:



Hi nishith,


Thank you for your affirmation. The content in the blue box is to help us
understand the highlighted content.
It is different from the body content, so we need it. There are several
ways to present it, for examples.


best,
lamber-ken



At 2019-12-20 05:57:16, "nishith agarwal"  wrote:
Great job Lamber!

The website looks really slick and has a much better experience of moving
from one page to another (mostly I think because it's faster), also find
it
the text much more conducive to absorb.

While going through the quick start, I noticed that under the highlighted
box in dark (showing the code pieces), there's another highlighted box (in
light blue) which talks about more details. Do we need that ? May be the
details in that box can just follow the plain text style of other parts on
that page.

-Nishith

On Wed, Dec 18, 2019 at 10:59 PM vino yang  wrote:

Hi Lamber,

Awesome! Thanks for your hard work.

Best,
Vino

lamberken  于2019年12月19日周四 下午2:11写道:



Hi everyone,


I finished the rework of the new UI, if you have time, please visit
the
website[1].
Any questions are welcome.


[1]https://lamber-ken.github.io/docs/quick-start-guide/


best,
lamber-ken



At 2019-12-19 07:38:47, "lamberken"  wrote:


Hi @Shiyan Xu


Thanks. :)
best,
lamber-ken


At 2019-12-19 00:53:51, "Shiyan Xu" 
wrote:
Thank you @lamber-ken for the work! It is definitely a greater
browsing
experience.

On Tue, Dec 17, 2019 at 8:28 PM lamberken 
wrote:


Hi, @Vinoth



I'm glad to hear your thoughts on the new UI, thanks. So we keep
its
style
as it is now.
The development of new UI can be completed these days, any
questions
are
welcome.


best,
lamber-ken


At 2019-12-18 11:44:27, "Vinoth Chandar" <
mail.vinoth.chan...@gmail.com>
wrote:
The case for right navigation for me, is mainly from pages like

https://lamber-ken.github.io/docs/docker_demo
https://lamber-ken.github.io/docs/querying_data
https://lamber-ken.github.io/docs/writing_data

which often have commands/text you want to selectively copy paste
from a
section.
For content you read sequentially, it matters less. I agree..

BTW the new site looks very sleek.. :)



On Tue, Dec 17, 2019 at 4:50 PM lamberken 
wrote:


hi, allOne more thing that is missing.In the new UI, I put a
"BACK
TO
TOP"
button at the bottom of all pages to help us back to top.
We can also discuss whether we need the right navigation at the
community
meeting today.best,
lamber-ken









At 2019-12-18 08:41:49, "lamberken"  wrote:

Hi @Vinoth,


Thanks for raising this point, but I have some different
views.


I've thought about it very seriously before, and I remove the
right
navigation finally.
1, I have a deep analysis of the characteristics of our
documents,
most
of them have many
commands, if the right navigation exists, it will affect
us
to
read.
2, Most documents are short, we can visit them all just at one
page.
3, The max width of web page is 1280px, left navigation is
250px(at
least), right navigation is 250px(at least),
if so, the width of the main content is only left 800px,
may
it's
not
suitable for readers.
4, I also analysised other projects, like
1) flink, spark, zeppelin, kafka, superset, elasticsearch,
arrow,
kudu, hadoop don't have right navigation
2) druid, kylin, beam have right navigation.
These are my personal views. Welcome all community members to
join
in
the
discussion.
In the end, I will follow our community, thanks.


BTW, I have synced most of the documents[1], we can use these
documents
as a reference to see
if we need the navigation bar on the right in the new UI.


[1] https://lamber-ken.github.io/docs/admin_guide
[2] https://lamber-ken.github.io/docs/writing_data
[3] https://lamber-ken.github.io/docs/quick-start-guide/


best,
lamber-ken




At 2019-12-18 04:44:04, "Vinoth Chandar" 
wrote:
One more thing that is missing.

Current site has a navigation links on the right, which lets
you
jump
to
the right section directly. This i

Re:Re: Re: Re:Re: Re: Re:Re: Re: Re: [DISCUSS] Rework of new web site

2019-12-20 Thread lamberken


Hi leesf,


Thank you for your affirmation.


best,
lamber-ken





At 2019-12-21 07:28:50, "leesf"  wrote:

Hi lamber,


Thanks for your great work, the new website looks much better. 


Also if you guys have other companies(logos) needed to add to powered by(Hudi 
Users)[1], please let lamberken/me know before using new website.


Best,
Leesf


[1] https://lamber-ken.github.io/


lamberken  于2019年12月20日周五 上午9:29写道:



Hi nishith,


Thank you for your affirmation. The content in the blue box is to help us 
understand the highlighted content.
It is different from the body content, so we need it. There are several ways to 
present it, for examples.


best,
lamber-ken



At 2019-12-20 05:57:16, "nishith agarwal"  wrote:
>Great job Lamber!
>
>The website looks really slick and has a much better experience of moving
>from one page to another (mostly I think because it's faster), also find it
>the text much more conducive to absorb.
>
>While going through the quick start, I noticed that under the highlighted
>box in dark (showing the code pieces), there's another highlighted box (in
>light blue) which talks about more details. Do we need that ? May be the
>details in that box can just follow the plain text style of other parts on
>that page.
>
>-Nishith
>
>On Wed, Dec 18, 2019 at 10:59 PM vino yang  wrote:
>
>> Hi Lamber,
>>
>> Awesome! Thanks for your hard work.
>>
>> Best,
>> Vino
>>
>> lamberken  于2019年12月19日周四 下午2:11写道:
>>
>> >
>> >
>> > Hi everyone,
>> >
>> >
>> > I finished the rework of the new UI, if you have time, please visit the
>> > website[1].
>> > Any questions are welcome.
>> >
>> >
>> > [1]https://lamber-ken.github.io/docs/quick-start-guide/
>> >
>> >
>> > best,
>> > lamber-ken
>> >
>> >
>> >
>> > At 2019-12-19 07:38:47, "lamberken"  wrote:
>> > >
>> > >
>> > >Hi @Shiyan Xu
>> > >
>> > >
>> > >Thanks. :)
>> > >best,
>> > >lamber-ken
>> > >
>> > >
>> > >At 2019-12-19 00:53:51, "Shiyan Xu" 
>> wrote:
>> > >>Thank you @lamber-ken for the work! It is definitely a greater browsing
>> > >>experience.
>> > >>
>> > >>On Tue, Dec 17, 2019 at 8:28 PM lamberken  wrote:
>> > >>
>> > >>>
>> > >>> Hi, @Vinoth
>> > >>>
>> > >>>
>> > >>>
>> > >>> I'm glad to hear your thoughts on the new UI, thanks. So we keep its
>> > style
>> > >>> as it is now.
>> > >>> The development of new UI can be completed these days, any questions
>> > are
>> > >>> welcome.
>> > >>>
>> > >>>
>> > >>> best,
>> > >>> lamber-ken
>> > >>>
>> > >>>
>> > >>> At 2019-12-18 11:44:27, "Vinoth Chandar" <
>> > mail.vinoth.chan...@gmail.com>
>> > >>> wrote:
>> > >>> >The case for right navigation for me, is mainly from pages like
>> > >>> >
>> > >>> >https://lamber-ken.github.io/docs/docker_demo
>> > >>> >https://lamber-ken.github.io/docs/querying_data
>> > >>> >https://lamber-ken.github.io/docs/writing_data
>> > >>> >
>> > >>> >which often have commands/text you want to selectively copy paste
>> > from a
>> > >>> >section.
>> > >>> >For content you read sequentially, it matters less. I agree..
>> > >>> >
>> > >>> >BTW the new site looks very sleek.. :)
>> > >>> >
>> > >>> >
>> > >>> >
>> > >>> >On Tue, Dec 17, 2019 at 4:50 PM lamberken 
>> wrote:
>> > >>> >
>> > >>> >>
>> > >>> >> hi, allOne more thing that is missing.In the new UI, I put a "BACK
>> > TO
>> > >>> TOP"
>> > >>> >> button at the bottom of all pages to help us back to top.
>> > >>> >> We can also discuss whether we need the right navigation at the
>> > >>> community
>> > >>> >> meeting today.best,
>> > >>> >> lamber-ken
>> > >>> >>
>&g

Re:Re:Re: Re: Re:Re: Re: Re: [DISCUSS] Rework of new web site

2019-12-18 Thread lamberken


Hi everyone,


I finished the rework of the new UI, if you have time, please visit the 
website[1].
Any questions are welcome.


[1]https://lamber-ken.github.io/docs/quick-start-guide/


best,
lamber-ken



At 2019-12-19 07:38:47, "lamberken"  wrote:
>
>
>Hi @Shiyan Xu
>
>
>Thanks. :)
>best,
>lamber-ken
>
>
>At 2019-12-19 00:53:51, "Shiyan Xu"  wrote:
>>Thank you @lamber-ken for the work! It is definitely a greater browsing
>>experience.
>>
>>On Tue, Dec 17, 2019 at 8:28 PM lamberken  wrote:
>>
>>>
>>> Hi, @Vinoth
>>>
>>>
>>>
>>> I'm glad to hear your thoughts on the new UI, thanks. So we keep its style
>>> as it is now.
>>> The development of new UI can be completed these days, any questions are
>>> welcome.
>>>
>>>
>>> best,
>>> lamber-ken
>>>
>>>
>>> At 2019-12-18 11:44:27, "Vinoth Chandar" 
>>> wrote:
>>> >The case for right navigation for me, is mainly from pages like
>>> >
>>> >https://lamber-ken.github.io/docs/docker_demo
>>> >https://lamber-ken.github.io/docs/querying_data
>>> >https://lamber-ken.github.io/docs/writing_data
>>> >
>>> >which often have commands/text you want to selectively copy paste from a
>>> >section.
>>> >For content you read sequentially, it matters less. I agree..
>>> >
>>> >BTW the new site looks very sleek.. :)
>>> >
>>> >
>>> >
>>> >On Tue, Dec 17, 2019 at 4:50 PM lamberken  wrote:
>>> >
>>> >>
>>> >> hi, allOne more thing that is missing.In the new UI, I put a "BACK TO
>>> TOP"
>>> >> button at the bottom of all pages to help us back to top.
>>> >> We can also discuss whether we need the right navigation at the
>>> community
>>> >> meeting today.best,
>>> >> lamber-ken
>>> >>
>>> >>
>>> >>
>>> >>
>>> >>
>>> >>
>>> >>
>>> >>
>>> >>
>>> >> At 2019-12-18 08:41:49, "lamberken"  wrote:
>>> >> >
>>> >> >Hi @Vinoth,
>>> >> >
>>> >> >
>>> >> >Thanks for raising this point, but I have some different views.
>>> >> >
>>> >> >
>>> >> >I've thought about it very seriously before, and I remove the right
>>> >> navigation finally.
>>> >> >1, I have a deep analysis of the characteristics of our documents, most
>>> >> of them have many
>>> >> >commands, if the right navigation exists, it will affect us to
>>> read.
>>> >> >2, Most documents are short, we can visit them all just at one page.
>>> >> >3, The max width of web page is 1280px, left navigation is 250px(at
>>> >> least), right navigation is 250px(at least),
>>> >> >if so, the width of the main content is only left 800px, may it's
>>> not
>>> >> suitable for readers.
>>> >> >4, I also analysised other projects, like
>>> >> >1) flink, spark, zeppelin, kafka, superset, elasticsearch, arrow,
>>> >> kudu, hadoop don't have right navigation
>>> >> >2) druid, kylin, beam have right navigation.
>>> >> >These are my personal views. Welcome all community members to join in
>>> the
>>> >> discussion.
>>> >> >In the end, I will follow our community, thanks.
>>> >> >
>>> >> >
>>> >> >BTW, I have synced most of the documents[1], we can use these documents
>>> >> as a reference to see
>>> >> >if we need the navigation bar on the right in the new UI.
>>> >> >
>>> >> >
>>> >> >[1] https://lamber-ken.github.io/docs/admin_guide
>>> >> >[2] https://lamber-ken.github.io/docs/writing_data
>>> >> >[3] https://lamber-ken.github.io/docs/quick-start-guide/
>>> >> >
>>> >> >
>>> >> >best,
>>> >> >lamber-ken
>>> >> >
>>> >> >
>>> >> >
>>> >> >
>>> >> >At 2019-12-18 04:44:04, "Vinoth Chandar"  wrote:
>>> >> >>O

Re:Re: Re: Re:Re: Re: Re: [DISCUSS] Rework of new web site

2019-12-18 Thread lamberken


Hi @Shiyan Xu


Thanks. :)
best,
lamber-ken


At 2019-12-19 00:53:51, "Shiyan Xu"  wrote:
>Thank you @lamber-ken for the work! It is definitely a greater browsing
>experience.
>
>On Tue, Dec 17, 2019 at 8:28 PM lamberken  wrote:
>
>>
>> Hi, @Vinoth
>>
>>
>>
>> I'm glad to hear your thoughts on the new UI, thanks. So we keep its style
>> as it is now.
>> The development of new UI can be completed these days, any questions are
>> welcome.
>>
>>
>> best,
>> lamber-ken
>>
>>
>> At 2019-12-18 11:44:27, "Vinoth Chandar" 
>> wrote:
>> >The case for right navigation for me, is mainly from pages like
>> >
>> >https://lamber-ken.github.io/docs/docker_demo
>> >https://lamber-ken.github.io/docs/querying_data
>> >https://lamber-ken.github.io/docs/writing_data
>> >
>> >which often have commands/text you want to selectively copy paste from a
>> >section.
>> >For content you read sequentially, it matters less. I agree..
>> >
>> >BTW the new site looks very sleek.. :)
>> >
>> >
>> >
>> >On Tue, Dec 17, 2019 at 4:50 PM lamberken  wrote:
>> >
>> >>
>> >> hi, allOne more thing that is missing.In the new UI, I put a "BACK TO
>> TOP"
>> >> button at the bottom of all pages to help us back to top.
>> >> We can also discuss whether we need the right navigation at the
>> community
>> >> meeting today.best,
>> >> lamber-ken
>> >>
>> >>
>> >>
>> >>
>> >>
>> >>
>> >>
>> >>
>> >>
>> >> At 2019-12-18 08:41:49, "lamberken"  wrote:
>> >> >
>> >> >Hi @Vinoth,
>> >> >
>> >> >
>> >> >Thanks for raising this point, but I have some different views.
>> >> >
>> >> >
>> >> >I've thought about it very seriously before, and I remove the right
>> >> navigation finally.
>> >> >1, I have a deep analysis of the characteristics of our documents, most
>> >> of them have many
>> >> >commands, if the right navigation exists, it will affect us to
>> read.
>> >> >2, Most documents are short, we can visit them all just at one page.
>> >> >3, The max width of web page is 1280px, left navigation is 250px(at
>> >> least), right navigation is 250px(at least),
>> >> >if so, the width of the main content is only left 800px, may it's
>> not
>> >> suitable for readers.
>> >> >4, I also analysised other projects, like
>> >> >1) flink, spark, zeppelin, kafka, superset, elasticsearch, arrow,
>> >> kudu, hadoop don't have right navigation
>> >> >2) druid, kylin, beam have right navigation.
>> >> >These are my personal views. Welcome all community members to join in
>> the
>> >> discussion.
>> >> >In the end, I will follow our community, thanks.
>> >> >
>> >> >
>> >> >BTW, I have synced most of the documents[1], we can use these documents
>> >> as a reference to see
>> >> >if we need the navigation bar on the right in the new UI.
>> >> >
>> >> >
>> >> >[1] https://lamber-ken.github.io/docs/admin_guide
>> >> >[2] https://lamber-ken.github.io/docs/writing_data
>> >> >[3] https://lamber-ken.github.io/docs/quick-start-guide/
>> >> >
>> >> >
>> >> >best,
>> >> >lamber-ken
>> >> >
>> >> >
>> >> >
>> >> >
>> >> >At 2019-12-18 04:44:04, "Vinoth Chandar"  wrote:
>> >> >>One more thing that is missing.
>> >> >>
>> >> >>Current site has a navigation links on the right, which lets you jump
>> to
>> >> >>the right section directly. This is also a must-have IMHO.
>> >> >>I would suggest wait for more folks to come back from vacation,
>> before we
>> >> >>finalize anything on this, as there could be more feedback
>> >> >>
>> >> >>
>> >> >>
>> >> >>On Mon, Dec 16, 2019 at 9:15 PM lamberken  wrote:
>> >> >>
>> >> >>>
>> >> >>> Hi Vinoth,
>>

Re:Re: Re: Re: Re: Re: Re: Re: [DISCUSS] Refactor of the configuration framework of hudi project

2019-12-18 Thread lamberken


Hi @Vinoth


I understand what you mean, I will continue to work on this when I finish 
reworking the new UI. :)


best,
lamber-ken




At 2019-12-18 11:39:30, "Vinoth Chandar"  wrote:
>Expect most users to use inputDF.write() approach...  Uber uses the lower
>level RDD apis, like the DeltaStreamer tool does..
>If we don't rename configs and still support a builder, it should be fine.
>
>I think we can scope this down to introducing a ConfigOption class that
>ties, the key,value, default together.. That definitely seems like a better
>abstraction.
>
>On Fri, Dec 13, 2019 at 5:18 PM lamberken  wrote:
>
>>
>>
>> Hi, @vinoth
>>
>>
>> Okay, I see. If we don't want existing users to do any upgrading or
>> reconfigurations, then this refactor work will not make much sense.
>> This issue can be closed, because ConfigOptions and these builders do the
>> same things.
>> From another side, if we finish this work before a stable release, we will
>> benefit a lot from it. We need to make a choice.
>>
>>
>> btw, I have a question that users will use HoodieWriteConfig /
>> HoodieWriteClient in their program?
>>
>> /
>> HoodieWriteConfig cfg = HoodieWriteConfig.newBuilder()
>> .withPath(basePath)
>> .forTable(tableName)
>> .withSchema(schemaStr)
>> .withProps(props) // pass raw k,v pairs from a property file.
>>
>> .withCompactionConfig(HoodieCompactionConfig.newBuilder().withXXX(...).build())
>>
>> .withIndexConfig(HoodieIndexConfig.newBuilder().withXXX(...).build())
>> ...
>> .build();
>>
>> /
>> OR
>>
>> /
>> inputDF.write()
>> .format("org.apache.hudi")
>> .options(clientOpts) // any of the Hudi client opts can be passed in
>> as well
>> .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY(), "_row_key")
>> .option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY(),
>> "partition")
>> .option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY(), "timestamp")
>> .option(HoodieWriteConfig.TABLE_NAME, tableName)
>> .mode(SaveMode.Append)
>> .save(basePath);
>>
>> /----
>>
>>
>>
>>
>> best,
>> lamber-ken
>>
>> At 2019-12-14 08:43:06, "Vinoth Chandar"  wrote:
>> >Hi,
>> >
>> >Are you saying these classes needs to change? If so, understood. But are
>> >you planning on renaming configs or relocating them? We dont want existing
>> >users to do any upgrading or reconfigurations
>> >
>> >On Fri, Dec 13, 2019 at 10:28 AM lamberken  wrote:
>> >
>> >>
>> >>
>> >> Hi,
>> >>
>> >>
>> >> They need to change due to this, because only HoodieWriteConfig and
>> >> *Options will be kept.
>> >>
>> >>
>> >> best,
>> >> lamber-ken
>> >>
>> >>
>> >> At 2019-12-14 01:23:35, "Vinoth Chandar"  wrote:
>> >> >Hi,
>> >> >
>> >> >We are trying to understand if existing jobs (datasource,
>> deltastreamer,
>> >> >anything else) needs to change due to this.
>> >> >
>> >> >On Wed, Dec 11, 2019 at 7:18 PM lamberken  wrote:
>> >> >
>> >> >>
>> >> >>
>> >> >> Hi, @vinoth
>> >> >>
>> >> >>
>> >> >> 1, Hoodie*Config classes are only used to set default value when call
>> >> >> their build method currently.
>> >> >> They will be replaced by HoodieMemoryOptions, HoodieIndexOptions,
>> >> >> HoodieHBaseIndexOptions, etc...
>> >> >> 2, I don't understand the question "It is not clear to me whether
>> there
>> >> is
>> >> >> any external facing changes which changes this model.".
>> >> >>
>> >> >>
>> >> >> Best,
>> >> >> lamber-ken
>> >> >>
>> >> >>
>> >> >> At 2019-12-12 11:01:36, "Vinoth Chandar"  wrote:
&

Re:Re: Re:Re: Re: Re: [DISCUSS] Rework of new web site

2019-12-17 Thread lamberken

Hi, @Vinoth



I'm glad to hear your thoughts on the new UI, thanks. So we keep its style as 
it is now.
The development of new UI can be completed these days, any questions are 
welcome.


best,
lamber-ken


At 2019-12-18 11:44:27, "Vinoth Chandar"  wrote:
>The case for right navigation for me, is mainly from pages like
>
>https://lamber-ken.github.io/docs/docker_demo
>https://lamber-ken.github.io/docs/querying_data
>https://lamber-ken.github.io/docs/writing_data
>
>which often have commands/text you want to selectively copy paste from a
>section.
>For content you read sequentially, it matters less. I agree..
>
>BTW the new site looks very sleek.. :)
>
>
>
>On Tue, Dec 17, 2019 at 4:50 PM lamberken  wrote:
>
>>
>> hi, allOne more thing that is missing.In the new UI, I put a "BACK TO TOP"
>> button at the bottom of all pages to help us back to top.
>> We can also discuss whether we need the right navigation at the community
>> meeting today.best,
>> lamber-ken
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> At 2019-12-18 08:41:49, "lamberken"  wrote:
>> >
>> >Hi @Vinoth,
>> >
>> >
>> >Thanks for raising this point, but I have some different views.
>> >
>> >
>> >I've thought about it very seriously before, and I remove the right
>> navigation finally.
>> >1, I have a deep analysis of the characteristics of our documents, most
>> of them have many
>> >commands, if the right navigation exists, it will affect us to read.
>> >2, Most documents are short, we can visit them all just at one page.
>> >3, The max width of web page is 1280px, left navigation is 250px(at
>> least), right navigation is 250px(at least),
>> >if so, the width of the main content is only left 800px, may it's not
>> suitable for readers.
>> >4, I also analysised other projects, like
>> >1) flink, spark, zeppelin, kafka, superset, elasticsearch, arrow,
>> kudu, hadoop don't have right navigation
>> >2) druid, kylin, beam have right navigation.
>> >These are my personal views. Welcome all community members to join in the
>> discussion.
>> >In the end, I will follow our community, thanks.
>> >
>> >
>> >BTW, I have synced most of the documents[1], we can use these documents
>> as a reference to see
>> >if we need the navigation bar on the right in the new UI.
>> >
>> >
>> >[1] https://lamber-ken.github.io/docs/admin_guide
>> >[2] https://lamber-ken.github.io/docs/writing_data
>> >[3] https://lamber-ken.github.io/docs/quick-start-guide/
>> >
>> >
>> >best,
>> >lamber-ken
>> >
>> >
>> >
>> >
>> >At 2019-12-18 04:44:04, "Vinoth Chandar"  wrote:
>> >>One more thing that is missing.
>> >>
>> >>Current site has a navigation links on the right, which lets you jump to
>> >>the right section directly. This is also a must-have IMHO.
>> >>I would suggest wait for more folks to come back from vacation, before we
>> >>finalize anything on this, as there could be more feedback
>> >>
>> >>
>> >>
>> >>On Mon, Dec 16, 2019 at 9:15 PM lamberken  wrote:
>> >>
>> >>>
>> >>> Hi Vinoth,
>> >>>
>> >>>
>> >>> 1, I'll update the site content this week, clean some useless templete
>> >>> codes, adjust the content etc...
>> >>> It will take a little long time for syncing the content.
>> >>> 2, I will adjust the style as much as I can to keep the theming blue
>> and
>> >>> white.
>> >>>
>> >>>
>> >>> When the above work is completed, I will notify you all again.
>> >>> best,
>> >>> lamber-ken
>> >>>
>> >>>
>> >>> At 2019-12-17 12:49:23, "Vinoth Chandar"  wrote:
>> >>> >Hi Lamber,
>> >>> >
>> >>> >+1 on the look and feel. Definitely feels slick and fast. Love the
>> syntax
>> >>> >highlighting.
>> >>> >
>> >>> >
>> >>> >Few things :
>> >>> >- Can we just update the site content as-is? ( I'd rather change just
>> the
>> >>> >look-and-feel and evolve the content from there, per usual means)
>> >>> >- Can we keep the

Re:Re:Re: Re: Re: [DISCUSS] Rework of new web site

2019-12-17 Thread lamberken

hi, allOne more thing that is missing.In the new UI, I put a "BACK TO TOP" 
button at the bottom of all pages to help us back to top.
We can also discuss whether we need the right navigation at the community 
meeting today.best,
lamber-ken









At 2019-12-18 08:41:49, "lamberken"  wrote:
>
>Hi @Vinoth,
>
>
>Thanks for raising this point, but I have some different views.
>
>
>I've thought about it very seriously before, and I remove the right navigation 
>finally.
>1, I have a deep analysis of the characteristics of our documents, most of 
>them have many
>commands, if the right navigation exists, it will affect us to read.
>2, Most documents are short, we can visit them all just at one page.
>3, The max width of web page is 1280px, left navigation is 250px(at least), 
>right navigation is 250px(at least),
>if so, the width of the main content is only left 800px, may it's not 
> suitable for readers.
>4, I also analysised other projects, like
>1) flink, spark, zeppelin, kafka, superset, elasticsearch, arrow, kudu, 
> hadoop don't have right navigation
>2) druid, kylin, beam have right navigation.
>These are my personal views. Welcome all community members to join in the 
>discussion.
>In the end, I will follow our community, thanks.
>
>
>BTW, I have synced most of the documents[1], we can use these documents as a 
>reference to see 
>if we need the navigation bar on the right in the new UI.
>
>
>[1] https://lamber-ken.github.io/docs/admin_guide
>[2] https://lamber-ken.github.io/docs/writing_data
>[3] https://lamber-ken.github.io/docs/quick-start-guide/
>
>
>best,
>lamber-ken
>
>
>
>
>At 2019-12-18 04:44:04, "Vinoth Chandar"  wrote:
>>One more thing that is missing.
>>
>>Current site has a navigation links on the right, which lets you jump to
>>the right section directly. This is also a must-have IMHO.
>>I would suggest wait for more folks to come back from vacation, before we
>>finalize anything on this, as there could be more feedback
>>
>>
>>
>>On Mon, Dec 16, 2019 at 9:15 PM lamberken  wrote:
>>
>>>
>>> Hi Vinoth,
>>>
>>>
>>> 1, I'll update the site content this week, clean some useless templete
>>> codes, adjust the content etc...
>>> It will take a little long time for syncing the content.
>>> 2, I will adjust the style as much as I can to keep the theming blue and
>>> white.
>>>
>>>
>>> When the above work is completed, I will notify you all again.
>>> best,
>>> lamber-ken
>>>
>>>
>>> At 2019-12-17 12:49:23, "Vinoth Chandar"  wrote:
>>> >Hi Lamber,
>>> >
>>> >+1 on the look and feel. Definitely feels slick and fast. Love the syntax
>>> >highlighting.
>>> >
>>> >
>>> >Few things :
>>> >- Can we just update the site content as-is? ( I'd rather change just the
>>> >look-and-feel and evolve the content from there, per usual means)
>>> >- Can we keep the theming blue and white, like now, since it gels well
>>> with
>>> >the logo and images.
>>> >
>>> >
>>> >On Mon, Dec 16, 2019 at 8:02 AM lamberken  wrote:
>>> >
>>> >>
>>> >>
>>> >> Thanks for your reply @lees @vino @vinoth :)
>>> >>
>>> >>
>>> >> best,
>>> >> lamber-ken
>>> >>
>>> >>
>>> >>
>>> >>
>>> >>
>>> >>
>>> >> 在 2019-12-16 12:24:26,"leesf"  写道:
>>> >> >Hi Lamber,
>>> >> >
>>> >> >Thanks for your work, have gone through the new web ui, looks good.
>>> >> >Hence +1 from my side.
>>> >> >
>>> >> >Best,
>>> >> >Leesf
>>> >> >
>>> >> >vino yang  于2019年12月16日周一 上午10:17写道:
>>> >> >
>>> >> >> Hi Lamber,
>>> >> >>
>>> >> >> I am not an expert on Jekyll. But big +1 for your proposal to improve
>>> >> the
>>> >> >> site.
>>> >> >>
>>> >> >> Best,
>>> >> >> Vino
>>> >> >>
>>> >> >> Vinoth Chandar  于2019年12月16日周一 上午3:15写道:
>>> >> >>
>>> >> >> > Thanks for taking the time to improve the site. Will review 

Re:Re: Re: Re: [DISCUSS] Rework of new web site

2019-12-17 Thread lamberken

Hi @Vinoth,


Thanks for raising this point, but I have some different views.


I've thought about it very seriously before, and I remove the right navigation 
finally.
1, I have a deep analysis of the characteristics of our documents, most of them 
have many
commands, if the right navigation exists, it will affect us to read.
2, Most documents are short, we can visit them all just at one page.
3, The max width of web page is 1280px, left navigation is 250px(at least), 
right navigation is 250px(at least),
if so, the width of the main content is only left 800px, may it's not 
suitable for readers.
4, I also analysised other projects, like
1) flink, spark, zeppelin, kafka, superset, elasticsearch, arrow, kudu, 
hadoop don't have right navigation
2) druid, kylin, beam have right navigation.
These are my personal views. Welcome all community members to join in the 
discussion.
In the end, I will follow our community, thanks.


BTW, I have synced most of the documents[1], we can use these documents as a 
reference to see 
if we need the navigation bar on the right in the new UI.


[1] https://lamber-ken.github.io/docs/admin_guide
[2] https://lamber-ken.github.io/docs/writing_data
[3] https://lamber-ken.github.io/docs/quick-start-guide/


best,
lamber-ken




At 2019-12-18 04:44:04, "Vinoth Chandar"  wrote:
>One more thing that is missing.
>
>Current site has a navigation links on the right, which lets you jump to
>the right section directly. This is also a must-have IMHO.
>I would suggest wait for more folks to come back from vacation, before we
>finalize anything on this, as there could be more feedback
>
>
>
>On Mon, Dec 16, 2019 at 9:15 PM lamberken  wrote:
>
>>
>> Hi Vinoth,
>>
>>
>> 1, I'll update the site content this week, clean some useless templete
>> codes, adjust the content etc...
>> It will take a little long time for syncing the content.
>> 2, I will adjust the style as much as I can to keep the theming blue and
>> white.
>>
>>
>> When the above work is completed, I will notify you all again.
>> best,
>> lamber-ken
>>
>>
>> At 2019-12-17 12:49:23, "Vinoth Chandar"  wrote:
>> >Hi Lamber,
>> >
>> >+1 on the look and feel. Definitely feels slick and fast. Love the syntax
>> >highlighting.
>> >
>> >
>> >Few things :
>> >- Can we just update the site content as-is? ( I'd rather change just the
>> >look-and-feel and evolve the content from there, per usual means)
>> >- Can we keep the theming blue and white, like now, since it gels well
>> with
>> >the logo and images.
>> >
>> >
>> >On Mon, Dec 16, 2019 at 8:02 AM lamberken  wrote:
>> >
>> >>
>> >>
>> >> Thanks for your reply @lees @vino @vinoth :)
>> >>
>> >>
>> >> best,
>> >> lamber-ken
>> >>
>> >>
>> >>
>> >>
>> >>
>> >>
>> >> 在 2019-12-16 12:24:26,"leesf"  写道:
>> >> >Hi Lamber,
>> >> >
>> >> >Thanks for your work, have gone through the new web ui, looks good.
>> >> >Hence +1 from my side.
>> >> >
>> >> >Best,
>> >> >Leesf
>> >> >
>> >> >vino yang  于2019年12月16日周一 上午10:17写道:
>> >> >
>> >> >> Hi Lamber,
>> >> >>
>> >> >> I am not an expert on Jekyll. But big +1 for your proposal to improve
>> >> the
>> >> >> site.
>> >> >>
>> >> >> Best,
>> >> >> Vino
>> >> >>
>> >> >> Vinoth Chandar  于2019年12月16日周一 上午3:15写道:
>> >> >>
>> >> >> > Thanks for taking the time to improve the site. Will review closely
>> >> and
>> >> >> get
>> >> >> > back to you.
>> >> >> >
>> >> >> > On Sun, Dec 15, 2019 at 11:02 AM lamberken 
>> wrote:
>> >> >> >
>> >> >> > >
>> >> >> > >
>> >> >> > > Hello, everyone.
>> >> >> > >
>> >> >> > >
>> >> >> > > Compare to the web site of Delta Lake[1] and Apache Iceberg[2],
>> they
>> >> >> may
>> >> >> > > looks better than hudi project[3].
>> >> >> > >
>> >> >> > >
>> >> >> > > I delved into our web ui and try to improve it, I learned that
>> hudi
>> >> web
>> >> >> > ui
>> >> >> > > is based on jekyll-doc[4] theme
>> >> >> > > which is not active. So it needs us to find a new active theme.
>> >> >> > >
>> >> >> > >
>> >> >> > > So I try my best to find a free and beatiful theme in the past.
>> >> >> > > Fortunately, I found a suitable theme
>> >> >> > > in the huge amount of themes(check them one by one). It is
>> >> >> > > minimal-mistakes[5], it's very popular and 100% free.
>> >> >> > >
>> >> >> > >
>> >> >> > > Based on minimal theme, I rework a basic new web ui framework. I
>> >> adjust
>> >> >> > > some css styles, nav bars and etc..
>> >> >> > > If you are interested in this, please visit
>> >> >> https://lamber-ken.github.io
>> >> >> > > for a quick overview.
>> >> >> > >
>> >> >> > >
>> >> >> > > I’m looking forward to your reply, thanks!
>> >> >> > >
>> >> >> > >
>> >> >> > > [1] https://delta.io
>> >> >> > > [2] https://iceberg.apache.org
>> >> >> > > [3] http://hudi.apache.org
>> >> >> > > [4] https://github.com/tomjoht/documentation-theme-jekyll
>> >> >> > > [5] https://github.com/mmistakes/minimal-mistakes
>> >> >> > >
>> >> >> > >
>> >> >> > > best,
>> >> >> > > lamber-ken
>> >> >> > >
>> >> >> > >
>> >> >> >
>> >> >>
>> >>
>>


Re:Re: Re: [DISCUSS] Rework of new web site

2019-12-16 Thread lamberken

Hi Vinoth,


1, I'll update the site content this week, clean some useless templete codes, 
adjust the content etc...
It will take a little long time for syncing the content.
2, I will adjust the style as much as I can to keep the theming blue and white.


When the above work is completed, I will notify you all again.
best,
lamber-ken


At 2019-12-17 12:49:23, "Vinoth Chandar"  wrote:
>Hi Lamber,
>
>+1 on the look and feel. Definitely feels slick and fast. Love the syntax
>highlighting.
>
>
>Few things :
>- Can we just update the site content as-is? ( I'd rather change just the
>look-and-feel and evolve the content from there, per usual means)
>- Can we keep the theming blue and white, like now, since it gels well with
>the logo and images.
>
>
>On Mon, Dec 16, 2019 at 8:02 AM lamberken  wrote:
>
>>
>>
>> Thanks for your reply @lees @vino @vinoth :)
>>
>>
>> best,
>> lamber-ken
>>
>>
>>
>>
>>
>>
>> 在 2019-12-16 12:24:26,"leesf"  写道:
>> >Hi Lamber,
>> >
>> >Thanks for your work, have gone through the new web ui, looks good.
>> >Hence +1 from my side.
>> >
>> >Best,
>> >Leesf
>> >
>> >vino yang  于2019年12月16日周一 上午10:17写道:
>> >
>> >> Hi Lamber,
>> >>
>> >> I am not an expert on Jekyll. But big +1 for your proposal to improve
>> the
>> >> site.
>> >>
>> >> Best,
>> >> Vino
>> >>
>> >> Vinoth Chandar  于2019年12月16日周一 上午3:15写道:
>> >>
>> >> > Thanks for taking the time to improve the site. Will review closely
>> and
>> >> get
>> >> > back to you.
>> >> >
>> >> > On Sun, Dec 15, 2019 at 11:02 AM lamberken  wrote:
>> >> >
>> >> > >
>> >> > >
>> >> > > Hello, everyone.
>> >> > >
>> >> > >
>> >> > > Compare to the web site of Delta Lake[1] and Apache Iceberg[2], they
>> >> may
>> >> > > looks better than hudi project[3].
>> >> > >
>> >> > >
>> >> > > I delved into our web ui and try to improve it, I learned that hudi
>> web
>> >> > ui
>> >> > > is based on jekyll-doc[4] theme
>> >> > > which is not active. So it needs us to find a new active theme.
>> >> > >
>> >> > >
>> >> > > So I try my best to find a free and beatiful theme in the past.
>> >> > > Fortunately, I found a suitable theme
>> >> > > in the huge amount of themes(check them one by one). It is
>> >> > > minimal-mistakes[5], it's very popular and 100% free.
>> >> > >
>> >> > >
>> >> > > Based on minimal theme, I rework a basic new web ui framework. I
>> adjust
>> >> > > some css styles, nav bars and etc..
>> >> > > If you are interested in this, please visit
>> >> https://lamber-ken.github.io
>> >> > > for a quick overview.
>> >> > >
>> >> > >
>> >> > > I’m looking forward to your reply, thanks!
>> >> > >
>> >> > >
>> >> > > [1] https://delta.io
>> >> > > [2] https://iceberg.apache.org
>> >> > > [3] http://hudi.apache.org
>> >> > > [4] https://github.com/tomjoht/documentation-theme-jekyll
>> >> > > [5] https://github.com/mmistakes/minimal-mistakes
>> >> > >
>> >> > >
>> >> > > best,
>> >> > > lamber-ken
>> >> > >
>> >> > >
>> >> >
>> >>
>>


Re:Re: [DISCUSS] Rework of new web site

2019-12-16 Thread lamberken


Thanks for your reply @lees @vino @vinoth :)


best,
lamber-ken






在 2019-12-16 12:24:26,"leesf"  写道:
>Hi Lamber,
>
>Thanks for your work, have gone through the new web ui, looks good.
>Hence +1 from my side.
>
>Best,
>Leesf
>
>vino yang  于2019年12月16日周一 上午10:17写道:
>
>> Hi Lamber,
>>
>> I am not an expert on Jekyll. But big +1 for your proposal to improve the
>> site.
>>
>> Best,
>> Vino
>>
>> Vinoth Chandar  于2019年12月16日周一 上午3:15写道:
>>
>> > Thanks for taking the time to improve the site. Will review closely and
>> get
>> > back to you.
>> >
>> > On Sun, Dec 15, 2019 at 11:02 AM lamberken  wrote:
>> >
>> > >
>> > >
>> > > Hello, everyone.
>> > >
>> > >
>> > > Compare to the web site of Delta Lake[1] and Apache Iceberg[2], they
>> may
>> > > looks better than hudi project[3].
>> > >
>> > >
>> > > I delved into our web ui and try to improve it, I learned that hudi web
>> > ui
>> > > is based on jekyll-doc[4] theme
>> > > which is not active. So it needs us to find a new active theme.
>> > >
>> > >
>> > > So I try my best to find a free and beatiful theme in the past.
>> > > Fortunately, I found a suitable theme
>> > > in the huge amount of themes(check them one by one). It is
>> > > minimal-mistakes[5], it's very popular and 100% free.
>> > >
>> > >
>> > > Based on minimal theme, I rework a basic new web ui framework. I adjust
>> > > some css styles, nav bars and etc..
>> > > If you are interested in this, please visit
>> https://lamber-ken.github.io
>> > > for a quick overview.
>> > >
>> > >
>> > > I’m looking forward to your reply, thanks!
>> > >
>> > >
>> > > [1] https://delta.io
>> > > [2] https://iceberg.apache.org
>> > > [3] http://hudi.apache.org
>> > > [4] https://github.com/tomjoht/documentation-theme-jekyll
>> > > [5] https://github.com/mmistakes/minimal-mistakes
>> > >
>> > >
>> > > best,
>> > > lamber-ken
>> > >
>> > >
>> >
>>


[DISCUSS] Rework of new web site

2019-12-15 Thread lamberken


Hello, everyone.


Compare to the web site of Delta Lake[1] and Apache Iceberg[2], they may looks 
better than hudi project[3].


I delved into our web ui and try to improve it, I learned that hudi web ui is 
based on jekyll-doc[4] theme
which is not active. So it needs us to find a new active theme.


So I try my best to find a free and beatiful theme in the past. Fortunately, I 
found a suitable theme 
in the huge amount of themes(check them one by one). It is minimal-mistakes[5], 
it's very popular and 100% free.


Based on minimal theme, I rework a basic new web ui framework. I adjust some 
css styles, nav bars and etc..
If you are interested in this, please visit https://lamber-ken.github.io for a 
quick overview.


I’m looking forward to your reply, thanks!


[1] https://delta.io
[2] https://iceberg.apache.org
[3] http://hudi.apache.org
[4] https://github.com/tomjoht/documentation-theme-jekyll
[5] https://github.com/mmistakes/minimal-mistakes


best,
lamber-ken



Re:Re: Re: Re: Re: Re: Re: [DISCUSS] Refactor of the configuration framework of hudi project

2019-12-13 Thread lamberken


Hi, @vinoth


Okay, I see. If we don't want existing users to do any upgrading or 
reconfigurations, then this refactor work will not make much sense.
This issue can be closed, because ConfigOptions and these builders do the same 
things.
From another side, if we finish this work before a stable release, we will 
benefit a lot from it. We need to make a choice.


btw, I have a question that users will use HoodieWriteConfig / 
HoodieWriteClient in their program?
/
HoodieWriteConfig cfg = HoodieWriteConfig.newBuilder()
.withPath(basePath)
.forTable(tableName)
.withSchema(schemaStr)
.withProps(props) // pass raw k,v pairs from a property file.

.withCompactionConfig(HoodieCompactionConfig.newBuilder().withXXX(...).build())
.withIndexConfig(HoodieIndexConfig.newBuilder().withXXX(...).build())
...
.build();
/
OR
/
inputDF.write()
.format("org.apache.hudi")
.options(clientOpts) // any of the Hudi client opts can be passed in as well
.option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY(), "_row_key")
.option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY(), "partition")
.option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY(), "timestamp")
.option(HoodieWriteConfig.TABLE_NAME, tableName)
.mode(SaveMode.Append)
.save(basePath);
/




best,
lamber-ken

At 2019-12-14 08:43:06, "Vinoth Chandar"  wrote:
>Hi,
>
>Are you saying these classes needs to change? If so, understood. But are
>you planning on renaming configs or relocating them? We dont want existing
>users to do any upgrading or reconfigurations
>
>On Fri, Dec 13, 2019 at 10:28 AM lamberken  wrote:
>
>>
>>
>> Hi,
>>
>>
>> They need to change due to this, because only HoodieWriteConfig and
>> *Options will be kept.
>>
>>
>> best,
>> lamber-ken
>>
>>
>> At 2019-12-14 01:23:35, "Vinoth Chandar"  wrote:
>> >Hi,
>> >
>> >We are trying to understand if existing jobs (datasource, deltastreamer,
>> >anything else) needs to change due to this.
>> >
>> >On Wed, Dec 11, 2019 at 7:18 PM lamberken  wrote:
>> >
>> >>
>> >>
>> >> Hi, @vinoth
>> >>
>> >>
>> >> 1, Hoodie*Config classes are only used to set default value when call
>> >> their build method currently.
>> >> They will be replaced by HoodieMemoryOptions, HoodieIndexOptions,
>> >> HoodieHBaseIndexOptions, etc...
>> >> 2, I don't understand the question "It is not clear to me whether there
>> is
>> >> any external facing changes which changes this model.".
>> >>
>> >>
>> >> Best,
>> >> lamber-ken
>> >>
>> >>
>> >> At 2019-12-12 11:01:36, "Vinoth Chandar"  wrote:
>> >> >I actually prefer the builder pattern for making the configs, because I
>> >> can
>> >> >do `builder.` in the IDE and actually see all the options... That said,
>> >> >most developers program against the Spark datasource and so this may
>> not
>> >> be
>> >> >useful, unless we expose a builder for that.. I will concede that since
>> >> its
>> >> >also subjective anyway.
>> >> >
>> >> >But, to clarify Siva's question, you do intend to keep the different
>> >> >component level config classes right - HoodieIndexConfig,
>> >> >HoodieCompactionConfig?
>> >> >
>> >> >Once again, can you please explicitly address the following question,
>> so
>> >> we
>> >> >can get on the same page?
>> >> >>> It is not clear to me whether there is any external facing changes
>> >> which
>> >> >changes this model.
>> >> >This is still the most critical question from both me and balaji.
>> >> >
>> >> >On Wed, Dec 11, 2019 at 11:35 AM lamberken  wrote:
>> >> >
>> >> >>  hi, @Sivabalan
>> >> >>
>> >> >> Yes, thanks very much for help me explain my initial proposal.
>> >> >>
>> >> >>
>> >> >> Answer yo

Re:Re: Re: Re: Re: Re: [DISCUSS] Refactor of the configuration framework of hudi project

2019-12-13 Thread lamberken


Hi, 


They need to change due to this, because only HoodieWriteConfig and *Options 
will be kept.


best,
lamber-ken


At 2019-12-14 01:23:35, "Vinoth Chandar"  wrote:
>Hi,
>
>We are trying to understand if existing jobs (datasource, deltastreamer,
>anything else) needs to change due to this.
>
>On Wed, Dec 11, 2019 at 7:18 PM lamberken  wrote:
>
>>
>>
>> Hi, @vinoth
>>
>>
>> 1, Hoodie*Config classes are only used to set default value when call
>> their build method currently.
>> They will be replaced by HoodieMemoryOptions, HoodieIndexOptions,
>> HoodieHBaseIndexOptions, etc...
>> 2, I don't understand the question "It is not clear to me whether there is
>> any external facing changes which changes this model.".
>>
>>
>> Best,
>> lamber-ken
>>
>>
>> At 2019-12-12 11:01:36, "Vinoth Chandar"  wrote:
>> >I actually prefer the builder pattern for making the configs, because I
>> can
>> >do `builder.` in the IDE and actually see all the options... That said,
>> >most developers program against the Spark datasource and so this may not
>> be
>> >useful, unless we expose a builder for that.. I will concede that since
>> its
>> >also subjective anyway.
>> >
>> >But, to clarify Siva's question, you do intend to keep the different
>> >component level config classes right - HoodieIndexConfig,
>> >HoodieCompactionConfig?
>> >
>> >Once again, can you please explicitly address the following question, so
>> we
>> >can get on the same page?
>> >>> It is not clear to me whether there is any external facing changes
>> which
>> >changes this model.
>> >This is still the most critical question from both me and balaji.
>> >
>> >On Wed, Dec 11, 2019 at 11:35 AM lamberken  wrote:
>> >
>> >>  hi, @Sivabalan
>> >>
>> >> Yes, thanks very much for help me explain my initial proposal.
>> >>
>> >>
>> >> Answer your question, we can call HoodieWriteConfig as a SystemConfig,
>> we
>> >> need to pass it everywhere.
>> >> Actually, it may just contains a few custom configurations( does not
>> >> include default configurations)
>> >> Because each component has its own ConfigOptions.
>> >>
>> >>
>> >> The old version HoodieWriteConfig includes all keys(custom
>> configurations,
>> >> default configurations), it is a fat.
>> >>
>> >>
>> >> Best,
>> >> lamber-ken
>> >>
>> >>
>> >>
>> >>
>> >>
>> >>
>> >>
>> >>
>> >> At 2019-12-12 03:14:11, "Sivabalan"  wrote:
>> >> >Let me summarize your initial proposal and then will get into details.
>> >> >- Introduce ConfigOptions for ease of handling of default values.
>> >> >- Remove all Hoodie*Config classes and just have HoodieWriteConfig.
>> What
>> >> >this means is that, every other config file will be replaced by
>> >> >ConfigOptions. eg, HoodieIndexConfigOption,
>> HoodieCompactionConfigOption,
>> >> >etc.
>> >> >- Config option will take care of returning defaults for any property,
>> >> even
>> >> >if an entire Config(eg IndexConfig) is not explicitly set.
>> >> >
>> >> >Here are the positives I see.
>> >> >- By way of having component level ConfigOptions, we bucketize the
>> configs
>> >> >and have defaults set(same as before)
>> >> >- User doesn't need to set each component's config(eg IndexConfig)
>> >> >explicitly with HoodieWriteConfig.
>> >> >
>> >> >But have one question:
>> >> >- I see Bucketizing only in write path. How does one get hold of
>> >> >IndexConfigOptions as a consumer?  For eg, If some class is using just
>> >> >IndexConfig alone, how will it consume? From your eg, I see only
>> >> >HoodieWriteConfig. Do we pass in HoodieWriteConfig everywhere then?
>> >> >Wouldn't that contradicts your initial proposal to not have a fat
>> config
>> >> >class? May be can you expand your example below to show how a consumer
>> of
>> >> >IndexConfig look like.
>> 

Re:Re: Checkstyle changes?

2019-12-12 Thread lamberken

You are welcome. For detail, you can visit HUDI-363, 
https://issues.apache.org/jira/browse/HUDI-363



best,
lamber-ken

At 2019-12-13 03:49:12, "Sivabalan"  wrote:
>thanks lamber-ken. Sorry, I wasn't paying close attention to these changes.
>Don't we make a separate PR (with just the changes pertaining to new check
>style rules across entire repo) whenever a new change is made to check
>style ? I rebased with latest and in order to get my build pass, I have
>already fixed like 20 files and the list keeps growing.
>
>
>
>
>
>
>
>On Thu, Dec 12, 2019 at 10:14 AM lamberken  wrote:
>
>>
>>
>> Hi, @Sivabalan
>>
>> The new ImportOrder rule split import statements into groups and groups
>> are separated by one blank line.
>> These groups are 1) org.apache.hudi   2) third party imports   3) javax
>>  4) java   5) static
>>
>>
>> For example
>>
>> /---
>> package org.apache.hudi.metrics;
>>
>> import org.apache.hudi.config.HoodieWriteConfig;
>> import org.apache.hudi.exception.HoodieException;
>>
>> import com.google.common.base.Preconditions;
>> import org.apache.log4j.LogManager;
>> import org.apache.log4j.Logger;
>>
>> import javax.management.remote.JMXConnectorServer;
>> import javax.management.remote.JMXConnectorServerFactory;
>> import javax.management.remote.JMXServiceURL;
>>
>> import java.io.Closeable;
>> import java.lang.management.ManagementFactory;
>> import java.rmi.registry.LocateRegistry;
>>
>> public class JmxMetricsReporter extends MetricsReporter {
>>
>>
>> /---
>>
>>
>> best,
>> lamber-ken
>>
>> 在 2019-12-13 01:01:05,"Sivabalan"  写道:
>>
>> Hi folks,
>> Is there any recent change wrt checkstyle? Usually I run "mvn package
>> -DskipTests" locally to check for any checkstyle and build errors. And
>> travis CI usually stays in line with that. But recently(probably a week or
>> 10 days), even though my local maven package command succeeds, travis CI
>> fails specifically wrt import ordering.
>>
>>
>> When I apply reformat code via intellij, usually I choose just "Optimize
>> Imports". But this time around, I also tried choosing "Rearrange entries",
>> but none helped me in fixing the travis CI failure.
>>
>>
>> Here is my travis CI build:
>> https://travis-ci.org/apache/incubator-hudi/jobs/624228722?utm_medium=notification&utm_source=github_status
>>
>>
>> - Do others face this issue or it is just me?
>> - Can someone give some pointers on how to go about fixing this?
>>
>>
>> --
>>
>> Regards,
>> -Sivabalan
>
>
>
>-- 
>Regards,
>-Sivabalan


Re:Checkstyle changes?

2019-12-12 Thread lamberken


Hi, @Sivabalan
 
The new ImportOrder rule split import statements into groups and groups are 
separated by one blank line. 
These groups are 1) org.apache.hudi   2) third party imports   3) javax   4) 
java   5) static


For example
/---
package org.apache.hudi.metrics;

import org.apache.hudi.config.HoodieWriteConfig;
import org.apache.hudi.exception.HoodieException;

import com.google.common.base.Preconditions;
import org.apache.log4j.LogManager;
import org.apache.log4j.Logger;

import javax.management.remote.JMXConnectorServer;
import javax.management.remote.JMXConnectorServerFactory;
import javax.management.remote.JMXServiceURL;

import java.io.Closeable;
import java.lang.management.ManagementFactory;
import java.rmi.registry.LocateRegistry;

public class JmxMetricsReporter extends MetricsReporter {

/---


best,
lamber-ken

在 2019-12-13 01:01:05,"Sivabalan"  写道:

Hi folks,
Is there any recent change wrt checkstyle? Usually I run "mvn package 
-DskipTests" locally to check for any checkstyle and build errors. And travis 
CI usually stays in line with that. But recently(probably a week or 10 days), 
even though my local maven package command succeeds, travis CI fails 
specifically wrt import ordering. 


When I apply reformat code via intellij, usually I choose just "Optimize 
Imports". But this time around, I also tried choosing "Rearrange entries", but 
none helped me in fixing the travis CI failure. 


Here is my travis CI build: 
https://travis-ci.org/apache/incubator-hudi/jobs/624228722?utm_medium=notification&utm_source=github_status


- Do others face this issue or it is just me? 
- Can someone give some pointers on how to go about fixing this?


--

Regards,
-Sivabalan

Re:Re: Re: Re: Re: [DISCUSS] Refactor of the configuration framework of hudi project

2019-12-11 Thread lamberken


Hi, @vinoth


1, Hoodie*Config classes are only used to set default value when call their 
build method currently.
They will be replaced by HoodieMemoryOptions, HoodieIndexOptions, 
HoodieHBaseIndexOptions, etc...
2, I don't understand the question "It is not clear to me whether there is any 
external facing changes which changes this model.".


Best,
lamber-ken


At 2019-12-12 11:01:36, "Vinoth Chandar"  wrote:
>I actually prefer the builder pattern for making the configs, because I can
>do `builder.` in the IDE and actually see all the options... That said,
>most developers program against the Spark datasource and so this may not be
>useful, unless we expose a builder for that.. I will concede that since its
>also subjective anyway.
>
>But, to clarify Siva's question, you do intend to keep the different
>component level config classes right - HoodieIndexConfig,
>HoodieCompactionConfig?
>
>Once again, can you please explicitly address the following question, so we
>can get on the same page?
>>> It is not clear to me whether there is any external facing changes which
>changes this model.
>This is still the most critical question from both me and balaji.
>
>On Wed, Dec 11, 2019 at 11:35 AM lamberken  wrote:
>
>>  hi, @Sivabalan
>>
>> Yes, thanks very much for help me explain my initial proposal.
>>
>>
>> Answer your question, we can call HoodieWriteConfig as a SystemConfig, we
>> need to pass it everywhere.
>> Actually, it may just contains a few custom configurations( does not
>> include default configurations)
>> Because each component has its own ConfigOptions.
>>
>>
>> The old version HoodieWriteConfig includes all keys(custom configurations,
>> default configurations), it is a fat.
>>
>>
>> Best,
>> lamber-ken
>>
>>
>>
>>
>>
>>
>>
>>
>> At 2019-12-12 03:14:11, "Sivabalan"  wrote:
>> >Let me summarize your initial proposal and then will get into details.
>> >- Introduce ConfigOptions for ease of handling of default values.
>> >- Remove all Hoodie*Config classes and just have HoodieWriteConfig. What
>> >this means is that, every other config file will be replaced by
>> >ConfigOptions. eg, HoodieIndexConfigOption, HoodieCompactionConfigOption,
>> >etc.
>> >- Config option will take care of returning defaults for any property,
>> even
>> >if an entire Config(eg IndexConfig) is not explicitly set.
>> >
>> >Here are the positives I see.
>> >- By way of having component level ConfigOptions, we bucketize the configs
>> >and have defaults set(same as before)
>> >- User doesn't need to set each component's config(eg IndexConfig)
>> >explicitly with HoodieWriteConfig.
>> >
>> >But have one question:
>> >- I see Bucketizing only in write path. How does one get hold of
>> >IndexConfigOptions as a consumer?  For eg, If some class is using just
>> >IndexConfig alone, how will it consume? From your eg, I see only
>> >HoodieWriteConfig. Do we pass in HoodieWriteConfig everywhere then?
>> >Wouldn't that contradicts your initial proposal to not have a fat config
>> >class? May be can you expand your example below to show how a consumer of
>> >IndexConfig look like.
>> >
>> >Your eg:
>> >/**
>> > * New version
>> > */
>> >// set value overrite the default value
>> >HoodieWriteConfig config = new HoodieWriteConfig();
>> >config.set(HoodieIndexConfigOptions.INDEX_TYPE,
>> >HoodieIndex.IndexType.HBASE.name <
>> http://hoodieindex.indextype.hbase.name/>
>> >())
>> >
>> >
>> >
>> >
>> >On Wed, Dec 11, 2019 at 8:33 AM lamberken  wrote:
>> >
>> >>
>> >>
>> >> Hi,
>> >>
>> >>
>> >>
>> >>
>> >> On 1,2. Yes, you are right, moving the getter to the component level
>> >> Config class itself.
>> >>
>> >>
>> >> On 3, HoodieWriteConfig can also set value through ConfigOption, small
>> >> code snippets.
>> >> From the bellow snippets, we can see that clients need to know each
>> >> component's builders
>> >> and also call their "with" methods to override the default value in old
>> >> version.
>> >>
>> >>
>> >> But, in new version, clients just need to know each component's public
>> >> config options, just like constants.
>>

Re:Re: Re: Re: [DISCUSS] Refactor of the configuration framework of hudi project

2019-12-11 Thread lamberken
 hi, @Sivabalan
 
Yes, thanks very much for help me explain my initial proposal.


Answer your question, we can call HoodieWriteConfig as a SystemConfig, we need 
to pass it everywhere. 
Actually, it may just contains a few custom configurations( does not include 
default configurations)
Because each component has its own ConfigOptions.


The old version HoodieWriteConfig includes all keys(custom configurations, 
default configurations), it is a fat.


Best,
lamber-ken








At 2019-12-12 03:14:11, "Sivabalan"  wrote:
>Let me summarize your initial proposal and then will get into details.
>- Introduce ConfigOptions for ease of handling of default values.
>- Remove all Hoodie*Config classes and just have HoodieWriteConfig. What
>this means is that, every other config file will be replaced by
>ConfigOptions. eg, HoodieIndexConfigOption, HoodieCompactionConfigOption,
>etc.
>- Config option will take care of returning defaults for any property, even
>if an entire Config(eg IndexConfig) is not explicitly set.
>
>Here are the positives I see.
>- By way of having component level ConfigOptions, we bucketize the configs
>and have defaults set(same as before)
>- User doesn't need to set each component's config(eg IndexConfig)
>explicitly with HoodieWriteConfig.
>
>But have one question:
>- I see Bucketizing only in write path. How does one get hold of
>IndexConfigOptions as a consumer?  For eg, If some class is using just
>IndexConfig alone, how will it consume? From your eg, I see only
>HoodieWriteConfig. Do we pass in HoodieWriteConfig everywhere then?
>Wouldn't that contradicts your initial proposal to not have a fat config
>class? May be can you expand your example below to show how a consumer of
>IndexConfig look like.
>
>Your eg:
>/**
> * New version
> */
>// set value overrite the default value
>HoodieWriteConfig config = new HoodieWriteConfig();
>config.set(HoodieIndexConfigOptions.INDEX_TYPE,
>HoodieIndex.IndexType.HBASE.name <http://hoodieindex.indextype.hbase.name/>
>())
>
>
>
>
>On Wed, Dec 11, 2019 at 8:33 AM lamberken  wrote:
>
>>
>>
>> Hi,
>>
>>
>>
>>
>> On 1,2. Yes, you are right, moving the getter to the component level
>> Config class itself.
>>
>>
>> On 3, HoodieWriteConfig can also set value through ConfigOption, small
>> code snippets.
>> From the bellow snippets, we can see that clients need to know each
>> component's builders
>> and also call their "with" methods to override the default value in old
>> version.
>>
>>
>> But, in new version, clients just need to know each component's public
>> config options, just like constants.
>> So, these builders are redundant.
>>
>>
>> /---/
>>
>>
>> public class HoodieIndexConfigOptions {
>>   public static final ConfigOption INDEX_TYPE = ConfigOption
>>   .key("hoodie.index.type")
>>   .defaultValue(HoodieIndex.IndexType.BLOOM.name());
>> }
>>
>>
>> public class HoodieWriteConfig {
>>   public void setString(ConfigOption option, String value) {
>> this.props.put(option.key(), value);
>>   }
>> }
>>
>>
>>
>>
>> /**
>>  * New version
>>  */
>> // set value overrite the default value
>> HoodieWriteConfig config = new HoodieWriteConfig();
>> config.set(HoodieIndexConfigOptions.INDEX_TYPE,
>> HoodieIndex.IndexType.HBASE.name())
>>
>>
>>
>>
>> /**
>>  * Old version
>>  */
>> HoodieWriteConfig.Builder builder = HoodieWriteConfig.newBuilder()
>>
>> builder.withIndexConfig(HoodieIndexConfig.newBuilder().withIndexType(HoodieIndex.IndexType.BLOOM).build())
>>
>>
>>
>> /---/
>>
>>
>> Another, users use hudi like bellow, here're all keys.
>>
>> /---/
>>
>>
>> df.write.format("hudi").
>> option("hoodie.insert.shuffle.parallelism", "10").
>> option("hoodie.upsert.shuffle.parallelism", "10").
>> option("hoodie.delete.shuffle.parallelism", "10").
>> option("hoodie.bulkinsert.shuffle.parallelism", "10").
>> option("hoodie.datasource.write.recordkey.field", "name").
>> option("hoodie.datasource.write.partitionpath.field", "location").
>>   

Re:Re: Re: [DISCUSS] Refactor of the configuration framework of hudi project

2019-12-11 Thread lamberken


Hi, 




On 1,2. Yes, you are right, moving the getter to the component level Config 
class itself. 


On 3, HoodieWriteConfig can also set value through ConfigOption, small code 
snippets.
From the bellow snippets, we can see that clients need to know each component's 
builders 
and also call their "with" methods to override the default value in old version.


But, in new version, clients just need to know each component's public config 
options, just like constants.
So, these builders are redundant.
 
/---/


public class HoodieIndexConfigOptions {
  public static final ConfigOption INDEX_TYPE = ConfigOption
  .key("hoodie.index.type")
  .defaultValue(HoodieIndex.IndexType.BLOOM.name());
}


public class HoodieWriteConfig {
  public void setString(ConfigOption option, String value) {
this.props.put(option.key(), value);
  }
}




/**
 * New version
 */
// set value overrite the default value
HoodieWriteConfig config = new HoodieWriteConfig();
config.set(HoodieIndexConfigOptions.INDEX_TYPE, 
HoodieIndex.IndexType.HBASE.name())




/**
 * Old version
 */
HoodieWriteConfig.Builder builder = HoodieWriteConfig.newBuilder()
builder.withIndexConfig(HoodieIndexConfig.newBuilder().withIndexType(HoodieIndex.IndexType.BLOOM).build())


/---/


Another, users use hudi like bellow, here're all keys.
/---/


df.write.format("hudi").
option("hoodie.insert.shuffle.parallelism", "10").
option("hoodie.upsert.shuffle.parallelism", "10").
option("hoodie.delete.shuffle.parallelism", "10").
option("hoodie.bulkinsert.shuffle.parallelism", "10").
option("hoodie.datasource.write.recordkey.field", "name").
option("hoodie.datasource.write.partitionpath.field", "location").
option("hoodie.datasource.write.precombine.field", "ts").
option("hoodie.table.name", tableName).
mode(Overwrite).
save(basePath);


/---/




Last, as I responsed to @vino, it's reasonable to handle fallbackkeys. I think 
we need to do this step by step,
it's easy to integrate FallbackKey in the future, it is not what we need right 
now in my opinion.


If some places are still not very clear, feel free to feedback.




Best,
lamber-ken












At 2019-12-11 23:41:31, "Vinoth Chandar"  wrote:
>Hi Lamber-ken,
>
>I looked at the sample PR you put up as well.
>
>On 1,2 => Seems your intent is to replace these with moving the getter to
>the component level Config class itself? I am fine with that (although I
>think its not that big of a hurdle really to use atm). But, once we do that
>we could pass just the specific component config into parts of code versus
>passing in the entire HoodieWriteConfig object. I am fine with moving the
>classes to a ConfigOption class as you suggested as well.
>
>On 3, I still we feel we will need the builder pattern going forward. to
>build the HoodieWriteConfig object. Like below? Cannot understand why we
>would want to change this. Could you please clarify?
>
>HoodieWriteConfig.Builder builder =
>
> HoodieWriteConfig.newBuilder().withPath(cfg.targetBasePath).combineInput(cfg.filterDupes,
>true)
>
> .withCompactionConfig(HoodieCompactionConfig.newBuilder().withPayloadClass(cfg.payloadClassName)
>// Inline compaction is disabled for continuous mode.
>otherwise enabled for MOR
>.withInlineCompaction(cfg.isInlineCompactionEnabled()).build())
>.forTable(cfg.targetTableName)
>
> .withIndexConfig(HoodieIndexConfig.newBuilder().withIndexType(HoodieIndex.IndexType.BLOOM).build())
>.withAutoCommit(false).withProps(props);
>
>
>Typically, we write RFCs for large changes that breaks existing behavior or
>introduces significantly complex new features.. If you are just planning to
>do the refactoring into ConfigOption class, per se you don't need a RFC.
>But , if you plan to address the fallback keys (or) your changes are going
>to break/change existing jobs, we would need a RFC.
>
>>> It is not clear to me whether there is any external facing changes which
>changes this model.
>I am still unclear on this as well. can you please explicitly clarify?
>
>thanks
>vinoth
>
>
>On Tue, Dec 10, 2019 at 12:35 PM lamberken  wrote:
>
>>
>> Hi, @Balaji @Vinoth
>>
>>
>> I'm sorry, some places are not very clear,
>>
>>
>> 1, We can see 

Re:Re: [DISCUSS] Refactor of the configuration framework of hudi project

2019-12-10 Thread lamberken

Hi, @Balaji @Vinoth


I'm sorry, some places are not very clear, 


1, We can see that HoodieMetricsConfig, HoodieStorageConfig, etc.. already 
defined in project.
   But we get property value by methods which defined in HoodieWriteConfig, 
like HoodieWriteConfig#getParquetMaxFileSize,
   HoodieWriteConfig#getParquetBlockSize, etc. It's means that Hoodie*Config 
are redundant.


2, These Hoodie*Config classes are used to set default value when call their 
build method, nothing else.


3, For current plan is keep the Builder pattern when configuring, when we are 
familiar with the config framework,
   We will find that Hoodie*Config class are redundant and methods prefixed 
with "get" in HoodieWriteConfig are also redundant.


In addition, I create a pr[1] for initializing with a demo. At this demo, I 
create 
MetricsGraphiteReporterOptions which contains HOST, PORT, PREFIX, and remove 
getGraphiteServerHost,
getGraphiteServerPort, getGraphiteMetricPrefix in HoodieMetricsConfig. 


https://github.com/apache/incubator-hudi/pull/1094


Best,
lamber-ken







At 2019-12-11 02:35:30, "Balaji Varadarajan"  wrote:
> Hi Lamber-Ken, 
>Thanks for the time writing the proposal and thinking about improving Hudi 
>usability.
>My preference would be to keep the Builder pattern when configuring. It is 
>something I find it natural when configuring. It is not clear to me whether 
>there is any external facing changes which changes this model. Would you mind 
>adding some more details on the RFC. It would save time to read it in one 
>place as opposed to checking out github repo :)
>Thanks,Balaji.V
>On Tuesday, December 10, 2019, 07:55:01 AM PST, Vinoth Chandar 
>  wrote:  
> 
> Hi ,
>
>Thanks for the proposal. Some parts I agree, some parts I don't and some
>parts are unclear
>
>Agree :
>- On introducing a class that binds key, default value, provided value, and
>also may be a doc along with it (?).
>- Designing the framework to have fallback keys is good IMO. It helps us do
>things like https://issues.apache.org/jira/browse/HUDI-89
>
>Disagree :
>- Not all configuration values are in HoodieWriteConfig, its not accurate.
>Configs are already split by components into HoodieIndexConfig,
>HoodieCompactionConfig etc..
>- There are helpers for all these conveniently located in
>HoodieWriteConfig. I think some of the claims of usability seem subjective
>to me, speaking from hands-on experience writing jobs. So, if you proposing
>a large shake up (e.g not have a single properties file load all
>components), I would love to understand this at more depth. From my
>experience, well namespaced configs in a single properties file keeps it
>simple and understandable.
>
>Unclear :
>- What is impact on existing jobs - using  RDD/WriteClient API, DataSource,
>DeltaStreamer level? Do you intend to change namespacing of configs?
>
>
>Thanks
>Vinoth
>
>On Tue, Dec 10, 2019 at 6:44 AM lamberken  wrote:
>
>>
>>
>> Hi, vino
>>
>>
>> Reasonable,  we can refactor this step by step. The first step now is to
>> introduce the config framework.
>> When our community is familiar with the config framework mechanism, it's
>> easy to integrate FallbackKey in the future.
>>
>>
>> Best,
>> lamber-ken
>>
>>
>>
>> At 2019-12-10 11:51:22, "vino yang"  wrote:
>> >Hi Lamber,
>> >
>> >Thanks for the proposal. +1 from my side.
>> >
>> >When it comes to configuration, it will involve how we handle deprecated
>> >configuration items in the future. In my opinion, we need to take this
>> into
>> >consideration when designing. There are already some successful practices
>> >for our reference. For example, Flink defines some deprecated
>> >configurations as FallbackKey[1]. Maybe we can learn from these designs.
>> >
>> >WDYT?
>> >
>> >[1]:
>> >
>> https://github.com/apache/flink/blob/master/flink-core/src/main/java/org/apache/flink/configuration/FallbackKey.java
>> >
>> >Best,
>> >Vino
>> >
>> >lamberken  于2019年12月9日周一 下午11:19写道:
>> >
>> >>
>> >>
>> >> Hi, all
>> >>
>> >>
>> >> Currently, many configuration items and their default values are
>> dispersed
>> >> in the config file like HoodieWriteConfig. It’s very confused for
>> >> developers, and it's easy for developers to use them in a wrong place
>> >> especially when there are more and more configuration items. If we can
>> >> solve this, developers will benefit from it and the code structure will
>> be
>> >> more concise.
>> >>
>> >>
>> >> I had create a JIRA[1] and a under discuss RFC[2] to explain how to
>> solve
>> >> the problem, if you are interested in this, you can visit jira and RFC
>> for
>> >> detail. Any comments and feedback are welcome, WDYT?
>> >>
>> >>
>> >> Best,
>> >> lamber-ken
>> >>
>> >>
>> >> [1] https://issues.apache.org/jira/projects/HUDI/issues/HUDI-375
>> >> [2]
>> >>
>> https://cwiki.apache.org/confluence/display/HUDI/RFC-11+%3A+Refactor+of+the+configuration+framework+of+hudi+project
>>  


Re:Re: [DISCUSS] Refactor of the configuration framework of hudi project

2019-12-10 Thread lamberken


Hi, vino


Reasonable,  we can refactor this step by step. The first step now is to 
introduce the config framework.
When our community is familiar with the config framework mechanism, it's easy 
to integrate FallbackKey in the future. 


Best,
lamber-ken



At 2019-12-10 11:51:22, "vino yang"  wrote:
>Hi Lamber,
>
>Thanks for the proposal. +1 from my side.
>
>When it comes to configuration, it will involve how we handle deprecated
>configuration items in the future. In my opinion, we need to take this into
>consideration when designing. There are already some successful practices
>for our reference. For example, Flink defines some deprecated
>configurations as FallbackKey[1]. Maybe we can learn from these designs.
>
>WDYT?
>
>[1]:
>https://github.com/apache/flink/blob/master/flink-core/src/main/java/org/apache/flink/configuration/FallbackKey.java
>
>Best,
>Vino
>
>lamberken  于2019年12月9日周一 下午11:19写道:
>
>>
>>
>> Hi, all
>>
>>
>> Currently, many configuration items and their default values are dispersed
>> in the config file like HoodieWriteConfig. It’s very confused for
>> developers, and it's easy for developers to use them in a wrong place
>> especially when there are more and more configuration items. If we can
>> solve this, developers will benefit from it and the code structure will be
>> more concise.
>>
>>
>> I had create a JIRA[1] and a under discuss RFC[2] to explain how to solve
>> the problem, if you are interested in this, you can visit jira and RFC for
>> detail. Any comments and feedback are welcome, WDYT?
>>
>>
>> Best,
>> lamber-ken
>>
>>
>> [1] https://issues.apache.org/jira/projects/HUDI/issues/HUDI-375
>> [2]
>> https://cwiki.apache.org/confluence/display/HUDI/RFC-11+%3A+Refactor+of+the+configuration+framework+of+hudi+project


[DISCUSS] Refactor of the configuration framework of hudi project

2019-12-09 Thread lamberken


Hi, all


Currently, many configuration items and their default values are dispersed in 
the config file like HoodieWriteConfig. It’s very confused for developers, and 
it's easy for developers to use them in a wrong place especially when there are 
more and more configuration items. If we can solve this, developers will 
benefit from it and the code structure will be more concise. 


I had create a JIRA[1] and a under discuss RFC[2] to explain how to solve the 
problem, if you are interested in this, you can visit jira and RFC for detail. 
Any comments and feedback are welcome, WDYT?


Best,
lamber-ken


[1] https://issues.apache.org/jira/projects/HUDI/issues/HUDI-375
[2] 
https://cwiki.apache.org/confluence/display/HUDI/RFC-11+%3A+Refactor+of+the+configuration+framework+of+hudi+project

Re:Re: Re:[DISCUSS] Scaling community support

2019-12-08 Thread lamberken


Okay, thanks for reminding me, I'll see earlier discuss thread.


At 2019-12-09 14:09:56, "Vinoth Chandar"  wrote:

Please see an earlier discuss thread on the same topic - GH issues. 


Lets please keep this thread to discuss support process, not logistics, if I 
may say so :)


On Sun, Dec 8, 2019 at 10:03 PM lamberken  wrote:



In addition, we can use some tags to mark these issues, like "question", "bug", 
"new feature". we can solve these bug firstly.




Best,
lamber-ken








At 2019-12-09 13:43:38, "lamberken"  wrote:
>
>
>Hi, I'd like to make suggestions from the perspective of contributor, just for 
>reference only.
>
>
>About [1]
>As hudi project grows, users / developers will encounter various problems, 
>will asking issues on this mailing list or GH issues or occasionally slack. I 
>think committers should guide them to create a related jira about their 
>problems firstly.
>Because committers or PMC may focusing on thier work(fix a bug / develop new 
>features), and don't have enough time to answer these
>occasionally issues. We can see that Spark, Flink, Hadoop or other popular 
>projects have turned the issue off on github. Users can not
>create issue on GH, they can create a jira or send a email, so committers / 
>PMC can solve these issues in order. 
>
>
>https://github.com/apache/spark
>https://github.com/apache/flink
>https://github.com/apache/calcite
>https://github.com/apache/hadoop
>
>
>Best,
>lamber-ken
>
>
>
>At 2019-12-08 04:01:13, "Vinoth Chandar"  wrote:
>>Hello all,
>>
>>As we grow, we need a scalable way for new users/contributors to either
>>easily use Hudi or ramp up on the project. Last month alone, we had close
>>to 1600 notifications on commits@. and few hundred emails on this list. In
>>addition, to authoring RFCs and implementing JIRAs we need to share the
>>following responsibilities amongst us to be able to scale this process.
>>
>>1) Answering issues on this mailing list or GH issues or occasionally
>>slack. We need a clear owner to triage the problem, reproduce it if needed,
>>either provide suggestions or file a JIRA - AND always look for ways to
>>update the FAQ. We need a clear hand off process also.
>>2) Code review process currently spreads the load amongst all the
>>committers. But PRs vary dramatically in their complexity and we need more
>>committers who can review any part of the codebase.
>>3) Responding to pings/clarifications and unblocking . IMHO committers
>>should prioritize this higher than working on their own stuff (I know I
>>have been doing this at some cost to my productivity on the project). This
>>is the only way to scale and add new committers. committers need to be
>>nurturing in this process.
>>
>>I don't have a clear proposals for scaling 2 & 3, which fall heavily on
>>committers.. Love to hear suggestions.
>>
>>But for 1, I propose we have 2-3 day "Support Rotations" where any
>>contributor can assume responsibility for support the community. This
>>brings more focus to support and also fast tracks learning/ramping for the
>>person on the rotation. It also minimizes interruptions for other folks and
>>we gain more velocity. I am sure this is familiar to a lot of you at your
>>own companies. We have at-least 10-15 active contributors at this point..
>>So  the investment is minimal : doing this once a month.
>>
>> A committer and a PMC member will always be designated secondary/backup in
>>case the primary cannot field a question. I am happy to additionally
>>volunteer as "always on rotation" as a third level backup, to get this
>>process booted up.
>>
>>Please let me know what you all think. Please be specific in what issue
>>[1][2] or [3] you are talking about in your feedback
>>
>>thanks
>>vinoth


Re:[DISCUSS] Scaling community support

2019-12-08 Thread lamberken


Hi, I'd like to make suggestions from the perspective of contributor, just for 
reference only.


About [1]
As hudi project grows, users / developers will encounter various problems, will 
asking issues on this mailing list or GH issues or occasionally slack. I think 
committers should guide them to create a related jira about their problems 
firstly.
Because committers or PMC may focusing on thier work(fix a bug / develop new 
features), and don't have enough time to answer these
occasionally issues. We can see that Spark, Flink, Hadoop or other popular 
projects have turned the issue off on github. Users can not
create issue on GH, they can create a jira or send a email, so committers / PMC 
can solve these issues in order. 


https://github.com/apache/spark
https://github.com/apache/flink
https://github.com/apache/calcite
https://github.com/apache/hadoop


Best,
lamber-ken



At 2019-12-08 04:01:13, "Vinoth Chandar"  wrote:
>Hello all,
>
>As we grow, we need a scalable way for new users/contributors to either
>easily use Hudi or ramp up on the project. Last month alone, we had close
>to 1600 notifications on commits@. and few hundred emails on this list. In
>addition, to authoring RFCs and implementing JIRAs we need to share the
>following responsibilities amongst us to be able to scale this process.
>
>1) Answering issues on this mailing list or GH issues or occasionally
>slack. We need a clear owner to triage the problem, reproduce it if needed,
>either provide suggestions or file a JIRA - AND always look for ways to
>update the FAQ. We need a clear hand off process also.
>2) Code review process currently spreads the load amongst all the
>committers. But PRs vary dramatically in their complexity and we need more
>committers who can review any part of the codebase.
>3) Responding to pings/clarifications and unblocking . IMHO committers
>should prioritize this higher than working on their own stuff (I know I
>have been doing this at some cost to my productivity on the project). This
>is the only way to scale and add new committers. committers need to be
>nurturing in this process.
>
>I don't have a clear proposals for scaling 2 & 3, which fall heavily on
>committers.. Love to hear suggestions.
>
>But for 1, I propose we have 2-3 day "Support Rotations" where any
>contributor can assume responsibility for support the community. This
>brings more focus to support and also fast tracks learning/ramping for the
>person on the rotation. It also minimizes interruptions for other folks and
>we gain more velocity. I am sure this is familiar to a lot of you at your
>own companies. We have at-least 10-15 active contributors at this point..
>So  the investment is minimal : doing this once a month.
>
> A committer and a PMC member will always be designated secondary/backup in
>case the primary cannot field a question. I am happy to additionally
>volunteer as "always on rotation" as a third level backup, to get this
>process booted up.
>
>Please let me know what you all think. Please be specific in what issue
>[1][2] or [3] you are talking about in your feedback
>
>thanks
>vinoth


Re:Re: Re: [DISCUSS] Refactor scala checkstyle

2019-12-06 Thread lamberken


OK, thank you for your reply. I will start to work on this.


在 2019-12-06 22:31:22,"Vinoth Chandar"  写道:
>+1 from me as well.
>
>On Fri, Dec 6, 2019 at 6:25 AM leesf  wrote:
>
>> +1 to refractor the scala checkstyle.
>>
>> Best,
>> Leesf
>>
>> lamberken  于2019年12月6日周五 下午8:00写道:
>>
>> > Right, refactor step by step like java style.
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> > At 2019-12-06 16:35:04, "vino yang"  wrote:
>> > >Hi lamber,
>> > >
>> > >+1 from my side.
>> > >
>> > >IMO, it would be better to refactor step by step like java style.
>> Firstly,
>> > >we should refactor code based on warning message, then change the
>> > >checkstyle rule level.
>> > >
>> > >WDYT? Is it what you prepare to do?
>> > >
>> > >Best,
>> > >Vino
>> > >
>> > >
>> > >lamberken  于2019年12月6日周五 下午2:39写道:
>> > >
>> > >> Hi,
>> > >>
>> > >>
>> > >> Currently, the level of scala codestyle rule is warning, it's better
>> > check
>> > >> these rules one by one
>> > >> and refactor scala codes then now.
>> > >>
>> > >>
>> > >> Furthermore, in order to sync to java codestyle, needs to add two
>> rules.
>> > >> One is BlockImportChecker
>> > >> which allows to ensure that only single imports are used in order to
>> > >> minimize merge errors in import declarations, another is
>> > ImportOrderChecker
>> > >> which checks that imports are grouped and ordered according to the
>> style
>> > >> configuration.
>> > >>
>> > >>
>> > >> Summary
>> > >> 1, check scala checkstyle rules one by one, change some warning level
>> to
>> > >> error.
>> > >> 2, add ImportOrderChecker and BlockImportChecker.
>> > >>
>> > >>
>> > >> Any comments and feedback are welcome, WDYT?
>> > >>
>> > >>
>> > >> Best,
>> > >> lamber-ken
>> >
>>


Re:Re: [DISCUSS] Refactor scala checkstyle

2019-12-06 Thread lamberken
Right, refactor step by step like java style.







At 2019-12-06 16:35:04, "vino yang"  wrote:
>Hi lamber,
>
>+1 from my side.
>
>IMO, it would be better to refactor step by step like java style. Firstly,
>we should refactor code based on warning message, then change the
>checkstyle rule level.
>
>WDYT? Is it what you prepare to do?
>
>Best,
>Vino
>
>
>lamberken  于2019年12月6日周五 下午2:39写道:
>
>> Hi,
>>
>>
>> Currently, the level of scala codestyle rule is warning, it's better check
>> these rules one by one
>> and refactor scala codes then now.
>>
>>
>> Furthermore, in order to sync to java codestyle, needs to add two rules.
>> One is BlockImportChecker
>> which allows to ensure that only single imports are used in order to
>> minimize merge errors in import declarations, another is ImportOrderChecker
>> which checks that imports are grouped and ordered according to the style
>> configuration.
>>
>>
>> Summary
>> 1, check scala checkstyle rules one by one, change some warning level to
>> error.
>> 2, add ImportOrderChecker and BlockImportChecker.
>>
>>
>> Any comments and feedback are welcome, WDYT?
>>
>>
>> Best,
>> lamber-ken


[DISCUSS] Refactor scala checkstyle

2019-12-05 Thread lamberken
Hi,


Currently, the level of scala codestyle rule is warning, it's better check 
these rules one by one
and refactor scala codes then now.


Furthermore, in order to sync to java codestyle, needs to add two rules. One is 
BlockImportChecker 
which allows to ensure that only single imports are used in order to minimize 
merge errors in import declarations, another is ImportOrderChecker which checks 
that imports are grouped and ordered according to the style configuration.


Summary
1, check scala checkstyle rules one by one, change some warning level to error.
2, add ImportOrderChecker and BlockImportChecker.


Any comments and feedback are welcome, WDYT?


Best,
lamber-ken

Re:Re: Error when running TestHoodieDeltaStreamer

2019-11-29 Thread lamberken

Right, I think your analysis is right. BTW, how do you run the 
TestHoodieDeltaStreamer tests? in IDE or other ways.


At 2019-11-29 23:38:54, "Pratyaksh Sharma"  wrote:
>Looks like jetty version was causing an issue for me. Fixed the same by
>excluding jetty-util artifact from hudi-client dependency in hudi-utilities
>pom.
>
>In version 9.4.15, Container class is an interface while in lower version
>7.6.0, it is a class. This conflict of versions was causing mentioned
>exception for me.
>
>On Fri, Nov 29, 2019 at 5:37 PM Pratyaksh Sharma 
>wrote:
>
>> Hi Lamberken,
>>
>> Here are the details -
>>
>> MacOS Mojave version 10.14.6
>> Java version - 1.8.0_212
>> Docker version - 19.03.5
>>
>> Please let me know if anything else is needed.
>>
>> On Fri, Nov 29, 2019 at 5:21 PM lamberken  wrote:
>>
>>>
>>>
>>> Hi, Pratyaksh Sharma
>>>
>>>
>>> In order to solve the problem better, please provides the running
>>> environment.
>>> For example, what is your operating system or the java version and so on.
>>> thanks
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> At 2019-11-29 18:15:45, "Pratyaksh Sharma"  wrote:
>>> >Hi,
>>> >
>>> >Every time I try to run test cases of TestHoodieDeltaStreamer class
>>> >individually, I get the following error -
>>> >
>>> >java.lang.InstantiationError: org.eclipse.jetty.util.component.Container
>>> >
>>> >at org.eclipse.jetty.server.Server.(Server.java:66)
>>> >at org.apache.hive.http.HttpServer.(HttpServer.java:98)
>>> >at org.apache.hive.http.HttpServer.(HttpServer.java:80)
>>> >at org.apache.hive.http.HttpServer$Builder.build(HttpServer.java:133)
>>> >at org.apache.hive.service.server.HiveServer2.init(HiveServer2.java:227)
>>> >at
>>>
>>> >org.apache.hudi.hive.util.HiveTestService.startHiveServer(HiveTestService.java:214)
>>> >at
>>> org.apache.hudi.hive.util.HiveTestService.start(HiveTestService.java:106)
>>> >at
>>>
>>> >org.apache.hudi.utilities.UtilitiesTestBase.initClass(UtilitiesTestBase.java:88)
>>> >at
>>>
>>> >org.apache.hudi.utilities.TestHoodieDeltaStreamer.initClass(TestHoodieDeltaStreamer.java:91)
>>> >
>>> >All other test cases run fine. If I try to run using the command man
>>> clean
>>> >install -DskipITs, everything seems to be fine.
>>> >
>>> >Has anyone faced a similar issue?
>>>
>>


Re:Error when running TestHoodieDeltaStreamer

2019-11-29 Thread lamberken


Hi, Pratyaksh Sharma


In order to solve the problem better, please provides the running environment.
For example, what is your operating system or the java version and so on. thanks











At 2019-11-29 18:15:45, "Pratyaksh Sharma"  wrote:
>Hi,
>
>Every time I try to run test cases of TestHoodieDeltaStreamer class
>individually, I get the following error -
>
>java.lang.InstantiationError: org.eclipse.jetty.util.component.Container
>
>at org.eclipse.jetty.server.Server.(Server.java:66)
>at org.apache.hive.http.HttpServer.(HttpServer.java:98)
>at org.apache.hive.http.HttpServer.(HttpServer.java:80)
>at org.apache.hive.http.HttpServer$Builder.build(HttpServer.java:133)
>at org.apache.hive.service.server.HiveServer2.init(HiveServer2.java:227)
>at
>org.apache.hudi.hive.util.HiveTestService.startHiveServer(HiveTestService.java:214)
>at org.apache.hudi.hive.util.HiveTestService.start(HiveTestService.java:106)
>at
>org.apache.hudi.utilities.UtilitiesTestBase.initClass(UtilitiesTestBase.java:88)
>at
>org.apache.hudi.utilities.TestHoodieDeltaStreamer.initClass(TestHoodieDeltaStreamer.java:91)
>
>All other test cases run fine. If I try to run using the command man clean
>install -DskipITs, everything seems to be fine.
>
>Has anyone faced a similar issue?


Re:Re: [Discuss] Migrate from log4j to slf4j

2019-11-25 Thread lamberken
Thanks, 


I don't know there is a JIRA for this already(HUDI-233). We can talk about it 
under that issue.


Best,
lamberken


At 2019-11-26 03:09:40, "Vinoth Chandar"  wrote:
>Hi,
>
>Its log4j actually across the board. (I think there are a couple files that
>have non log4j loggers? might be good to fix  to log4j as well for now to
>be consistent)
>
>Nonetheless, there is a JIRA for this already
>https://issues.apache.org/jira/browse/HUDI-233
>
>Main thing we need to be mindful of is to ensure all the shading and
>everything works properly across the bundles.
>
>On Mon, Nov 25, 2019 at 9:57 AM lamberken  wrote:
>
>> Hi, everyone
>>
>>
>> Currently, there are three kinds of java logging framework in hudi's
>> project (java.util.logging.Logger、org.apache.log4j.Logger、org.slf4j.Logger).
>>
>>
>> The org.apache.log4j.Logger doesn't support placeholders, so it‘s hard for
>> us to format the message like
>> | logger.info(String.format("The job needs to copy %d partitions.",
>> partitions.size())); |
>>
>>
>> So, I suggest migrate from log4j to slf4j, what dou you think?
>>
>>
>> Best,
>> lamberken


[Discuss] Migrate from log4j to slf4j

2019-11-25 Thread lamberken
Hi, everyone


Currently, there are three kinds of java logging framework in hudi's project 
(java.util.logging.Logger、org.apache.log4j.Logger、org.slf4j.Logger).


The org.apache.log4j.Logger doesn't support placeholders, so it‘s hard for us 
to format the message like 
| logger.info(String.format("The job needs to copy %d partitions.", 
partitions.size())); |


So, I suggest migrate from log4j to slf4j, what dou you think? 


Best,
lamberken

Re:Re: [DISCUSS] Introduce stricter comment and code style validation rules

2019-11-21 Thread lamberken
Hi, vino. I think we can set severity to info level, if so, use can check style 
themself.









At 2019-11-21 15:43:34, "vino yang"  wrote:
>Hi all,
>
>The umbrella issue[1] has been created, please feel free to join us to
>improve the comment and code quality.
>
>Best,
>Vino
>
>[1]: https://issues.apache.org/jira/browse/HUDI-354
><https://issues.apache.org/jira/browse/HUDI-354#>
>
>vino yang  于2019年11月20日周三 下午7:33写道:
>
>>
>> Hi guys,
>>
>> Since there is no objection. I will create an umbrella issue to track this
>> work. The plan is:
>>
>> 1) Given relevant check style rules to find all the illegal points;
>> 2) We will refactor modules one by one, each module mappings to one
>> subtask;
>> 3) Add global check style rule for the whole project
>>
>> Best,
>> Vino
>>
>>
>> On 11/20/2019 12:59, Y Ethan Guo  wrote:
>> +1 on all of the proposed rules.  These will also make the javadoc more
>> readable.
>>
>> On Mon, Nov 18, 2019 at 5:55 PM Vinoth Chandar  wrote:
>>
>> > +1 on all three.
>> >
>> > Would there be a overhaul of existing code to add comments to all
>> classes?
>> > We are pretty reasonable already, but good to get this in shape.
>> >
>> > 17:54:37 [incubator-hudi]$ grep -R -B 1 "public class"
>> hudi-*/src/main/java
>> > | grep "public class" | wc -l
>> >  274
>> > 17:54:50 [incubator-hudi]$ grep -R -B 1 "public class"
>> hudi-*/src/main/java
>> > | grep "*/" | wc -l
>> >  178
>> > 17:55:06 [incubator-hudi]$
>> >
>> >
>> >
>> >
>> > On Mon, Nov 18, 2019 at 5:48 PM lamberken  wrote:
>> >
>> > > +1, it’s a hard work but meaningful.
>> > >
>> > >
>> > > | |
>> > > lamberken
>> > > IT
>> > > |
>> > > |
>> > > ly.com
>> > > lamber...@163.com
>> > > |
>> > > 签名由网易邮箱大师定制
>> > >
>> > >
>> > > On 11/19/2019 07:27,leesf wrote:
>> > > Hi vino,
>> > >
>> > > Thanks for bringing ths discussion up.
>> > > +1 on all. the third one seems a bit too strict and usually requires
>> > manual
>> > > processing of the import order, but I also agree and think it makes
>> our
>> > > project more professional. And I learned that the calcite community is
>> > also
>> > > applying this rule.
>> > >
>> > > Best,
>> > > Leesf
>> > >
>> > >
>> > > Pratyaksh Sharma  于2019年11月18日周一 下午8:53写道:
>> > >
>> > > Having proper class level and method level comments always makes the
>> life
>> > > easier for any new user.
>> > >
>> > > +1 for points 1,2 and 4.
>> > >
>> > > On Mon, Nov 18, 2019 at 5:59 PM vino yang 
>> wrote:
>> > >
>> > > Hi guys,
>> > >
>> > > Currently, Hudi's comment and code styles do not have a uniform
>> > > specification on certain rules. I will list them below. With the rapid
>> > > development of the community, the inconsistent comment specification
>> will
>> > > bring a lot of problems. I am here to assume that everyone is aware of
>> > > its
>> > > importance, so I will not spend too much time emphasizing it.
>> > >
>> > > In short, I want to add more detection rules to the current warehouse
>> to
>> > > force everyone to follow a more "strict" code specification.
>> > >
>> > > These rules are listed below:
>> > >
>> > > 1) All public classes must add class-level comments;
>> > >
>> > > 2) All comments must end with a clear "."
>> > >
>> > > 3) In the import statement of the class, clearly distinguish (by blank
>> > > lines) the import of Java SE and the import of non-java SE. Currently,
>> I
>> > > saw at least two projects(Spark and Flink) that implement this rule.
>> > > Flink
>> > > implements stricter rules than Spark. It is divided into several
>> blocks
>> > > from top to bottom(owner import -> non-owner and non-JavaSE import ->
>> > > Java
>> > > SE import -> static import), each block are sorted according to the
>> > > natural
>> > > sequence of letters;
>> > >
>> > > 4) Reconfirm the method and whether the comment is consistency;
>> > >
>> > > The first, second, and third points can be checked by setting the
>> > > check-style rule. The fourth point requires human confirmation.
>> > >
>> > > Regarding the third point, everyone can express their views. According
>> to
>> > > my personal experience, this strict model used by Flink also brings
>> the
>> > > best reading experience. But this is a subjective feeling.
>> > >
>> > > Additionally, I want to collect more ideas about this topic through
>> this
>> > > thread and discuss the feasibility of them.
>> > >
>> > > Any comments and feedback are commendable.
>> > >
>> > > Best,
>> > > Vino
>> > >
>> > >
>> > >
>> >
>>
>>


Re: [DISCUSS] Introduce stricter comment and code style validation rules

2019-11-18 Thread lamberken
+1, it’s a hard work but meaningful.


| |
lamberken
IT
|
|
ly.com
lamber...@163.com
|
签名由网易邮箱大师定制


On 11/19/2019 07:27,leesf wrote:
Hi vino,

Thanks for bringing ths discussion up.
+1 on all. the third one seems a bit too strict and usually requires manual
processing of the import order, but I also agree and think it makes our
project more professional. And I learned that the calcite community is also
applying this rule.

Best,
Leesf


Pratyaksh Sharma  于2019年11月18日周一 下午8:53写道:

Having proper class level and method level comments always makes the life
easier for any new user.

+1 for points 1,2 and 4.

On Mon, Nov 18, 2019 at 5:59 PM vino yang  wrote:

Hi guys,

Currently, Hudi's comment and code styles do not have a uniform
specification on certain rules. I will list them below. With the rapid
development of the community, the inconsistent comment specification will
bring a lot of problems. I am here to assume that everyone is aware of
its
importance, so I will not spend too much time emphasizing it.

In short, I want to add more detection rules to the current warehouse to
force everyone to follow a more "strict" code specification.

These rules are listed below:

1) All public classes must add class-level comments;

2) All comments must end with a clear "."

3) In the import statement of the class, clearly distinguish (by blank
lines) the import of Java SE and the import of non-java SE. Currently, I
saw at least two projects(Spark and Flink) that implement this rule.
Flink
implements stricter rules than Spark. It is divided into several blocks
from top to bottom(owner import -> non-owner and non-JavaSE import ->
Java
SE import -> static import), each block are sorted according to the
natural
sequence of letters;

4) Reconfirm the method and whether the comment is consistency;

The first, second, and third points can be checked by setting the
check-style rule. The fourth point requires human confirmation.

Regarding the third point, everyone can express their views. According to
my personal experience, this strict model used by Flink also brings the
best reading experience. But this is a subjective feeling.

Additionally, I want to collect more ideas about this topic through this
thread and discuss the feasibility of them.

Any comments and feedback are commendable.

Best,
Vino