Re: New site/docs navigation

2021-10-28 Thread Rajesh Mahindra
This is awesome! Well done Kyle.

Thanks
Rajesh

On Thu, Oct 28, 2021 at 7:35 PM sagar sumit  wrote:

> This is awesome!
> I really like the separation of `Concepts` and `Services`. Very helpful for
> Hudi users in my opinion.
>
> One suggestion:
> Thinking from a newbie perspective who is thinking of adopting data lake,
> would it be better to move the `Use Cases` section right below `Overview`
> for better visibility?
> For e.g. Presto/Trino have mentioned use cases right under the Overview
> section.
>
> Regards,
> Sagar
>
> On Thu, Oct 28, 2021 at 6:55 PM Vinoth Chandar  wrote:
>
> > Awesome!
> > I think Kyle has already fixed some issues around cn docs in the PR
> above.
> > Could you review that?
> > Kyle, if you are here, please chime in. We can organize all the work
> under
> > a single umbrella JIRA.
> > https://issues.apache.org/jira/browse/HUDI-270 so its easier for any
> > volunteers to pick up?
> >
> > On Thu, Oct 28, 2021 at 6:21 AM Shawy Geng 
> > wrote:
> >
> > > Hi Vinoth,
> > >
> > > Volunteer to update the Chinese doc. Already commented at the
> > > https://issues.apache.org/jira/browse/HUDI-2628 <
> > > https://issues.apache.org/jira/browse/HUDI-2628>.
> > > Are there any other volunteers who want to work together to translate?
> > > Please contact me.
> > >
> > > > 2021年10月28日 20:35,Vinoth Chandar  写道:
> > > >
> > > > Hi all,
> > > >
> > > > https://github.com/apache/hudi/pull/3855 puts up a nice redesign of
> > the
> > > > content, that can show case all of the Hudi capabilities. Please
> chime
> > in
> > > > and help merge the PR.
> > > >
> > > > As follow on, we can also fix the Chinese site docs after this?
> > > >
> > > > Thanks
> > > > Vinoth
> > >
> > >
> >
>


-- 
Take Care,
Rajesh Mahindra


Re: feature request/proposal: leverage bloom indexes for readingb

2021-10-28 Thread Nicolas Paris
I tested the HoodieReadClient. It's a great start indeed. Looks like
this client is meant fo testing purpose and needs some enhancement. I
will try to produce a general purpose code aroud this and who knows
contribute.

I guess the datasource api is not the best candidate since hudi keys
cannot be passed as options but with rdd or df:

sprark.read.format('hudi').option('hudi.filter.keys',
'a,flat,list,of,keys,not,really,cool').load(...)

there is also the option to introduce a new hudi operation such
"select". but again it's not supposed to return a dataframe but write to
the hudi:

df_hudi_keys.options(**hudi_options).save(...)

Then a full featured / documented hoodie client is maybe the best option


thought ?


On Thu Oct 28, 2021 at 2:34 PM CEST, Vinoth Chandar wrote:
> Sounds great!
>
> On Tue, Oct 26, 2021 at 7:26 AM Nicolas Paris 
> wrote:
>
> > Hi Vinoth,
> >
> > Thanks for the starter. Definitely once the new way to manage indexes
> > and we get migrated on hudi on our datalake, I d'be glad to give this a
> > shot.
> >
> >
> > Regards, Nicolas
> >
> > On Fri Oct 22, 2021 at 4:33 PM CEST, Vinoth Chandar wrote:
> > > Hi Nicolas,
> > >
> > > Thanks for raising this! I think it's a very valid ask.
> > > https://issues.apache.org/jira/browse/HUDI-2601 has been raised.
> > >
> > > As a proof of concept, would you be able to give filterExists() a shot
> > > and
> > > see if the filtering time improves?
> > >
> > https://github.com/apache/hudi/blob/master/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/HoodieReadClient.java#L172
> > >
> > > In the upcoming 0.10.0 release, we are planning to move the bloom
> > > filters
> > > out to a partition on the metadata table, to even speed this up for very
> > > large tables.
> > > https://issues.apache.org/jira/browse/HUDI-1295
> > >
> > > Please let us know if you are interested in testing that when the PR is
> > > up.
> > >
> > > Thanks
> > > Vinoth
> > >
> > > On Tue, Oct 19, 2021 at 4:38 AM Nicolas Paris 
> > > wrote:
> > >
> > > > hi !
> > > >
> > > > In my use case, for GDPR I have to export all informations of a given
> > > > user from several hudi HUGE tables. Filtering the table results in a
> > > > full scan of around 10 hours and this will get worst year after year.
> > > >
> > > > Since the filter criteria is based on the bloom key (user_id) it would
> > > > be handy to exploit the bloom and produce a temporary table (in the
> > > > metastore for eg) with the resulting rows.
> > > >
> > > > So far the bloom indexing is used for update/delete operations on a
> > hudi
> > > > table.
> > > >
> > > > 1. There is a oportunity to exploit the bloom for select operations.
> > > > the hudi options would be:
> > > > operation: select
> > > > result-table: 
> > > > result-path: 
> > > > result-schema:  (optional ; when empty no
> > > > sync with the hms, only raw path)
> > > >
> > > >
> > > > 2. It could be implemented as predicate push down in the spark
> > > > datasource API. When filtering with a IN statement.
> > > >
> > > >
> > > > Thought ?
> > > >
> >
> >



Re: New site/docs navigation

2021-10-28 Thread Vinoth Chandar
Awesome!
I think Kyle has already fixed some issues around cn docs in the PR above.
Could you review that?
Kyle, if you are here, please chime in. We can organize all the work under
a single umbrella JIRA.
https://issues.apache.org/jira/browse/HUDI-270 so its easier for any
volunteers to pick up?

On Thu, Oct 28, 2021 at 6:21 AM Shawy Geng  wrote:

> Hi Vinoth,
>
> Volunteer to update the Chinese doc. Already commented at the
> https://issues.apache.org/jira/browse/HUDI-2628 <
> https://issues.apache.org/jira/browse/HUDI-2628>.
> Are there any other volunteers who want to work together to translate?
> Please contact me.
>
> > 2021年10月28日 20:35,Vinoth Chandar  写道:
> >
> > Hi all,
> >
> > https://github.com/apache/hudi/pull/3855 puts up a nice redesign of the
> > content, that can show case all of the Hudi capabilities. Please chime in
> > and help merge the PR.
> >
> > As follow on, we can also fix the Chinese site docs after this?
> >
> > Thanks
> > Vinoth
>
>


Re: New site/docs navigation

2021-10-28 Thread Shawy Geng
Hi Vinoth,

Volunteer to update the Chinese doc. Already commented at the 
https://issues.apache.org/jira/browse/HUDI-2628 
.
Are there any other volunteers who want to work together to translate? Please 
contact me.

> 2021年10月28日 20:35,Vinoth Chandar  写道:
> 
> Hi all,
> 
> https://github.com/apache/hudi/pull/3855 puts up a nice redesign of the
> content, that can show case all of the Hudi capabilities. Please chime in
> and help merge the PR.
> 
> As follow on, we can also fix the Chinese site docs after this?
> 
> Thanks
> Vinoth



New site/docs navigation

2021-10-28 Thread Vinoth Chandar
Hi all,

https://github.com/apache/hudi/pull/3855 puts up a nice redesign of the
content, that can show case all of the Hudi capabilities. Please chime in
and help merge the PR.

As follow on, we can also fix the Chinese site docs after this?

Thanks
Vinoth


Re: feature request/proposal: leverage bloom indexes for readingb

2021-10-28 Thread Vinoth Chandar
Sounds great!

On Tue, Oct 26, 2021 at 7:26 AM Nicolas Paris 
wrote:

> Hi Vinoth,
>
> Thanks for the starter. Definitely once the new way to manage indexes
> and we get migrated on hudi on our datalake, I d'be glad to give this a
> shot.
>
>
> Regards, Nicolas
>
> On Fri Oct 22, 2021 at 4:33 PM CEST, Vinoth Chandar wrote:
> > Hi Nicolas,
> >
> > Thanks for raising this! I think it's a very valid ask.
> > https://issues.apache.org/jira/browse/HUDI-2601 has been raised.
> >
> > As a proof of concept, would you be able to give filterExists() a shot
> > and
> > see if the filtering time improves?
> >
> https://github.com/apache/hudi/blob/master/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/HoodieReadClient.java#L172
> >
> > In the upcoming 0.10.0 release, we are planning to move the bloom
> > filters
> > out to a partition on the metadata table, to even speed this up for very
> > large tables.
> > https://issues.apache.org/jira/browse/HUDI-1295
> >
> > Please let us know if you are interested in testing that when the PR is
> > up.
> >
> > Thanks
> > Vinoth
> >
> > On Tue, Oct 19, 2021 at 4:38 AM Nicolas Paris 
> > wrote:
> >
> > > hi !
> > >
> > > In my use case, for GDPR I have to export all informations of a given
> > > user from several hudi HUGE tables. Filtering the table results in a
> > > full scan of around 10 hours and this will get worst year after year.
> > >
> > > Since the filter criteria is based on the bloom key (user_id) it would
> > > be handy to exploit the bloom and produce a temporary table (in the
> > > metastore for eg) with the resulting rows.
> > >
> > > So far the bloom indexing is used for update/delete operations on a
> hudi
> > > table.
> > >
> > > 1. There is a oportunity to exploit the bloom for select operations.
> > > the hudi options would be:
> > > operation: select
> > > result-table: 
> > > result-path: 
> > > result-schema:  (optional ; when empty no
> > > sync with the hms, only raw path)
> > >
> > >
> > > 2. It could be implemented as predicate push down in the spark
> > > datasource API. When filtering with a IN statement.
> > >
> > >
> > > Thought ?
> > >
>
>


Re: Limitations of non unique keys

2021-10-28 Thread Vinoth Chandar
Hi,

Are you asking if there are advantages to allowing duplicates or not having
keys in your table?

Having keys, helps with othe practical scenarios, in addition to what you
called out.
e.g: Oftentimes, you would want to backfill an insert-only table and you
don't want to introduce duplicates when doing so.

Thanks
Vinoth

On Tue, Oct 26, 2021 at 1:37 AM Nicolas Paris 
wrote:

> Hi devs,
>
> AFAIK, hudi has been designed to have primary keys in the hudi's key.
> However it is possible to also choose a non unique field. I have listed
> several trouble with such design:
>
> Non unique key yield to :
> - cannot delete / update a unique record
> - cannot apply primary key for new sql tables feature
>
> Is there other downsides to choose a non unique key you have in mind ?
>
> In my case, having user_id as a hudi key will help to apply deletion on
> the user level in any user table. The table are insert only, so the
> drawbacks listed above do not really apply. In case of error in the
> tables I have several options:
>
> - rollback to a previous commit
> - read partition/filter overwrite partition
>
> Thanks
>