Re: [VOTE] Release 0.14.1, release candidate #1

2023-12-25 Thread Nicolas Paris
-1 (non binding)

ran our internal test suite on 0.14.1-rc1 and found 2 issues on hudi
third parties:

- datadog: https://github.com/apache/hudi/issues/10403
- dynamodb lock provider: https://github.com/apache/hudi/issues/10394

Proposed a PR for each.

On Sun, 2023-12-24 at 07:01 -0800, Sivabalan wrote:
> Please review and vote on the *release candidate #1* for the version
> 0.14.1, as follows:
> 
> [ ] +1, Approve the release
> 
> [ ] -1, Do not approve the release (please provide specific comments)
> 
> 
> 
> The complete staging area is available for your review, which
> includes:
> 
> * JIRA release notes [1],
> 
> * the official Apache source release and binary convenience releases
> to be
> deployed to dist.apache.org
>  [2],
> which
> are signed with the key with
> fingerprint ACD52A06633DB3B2C7D0EA5642CA2D3ED5895122 [3],
> 
> * all artifacts to be deployed to the Maven Central Repository [4],
> 
> * source code tag "0.14.1-rc1" [5],
> 
> 
> The vote will be open for at least 72 hours. It is adopted by
> majority
> approval, with at least 3 PMC affirmative votes.
> 
> 
> Thanks,
> Release Manager (Sivabalan Narayanan)
> 
> 
> [1]
> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12322822=12353493
> 
> [2] https://dist.apache.org/repos/dist/dev/hudi/hudi-0.14.1-rc1/
> 
> [3] https://dist.apache.org/repos/dist/release/hudi/KEYS
> 
> [4]
> https://repository.apache.org/content/repositories/orgapachehudi-1132/
> 
> [5] https://github.com/apache/hudi/releases/tag/release-0.14.1-rc1



Re: [External] Current state of parquet zstd OOM with hudi

2023-11-21 Thread Nicolas Paris
We fixed the hudi memory leak by patching parquet 1.12 and rely on gradle to 
overwrite the transitive dependencies of parquet with that latest version.

I would say an entry in the hudi FAQ on this issue would be great, since hard 
to spot, and marked as fixed on spark side.

Also we didn't notice the issue on EMR and had issue when we migrated on 
kubernetes, which did not help to identify the zstd leak.


Re: [External] Current state of parquet zstd OOM with hudi

2023-11-20 Thread Nicolas Paris
Following up on this, only spark 3.5.x ships with fixed parquet version 0.13.x. 
It's available for latest hudi 0.14 only.

If i replace parquet in previous version of spark i likely breaks the 
reader/writers since methods have been changed in parquet.

Right now I will experiment with 3.5 and  hudi 0.14 but I d be happy to hear 
about a fix for previous spark / hudi versions.

Zstd and this error have been out there for a while now so I would be surprised 
only a very recent release would fix this


Re: [External] Current state of parquet zstd OOM with hudi

2023-11-20 Thread Nicolas Paris
Hi, thanks for tour answer. Do you mean upgrading to parquet 0.13.x ? BTW spark 
introduced a workaround in 3.2.4. Do you mean hudi bypass the workaround ?
Thanks.

Nov 20, 2023 13:37:58 管梓越 :

> hi Nicolas
> This problem is caused by historical parquet version. To fix it, you need
> to ensure parquet version in your spark runtime is upgraded to the latest
> one. In most cases, parquet version is determined by spark version by
> default. Though hudi depends on parquet, such a fix not happen on the
> parquet interface used by hudi. You can simply upgrade spark to latest
> version and check if it is fixed w/O change anything in hudi
> From: "nicolas paris"
> Date: Mon, Nov 20, 2023, 20:07
> Subject: [External] Current state of parquet zstd OOM with hudi
> To: "Hudi Dev List"
> hey month ago someone spotted memory leak while reading zstd files with
> hudi https://github.com/apache/parquet-mr/pull/982#issuecomment-1376498280
> since then spark has merged fixes for 3.2.4, 3.3.3, 3.4.0
> https://issues.apache.org/jira/browse/SPARK-41952 we are currently on spark
> 3.2.4, hudi 0.13.1 and having similar issue (massive off-heap usage) while
> scanning very large hudi tables backed with zstd What is the state of this
> issue? is there any patch to apply on hudi side as well or can I consider
> it fixed by using spark 3.2.4 ? I attach a graph from the uber jvm-profiler
> to illustrate our current troubles. thanks by advance


Current state of parquet zstd OOM with hudi

2023-11-20 Thread nicolas paris
hey

month ago someone spotted memory leak while reading zstd files with
hudi
https://github.com/apache/parquet-mr/pull/982#issuecomment-1376498280

since then spark has merged fixes for 3.2.4, 3.3.3, 3.4.0
https://issues.apache.org/jira/browse/SPARK-41952

we are currently on spark 3.2.4, hudi 0.13.1 and having similar issue
(massive off-heap usage) while scanning very large hudi tables backed
with zstd

What is the state of this issue? is there any patch to apply on hudi
side as well or can I consider it fixed by using spark 3.2.4 ?

I attach a graph from the uber jvm-profiler to illustrate our current
troubles.

thanks by advance


Tuning guide question about off-heap

2023-11-20 Thread nicolas paris
hi everyone,

from the tuning guide:

> Off-heap memory : Hudi writes parquet files and that needs good
amount of off-heap memory proportional to schema width. Consider
setting something like spark.executor.memoryOverhead or
spark.driver.memoryOverhead, if you are running into such failures.


can you elaborate if off-heap usage is specific to hudi when writing
parquet files or if this is a general parquet behavior ? Any details on
this would help

Thanks a lot


Re: Improved MOR spark reader

2023-07-24 Thread Nicolas Paris
>Jon is working on new Hudi Spark integration relying on a new
>implementation of the ParquetFileFormat

Sounds good, thanks for the pointer


On July 24, 2023 5:54:55 AM UTC, Y Ethan Guo  wrote:
>Hi Nicolas,
>
>Thanks for bringing up the discussion.  Spark's MOR snapshot relation
>provides different readers for different splits such as base-file-only
>split and regular split with base and log files.
>
>https://github.com/apache/hudi/blob/master/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/MergeOnReadSnapshotRelation.scala#L124
>https://github.com/apache/hudi/blob/master/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieMergeOnReadRDD.scala#L93
>
>Jon is working on new Hudi Spark integration relying on a new
>implementation of the ParquetFileFormat, so Spark optimizations can kick in
>for MOR; see draft RFC here: https://github.com/apache/hudi/pull/9235.
>Feel free to give feedback there.
>
>Best,
>- Ethan
>
>On Sat, Jul 22, 2023 at 1:23 PM Nicolas Paris 
>wrote:
>
>> Just to clarify: the read path described is all about RT views here only,
>> not related to RO.
>>
>> On July 22, 2023 8:14:09 PM UTC, Nicolas Paris 
>> wrote:
>> >I have been playing with the starrocks MOR hudi reader recently and it
>> does an amazing work: it has two read paths:
>> >
>> >1. For partitions with log files, use the merging logic
>> >2. For partitions with only parquet files, use the cow read logic
>> >
>> >As you know, the first path is slow bcoz it has merging overhead and
>> can't provide any parquet benefit (pushdown, blooms...). In contrast, the
>> second path is blazing fast.
>> >
>> >MOR comes with tons of compaction rules, and  having such behavior makes
>> possible hot/cold partition management.
>> >
>> >One particular case is GDPR where usually old records are deleted/masked
>> on a random distribution , while new partitions are free of changes.
>> >
>> >So far spark does not make distinction between log / log free partitions
>> and I suspect adding such improvement would make MOR table more performant.
>> >
>> >I would be glad to work on such feature so please give early feedback if
>> there is some blocker.
>>


Re: Improved MOR spark reader

2023-07-22 Thread Nicolas Paris
Just to clarify: the read path described is all about RT views here only, not 
related to RO.

On July 22, 2023 8:14:09 PM UTC, Nicolas Paris  wrote:
>I have been playing with the starrocks MOR hudi reader recently and it does an 
>amazing work: it has two read paths:
>
>1. For partitions with log files, use the merging logic
>2. For partitions with only parquet files, use the cow read logic
>
>As you know, the first path is slow bcoz it has merging overhead and can't 
>provide any parquet benefit (pushdown, blooms...). In contrast, the second 
>path is blazing fast.
>
>MOR comes with tons of compaction rules, and  having such behavior makes 
>possible hot/cold partition management.
>
>One particular case is GDPR where usually old records are deleted/masked on a 
>random distribution , while new partitions are free of changes.
>
>So far spark does not make distinction between log / log free partitions and I 
>suspect adding such improvement would make MOR table more performant.
>
>I would be glad to work on such feature so please give early feedback if there 
>is some blocker.


Improved MOR spark reader

2023-07-22 Thread Nicolas Paris
I have been playing with the starrocks MOR hudi reader recently and it does an 
amazing work: it has two read paths:

1. For partitions with log files, use the merging logic
2. For partitions with only parquet files, use the cow read logic

As you know, the first path is slow bcoz it has merging overhead and can't 
provide any parquet benefit (pushdown, blooms...). In contrast, the second path 
is blazing fast.

MOR comes with tons of compaction rules, and  having such behavior makes 
possible hot/cold partition management.

One particular case is GDPR where usually old records are deleted/masked on a 
random distribution , while new partitions are free of changes.

So far spark does not make distinction between log / log free partitions and I 
suspect adding such improvement would make MOR table more performant.

I would be glad to work on such feature so please give early feedback if there 
is some blocker.


Re: Discuss fast copy on write rfc-68

2023-07-21 Thread Nicolas Paris
Definitely can't see a benefit to use 30MB row groups over just creating 30MB 
parquet files.

I would add that stats indexes are on the file level, so it's in favor to using 
row groups size=file size.

The only context it would help is when clustering is setup and targets 1GB 
files, w/ 128MB row groups.

Would love to be contradict on this. But so far the fast cow already exists, 
it's consist of reducing the parquet size for faster writes. It comes with 
drawback on read performances, as would be smaller row groups but it benefits 
from stats indexes better.


On July 20, 2023 9:28:07 PM UTC, Nicolas Paris  wrote:
>Spliting parquet file into 5 row groups, leads to same benefit as creating 5 
>parquet files each 1 row group instead.
>
>Also the later can involve more parallelism for writes.
>
>Am I missing something?
>
>On July 20, 2023 12:38:54 PM UTC, sagar sumit  wrote:
>>Good questions! The idea is to be able to skip rowgroups based on index.
>>But, if we have to do a full snapshot load, then our wrapper should actually
>>be doing batch GET on S3. Why incur 5x more calls.
>>As for the update, I think this is in the context of COW. So, the footer
>>will be
>>recomputed anyways, so handling updates should not be that tricky.
>>
>>Regards,
>>Sagar
>>
>>On Thu, Jul 20, 2023 at 3:26 PM nicolas paris 
>>wrote:
>>
>>> Hi,
>>>
>>> Multiple idenpendant initiatives for fast copy on write have emerged
>>> (correct me if I am wrong):
>>> 1.
>>>
>>> https://github.com/apache/hudi/blob/f1afb1bf04abdc94a26d61dc302f36ec2bbeb15b/rfc/rfc-68/rfc-68.md
>>> 2.
>>> https://www.uber.com/en-FR/blog/fast-copy-on-write-within-apache-parquet/
>>>
>>>
>>> The idea is to rely on RLI index to target only some row groups in a
>>> given parquet file, and only serde that one when copying the file
>>>
>>> Currently hudi generates one row group per parquet file (and having
>>> large row group is what parquet and other advocates).
>>>
>>> The FCOW feature then need to use several row group per parquet to
>>> provide some benefit, let's say 30MB as mentionned in the rfc68
>>> discussion.
>>>
>>> I have concerns about using small row groups for read performances such
>>> as :
>>> - more s3 throttle: if we have 5x more row group in a parquet files,
>>> then it leads to 5x GET call
>>> - worst read performances: since largest row group leads to better
>>> performances overall
>>>
>>>
>>> As a side question, I wonder how the writer can keep statistics within
>>> parquet footer correct. If updates occurs somewhere, then the below
>>> stuff present in the footer shall be updated accordingly:
>>> - parquet row group/pages stats
>>> - parquet dictionary
>>> - parquet bloom filters
>>>
>>> Thanks for your feedback on those
>>>


Re: Discuss fast copy on write rfc-68

2023-07-20 Thread Nicolas Paris
Spliting parquet file into 5 row groups, leads to same benefit as creating 5 
parquet files each 1 row group instead.

Also the later can involve more parallelism for writes.

Am I missing something?

On July 20, 2023 12:38:54 PM UTC, sagar sumit  wrote:
>Good questions! The idea is to be able to skip rowgroups based on index.
>But, if we have to do a full snapshot load, then our wrapper should actually
>be doing batch GET on S3. Why incur 5x more calls.
>As for the update, I think this is in the context of COW. So, the footer
>will be
>recomputed anyways, so handling updates should not be that tricky.
>
>Regards,
>Sagar
>
>On Thu, Jul 20, 2023 at 3:26 PM nicolas paris 
>wrote:
>
>> Hi,
>>
>> Multiple idenpendant initiatives for fast copy on write have emerged
>> (correct me if I am wrong):
>> 1.
>>
>> https://github.com/apache/hudi/blob/f1afb1bf04abdc94a26d61dc302f36ec2bbeb15b/rfc/rfc-68/rfc-68.md
>> 2.
>> https://www.uber.com/en-FR/blog/fast-copy-on-write-within-apache-parquet/
>>
>>
>> The idea is to rely on RLI index to target only some row groups in a
>> given parquet file, and only serde that one when copying the file
>>
>> Currently hudi generates one row group per parquet file (and having
>> large row group is what parquet and other advocates).
>>
>> The FCOW feature then need to use several row group per parquet to
>> provide some benefit, let's say 30MB as mentionned in the rfc68
>> discussion.
>>
>> I have concerns about using small row groups for read performances such
>> as :
>> - more s3 throttle: if we have 5x more row group in a parquet files,
>> then it leads to 5x GET call
>> - worst read performances: since largest row group leads to better
>> performances overall
>>
>>
>> As a side question, I wonder how the writer can keep statistics within
>> parquet footer correct. If updates occurs somewhere, then the below
>> stuff present in the footer shall be updated accordingly:
>> - parquet row group/pages stats
>> - parquet dictionary
>> - parquet bloom filters
>>
>> Thanks for your feedback on those
>>


Discuss fast copy on write rfc-68

2023-07-20 Thread nicolas paris
Hi,

Multiple idenpendant initiatives for fast copy on write have emerged
(correct me if I am wrong):
1.
https://github.com/apache/hudi/blob/f1afb1bf04abdc94a26d61dc302f36ec2bbeb15b/rfc/rfc-68/rfc-68.md
2.
https://www.uber.com/en-FR/blog/fast-copy-on-write-within-apache-parquet/


The idea is to rely on RLI index to target only some row groups in a
given parquet file, and only serde that one when copying the file

Currently hudi generates one row group per parquet file (and having
large row group is what parquet and other advocates). 

The FCOW feature then need to use several row group per parquet to
provide some benefit, let's say 30MB as mentionned in the rfc68
discussion.

I have concerns about using small row groups for read performances such
as :
- more s3 throttle: if we have 5x more row group in a parquet files,
then it leads to 5x GET call
- worst read performances: since largest row group leads to better
performances overall


As a side question, I wonder how the writer can keep statistics within
parquet footer correct. If updates occurs somewhere, then the below
stuff present in the footer shall be updated accordingly:
- parquet row group/pages stats
- parquet dictionary
- parquet bloom filters

Thanks for your feedback on those


Re: Record level index with not unique keys

2023-07-13 Thread nicolas paris
Hello Prashant, thanks for your time.


> With non unique keys how would tagging of records (for updates /
deletes) work?

Currently both GLOBAL_SIMPLE/BLOOM work out of the box in the mentioned
context. See below pyspark script and results. As for the
implementation, the tagLocationBacktoRecords returns a rdd of
HoodieRecord with (key/part/location), and it can contain duplicate
keys (then multiple records for same key).

```
tableName = "test_global_bloom"
basePath = f"/tmp/{tableName}"

hudi_options = {
"hoodie.table.name": tableName,
"hoodie.datasource.write.recordkey.field": "event_id",
"hoodie.datasource.write.partitionpath.field": "part",
"hoodie.datasource.write.table.name": tableName,
"hoodie.datasource.write.precombine.field": "ts",
"hoodie.datasource.write.hive_style_partitioning": "true",
"hoodie.datasource.hive_sync.enable": "false",
"hoodie.metadata.enable": "true",
"hoodie.index.type": "GLOBAL_BLOOM", # GLOBAL_SIMPLE works as well
}

# LET'S GEN DUPLS
mode="overwrite"
df =spark.sql("""select '1' as event_id, '2' as ts, '2' as part UNION
 select '1' as event_id, '3' as ts, '3' as part UNION
 select '1' as event_id, '2' as ts, '3' as part UNION
 select '2' as event_id, '2' as ts, '3' as part""")
df.write.format("hudi").options(**hudi_options).option("hoodie.datasour
ce.write.operation", "BULK_INSERT").mode(mode).save(basePath)
spark.read.format("hudi").load(basePath).select("event_id",
"ts","part").show()
# ++---++
# |event_id| ts|part|
# ++---++
# |   1|  3|   3|
# |   1|  2|   3|
# |   2|  2|   3|
# |   1|  2|   2|
# ++---++

# UPDATE
mode="append"
spark.sql("select '1' as event_id, '20' as ts, '4' as
part").write.format("hudi").options(**hudi_options).option("hoodie.data
source.write.operation", "UPSERT").mode(mode).save(basePath)
spark.read.format("hudi").load(basePath).select("event_id",
"ts","part").show()
# ++---++
# |event_id| ts|part|
# ++---++
# |   1| 20|   4|
# |   1| 20|   4|
# |   1| 20|   4|
# |   2|  2|   3|
# ++---++

# DELETE
mode="append"
spark.sql("select 1 as
event_id").write.format("hudi").options(**hudi_options).option("hoodie.
datasource.write.operation", "DELETE").mode(mode).save(basePath)
spark.read.format("hudi").load(basePath).select("event_id",
"ts","part").show()
# ++---++
# |event_id| ts|part|
# ++---++
# |   2|  2|   3|
# ++---++
```


> How would record Index know which mapping of the array to
return for a given record key?

As well as GLOBAL_SIMPLE/BLOOM, for a given record key, the RLI would
return a list of mapping. Then the operation (update, delete, FCOW ...)
would apply to each location.

To illustrate, we could get something like this in the MDT:

|event_id:1|[
 {part=2, -5811947225812876253, -6812062179961430298, 0, 
1689147210233}, 
 {part=3, -711947225812876253, -8812062179961430298, 1, 
1689147210233},
 {part=3, -1811947225812876253, -2812062179961430298, 0, 
1689147210233} 
     ]|


On Thu, 2023-07-13 at 10:17 -0700, Prashant Wason wrote:
> Hi Nicolas,
> 
> The RI feature is designed for max performance as it is at a record-
> count
> scale. Hence, the schema is simplified and minimized.
> 
> With non unique keys how would tagging of records (for updates /
> deletes)
> work? How would record Index know which mapping of the array to
> return for
> a given record key?
> 
> Thanks
> Prashant
> 
> 
> 
> On Wed, Jul 12, 2023 at 2:02 AM nicolas paris
> 
> wrote:
> 
> > hi there,
> > 
> > Just tested preview of RLI (rfc-08), amazing feature. Soon the fast
> > COW
> > (rfc-68) will be based on RLI to get the parquet offsets and allow
> > targeting parquet row groups.
> > 
> > RLI is a global index, therefore it assumes the hudi key is present
> > in
> > at most one parquet file. As a result in the MDT, the RLI is of
> > type
> > struct, and there is a 1:1 mapping w/ a given file.
> > 
> > Type:
> >    |-- recordIndexMetadata: struct (nullable = true)
> >    |    |-- partition: string (nullable = false)
> >    |    |-- fileIdHighBits: long (nullable = false)
> >    |    |-- fileIdLowBits: long (nullable = false)
> >    |    |-- fileIndex: integer (nullable = false)
> >    |    |-- instantTime: long (nullable = false)
> > 
> > Content:
> >    |event_id:1    |{part=3, -6811947225812876253,
> > -7812062179961430298, 0, 1689147210233}|
> > 
> > We would love to use both RLI and FCOW features, but I'm afraid our
> > keys are not unique in our kafka archives. Same key might be
> > present
> > in multiple partitions, and even in multiple slices within
> > partitions.
> > 
> > I wonder if the future, RLI could support multiple parquet files
> > (by
> > storing an array of struct for eg). This would enable to leverage
> > LRI
> > in more contexts
> > 
> > Thx
> > 
> > 
> > 
> > 
> > 



Record level index with not unique keys

2023-07-12 Thread nicolas paris
hi there,

Just tested preview of RLI (rfc-08), amazing feature. Soon the fast COW
(rfc-68) will be based on RLI to get the parquet offsets and allow
targeting parquet row groups.

RLI is a global index, therefore it assumes the hudi key is present in
at most one parquet file. As a result in the MDT, the RLI is of type
struct, and there is a 1:1 mapping w/ a given file.

Type:
   |-- recordIndexMetadata: struct (nullable = true)  
   ||-- partition: string (nullable = false)  
   ||-- fileIdHighBits: long (nullable = false)  
   ||-- fileIdLowBits: long (nullable = false)  
   ||-- fileIndex: integer (nullable = false)  
   ||-- instantTime: long (nullable = false)

Content:
   |event_id:1|{part=3, -6811947225812876253, -7812062179961430298, 0, 
1689147210233}|
   
We would love to use both RLI and FCOW features, but I'm afraid our 
keys are not unique in our kafka archives. Same key might be present 
in multiple partitions, and even in multiple slices within partitions.

I wonder if the future, RLI could support multiple parquet files (by 
storing an array of struct for eg). This would enable to leverage LRI
in more contexts

Thx






Re: [DISCUSS] Hudi Reverse Streamer

2023-06-14 Thread Nicolas Paris
Hi any rfc/ongoing efforts on the reverse delta streamer ? We have a use case 
to do hudi => Kafka and would enjoy building a more general tool. 

However we need a rfc basis to start some effort in the right way

On April 12, 2023 3:08:22 AM UTC, Vinoth Chandar 
 wrote:
>Cool. lets draw up a RFC for this? @pratyaksh - do you want to start one,
>given you expressed interest?
>
>On Mon, Apr 10, 2023 at 7:32 PM Léo Biscassi  wrote:
>
>> +1
>> This would be great!
>>
>> Cheers,
>>
>> On Mon, Apr 3, 2023 at 3:00 PM Pratyaksh Sharma 
>> wrote:
>>
>> > Hi Vinoth,
>> >
>> > I am aligned with the first reason that you mentioned. Better to have a
>> > separate tool to take care of this.
>> >
>> > On Mon, Apr 3, 2023 at 9:01 PM Vinoth Chandar <
>> > mail.vinoth.chan...@gmail.com>
>> > wrote:
>> >
>> > > +1
>> > >
>> > > I was thinking that we add a new utility and NOT extend DeltaStreamer
>> by
>> > > adding a Sink interface, for the following reasons
>> > >
>> > > - It will make it look like a generic Source => Sink ETL tool, which is
>> > > actually not our intention to support on Hudi. There are plenty of good
>> > > tools for that out there.
>> > > - the config management can get bit hard to understand, since we
>> overload
>> > > ingest and reverse ETL into a single tool. So break it off at use-case
>> > > level?
>> > >
>> > > Thoughts?
>> > >
>> > > David:  PMC does not have control over that. Please see unsubscribe
>> > > instructions here. https://hudi.apache.org/community/get-involved
>> > > Love to keep this thread about reverse streamer discussion. So kindly
>> > fork
>> > > another thread if you want to discuss unsubscribing.
>> > >
>> > > On Fri, Mar 31, 2023 at 1:47 AM Davidiam 
>> > wrote:
>> > >
>> > > > Hello Vinoth,
>> > > >
>> > > > Can you please unsubscribe me?  I have been trying to unsubscribe for
>> > > > months without success.
>> > > >
>> > > > Kind Regards,
>> > > > David
>> > > >
>> > > > Sent from Outlook for Android
>> > > > 
>> > > > From: Vinoth Chandar 
>> > > > Sent: Friday, March 31, 2023 5:09:52 AM
>> > > > To: dev 
>> > > > Subject: [DISCUSS] Hudi Reverse Streamer
>> > > >
>> > > > Hi all,
>> > > >
>> > > > Any interest in building a reverse streaming tool, that does the
>> > reverse
>> > > of
>> > > > what the DeltaStreamer tool does? It will read Hudi table
>> incrementally
>> > > > (only source) and write out the data to a variety of sinks - Kafka,
>> > JDBC
>> > > > Databases, DFS.
>> > > >
>> > > > This has come up many times with data warehouse users. Often times,
>> > they
>> > > > want to use Hudi to speed up or reduce costs on their data ingestion
>> > and
>> > > > ETL (using Spark/Flink), but want to move the derived data back into
>> a
>> > > data
>> > > > warehouse or an operational database for serving.
>> > > >
>> > > > What do you all think?
>> > > >
>> > > > Thanks
>> > > > Vinoth
>> > > >
>> > >
>> >
>>
>>
>> --
>> *Léo Biscassi*
>> Blog - https://leobiscassi.com
>>
>>-
>>


Re: Calling for 0.13.1 Release

2023-05-04 Thread nicolas paris
Hi, any timeline for the 0.13.1 bugfix release ?
may that one be added to the prep branch 
https://github.com/apache/hudi/pull/8432


On Thu, 2023-03-09 at 11:21 -0600, Shiyan Xu wrote:
> thanks for volunteering! let's collab on the release work
> 
> On Sun, Mar 5, 2023 at 8:16 PM Forward Xu 
> wrote:
> 
> > +1, Thanks for Yue Zhang to be the RM for the next 0.13.
> > ForwardXu
> > 
> > Yue Zhang  于2023年3月3日周五 16:31写道:
> > 
> > > Hi Hudiers,
> > >     I volunteer to be the RM for the next 0.13.1 if u don’t mind
> > > :)
> > > 
> > > 
> > > > > 
> > > Yue Zhang
> > > > 
> > > > 
> > > zhangyue921...@163.com
> > > > 
> > > 
> > > 
> > > On 03/3/2023 16:23,Y Ethan Guo wrote:
> > > Hi folks,
> > > 
> > > Given that we have already found a few critical issues affecting
> > > 0.13.0
> > > release, such as the following, I suggest that we, as a
> > > community, follow
> > > up with 0.13.1 release in a month to address reliability issues
> > > in
> > 0.13.0.
> > > Any volunteer for 0.13.1 Release Manager is welcome.
> > > 
> > > https://github.com/apache/hudi/pull/8026
> > > https://github.com/apache/hudi/pull/8079
> > > https://github.com/apache/hudi/pull/8080
> > > 
> > > Thanks,
> > > - Ethan
> > > 
> > 
> 
> 



Re: [REVERT] [VOTE] Release 0.12.0, release candidate #1

2022-10-07 Thread Nicolas Paris
Hi dev team,

I take this opportunity to also propose to land this tiny fix which
lead us not to use the spark-bundle due to conflicts with other libs.
https://github.com/apache/hudi/pull/6874

In any case, thanks !


On Fri, 2022-10-07 at 18:43 +0800, Shiyan Xu wrote:
> Thank you, Zhaojing, for handling this. Agree on the decision. the
> fix
> <
> https://github.com/apache/hudi/commit/a51181726ce6efb57459285a66868e9d
> 3687bd60>
> was landed
> 
> On Fri, Oct 7, 2022 at 4:56 PM zhaojing yu 
> wrote:
> 
> > Update 0.12.0 to 0.12.1, sorry about that.
> > 
> > zhaojing yu  于2022年10月7日周五 16:55写道:
> > 
> > > After careful consideration, we decided to restore the voting
> > > results due
> > > to the serious issue found with bulkinsert/row-writing path
> > > (affecting
> > > persisted data and resulting in duplicates). fix is here
> > > releasing 0.12.1 without the fix would make 0.12.0 and 0.12.1
> > > basically
> > > crippled (have to ask users to avoid bulkinsert with row-writing)
> > > Therefore RC1 will be canceled and I'll start preparing for RC2.
> > > 
> > 
> 
> 


Re: Updates on 0.11.1 release

2022-06-10 Thread Nicolas Paris
Thanks to the community support, I have closed that issue, and
commenting the reason.

glad to see 0.11.1 soon



On Fri Jun 10, 2022 at 11:33 AM CEST, Nicolas Paris wrote:
> Hi team
>
> I likely spotted a blocker issue with the incremental cleaning service
> which is a blocker on our side to scale cleaning on large tables.
>
> See https://github.com/apache/hudi/issues/5835
>
> Please tell me if my email does not respect the release process
>
> On Wed Jun 8, 2022 at 1:39 AM CEST, Y Ethan Guo wrote:
> > Hi folks,
> >
> > All the 0.11.1 release blockers are landed. I'm going to cut RC1 and
> > start
> > the release candidate process.
> >
> > Thanks,
> > - Ethan
> >
> > On Thu, Jun 2, 2022 at 9:48 PM Y Ethan Guo  wrote:
> >
> > > Hi folks,
> > >
> > > There are still a few critical PRs on bridging the performance gaps
> > > between 0.11.0 and 0.10.1 that are pending, e.g., HUDI-4176
> > > <https://github.com/apache/hudi/pull/5733>, HUDI-4178
> > > <https://github.com/apache/hudi/pull/5737>, etc.  We should get those
> > > landed for 0.11.1.  In this case, I'm going to postpone the code freeze.
> > > If you have more fixes you'd like to merge for the release, please let me
> > > know in this thread.
> > >
> > > The exact code freeze date will be updated soon.  Please stay tuned.
> > >
> > > Thanks,
> > > - Ethan
> > >



Re: Updates on 0.11.1 release

2022-06-10 Thread Nicolas Paris
Hi team

I likely spotted a blocker issue with the incremental cleaning service
which is a blocker on our side to scale cleaning on large tables.

See https://github.com/apache/hudi/issues/5835

Please tell me if my email does not respect the release process 

On Wed Jun 8, 2022 at 1:39 AM CEST, Y Ethan Guo wrote:
> Hi folks,
>
> All the 0.11.1 release blockers are landed. I'm going to cut RC1 and
> start
> the release candidate process.
>
> Thanks,
> - Ethan
>
> On Thu, Jun 2, 2022 at 9:48 PM Y Ethan Guo  wrote:
>
> > Hi folks,
> >
> > There are still a few critical PRs on bridging the performance gaps
> > between 0.11.0 and 0.10.1 that are pending, e.g., HUDI-4176
> > , HUDI-4178
> > , etc.  We should get those
> > landed for 0.11.1.  In this case, I'm going to postpone the code freeze.
> > If you have more fixes you'd like to merge for the release, please let me
> > know in this thread.
> >
> > The exact code freeze date will be updated soon.  Please stay tuned.
> >
> > Thanks,
> > - Ethan
> >



Re: spark 3.2.1 built-in bloom filters

2022-05-19 Thread Nicolas Paris
As we now got hudi 0.11 with multiple columns bloom indexes thougth
`hoodie.metadata.index.bloom.filter.column.list`, the question is wether
those bloom are used by query planner for e.g id=19

The spark built-in blooms are used in this case, maybe that's also the
hudi multi-bloom purpose as well ? (there is no mention about their use)


thanks




On Wed Mar 30, 2022 at 11:36 PM CEST, Vinoth Chandar wrote:
> Hi,
>
> I noticed that it finally landed. We actually began tracking that JIRA
> while initially writing Hudi at Uber.. Parquet + Bloom Filters has taken
> just a few years :)
> I think we could switch out to reading the built-in bloom filters as
> well.
> it could make the footer reading lighter potentially.
>
> Few things that Hudi has built on top would be missing
>
> - Dynamic bloom filter support, where we auto size current bloom filters
> based on number of records, given a fpp target
> - Our current DAG that optimizes for checking records against bloom
> filters
> is still needed on writer side. Checking bloom filters for a given
> predicate e.g id=19, is much simpler compared to matching say a 100k ids
> against 1000 files. We need to be able to amortize the cost of these
> 100M
> comparisons.
>
> On the future direction, with 0.11, we are enabling storing of bloom
> filters and column ranges inside the Hudi metadata table (MDT). *(what
> we
> call multi modal indexes).
> This helps us make the access more resilient towards cloud storage
> throttling and also more performant (we need to read much fewer files)
>
> Over time, when this mechanism is stable, we plan to stop writing out
> bloom
> filters in parquet and also integrate the Hudi MDT with different query
> engines for point-ish lookups.
>
> Hope that helps
>
> Thanks
> Vinoth
>
>
>
>
> On Mon, Mar 28, 2022 at 9:57 AM Nicolas Paris 
> wrote:
>
> > Hi,
> >
> > spark 3.2 ships parquet 1.12 which provides built-in bloom filters on
> > arbirtrary columns. I wonder if:
> >
> > - hudi can benefit from them ? (likely in 0.11, but not with MOR tables)
> > - would make sense to replace the hudi blooms with them ?
> > - what would be the advantage of storing our blooms in hfiles (AFAIK
> >   this is the future expected implementation) over the parquet built-in.
> >
> >
> > here is the syntax:
> >
> > .option("parquet.bloom.filter.enabled#favorite_color", "true")
> > .option("parquet.bloom.filter.expected.ndv#favorite_color", "100")
> >
> >
> > and here some code to illustrate :
> >
> >
> > https://github.com/apache/spark/blob/e40fce919ab77f5faeb0bbd34dc86c56c04adbaa/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilterSuite.scala#L1654
> >
> >
> >
> > thx
> >



Re: spark 3.2.1 built-in bloom filters

2022-04-02 Thread Nicolas Paris
Hi Vinoth,

Thanks for your in depth explanations. I think those details could be
of interest in the documentation. I can work on this if agreed

On Wed, 2022-03-30 at 14:36 -0700, Vinoth Chandar wrote:
> Hi,
> 
> I noticed that it finally landed. We actually began tracking that
> JIRA
> while initially writing Hudi at Uber.. Parquet + Bloom Filters has
> taken
> just a few years :)
> I think we could switch out to reading the built-in bloom filters as
> well.
> it could make the footer reading lighter potentially.
> 
> Few things that Hudi has built on top would be missing
> 
> - Dynamic bloom filter support, where we auto size current bloom
> filters
> based on number of records, given a fpp target
> - Our current DAG that optimizes for checking records against bloom
> filters
> is still needed on writer side. Checking bloom filters for a given
> predicate e.g id=19, is much simpler compared to matching say a 100k
> ids
> against 1000 files. We need to be able to amortize the cost of these
> 100M
> comparisons.
> 
> On the future direction, with 0.11, we are enabling storing of bloom
> filters and column ranges inside the Hudi metadata table (MDT).
> *(what we
> call multi modal indexes).
> This helps us make the access more resilient towards cloud storage
> throttling and also more performant (we need to read much fewer
> files)
> 
> Over time, when this mechanism is stable, we plan to stop writing out
> bloom
> filters in parquet and also integrate the Hudi MDT with different
> query
> engines for point-ish lookups.
> 
> Hope that helps
> 
> Thanks
> Vinoth
> 
> 
> 
> 
> On Mon, Mar 28, 2022 at 9:57 AM Nicolas Paris
> 
> wrote:
> 
> > Hi,
> > 
> > spark 3.2 ships parquet 1.12 which provides built-in bloom filters
> > on
> > arbirtrary columns. I wonder if:
> > 
> > - hudi can benefit from them ? (likely in 0.11, but not with MOR
> > tables)
> > - would make sense to replace the hudi blooms with them ?
> > - what would be the advantage of storing our blooms in hfiles
> > (AFAIK
> >   this is the future expected implementation) over the parquet
> > built-in.
> > 
> > 
> > here is the syntax:
> > 
> >     .option("parquet.bloom.filter.enabled#favorite_color", "true")
> >     .option("parquet.bloom.filter.expected.ndv#favorite_color",
> > "100")
> > 
> > 
> > and here some code to illustrate :
> > 
> > 
> > https://github.com/apache/spark/blob/e40fce919ab77f5faeb0bbd34dc86c56c04adbaa/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilterSuite.scala#L1654
> > 
> > 
> > 
> > thx
> > 


spark 3.2.1 built-in bloom filters

2022-03-28 Thread Nicolas Paris
Hi,

spark 3.2 ships parquet 1.12 which provides built-in bloom filters on
arbirtrary columns. I wonder if:

- hudi can benefit from them ? (likely in 0.11, but not with MOR tables)
- would make sense to replace the hudi blooms with them ?
- what would be the advantage of storing our blooms in hfiles (AFAIK
  this is the future expected implementation) over the parquet built-in.


here is the syntax:

.option("parquet.bloom.filter.enabled#favorite_color", "true")
.option("parquet.bloom.filter.expected.ndv#favorite_color", "100")


and here some code to illustrate :

https://github.com/apache/spark/blob/e40fce919ab77f5faeb0bbd34dc86c56c04adbaa/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilterSuite.scala#L1654



thx


Re: [ANNOUNCE] Apache Hudi 0.10.1 released

2022-01-29 Thread Nicolas Paris
congrats

what about also posting releases into the apache announce mailing list 
annou...@apache.org


On Fri Jan 28, 2022 at 1:39 PM CET, Sivabalan wrote:
> The Apache Hudi team is pleased to announce the release of Apache
>
> Hudi 0.10.1.
>
>
> Apache Hudi (pronounced Hoodie) stands for Hadoop Upserts Deletes
>
> and Incrementals. Apache Hudi manages storage of large analytical
>
> datasets on DFS (Cloud stores, HDFS or any Hadoop FileSystem compatible
> storage)
>
> and provides the ability to query them.
>
>
> This release comes 1.5 months after 0.10.0. This release is purely
> intended to fix stability and bugs, which includes more than 120+
> resolved
> issues. Fixes span many areas ranging from key generators to timeline,
> engine specific fixes, table services, etc.
>
>
> For details on how to use Hudi, please look at the quick start page
> located
> at https://hudi.apache.org/docs/quick-start-guide.html
>
> If you'd like to download the source release, you can find it here:
>
> https://github.com/apache/hudi/releases/tag/release-0.10.1
>
> You can read more about the release (including release notes) here:
>
> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12322822=12351135
>
>
> We welcome your help and feedback. For more information on how to
> report problems, and to get involved, visit the project website at:
>
> http://hudi.apache.org/
>
> Thanks to everyone involved!
>
>
> --
> Regards,
> -Sivabalan



Re: Limitations of non unique keys

2021-11-03 Thread Nicolas Paris


> In another words, we are generalizing this so hudi feels more like
> MySQL and not HBase/Cassandra (key value store). Thats the direction
> we are approaching.

wow this is amazing. I haven't found yet RFC about this, nor ready to
test PR.

This answer my initial question: with the secondary indexes options
comming, the hudi key shall be a primary key (if exists). There is no
reason to choose anything else.

On Wed Nov 3, 2021 at 9:03 PM CET, Vinoth Chandar wrote:
> Hi.
>
> With the indexing approach we are taking, you should be able to add
> secondary indexes on any column. not just the key.
> In another words, we are generalizing this so hudi feels more like MySQL
> and not HBase/Cassandra (key value store). Thats the direction we are
> approaching.
>
> love to hear more feedback.
>
> On Tue, Nov 2, 2021 at 2:29 AM Nicolas Paris 
> wrote:
>
> > for example does the move of blooms into hfiles (0.10.0 feature) makes
> > unique bloom keys mandatory ?
> >
> >
> >
> > On Thu Oct 28, 2021 at 7:00 PM CEST, Nicolas Paris wrote:
> > >
> > > > Are you asking if there are advantages to allowing duplicates or not
> > having keys in your table?
> > > it's all about allowing duplicates
> > >
> > > use case is say an Order table and choosing key = customer_id
> > > then being able to do indexed delete without need of prescanning the
> > > dataset
> > >
> > > I wonder if there will be trouble I am unaware of with such trick
> > >
> > > On Thu Oct 28, 2021 at 2:33 PM CEST, Vinoth Chandar wrote:
> > > > Hi,
> > > >
> > > > Are you asking if there are advantages to allowing duplicates or not
> > > > having
> > > > keys in your table?
> > > >
> > > > Having keys, helps with othe practical scenarios, in addition to what
> > > > you
> > > > called out.
> > > > e.g: Oftentimes, you would want to backfill an insert-only table and
> > you
> > > > don't want to introduce duplicates when doing so.
> > > >
> > > > Thanks
> > > > Vinoth
> > > >
> > > > On Tue, Oct 26, 2021 at 1:37 AM Nicolas Paris <
> > nicolas.pa...@riseup.net>
> > > > wrote:
> > > >
> > > > > Hi devs,
> > > > >
> > > > > AFAIK, hudi has been designed to have primary keys in the hudi's key.
> > > > > However it is possible to also choose a non unique field. I have
> > listed
> > > > > several trouble with such design:
> > > > >
> > > > > Non unique key yield to :
> > > > > - cannot delete / update a unique record
> > > > > - cannot apply primary key for new sql tables feature
> > > > >
> > > > > Is there other downsides to choose a non unique key you have in mind
> > ?
> > > > >
> > > > > In my case, having user_id as a hudi key will help to apply deletion
> > on
> > > > > the user level in any user table. The table are insert only, so the
> > > > > drawbacks listed above do not really apply. In case of error in the
> > > > > tables I have several options:
> > > > >
> > > > > - rollback to a previous commit
> > > > > - read partition/filter overwrite partition
> > > > >
> > > > > Thanks
> > > > >
> >
> >



Re: Limitations of non unique keys

2021-11-02 Thread Nicolas Paris
for example does the move of blooms into hfiles (0.10.0 feature) makes
unique bloom keys mandatory ?



On Thu Oct 28, 2021 at 7:00 PM CEST, Nicolas Paris wrote:
>
> > Are you asking if there are advantages to allowing duplicates or not having 
> > keys in your table?
> it's all about allowing duplicates
>
> use case is say an Order table and choosing key = customer_id
> then being able to do indexed delete without need of prescanning the
> dataset
>
> I wonder if there will be trouble I am unaware of with such trick
>
> On Thu Oct 28, 2021 at 2:33 PM CEST, Vinoth Chandar wrote:
> > Hi,
> >
> > Are you asking if there are advantages to allowing duplicates or not
> > having
> > keys in your table?
> >
> > Having keys, helps with othe practical scenarios, in addition to what
> > you
> > called out.
> > e.g: Oftentimes, you would want to backfill an insert-only table and you
> > don't want to introduce duplicates when doing so.
> >
> > Thanks
> > Vinoth
> >
> > On Tue, Oct 26, 2021 at 1:37 AM Nicolas Paris 
> > wrote:
> >
> > > Hi devs,
> > >
> > > AFAIK, hudi has been designed to have primary keys in the hudi's key.
> > > However it is possible to also choose a non unique field. I have listed
> > > several trouble with such design:
> > >
> > > Non unique key yield to :
> > > - cannot delete / update a unique record
> > > - cannot apply primary key for new sql tables feature
> > >
> > > Is there other downsides to choose a non unique key you have in mind ?
> > >
> > > In my case, having user_id as a hudi key will help to apply deletion on
> > > the user level in any user table. The table are insert only, so the
> > > drawbacks listed above do not really apply. In case of error in the
> > > tables I have several options:
> > >
> > > - rollback to a previous commit
> > > - read partition/filter overwrite partition
> > >
> > > Thanks
> > >



Re: feature request/proposal: leverage bloom indexes for readingb

2021-10-28 Thread Nicolas Paris
I tested the HoodieReadClient. It's a great start indeed. Looks like
this client is meant fo testing purpose and needs some enhancement. I
will try to produce a general purpose code aroud this and who knows
contribute.

I guess the datasource api is not the best candidate since hudi keys
cannot be passed as options but with rdd or df:

sprark.read.format('hudi').option('hudi.filter.keys',
'a,flat,list,of,keys,not,really,cool').load(...)

there is also the option to introduce a new hudi operation such
"select". but again it's not supposed to return a dataframe but write to
the hudi:

df_hudi_keys.options(**hudi_options).save(...)

Then a full featured / documented hoodie client is maybe the best option


thought ?


On Thu Oct 28, 2021 at 2:34 PM CEST, Vinoth Chandar wrote:
> Sounds great!
>
> On Tue, Oct 26, 2021 at 7:26 AM Nicolas Paris 
> wrote:
>
> > Hi Vinoth,
> >
> > Thanks for the starter. Definitely once the new way to manage indexes
> > and we get migrated on hudi on our datalake, I d'be glad to give this a
> > shot.
> >
> >
> > Regards, Nicolas
> >
> > On Fri Oct 22, 2021 at 4:33 PM CEST, Vinoth Chandar wrote:
> > > Hi Nicolas,
> > >
> > > Thanks for raising this! I think it's a very valid ask.
> > > https://issues.apache.org/jira/browse/HUDI-2601 has been raised.
> > >
> > > As a proof of concept, would you be able to give filterExists() a shot
> > > and
> > > see if the filtering time improves?
> > >
> > https://github.com/apache/hudi/blob/master/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/HoodieReadClient.java#L172
> > >
> > > In the upcoming 0.10.0 release, we are planning to move the bloom
> > > filters
> > > out to a partition on the metadata table, to even speed this up for very
> > > large tables.
> > > https://issues.apache.org/jira/browse/HUDI-1295
> > >
> > > Please let us know if you are interested in testing that when the PR is
> > > up.
> > >
> > > Thanks
> > > Vinoth
> > >
> > > On Tue, Oct 19, 2021 at 4:38 AM Nicolas Paris 
> > > wrote:
> > >
> > > > hi !
> > > >
> > > > In my use case, for GDPR I have to export all informations of a given
> > > > user from several hudi HUGE tables. Filtering the table results in a
> > > > full scan of around 10 hours and this will get worst year after year.
> > > >
> > > > Since the filter criteria is based on the bloom key (user_id) it would
> > > > be handy to exploit the bloom and produce a temporary table (in the
> > > > metastore for eg) with the resulting rows.
> > > >
> > > > So far the bloom indexing is used for update/delete operations on a
> > hudi
> > > > table.
> > > >
> > > > 1. There is a oportunity to exploit the bloom for select operations.
> > > > the hudi options would be:
> > > > operation: select
> > > > result-table: 
> > > > result-path: 
> > > > result-schema:  (optional ; when empty no
> > > > sync with the hms, only raw path)
> > > >
> > > >
> > > > 2. It could be implemented as predicate push down in the spark
> > > > datasource API. When filtering with a IN statement.
> > > >
> > > >
> > > > Thought ?
> > > >
> >
> >



Re: feature request/proposal: leverage bloom indexes for readingb

2021-10-26 Thread Nicolas Paris
Hi Vinoth,

Thanks for the starter. Definitely once the new way to manage indexes
and we get migrated on hudi on our datalake, I d'be glad to give this a
shot.


Regards, Nicolas

On Fri Oct 22, 2021 at 4:33 PM CEST, Vinoth Chandar wrote:
> Hi Nicolas,
>
> Thanks for raising this! I think it's a very valid ask.
> https://issues.apache.org/jira/browse/HUDI-2601 has been raised.
>
> As a proof of concept, would you be able to give filterExists() a shot
> and
> see if the filtering time improves?
> https://github.com/apache/hudi/blob/master/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/HoodieReadClient.java#L172
>
> In the upcoming 0.10.0 release, we are planning to move the bloom
> filters
> out to a partition on the metadata table, to even speed this up for very
> large tables.
> https://issues.apache.org/jira/browse/HUDI-1295
>
> Please let us know if you are interested in testing that when the PR is
> up.
>
> Thanks
> Vinoth
>
> On Tue, Oct 19, 2021 at 4:38 AM Nicolas Paris 
> wrote:
>
> > hi !
> >
> > In my use case, for GDPR I have to export all informations of a given
> > user from several hudi HUGE tables. Filtering the table results in a
> > full scan of around 10 hours and this will get worst year after year.
> >
> > Since the filter criteria is based on the bloom key (user_id) it would
> > be handy to exploit the bloom and produce a temporary table (in the
> > metastore for eg) with the resulting rows.
> >
> > So far the bloom indexing is used for update/delete operations on a hudi
> > table.
> >
> > 1. There is a oportunity to exploit the bloom for select operations.
> > the hudi options would be:
> > operation: select
> > result-table: 
> > result-path: 
> > result-schema:  (optional ; when empty no
> > sync with the hms, only raw path)
> >
> >
> > 2. It could be implemented as predicate push down in the spark
> > datasource API. When filtering with a IN statement.
> >
> >
> > Thought ?
> >



Limitations of non unique keys

2021-10-26 Thread Nicolas Paris
Hi devs,

AFAIK, hudi has been designed to have primary keys in the hudi's key.
However it is possible to also choose a non unique field. I have listed
several trouble with such design:

Non unique key yield to :
- cannot delete / update a unique record
- cannot apply primary key for new sql tables feature

Is there other downsides to choose a non unique key you have in mind ?

In my case, having user_id as a hudi key will help to apply deletion on
the user level in any user table. The table are insert only, so the
drawbacks listed above do not really apply. In case of error in the
tables I have several options:

- rollback to a previous commit
- read partition/filter overwrite partition

Thanks


feature request/proposal: leverage bloom indexes for readingb

2021-10-19 Thread Nicolas Paris
hi !

In my use case, for GDPR I have to export all informations of a given
user from several hudi HUGE tables. Filtering the table results in a
full scan of around 10 hours and this will get worst year after year.

Since the filter criteria is based on the bloom key (user_id) it would
be handy to exploit the bloom and produce a temporary table (in the
metastore for eg) with the resulting rows.

So far the bloom indexing is used for update/delete operations on a hudi
table.

1. There is a oportunity to exploit the bloom for select operations.
the hudi options would be:
operation: select
result-table: 
result-path: 
result-schema:  (optional ; when empty no
sync with the hms, only raw path)


2. It could be implemented as predicate push down in the spark
datasource API. When filtering with a IN statement.


Thought ?