date:20200603

Re: [DISCUSS] Write failed records

2020-06-03 Thread Vinoth Chandar

Thanks! Will review and get back to you

On Tue, Jun 2, 2020 at 10:37 AM Shiyan Xu 
wrote:

> Thank you for the feedback, Vinoth. Agreed with your points. Also created a
> small RFC for easy alignment on the changes
>
> https://cwiki.apache.org/confluence/display/HUDI/RFC+-+20+%3A+handle+failed+records
>
> On Sun, May 24, 2020 at 12:06 AM Vinoth Chandar  wrote:
>
> > Hi Raymond,
> >
> > Thanks for starting this discussion.
> >
> > Agree on 1.. (we may also need some CLI support for inspecting bad/record
> > and also code samples to consume them etc?)
> >
> > On 2, these place seem appropriate. We can figure it out, in more detail
> > when we get to implementation?
> >
> > On 3. +1 on logs.. We should also define a standard schema for error
> > record.. I see some tricky issues to handle here, for schema mismatch
> > errors. For e.g if the core problem was schema mismatch, then
> > serializing/deserializing the error record without a working schema
> > specific to that record may not be possible? May be we need the record
> data
> > itself in some format like json, that is schemaless?
> > I also wonder if we should write the error table as another internal
> > HoodieTable (we are abstracting out HoodieTable, FileGroupIO etc anyway)?
> >
> > On 4, +1 again.
> >
> > On Fri, May 22, 2020 at 7:47 PM Shiyan Xu 
> > wrote:
> >
> > > Hi all,
> > >
> > > I'd like to bring up this discussion around handling errors in Hudi
> write
> > > paths.
> > > https://issues.apache.org/jira/browse/HUDI-648
> > >
> > > Trying to gather some feedbacks about the implementation details
> > > 1. Error location
> > > I'm thinking of writing the failed records to `.hoodie/errors/` for
> > > a) encapsulate data within the Hudi table for ease of management
> > > b) make use of existing dedicated directory
> > >
> > > 2. Write path
> > > org.apache.hudi.client.HoodieWriteClient#postWrite
> > > org.apache.hudi.client.HoodieWriteClient#completeCompaction
> > > These 2 methods should be the places to persist failed records in
> > > `org.apache.hudi.table.action.HoodieWriteMetadata#writeStatuses`
> > > to the designated location
> > >
> > > 3. Format
> > > Records should be written as logs (avro)
> > >
> > > 4. Metric
> > > Post writing failed records, it should send a metric of basic count of
> > > errors written. Easier for monitoring system to pick up and send alert.
> > >
> > > Foreseeably, some details may need to be adjusted throughout the
> > > development. To begin with, we may agree on a feasible plan at high
> > level.
> > >
> > > Please kindly share thoughts and feedbacks. Thank you.
> > >
> > >
> > >
> > > Regards,
> > > Raymond
> > >
> >
>

Re: Suggestion needed - Hudi performance wrt no. and depth of partitions

2020-06-03 Thread Vinoth Chandar

This is a good conversation. The ask for support of bucketed tables has not
actually come up much, since if you are looking up things at that
granularity, it almost feels like you are doing OLTP/database like queries?

Assuming you hash the primary key into a hash that denotes the partition,
then a simple workaround is to always add a where clause using a UDF in
presto, I.e where key = 123 and partition = hash_udf(123)

But of course the down side Is that your ops team needs to remember to add
the second partition clause (which is not very different from querying
large time partitioned tables today)

Our mid term plan is to build out column indexes (RFC-15 has the details,
if you are interested)

On Wed, Jun 3, 2020 at 2:54 AM tanu dua  wrote:

> If I need to plugin this hashing algorithm to resolve the partitions in
> Presto and hive what is the code I should look into ?
>
> On Wed, Jun 3, 2020, 12:04 PM tanu dua  wrote:
>
> > Yes that’s also on cards and for developers that’s ok but we need to
> > provide an interface to our ops people to execute the queries from presto
> > so I need to find out if they fire a query on primary key how can I
> > calculate the hash. They can fire a query including primary key with
> other
> > fields. So that is the only problem I see in hash partitions and to get
> if
> > work I believe I need to go deeper into presto Hudi plugin
> >
> > On Wed, 3 Jun 2020 at 11:48 AM, Jaimin Shah 
> > wrote:
> >
> >> Hi Tanu,
> >>
> >> If your primary key is integer you can add one more field as hash of
> >> integer and partition based on hash field. It will add some complexity
> to
> >> read and write because hash has to be computed prior to each read or
> >> write.
> >> Not whether overhead of doing this exceeds performance gains due to less
> >> partitions. I wonder why HUDI don't directly support hash based
> >> partitions?
> >>
> >> Thanks
> >> Jaimin
> >>
> >> On Wed, 3 Jun 2020 at 10:07, tanu dua  wrote:
> >>
> >> > Thanks Vinoth for detailed explanation. Even I was thinking on the
> same
> >> > lines and I will relook. We can reduce the 2nd and 3rd partition but
> >> it’s
> >> > very difficult to reduce the 1st partition as that is the basic
> primary
> >> key
> >> > of our domain model on which analysts and developers need to query
> >> almost
> >> > 90% of time and its an integer primary key and can’t be decomposed
> >> further.
> >> >
> >> > On Wed, 3 Jun 2020 at 9:23 AM, Vinoth Chandar 
> >> wrote:
> >> >
> >> > > Hi tanu,
> >> > >
> >> > > For good query performance, its recommended to write optimally sized
> >> > files.
> >> > > Hudi already ensures that.
> >> > >
> >> > > Generally speaking, if you have too many partitions, then it also
> >> means
> >> > too
> >> > > many files. Mostly people limit to 1000s of partitions in their
> >> datasets,
> >> > > since queries typically crunch data based on time or a
> business_domain
> >> > (e.g
> >> > > city for uber)..  Partitioning too granular - say based on user_id -
> >> is
> >> > not
> >> > > very useful unless your queries only crunch per user.. if you are
> >> using
> >> > > Hive metastore then 25M partitions mean 25M rows in your backing
> mysql
> >> > > metastore db as well - not very scalable.
> >> > >
> >> > > What I am trying to say is : even outside of Hudi, if analytics is
> >> your
> >> > use
> >> > > case, might be worth partitioning at lower granularity and increase
> >> rows
> >> > > per parquet file.
> >> > >
> >> > > Thanks
> >> > > Vinoth
> >> > >
> >> > > On Tue, Jun 2, 2020 at 3:18 AM Tanuj  wrote:
> >> > >
> >> > > > Hi,
> >> > > > We have a requirement to ingest 30M records in S3 backed up by
> >> HUDI. I
> >> > am
> >> > > > figuring out the partition strategy and ending up with lot of
> >> > partitions
> >> > > > like 25M partitions (primary partition) --> 2.5 M (secondary
> >> partition)
> >> > > -->
> >> > > > 2.5 M (third partition) and each parquet file will have the
> records
> >> > with
> >> > > > less than 10 rows of data.
> >> > > >
> >> > > > Our dataset will be ingested at once in full and then it will be
> >> > > > incremental daily with less than 1k updates. So its more read
> heavy
> >> > > rather
> >> > > > than write heavy
> >> > > >
> >> > > > So what should be the suggestion in terms of HUDI performance - go
> >> > ahead
> >> > > > with the above partition strategy or shall I reduce my partitions
> >> and
> >> > > > increase  no of rows in each parquet file.
> >> > > >
> >> > >
> >> >
> >>
> >
>

Re: How to extend the timeline server schema to accommodate business metadata

2020-06-03 Thread Vinoth Chandar

Hi Mario,

We actually started with the idea of making the timeline server, a long
running service.  We have a module if you notice that builds our a bundle
that you could deploy. May be you can play with it and see if that sounds
interesting to you. It will definitely have some rough edges given it’s not
been widely used.

Thanks
Vinoth

On Wed, Jun 3, 2020 at 2:33 AM Mario de Sá Vera  wrote:

> Hi Vinoth, thanks for your comments on this. I spent sometime thinking over
> another possibility which would be externalising the Hudi timeline service
> itself to an external server holding both operational (ie Hudi) and
> business metadata.
>
> would you guys have any opinion on that ? would that be easy as I do not
> seem to see a way yet , except reading about RocksDB but that is still not
> quite clear.
>
> best regards,
>
> Mario.
>
> Em seg., 1 de jun. de 2020 às 16:01, Vinoth Chandar <
> mail.vinoth.chan...@gmail.com> escreveu:
>
> > Hi Mario,
> >
> > Thanks for the detailed explanation. Hudi already allows extra metadata
> to
> > be written atomically with each commit i.e write operation. In fact, that
> > is how we track checkpoints for our delta streamer tool.. It may not
> solve
> > the need for querying the data together with this information. but gives
> > you ability to do some basic tagging.. if thats useful
> >
> > >>If we enable the timeline service metadata model to be extended we
> could
> > use the service instance itself to support specialised queries that
> involve
> > business qualifiers in order to return a proper set of metadata pointing
> to
> > the related commits
> >
> > This is a good idea actually.. There is another active discuss thread on
> > making the metadata queryable.. there is also
> > https://issues.apache.org/jira/browse/HUDI-309 which we paused for now..
> > But that's more in line with what you are thinking IIUC
> >
> >
> > Thanks
> > vinoth
> >
> > On Mon, Jun 1, 2020 at 4:41 AM Mario de Sá Vera 
> > wrote:
> >
> > > Hi Balaji,
> > >
> > > business metadata are all types of info related to the business where
> the
> > > Hudi solution is being used... from a COB (ie close of business date)
> > > related to that commit to any qualifier related to that commit that
> might
> > > be useful to be associated with that commit id. If we enable the
> timeline
> > > service metadata model to be extended we could use the service instance
> > > itself to support specialised queries that involve business qualifiers
> in
> > > order to return a proper set of metadata pointing to the related
> commits
> > > that answer a business query.
> > >
> > > if we do not have that flexibility we might end up creating a external
> > > transaction log and then comes the hard task to make that service in
> sync
> > > to the timeline service.
> > >
> > > let me know if that makes sense to you,
> > >
> > > Mario.
> > >
> > > Em seg., 1 de jun. de 2020 às 06:55, Balaji Varadarajan
> > >  escreveu:
> > >
> > > >  Hi Mario,
> > > > Timeline Server was designed to serve hudi metadata for Hudi writers
> > and
> > > > readers.  it may not be suitable to serve arbitrary data. But, it is
> an
> > > > interesting thought. Can you elaborate more on what kind of business
> > > > metadata are you looking. Is this something you are planning to store
> > in
> > > > commit files ?
> > > > Balaji.V
> > > >
> > > > On Sunday, May 31, 2020, 04:22:27 PM PDT, Mario de Sá Vera <
> > > > desav...@gmail.com> wrote:
> > > >
> > > >  I see a need for extending the current timeline server schema so
> that
> > a
> > > > flexible model could be achieved in order to accommodate business
> > > metadata.
> > > >
> > > > let me know if that makes sense to anyone here...
> > > >
> > > > Regards,
> > > >
> > > > Mario.
> > > >
> > >
> >
>

Re: [DISCUSS] querying commit metadata from spark DataSource

2020-06-03 Thread Vinoth Chandar

Hi Raymond,

I am not sure generalizing this to all metadata like - errors and metrics -
would be a good idea. We can certainly implement logging errors to a common
errors hudi table, with a certain schema. But these can be just regular
“hudi” format tables.

Unlike the timeline metadata, these are really external data, not related
to a given table’ core functioning.. we don’t necessarily want to keep one
error table per hudi table..

Thoughts?

On Tue, Jun 2, 2020 at 5:34 PM Shiyan Xu 
wrote:

> I also encountered use cases where I'd like to programmatically query
> metadata.
> +1 on the idea of format(“hudi-timeline”)
>
> I also feel that the metadata can be extended further to include more info
> like, errors, metrics/write statistics, etc. Like the newly proposed error
> handling, we could also store all metrics or write stats there too, and
> relate them to the timeline actions.
>
> A potential use case could be, with all these info encapsulated within
> metadata, we may be able to derive some insightful results (by check
> against some benchmarks) and answer questions like: does table A need more
> tuning? does table B exceed error budget?
>
> Programmatic query to these metadata can help manage many tables in
> diagnosis and inspection. We may need different read formats like
> format("hudi-errors") or format("hudi-metrics")
>
> Sorry this sidetracked from the original question..These are really rough
> high-level thoughts, and may have sign of over-engineering. Would like to
> hear some feedbacks. Thanks.
>
>
>
>
> On Mon, Jun 1, 2020 at 9:28 PM Satish Kotha 
> wrote:
>
> > Got it. I'll look into implementation choices for creating a new data
> > source. Appreciate all the feedback.
> >
> > On Mon, Jun 1, 2020 at 7:53 PM Vinoth Chandar  wrote:
> >
> > > >Is it to separate data and metadata access?
> > > Correct. We already have modes for querying data using format("hudi").
> I
> > > feel it will get very confusing to mix data and metadata in the same
> > > source.. for e.g a lot of options we support for data may not even make
> > > sense for the TimelineRelation.
> > >
> > > >This class seems like a list of static methods, I'm not seeing where
> > these
> > > are accessed from
> > > That's the public API for obtaining this information for Scala/Java
> > Spark.
> > > If you have a way of calling this from python through some bridge
> without
> > > painful bridges (e.g jython), might be a tactical solution that can
> meet
> > > your needs.
> > >
> > > On Mon, Jun 1, 2020 at 5:07 PM Satish Kotha
>  > >
> > > wrote:
> > >
> > > > Thanks for the feedback.
> > > >
> > > > What is the advantage of doing
> > > > spark.read.format(“hudi-timeline”).load(basepath) as opposed to doing
> > new
> > > > relation? Is it to separate data and metadata access?
> > > >
> > > > Are you looking for similar functionality as HoodieDatasourceHelpers?
> > > > >
> > > > This class seems like a list of static methods, I'm not seeing where
> > > these
> > > > are accessed from. But, I need a way to query metadata details easily
> > > > in pyspark.
> > > >
> > > >
> > > > On Mon, Jun 1, 2020 at 8:02 AM Vinoth Chandar 
> > wrote:
> > > >
> > > > > Also please take a look at
> > > > >
> > > >
> > >
> >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_HUDI-2D309=DwIFaQ=r2dcLCtU9q6n0vrtnDw9vg=4xNSsHvHqd0Eym5a_ZpDVwlq_iJaZ0Rdk0u0SMLXZ0c=NLHsTFjPharIb29R1o1lWgYLCr1KIZZB4WGPt4IQnOE=fGOaSc8PxPJ8yqczQyzYtsqWMEXAbWdeKh-5xltbVG0=
> > > > > .
> > > > >
> > > > > This was an effort to make the timeline more generalized for
> querying
> > > > (for
> > > > > a different purpose).. but good to revisit now..
> > > > >
> > > > > On Sun, May 31, 2020 at 11:04 PM vbal...@apache.org <
> > > vbal...@apache.org>
> > > > > wrote:
> > > > >
> > > > > >
> > > > > > I strongly recommend using a separate datasource relation (option
> > 1)
> > > to
> > > > > > query timeline. It is elegant and fits well with spark APIs.
> > > > > > Thanks.Balaji.VOn Saturday, May 30, 2020, 01:18:45 PM PDT,
> > Vinoth
> > > > > > Chandar  wrote:
> > > > > >
> > > > > >  Hi satish,
> > > > > >
> > > > > > Are you looking for similar functionality as
> > HoodieDatasourceHelpers?
> > > > > >
> > > > > > We have historically relied on cli to inspect the table, which
> does
> > > not
> > > > > > lend it self well to programmatic access.. overall in like option
> > 1 -
> > > > > > allowing the timeline to be queryable with a standard schema does
> > > seem
> > > > > way
> > > > > > nicer.
> > > > > >
> > > > > > I am wondering though if we should introduce a new view. Instead
> we
> > > can
> > > > > use
> > > > > > a different data source name -
> > > > > > spark.read.format(“hudi-timeline”).load(basepath). We can start
> by
> > > just
> > > > > > allowing querying of active timeline and expand this to archive
> > > > timeline?
> > > > > >
> > > > > > What do other Think?
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > On

Re: Suggestion needed - Hudi performance wrt no. and depth of partitions

2020-06-03 Thread tanu dua

If I need to plugin this hashing algorithm to resolve the partitions in
Presto and hive what is the code I should look into ?

On Wed, Jun 3, 2020, 12:04 PM tanu dua  wrote:

> Yes that’s also on cards and for developers that’s ok but we need to
> provide an interface to our ops people to execute the queries from presto
> so I need to find out if they fire a query on primary key how can I
> calculate the hash. They can fire a query including primary key with other
> fields. So that is the only problem I see in hash partitions and to get if
> work I believe I need to go deeper into presto Hudi plugin
>
> On Wed, 3 Jun 2020 at 11:48 AM, Jaimin Shah 
> wrote:
>
>> Hi Tanu,
>>
>> If your primary key is integer you can add one more field as hash of
>> integer and partition based on hash field. It will add some complexity to
>> read and write because hash has to be computed prior to each read or
>> write.
>> Not whether overhead of doing this exceeds performance gains due to less
>> partitions. I wonder why HUDI don't directly support hash based
>> partitions?
>>
>> Thanks
>> Jaimin
>>
>> On Wed, 3 Jun 2020 at 10:07, tanu dua  wrote:
>>
>> > Thanks Vinoth for detailed explanation. Even I was thinking on the same
>> > lines and I will relook. We can reduce the 2nd and 3rd partition but
>> it’s
>> > very difficult to reduce the 1st partition as that is the basic primary
>> key
>> > of our domain model on which analysts and developers need to query
>> almost
>> > 90% of time and its an integer primary key and can’t be decomposed
>> further.
>> >
>> > On Wed, 3 Jun 2020 at 9:23 AM, Vinoth Chandar 
>> wrote:
>> >
>> > > Hi tanu,
>> > >
>> > > For good query performance, its recommended to write optimally sized
>> > files.
>> > > Hudi already ensures that.
>> > >
>> > > Generally speaking, if you have too many partitions, then it also
>> means
>> > too
>> > > many files. Mostly people limit to 1000s of partitions in their
>> datasets,
>> > > since queries typically crunch data based on time or a business_domain
>> > (e.g
>> > > city for uber)..  Partitioning too granular - say based on user_id -
>> is
>> > not
>> > > very useful unless your queries only crunch per user.. if you are
>> using
>> > > Hive metastore then 25M partitions mean 25M rows in your backing mysql
>> > > metastore db as well - not very scalable.
>> > >
>> > > What I am trying to say is : even outside of Hudi, if analytics is
>> your
>> > use
>> > > case, might be worth partitioning at lower granularity and increase
>> rows
>> > > per parquet file.
>> > >
>> > > Thanks
>> > > Vinoth
>> > >
>> > > On Tue, Jun 2, 2020 at 3:18 AM Tanuj  wrote:
>> > >
>> > > > Hi,
>> > > > We have a requirement to ingest 30M records in S3 backed up by
>> HUDI. I
>> > am
>> > > > figuring out the partition strategy and ending up with lot of
>> > partitions
>> > > > like 25M partitions (primary partition) --> 2.5 M (secondary
>> partition)
>> > > -->
>> > > > 2.5 M (third partition) and each parquet file will have the records
>> > with
>> > > > less than 10 rows of data.
>> > > >
>> > > > Our dataset will be ingested at once in full and then it will be
>> > > > incremental daily with less than 1k updates. So its more read heavy
>> > > rather
>> > > > than write heavy
>> > > >
>> > > > So what should be the suggestion in terms of HUDI performance - go
>> > ahead
>> > > > with the above partition strategy or shall I reduce my partitions
>> and
>> > > > increase  no of rows in each parquet file.
>> > > >
>> > >
>> >
>>
>

Re: How to extend the timeline server schema to accommodate business metadata

2020-06-03 Thread Mario de Sá Vera

Hi Vinoth, thanks for your comments on this. I spent sometime thinking over
another possibility which would be externalising the Hudi timeline service
itself to an external server holding both operational (ie Hudi) and
business metadata.

would you guys have any opinion on that ? would that be easy as I do not
seem to see a way yet , except reading about RocksDB but that is still not
quite clear.

best regards,

Mario.

Em seg., 1 de jun. de 2020 às 16:01, Vinoth Chandar <
mail.vinoth.chan...@gmail.com> escreveu:

> Hi Mario,
>
> Thanks for the detailed explanation. Hudi already allows extra metadata to
> be written atomically with each commit i.e write operation. In fact, that
> is how we track checkpoints for our delta streamer tool.. It may not solve
> the need for querying the data together with this information. but gives
> you ability to do some basic tagging.. if thats useful
>
> >>If we enable the timeline service metadata model to be extended we could
> use the service instance itself to support specialised queries that involve
> business qualifiers in order to return a proper set of metadata pointing to
> the related commits
>
> This is a good idea actually.. There is another active discuss thread on
> making the metadata queryable.. there is also
> https://issues.apache.org/jira/browse/HUDI-309 which we paused for now..
> But that's more in line with what you are thinking IIUC
>
>
> Thanks
> vinoth
>
> On Mon, Jun 1, 2020 at 4:41 AM Mario de Sá Vera 
> wrote:
>
> > Hi Balaji,
> >
> > business metadata are all types of info related to the business where the
> > Hudi solution is being used... from a COB (ie close of business date)
> > related to that commit to any qualifier related to that commit that might
> > be useful to be associated with that commit id. If we enable the timeline
> > service metadata model to be extended we could use the service instance
> > itself to support specialised queries that involve business qualifiers in
> > order to return a proper set of metadata pointing to the related commits
> > that answer a business query.
> >
> > if we do not have that flexibility we might end up creating a external
> > transaction log and then comes the hard task to make that service in sync
> > to the timeline service.
> >
> > let me know if that makes sense to you,
> >
> > Mario.
> >
> > Em seg., 1 de jun. de 2020 às 06:55, Balaji Varadarajan
> >  escreveu:
> >
> > >  Hi Mario,
> > > Timeline Server was designed to serve hudi metadata for Hudi writers
> and
> > > readers.  it may not be suitable to serve arbitrary data. But, it is an
> > > interesting thought. Can you elaborate more on what kind of business
> > > metadata are you looking. Is this something you are planning to store
> in
> > > commit files ?
> > > Balaji.V
> > >
> > > On Sunday, May 31, 2020, 04:22:27 PM PDT, Mario de Sá Vera <
> > > desav...@gmail.com> wrote:
> > >
> > >  I see a need for extending the current timeline server schema so that
> a
> > > flexible model could be achieved in order to accommodate business
> > metadata.
> > >
> > > let me know if that makes sense to anyone here...
> > >
> > > Regards,
> > >
> > > Mario.
> > >
> >
>

Re: Suggestion needed - Hudi performance wrt no. and depth of partitions

2020-06-03 Thread Jaimin Shah

Hi Tanu,

If your primary key is integer you can add one more field as hash of
integer and partition based on hash field. It will add some complexity to
read and write because hash has to be computed prior to each read or write.
Not whether overhead of doing this exceeds performance gains due to less
partitions. I wonder why HUDI don't directly support hash based partitions?

Thanks
Jaimin

On Wed, 3 Jun 2020 at 10:07, tanu dua  wrote:

> Thanks Vinoth for detailed explanation. Even I was thinking on the same
> lines and I will relook. We can reduce the 2nd and 3rd partition but it’s
> very difficult to reduce the 1st partition as that is the basic primary key
> of our domain model on which analysts and developers need to query almost
> 90% of time and its an integer primary key and can’t be decomposed further.
>
> On Wed, 3 Jun 2020 at 9:23 AM, Vinoth Chandar  wrote:
>
> > Hi tanu,
> >
> > For good query performance, its recommended to write optimally sized
> files.
> > Hudi already ensures that.
> >
> > Generally speaking, if you have too many partitions, then it also means
> too
> > many files. Mostly people limit to 1000s of partitions in their datasets,
> > since queries typically crunch data based on time or a business_domain
> (e.g
> > city for uber)..  Partitioning too granular - say based on user_id - is
> not
> > very useful unless your queries only crunch per user.. if you are using
> > Hive metastore then 25M partitions mean 25M rows in your backing mysql
> > metastore db as well - not very scalable.
> >
> > What I am trying to say is : even outside of Hudi, if analytics is your
> use
> > case, might be worth partitioning at lower granularity and increase rows
> > per parquet file.
> >
> > Thanks
> > Vinoth
> >
> > On Tue, Jun 2, 2020 at 3:18 AM Tanuj  wrote:
> >
> > > Hi,
> > > We have a requirement to ingest 30M records in S3 backed up by HUDI. I
> am
> > > figuring out the partition strategy and ending up with lot of
> partitions
> > > like 25M partitions (primary partition) --> 2.5 M (secondary partition)
> > -->
> > > 2.5 M (third partition) and each parquet file will have the records
> with
> > > less than 10 rows of data.
> > >
> > > Our dataset will be ingested at once in full and then it will be
> > > incremental daily with less than 1k updates. So its more read heavy
> > rather
> > > than write heavy
> > >
> > > So what should be the suggestion in terms of HUDI performance - go
> ahead
> > > with the above partition strategy or shall I reduce my partitions and
> > > increase  no of rows in each parquet file.
> > >
> >
>

Re: [DISCUSS] Write failed records

Re: Suggestion needed - Hudi performance wrt no. and depth of partitions

Re: How to extend the timeline server schema to accommodate business metadata

Re: [DISCUSS] querying commit metadata from spark DataSource

Re: Suggestion needed - Hudi performance wrt no. and depth of partitions

Re: How to extend the timeline server schema to accommodate business metadata

Re: Suggestion needed - Hudi performance wrt no. and depth of partitions

7 matches

Site Navigation

Mail list logo

Footer information