Re: Weekly sync notes 20201225

2020-02-25 Thread Shiyan Xu
link
https://cwiki.apache.org/confluence/display/HUDI/20200225+Weekly+Sync+Minutes

On Tue, Feb 25, 2020 at 9:39 PM vbal...@apache.org 
wrote:

> Please find the weekly sync notes here
> 20200225 Weekly Sync Minutes - HUDI - Apache Software Foundation
>
> Thanks,Balaji.V


Weekly sync notes 20201225

2020-02-25 Thread vbal...@apache.org
Please find the weekly sync notes here
20200225 Weekly Sync Minutes - HUDI - Apache Software Foundation

Thanks,Balaji.V

Re: Updating COW Table

2020-02-25 Thread leesf
You would pass it via option, like
option(DataSourceWriteOptions.PAYLOAD_CLASS_OPT_KEY(),
EmptyHoodieRecordPayload.class.getName())

selvaraj periyasamy  于2020年2月26日周三
上午2:24写道:

> OverwriteWithLatestAvroPayload is used for Delta Streamer. Is there a way
> for DataSource Writer?
>
> please correct me , if I am wrong.
>
> Thanks,
> Selva
>
>
> On Mon, Feb 24, 2020 at 1:15 PM Gary Li  wrote:
>
> > Hi, in this case you need to design your own logic to handle merging.
> > Please check OverwriteWithLatestAvroPlayload class. You can write your
> own
> > one and pass it as DataSourceWriteOptions.PAYLOAD_CLASS_OPT_KEY to Hudi.
> >
> > On Mon, Feb 24, 2020 at 12:25 PM selvaraj periyasamy <
> > selvaraj.periyasamy1...@gmail.com> wrote:
> >
> > > Hi, I am experimenting Hudi 0.5.0 version for some of the update use
> > cases.
> > >
> > > Our flow is as below
> > >
> > > RDBMS -> CDC Log -> Hive -> COW table.
> > >
> > > CCDC log for update would have the value only for the primary key
> > columns +
> > > updated columns. Remaining column values are null. While upserting
> values
> > > on CCOW table, would need to update only the column values , which are
> > > updated and retain the values for other columns. When I tested, Hudi
> > > updates remaining column values as null since log has null values.
> > >
> > > Is there a way to merge rows for columns which are having values during
> > > update?
> > >
> > > Thanks,
> > > Selva
> > >
> >
>


Re: Updating COW Table

2020-02-25 Thread selvaraj periyasamy
OverwriteWithLatestAvroPayload is used for Delta Streamer. Is there a way
for DataSource Writer?

please correct me , if I am wrong.

Thanks,
Selva


On Mon, Feb 24, 2020 at 1:15 PM Gary Li  wrote:

> Hi, in this case you need to design your own logic to handle merging.
> Please check OverwriteWithLatestAvroPlayload class. You can write your own
> one and pass it as DataSourceWriteOptions.PAYLOAD_CLASS_OPT_KEY to Hudi.
>
> On Mon, Feb 24, 2020 at 12:25 PM selvaraj periyasamy <
> selvaraj.periyasamy1...@gmail.com> wrote:
>
> > Hi, I am experimenting Hudi 0.5.0 version for some of the update use
> cases.
> >
> > Our flow is as below
> >
> > RDBMS -> CDC Log -> Hive -> COW table.
> >
> > CCDC log for update would have the value only for the primary key
> columns +
> > updated columns. Remaining column values are null. While upserting values
> > on CCOW table, would need to update only the column values , which are
> > updated and retain the values for other columns. When I tested, Hudi
> > updates remaining column values as null since log has null values.
> >
> > Is there a way to merge rows for columns which are having values during
> > update?
> >
> > Thanks,
> > Selva
> >
>


Re: HudiDeltaStreamer on EMR

2020-02-25 Thread Raghvendra Dhar Dubey


Got it Shiyan, Thanks.
On 2020/02/24 19:15:52, Shiyan Xu  wrote: 
> It's likely that the source parquet data has a column of Spark Timestamp
> type, which is not convertible to avro.
> By the way, ParquetDFSSource is not available in 0.5.0. Only added in
> 0.5.1. You'll probably need to add a custom class which follows its
> existing implementation, and get rid of it once EMR upgrade Hudi version.
> 
> On Mon, Feb 24, 2020 at 10:41 AM Raghvendra Dhar Dubey
>  wrote:
> 
> > Hi Team,
> >
> > I was trying to use HudiDeltaStreamer on EMR, which reads parquet data from
> > S3 and write data into Hudi Dataset, but I am getting into an issue like
> > AvroSchemaConverter not able to convert INT96, INT96 not yet implemented.
> > spark-submit command that I am using
> >
> > spark-submit --class
> > org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer --packages
> > org.apache.spark:spark-avro_2.11:2.4.4 --master yarn --deploy-mode client
> > /usr/lib/hudi/hudi-utilities-bundle-0.5.0-incubating.jar --storage-type
> > COPY_ON_WRITE --source-ordering-field action_date --source-class
> > org.apache.hudi.utilities.sources.ParquetDFSSource --target-base-path
> > s3://xxx/hudi_table --target-table hudi_table --payload-class
> > org.apache.hudi.payload.AWSDmsAvroPayload --hoodie-conf
> >
> > hoodie.datasource.write.recordkey.field=wbn,hoodie.datasource.write.partitionpath.field=ad,hoodie.deltastreamer.source.dfs.root=s3://xxx/Hoodi/
> >
> > Error I am getting is
> >
> > exception in thread "main" org.apache.spark.SparkException: Job aborted due
> > to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure:
> > Lost task 0.3 in stage 0.0 (TID 3, ip-172-30-37-9.ec2.internal, executor
> > 1): java.lang.IllegalArgumentException: INT96 not yet implemented. at
> >
> > org.apache.parquet.avro.AvroSchemaConverter$1.convertINT96(AvroSchemaConverter.java:279)
> > at
> >
> > org.apache.parquet.avro.AvroSchemaConverter$1.convertINT96(AvroSchemaConverter.java:264)
> > at
> >
> > org.apache.parquet.schema.PrimitiveType$PrimitiveTypeName$7.convert(PrimitiveType.java:297)
> > at
> >
> > org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:263)
> > at
> >
> > org.apache.parquet.avro.AvroSchemaConverter.convertFields(AvroSchemaConverter.java:241)
> > at
> >
> > org.apache.parquet.avro.AvroSchemaConverter.convert(AvroSchemaConverter.java:231)
> > at
> >
> > org.apache.parquet.avro.AvroReadSupport.prepareForRead(AvroReadSupport.java:130)
> > at
> >
> > org.apache.parquet.hadoop.InternalParquetRecordReader.initialize(InternalParquetRecordReader.java:204)
> > at
> >
> > org.apache.parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:182)
> > at
> >
> > org.apache.parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:140)
> > at
> >
> > org.apache.spark.rdd.NewHadoopRDD$$anon$1.liftedTree1$1(NewHadoopRDD.scala:199)
> > at org.apache.spark.rdd.NewHadoopRDD$$anon$1.(NewHadoopRDD.scala:196)
> > at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:151) at
> > org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:70) at
> > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at
> > org.apache.spark.rdd.RDD.iterator(RDD.scala:288) at
> > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at
> > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at
> > org.apache.spark.rdd.RDD.iterator(RDD.scala:288) at
> > org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at
> > org.apache.spark.scheduler.Task.run(Task.scala:123) at
> >
> > org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
> > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360) at
> > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414) at
> >
> > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> > at
> >
> > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> > at java.lang.Thread.run(Thread.java:748)
> >
> > Please help me into this.
> >
> > Thanks
> > Raghvendra
> >
> 


Re: [DISCUSS] RFC - 08 : Record level indexing mechanisms for Hudi datasets

2020-02-25 Thread Balaji Varadarajan
+1. Lets do it :)

Balaji.V

On Mon, Feb 24, 2020 at 6:36 PM Shiyan Xu 
wrote:

> +1 great reading and values!
>
> On Mon, 24 Feb 2020, 15:31 nishith agarwal,  wrote:
>
> > +100
> > - Reduces index lookup time hence improves job runtime
> > - Paves the way for streaming style ingestion
> > - Eliminates dependency on Hbase (alternate "global index" support at the
> > moment)
> >
> > -Nishith
> >
> > On Mon, Feb 24, 2020 at 10:56 AM Vinoth Chandar 
> wrote:
> >
> > > +1 from me as well. This will be a product defining feature, if we can
> do
> > > it/
> > >
> > > On Sun, Feb 23, 2020 at 6:27 PM vino yang 
> wrote:
> > >
> > > > Hi Sivabalan,
> > > >
> > > > Thanks for your proposal.
> > > >
> > > > Big +1 from my side, indexing for record granularity is really good
> for
> > > > performance. It is also towards the streaming processing.
> > > >
> > > > Best,
> > > > Vino
> > > >
> > > > Sivabalan  于2020年2月23日周日 上午12:52写道:
> > > >
> > > > > As Aapche Hudi is getting widely adopted, performance has become
> the
> > > need
> > > > > of the hour. This RFC focusses on improving performance of the Hudi
> > > index
> > > > > by introducing record level index. The proposal is to implement a
> new
> > > > index
> > > > > format that is a mapping of (recordKey <-> partition, fileId) or
> > > > > ((recordKey, partitionPath) → fileId). This mapping will be stored
> > and
> > > > > maintained by Hudi as another implementation of HoodieIndex. This
> > > record
> > > > > level indexing will definitely give a boost to both read and write
> > > > > performance.
> > > > >
> > > > > Here
> > > > > <
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/HUDI/RFC+-+08+%3A+Record+level+indexing+mechanisms+for+Hudi+datasets
> > > > > >
> > > > > is the link to RFC.
> > > > >
> > > > > Appreciate your review and thoughts.
> > > > >
> > > > > --
> > > > > Regards,
> > > > > -Sivabalan
> > > > >
> > > >
> > >
> >
>


Re: [DISCUSS] Support for complex record keys with TimestampBasedKeyGenerator

2020-02-25 Thread Balaji Varadarajan
 
See if you can have a generic implementation where individual fields in the 
partition-path can be configured with their own key-generator class. Currently, 
TimestampBasedKeyGenerator is the only type specific custom generator. If we 
are anticipating more such classes for specialized types, you can use a generic 
way to support overriding key-generator for individual partition-fields once 
and for all.
Balaji.VOn Monday, February 24, 2020, 03:09:02 AM PST, Pratyaksh Sharma 
 wrote:  
 
 Hi,

We have TimestampBasedKeyGenerator for defining custom partition paths and
we have ComplexKeyGenerator for supporting having combination of fields as
record key or partition key.

However we do not have support for the case where one wants to have
combination of fields as record key along with being able to define custom
partition paths. This use case recently came up at my organisation.

How about having CustomTimestampBasedKeyGenerator which supports the above
use case? This class can simply extend TimestampBasedKeyGenerator and allow
users to have combination of fields as record key.

Open to hearing others' opinions.
  

Weekly sync on Zoom

2020-02-25 Thread Vinoth Chandar
Hello all,

Given the woes we face with Hangouts now and then. We are going to try zoom
for todays meeting.

Details here.
https://cwiki.apache.org/confluence/display/HUDI/Apache+Hudi+Community+Weekly+Sync


Thanks
Vinoth


Re: [DISCUSS] Adding common errors and solutions to FAQs

2020-02-25 Thread Pratyaksh Sharma
Sure, will take a look whenever I get time.

On Tue, Feb 25, 2020 at 12:24 PM Vinoth Chandar  wrote:

> +1
>
> Thanks Pratyaksh! Will take a look and structure it accordingly. Might be
> worth looking at last 30 days of slack, mailing lists and see if we can add
> more..
>
>
> On Mon, Feb 24, 2020 at 4:34 AM Pratyaksh Sharma 
> wrote:
>
> > Hi Vinoth,
> >
> > Have added few more issues which I faced while adopting Hudi. Please
> have a
> > look.
> >
> > I guess everyone in community should make it a habit to try adding
> errors/
> > issues that one faces on this page. It would be really useful for others
> > also. Further we will have a consolidated page for all errors and like
> Siva
> > mentioned, it will be easier to fix common issues.
> >
> > On Fri, Feb 21, 2020 at 11:33 PM Vinoth Chandar 
> wrote:
> >
> > > Thanks Pratyaksh! Do you have any suggestions on priming this page with
> > > many more common issues?
> > >
> > > On Thu, Feb 20, 2020 at 12:54 AM Pratyaksh Sharma <
> pratyaks...@gmail.com
> > >
> > > wrote:
> > >
> > > > Here is the link to troubleshooting guide -
> > > >
> https://cwiki.apache.org/confluence/display/HUDI/Troubleshooting+Guide
> > .
> > > >
> > > > Suggestions are welcome.
> > > >
> > > > On Wed, Feb 19, 2020 at 12:36 AM Vinoth Chandar 
> > > wrote:
> > > >
> > > > > you have it now! Thanks for driving this, Pratyaksh!
> > > > >
> > > > > On Mon, Feb 17, 2020 at 5:51 AM Pratyaksh Sharma <
> > > pratyaks...@gmail.com>
> > > > > wrote:
> > > > >
> > > > > > I can help do the initial set up of troubleshooting guide if no
> one
> > > > else
> > > > > is
> > > > > > willing to drive this. Need perms for the same.
> > > > > >
> > > > > > On Mon, Feb 10, 2020 at 12:48 PM Vinoth Chandar <
> vin...@apache.org
> > >
> > > > > wrote:
> > > > > >
> > > > > > > +1
> > > > > > > We have a Tuning guide.. How about a separate troubleshooting
> > > guide?
> > > > > > (This
> > > > > > > way we can keep FAQ high level)..
> > > > > > > Any volunteers to drive this? we can keep that updated as new
> > > issues
> > > > > come
> > > > > > > up here or elsewhere.
> > > > > > >
> > > > > > > On Sun, Feb 9, 2020 at 3:25 AM Pratyaksh Sharma <
> > > > pratyaks...@gmail.com
> > > > > >
> > > > > > > wrote:
> > > > > > >
> > > > > > > > This would a valuable addition to our FAQs page. I also have
> > few
> > > > > errors
> > > > > > > > that I faced while adopting Hudi, I can help adding all of
> > them.
> > > > > > > >
> > > > > > > > On Sun, Feb 9, 2020 at 4:12 PM vino yang <
> > yanghua1...@gmail.com>
> > > > > > wrote:
> > > > > > > >
> > > > > > > > > +1 from my side, it is valuable.
> > > > > > > > >
> > > > > > > > > Best,
> > > > > > > > > Vino
> > > > > > > > >
> > > > > > > > > leesf  于2020年2月9日周日 上午11:25写道:
> > > > > > > > >
> > > > > > > > > > Hi Sivabalan,
> > > > > > > > > >
> > > > > > > > > > Thanks for bringing up the discussion.
> > > > > > > > > >
> > > > > > > > > > +1 to add a section(would be Common errors and solutions)
> > to
> > > > FAQ.
> > > > > > it
> > > > > > > > > will
> > > > > > > > > > be more convenient to find and solve problems since they
> > are
> > > > > > > summarized
> > > > > > > > > in
> > > > > > > > > > FAQ page , also slack channel(#general) will be forwarded
> > to
> > > ML
> > > > > as
> > > > > > > > > > discussed.
> > > > > > > > > >
> > > > > > > > > > Sivabalan  于2020年2月9日周日 上午1:40写道:
> > > > > > > > > >
> > > > > > > > > > > Hi folks,
> > > > > > > > > > >  We come across some common errors in our slack
> > channel
> > > > > which
> > > > > > > > > someone
> > > > > > > > > > > might have encountered before and I really appreciate
> the
> > > > folks
> > > > > > in
> > > > > > > > our
> > > > > > > > > > > community helping them actively. Can we add a section
> to
> > > our
> > > > > FAQ
> > > > > > > > > > > documenting common errors and how to debug the same. So
> > > that
> > > > > new
> > > > > > > > folks,
> > > > > > > > > > can
> > > > > > > > > > > ping in slack if they don't find them already in that
> > > > section.
> > > > > > > Also,
> > > > > > > > it
> > > > > > > > > > > gives us an opportunity to think about improvising hudi
> > > when
> > > > we
> > > > > > > look
> > > > > > > > > back
> > > > > > > > > > > at some of the common errors that users encounter.
> > > > > > > > > > >
> > > > > > > > > > > --
> > > > > > > > > > > Regards,
> > > > > > > > > > > -Sivabalan
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>


Re: [DISCUSS] Consider defaultValue of field when writing to Hudi dataset

2020-02-25 Thread Pratyaksh Sharma
Hi Vinoth,

> in avro you define it as an optional field (union of type and null)..
Yes that is correct. But imagine if someone does not want to populate null,
rather he wants to populate default values for the field, which is a very
common case.

> seems like it's being copied over?
When creating writer schema, we are copying Avro fields properly. However
at the time of actually populating values as per new writer schema, we are
not taking care of default values like I pointed out in my previous mail.

> Nonetheless, we should copy over the default values, if that code is not
doing so.
Yeah, will raise a jira for the same.

>

On Tue, Feb 25, 2020 at 12:21 PM Vinoth Chandar  wrote:

> IIUC the link between backwards compatibility and default values for fields
> in schema is that, in avro you define it as an optional field (union of
> type and null).. Not sure if it has anything to do with default values.
>
> Nonetheless, we should copy over the default values, if that code is not
> doing so.
>
> https://github.com/apache/incubator-hudi/blob/078d4825d909b2c469398f31c97d2290687321a8/hudi-common/src/main/java/org/apache/hudi/common/util/HoodieAvroUtils.java#L124
> seems like it's being copied over?
>
> On Mon, Feb 24, 2020 at 4:21 AM Pratyaksh Sharma 
> wrote:
>
> > Hi,
> >
> > Currently we recommend users to evolve schema in backwards compatible
> way.
> > When one is trying to evolve schema in backwards compatible way, one of
> the
> > most significant things to do is to define default value for newly added
> > columns so that records published with previous schema also can be
> consumed
> > properly.
> >
> > However just before actually writing record to Hudi dataset, we try to
> > rewrite record with new Avro schema which has Hudi metadata columns [1].
> In
> > this function, we are only trying to get the values from record without
> > considering field's default value. As a result, schema validation fails.
> In
> > essence this feels like we are not even respecting backwards compatible
> > schema changes.
> > IMO, this piece of code should take into account default value as well in
> > case field's actual value is null.
> >
> > Open to hearing others' thoughts.
> >
> > [1]
> >
> >
> https://github.com/apache/incubator-hudi/blob/078d4825d909b2c469398f31c97d2290687321a8/hudi-common/src/main/java/org/apache/hudi/common/util/HoodieAvroUtils.java#L205
> > .
> >
>