20200714 Weekly Sync Minutes

2020-07-21 Thread vbal...@apache.org
It was a very short meeting. The major highlight was : AWS Athena officially 
supporting Apache Hudi as a queryable source.
 https://cwiki.apache.org/confluence/display/HUDI/20200721+Weekly+Sync+Minutes
Thanks,Balaji.V


Re: Date handling in HUDI

2020-07-21 Thread Mehrotra, Udit
Hi Tanu/Balaji,

I have not really faced the issue mentioned here. AFAIK, the Date and Timestamp 
types should work fine. The Logical Date type is represented as INT in Avro, 
that is why you see the integer ingested there 
https://avro.apache.org/docs/current/spec.html#Date . But it should not have 
any impact on querying and spark should be able to determine the Date from that.

In addition to the information requested by Gary, can you possibly open a 
GitHub issue with details about the environment where you are running 
Hudi/Spark and also may be a small example that can reproduce this issue ?

Thanks,
Udit

On 7/21/20, 11:06 AM, "Gary Li"  wrote:

CAUTION: This email originated from outside of the organization. Do not 
click links or open attachments unless you can confirm the sender and know the 
content is safe.



Hi tanu,

This seems like a Spark-Parquet type conversion issue. I use timestamp type
and don’t have any issue with it.

Would you try the following and provide more context?
- save your dataframe as parquet instead of Hudi to see if the issue still
persists
- try timestamp type.
- are you querying mor table using Sparksql?

Thanks,
Gary

On Tue, Jul 21, 2020 at 1:23 AM tanu dua  wrote:

> Thanks and even I am struggling with all data types except String with 
same
> decode exception. For eg for both double and int and I got the exception
> and when I convert to string all works fine in spark sql.
>
> On Tue, 21 Jul 2020 at 1:38 PM, Balaji Varadarajan
>  wrote:
>
> >
> > Gary/Udit,
> > As you are familiar with this part of it, Can you please answer this
> > question ?
> > Thanks,Balaji.VOn Monday, July 20, 2020, 08:18:16 AM PDT, tanu dua <
> > tanu.dua...@gmail.com> wrote:
> >
> >  Hi Guys,
> > May I know how do you guys handle date and time stamp in Hudi.
> > When I set DataTypes as Date in StructType it’s getting ingested as int
> but
> > when I query using spark sql I get the following
> >
> > https://issues.apache.org/jira/plugins/servlet/mobile#issue/SPARK-17557
> >
> > So not sure if it’s only me who face this. Do I need to change to String
> > ?
>



Re: the contributor permission

2020-07-21 Thread vbal...@apache.org
 
Welcome to Hudi. I have added your jira id. 
Balaji.VOn Tuesday, July 21, 2020, 10:19:21 AM PDT, zjing...@sina.com 
 wrote:  
 
 Hi,I want to contribute to Apache Hudi. Would you please give me the 
contributor permission? My JIRA ID is AndyZhang0419  

the contributor permission

2020-07-21 Thread zjing627
Hi,I want to contribute to Apache Hudi. Would you please give me the 
contributor permission? My JIRA ID is AndyZhang0419

Re: Kafka Hudi pipeline design

2020-07-21 Thread Balaji Varadarajan
 Please see answers inline...

On Sunday, July 19, 2020, 10:08:09 PM PDT, Lian Jiang 
 wrote:  
 
 Hi,
I have a kafka topic using a kafka s3 connector to dump data into s3 hourly in 
parquet format. These parquet files are partitioned in ingestion time and each 
record has fields which are deeply nested jsons. Each record is a monolithic 
data containing multiple events each has its own event time. This causes two 
issues: 1. slow query by event time; 2. hard to use due to many levels of 
exploding. I plan to use the below design to solve these problems. 

In this design, I still use the s3 parquet dumped by the Kafka S3 connector as 
a backfill for the hudi pipeline. This is because the S3 connector pipeline is 
easier then the hudi pipeline to set up and will work before the hudi pipeline 
is working. Also, the s3 connector pipeline may be more reliable than the hudi 
pipeline due to the potential bugs in delta streamer.The delta streamer will 
decompose the monolithic kafka record into multiple event streams. Each event 
stream is written into one hudi dataset partition and sorted by its 
corresponding event time. Such hudi datasets are synced with hive which is 
exposed for user query so that they don't need to care whether the underlying 
table format is parquet or hudi.Hopefully, such design improves the query 
performance due to the fact that the data set is partitioned and sorted by 
event times as opposed to kafka ingest time. Also user experience is improved 
by querying the extracted events.

Let us know if you there are any issues with deltastreamer for it to be used in 
the first stage. If you want to faithfully append event stream logs to S3 
before you materialize in different order, you can try the "insert" mode in 
hudi, which would give you small file size handling. 

Questions:1. Do you see any issue for the delta streamer to handle both 
streaming and backfill at the same time? I know hudi dataset cannot be written 
by multiple writing clients simultaneously. Also, I don't want the delta 
streamer to stop handling the streaming data while doing backfill. The delta 
streamer will use dynamic allocation. Assuming the cluster has enough capacity, 
the load caused by backfill should not be an issue.

With 0.6, we are planning to allow multiple writers as long as there is 
guarantee that writers will be writing to different partitions. I think this 
will fit your requirement and also keep one timeline. 

2. If I want to time travel to a previous day (e.g. the first day 11:00:00AM 
PST of the last Month), how can I make hudi 1 and hudi 2 (... hudi n) in sync. 
AFAIK, hudi time travel is done by commit instead of timestamp. Should I do 
below: a. listing the commits of these hudi datasets, 
 b. finding the commits closing to each other and being closest to the desired 
timestamp, 
 c. apply time travel for each hudi dataset.Is there an easier and more 
accurate way? Will hudi support time travel by timestamp in the future as delta 
lake does? 


Commit time is like a timestamp although in specific format (secs). It should 
be straightforward to reformat a timestamp to commit time and then use it in 
the WHERE clause. But, I have opened a ticket 
https://issues.apache.org/jira/browse/HUDI-1116 to track this request. My 
initial thinking is this should not be hard to support. 

Balaji.V  

Re: Date handling in HUDI

2020-07-21 Thread tanu dua
Thanks and even I am struggling with all data types except String with same
decode exception. For eg for both double and int and I got the exception
and when I convert to string all works fine in spark sql.

On Tue, 21 Jul 2020 at 1:38 PM, Balaji Varadarajan
 wrote:

>
> Gary/Udit,
> As you are familiar with this part of it, Can you please answer this
> question ?
> Thanks,Balaji.VOn Monday, July 20, 2020, 08:18:16 AM PDT, tanu dua <
> tanu.dua...@gmail.com> wrote:
>
>  Hi Guys,
> May I know how do you guys handle date and time stamp in Hudi.
> When I set DataTypes as Date in StructType it’s getting ingested as int but
> when I query using spark sql I get the following
>
> https://issues.apache.org/jira/plugins/servlet/mobile#issue/SPARK-17557
>
> So not sure if it’s only me who face this. Do I need to change to String
> ?


Re: Date handling in HUDI

2020-07-21 Thread Balaji Varadarajan
 
Gary/Udit,
As you are familiar with this part of it, Can you please answer this question ?
Thanks,Balaji.VOn Monday, July 20, 2020, 08:18:16 AM PDT, tanu dua 
 wrote:  
 
 Hi Guys,
May I know how do you guys handle date and time stamp in Hudi.
When I set DataTypes as Date in StructType it’s getting ingested as int but
when I query using spark sql I get the following

https://issues.apache.org/jira/plugins/servlet/mobile#issue/SPARK-17557

So not sure if it’s only me who face this. Do I need to change to String ?