Re: Question

2020-03-18 Thread Vinoth Chandar
Hi Syed,

Please join the mailing list, so your responses make it here without needed
approval.

I am sure there is something odd going on here. Few things to check

- Hudi does use memory for caching inputs and computing heuristics. I have
seen slowness being caused by insufficient executor memory. Can you try a
larger heap size and configuring GC? (explained in
https://cwiki.apache.org/confluence/display/HUDI/Tuning+Guide)
- There is also a performance bug we fixed in 0.5.2. Can you try
setting hoodie.memory.merge.max.size=2147483648 (2GB of merge memory).
(initial load should be just doing an insert, so may be unrelated. still
something to keep in mind)

If you can open a GitHub issue with the spark UI screenshot and data size
etc, happy to take a look.

thanks
vinoth




On Wed, Mar 18, 2020 at 4:37 PM Syed Zaidi 
wrote:

> Hi Udit,
>
> Thanks for your recommendation. I was able to get the jars for 0.5.1. As a
> test we ran hudi against a small dataset (~2 million rows with 80 columns)
> in parquet file against 10 executors (m5.xlarge) . The initial load itself
> is taking 2+ hours. Do you have any suggestions on the settings I can
> update to speed up the process.
>
> Thanks
> Syed Zaidi
>
> 
> From: Mehrotra, Udit 
> Sent: Tuesday, March 17, 2020 8:08 PM
> To: dev@hudi.apache.org ; Syed Zaidi <
> syedmusaza...@hotmail.com>
> Subject: Re: Question
>
> Hi Zaidi,
>
> You should be able to use Hudi 0.5.1 in the next EMR release that should
> be fairly soon, but we can't give you an ETA. Meanwhile, there is nothing
> really stopping you to build your hudi 0.5.1 jars and replacing the ones on
> EMR cluster. The jars are located on the master node at /usr/lib/hudi/.
> Just replace the 0.5.0 jars there and have the symlink jars point to your
> 0.5.1 jars.
>
> Thanks,
> Udit Mehrotra
> SDE | AWS EMR
>
> On 3/17/20, 5:34 PM, "Syed Zaidi"  wrote:
>
> CAUTION: This email originated from outside of the organization. Do
> not click links or open attachments unless you can confirm the sender and
> know the content is safe.
>
>
>
> Hi,
>
> AWS EMR emr-5.29.0 comes with spark Spark 2.4.4 and Hudi 0.5.0 (
> hudi-hadoop-mr-bundle-0.5.0-incubating.jar). In version 0.5.1 we have new
> options for reading the AWS DMS change logs using DeltaStreamer. Do you
> guys have any idea when will AWS support the newer version of hudi. What
> options I have to upgrade hudi to the latest version while creating the EMR
> to support AWS DMS payload out of the box.
>
> Would appreciate your feedback in this regard.
>
> Thanks
> Syed Zaidi
>
>
>


Re: Question

2020-03-18 Thread Syed Zaidi
Hi Udit,

Thanks for your recommendation. I was able to get the jars for 0.5.1. As a test 
we ran hudi against a small dataset (~2 million rows with 80 columns) in 
parquet file against 10 executors (m5.xlarge) . The initial load itself is 
taking 2+ hours. Do you have any suggestions on the settings I can update to 
speed up the process.

Thanks
Syed Zaidi


From: Mehrotra, Udit 
Sent: Tuesday, March 17, 2020 8:08 PM
To: dev@hudi.apache.org ; Syed Zaidi 

Subject: Re: Question

Hi Zaidi,

You should be able to use Hudi 0.5.1 in the next EMR release that should be 
fairly soon, but we can't give you an ETA. Meanwhile, there is nothing really 
stopping you to build your hudi 0.5.1 jars and replacing the ones on EMR 
cluster. The jars are located on the master node at /usr/lib/hudi/. Just 
replace the 0.5.0 jars there and have the symlink jars point to your 0.5.1 jars.

Thanks,
Udit Mehrotra
SDE | AWS EMR

On 3/17/20, 5:34 PM, "Syed Zaidi"  wrote:

CAUTION: This email originated from outside of the organization. Do not 
click links or open attachments unless you can confirm the sender and know the 
content is safe.



Hi,

AWS EMR emr-5.29.0 comes with spark Spark 2.4.4 and Hudi 0.5.0 ( 
hudi-hadoop-mr-bundle-0.5.0-incubating.jar). In version 0.5.1 we have new 
options for reading the AWS DMS change logs using DeltaStreamer. Do you guys 
have any idea when will AWS support the newer version of hudi. What options I 
have to upgrade hudi to the latest version while creating the EMR to support 
AWS DMS payload out of the box.

Would appreciate your feedback in this regard.

Thanks
Syed Zaidi




Re: [NOTIFICATION] Hudi 0.5.2 Release Daily Report-20200318

2020-03-18 Thread vino yang
Hi Vinoth,

>>We may need to revert HUDI-676[3]: Address issues towards removing use of
WIP Disclaimer
>>I think we should address the feedback and ensure VOTE passed with "non
>>WIP" disclaimer.. the WIP disclaimer cannot be retained forever and needs
>>to be fixed before graduation IIUC.

So you mean, we keep the standard disclaimer? Then we can discuss it when
we vote for rc2, if someone raises this issue again. WDYT?

>>As you bundled several ASF projects that have NOTICE files,
>>Again, we are back to binary vs source releases.. the source releases we
>>make DO NOT bundle anything.. So we should not have to add anything to
>>NOTICE. The same link mentioned, also states, "dont include stuff
>>unnecessarily" (not a quote). Does maven jars count as binary distribution
>>of software? I suggest we file a LEGAL issue, get it stamped by the right
>>folks and do that.  This can be a good strategy, in addition to looking at
>>other projects.

So, the question remains how to look at the word "*bundle*".

We have some weird statements in the LICENSE file, excerpted one of them as
follows:











*This
product includes code from Apache Hive.*
org.apache.hadoop.hive.ql.io.CombineHiveInputFormat copied to
org.apache.hudi.hadoop.hive.HoodieCombineHiveInputFormatCopyright:
2011-2019 The Apache Software FoundationHome page: http://hive.apache.org/
License:
http://www.apache.org/licenses/LICENSE-2.0
*

Regarding "*This product includes code from *", I don't seem to see this
pattern from other projects.

So the question here is, even if we only publish the source code, is this
behavior considered as bundling other projects (because we do copy the
source code from other projects, the source code is obviously one of the
assets of other projects)? Unfortunately, I did not find a clear
clarification from the Apache official website. Obviously, there are two
possibilities here:


   - This is a kind of bundle, but we are not clear or there is a deviation
   in understanding;
   - This is not a bundle, but there is a misunderstanding during voting,
   and we have not clarified this;


In short, our LICENSE contains a relatively uncommon description "*This
product includes code from *". In this regard, the main problem here is the
information asymmetry between us and the IPMC. I don't think it makes much
sense to refer to other projects or interpret the official documents. A
more efficient way is to communicate directly with Justin and clarify.
Understand his thoughts and express our confusion.

What do you think?

Best,
Vino


Vinoth Chandar  于2020年3月19日周四 上午12:05写道:

> Thanks for the update, vino!
>
> here's the -1 vote feedback for everyone's context..
>
> As you bundled several ASF projects that have NOTICE files, their NOTICE
> > files need to be examined and parts added to your NOTICE file. [1]
> > License is missing information fo this file copyright Twitter [3]
> > Perhaps you should consider using the work in progress disclaimer. [2]
> > Thanks,
> > Justin
> > 1. http://www.apache.org/dev/licensing-howto.html#alv2-dep
> > 2.
> >
> https://incubator.apache.org/policy/incubation.html#work_in_progress_disclaimer
>
>
>
> >>We may need to revert HUDI-676[3]: Address issues towards removing use of
> WIP Disclaimer
> I think we should address the feedback and ensure VOTE passed with "non
> WIP" disclaimer.. the WIP disclaimer cannot be retained forever and needs
> to be fixed before graduation IIUC.
>
> >>As you bundled several ASF projects that have NOTICE files,
> Again, we are back to binary vs source releases.. the source releases we
> make DO NOT bundle anything.. So we should not have to add anything to
> NOTICE. The same link mentioned, also states, "dont include stuff
> unnecessarily" (not a quote). Does maven jars count as binary distribution
> of software? I suggest we file a LEGAL issue, get it stamped by the right
> folks and do that.  This can be a good strategy, in addition to looking at
> other projects.
>
> On Wed, Mar 18, 2020 at 3:27 AM vino yang  wrote:
>
> > Hi all,
> >
> > We encountered some issues while voting RC1 on general@[1], so we
> canceled
> > the vote for rc1. The blocker issues we are currently addressing are:
> >
> > * HUDI-720: NOTICE file needs to add more content based on the NOTICE
> files
> > of the ASF projects that hudi bundles(I have opened a PR[2] to fix it)
> > * We may need to revert HUDI-676[3]: Address issues towards removing use
> of
> > WIP Disclaimer
> >
> > I will prepare 0.5.2 RC2 ASAP after fixing them. Please be patient.
> >
> > Best,
> > Vino
> >
> > [1]:
> >
> >
> http://mail-archives.apache.org/mod_mbox/incubator-general/202003.mbox/%3C932F44A0-1CEE-4549-896B-70FB61EAA034%40classsoftware.com%3E
> > [2]: https://github.com/apache/incubator-hu

Re: Query regarding restoring HUDI tables to older commits

2020-03-18 Thread Balajee Nagasubramaniam
Hi Prashant,

Regarding clean vs rollback/restoreToInstant, if you think of all the
commits/datafiles in the active timeline as a queue of items,
rollback/restoreToInstant would be working on the head of the queue whereas
clean would be working on the tail of the queue. They should be treated as
two independent operations on the queue. At datafile/file-slice level, if
cleaner is configured to maintain 3 versions of the file, then you can
rollback at most 2 recent versions. Hope this helps.

Thanks,
Balajee

On Wed, Mar 18, 2020 at 11:54 AM Prashant Wason 
wrote:

> Thanks for the info Vinoth / Balaji.
>
> To me it feels a split between easier-to-understand design and
> current-implementation. I feel it is simpler to reason (based on how file
> systems work in general) that restoreToInstant is a complete point-in-time
> shift to the past (like restoring a file system from a snapshot/backup).
>
> If I have restored the Table to commitTime=005, then having any instants
> with commitTime > 005 are confusing as it implies that even though my table
> is at an older time, some future operations will be applied onto it at some
> point.
>
> I will have to read more about incremental timeline syncing and timeline
> server to understand how it uses the clean instants. BTW, the comment on
> the function HoodieWriteClient::restoreToInstant reads "NOTE : This action
> requires all writers (ingest and compact) to a table to be stopped before
> proceeding". So probably the embedded timeline server can recreate the view
> next time it comes back up?
>
> Thanks
> Prashant
>
>
> On Wed, Mar 18, 2020 at 11:37 AM Balaji Varadarajan
>  wrote:
>
> >  Prashanth,
> > I think we should not be reverting clean operations here. Cleans are done
> > on the oldest file slices and a restore/rollback is not completely
> undoing
> > the work of clean that happened before it.
> > For incremental timeline syncing, embedded timeline server needs to read
> > these clean metadata to sync its cached file-system view.
> > Let me know your thoughts.
> > Balaji.V
> > On Wednesday, March 18, 2020, 11:23:09 AM PDT, Prashant Wason
> >  wrote:
> >
> >  HI Team,
> >
> > I noticed that when a table is restored to a previous commit (
> > HoodieWriteClient::restoreToInstant
> > <
> >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_incubator-2Dhudi_blob_master_hudi-2Dclient_src_main_java_org_apache_hudi_client_HoodieWriteClient.java-23L735&d=DwIFaQ&c=r2dcLCtU9q6n0vrtnDw9vg&r=c89AU9T1AVhM4r2Xi3ctZA&m=ASTWkm7UUMnhZ7sBzpXGPkTc1PhNTJeO7q5IXlBCprY&s=43rqua7SdhvO91hA0ZhOPNQw8ON1nL3bAsCue5o8aYw&e=
> > >),
> > only the COMMIT, DELTA_COMMIT and COMPACTION instants are rolled back and
> > their corresponding files are deleted from the timeline. If there are
> some
> > CLEAN instants, they are left over.
> >
> > Is there a reason why CLEAN are not removed? Won't they be referring to
> > files  which are no longer present and hence not useful?
> >
> > Thanks
> > Prashant
> >
>


Re: Query regarding restoring HUDI tables to older commits

2020-03-18 Thread vbal...@apache.org
 Prashanth,
My concern was we should not be losing metadata about clean operation. 

But there is a way, As long as we are faithfully copying the clean metadata 
that tracks the files which got cleaned and storing in restore metadata, we 
should be able to keep metadata in sync.
Balaji.V



On Wednesday, March 18, 2020, 11:54:11 AM PDT, Prashant Wason 
 wrote:  
 
 Thanks for the info Vinoth / Balaji.

To me it feels a split between easier-to-understand design and
current-implementation. I feel it is simpler to reason (based on how file
systems work in general) that restoreToInstant is a complete point-in-time
shift to the past (like restoring a file system from a snapshot/backup).

If I have restored the Table to commitTime=005, then having any instants
with commitTime > 005 are confusing as it implies that even though my table
is at an older time, some future operations will be applied onto it at some
point.

I will have to read more about incremental timeline syncing and timeline
server to understand how it uses the clean instants. BTW, the comment on
the function HoodieWriteClient::restoreToInstant reads "NOTE : This action
requires all writers (ingest and compact) to a table to be stopped before
proceeding". So probably the embedded timeline server can recreate the view
next time it comes back up?

Thanks
Prashant


On Wed, Mar 18, 2020 at 11:37 AM Balaji Varadarajan
 wrote:

>  Prashanth,
> I think we should not be reverting clean operations here. Cleans are done
> on the oldest file slices and a restore/rollback is not completely undoing
> the work of clean that happened before it.
> For incremental timeline syncing, embedded timeline server needs to read
> these clean metadata to sync its cached file-system view.
> Let me know your thoughts.
> Balaji.V
>    On Wednesday, March 18, 2020, 11:23:09 AM PDT, Prashant Wason
>  wrote:
>
>  HI Team,
>
> I noticed that when a table is restored to a previous commit (
> HoodieWriteClient::restoreToInstant
> <
> https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_incubator-2Dhudi_blob_master_hudi-2Dclient_src_main_java_org_apache_hudi_client_HoodieWriteClient.java-23L735&d=DwIFaQ&c=r2dcLCtU9q6n0vrtnDw9vg&r=c89AU9T1AVhM4r2Xi3ctZA&m=ASTWkm7UUMnhZ7sBzpXGPkTc1PhNTJeO7q5IXlBCprY&s=43rqua7SdhvO91hA0ZhOPNQw8ON1nL3bAsCue5o8aYw&e=
> >),
> only the COMMIT, DELTA_COMMIT and COMPACTION instants are rolled back and
> their corresponding files are deleted from the timeline. If there are some
> CLEAN instants, they are left over.
>
> Is there a reason why CLEAN are not removed? Won't they be referring to
> files  which are no longer present and hence not useful?
>
> Thanks
> Prashant
>  

Re: Query regarding restoring HUDI tables to older commits

2020-03-18 Thread Prashant Wason
Thanks for the info Vinoth / Balaji.

To me it feels a split between easier-to-understand design and
current-implementation. I feel it is simpler to reason (based on how file
systems work in general) that restoreToInstant is a complete point-in-time
shift to the past (like restoring a file system from a snapshot/backup).

If I have restored the Table to commitTime=005, then having any instants
with commitTime > 005 are confusing as it implies that even though my table
is at an older time, some future operations will be applied onto it at some
point.

I will have to read more about incremental timeline syncing and timeline
server to understand how it uses the clean instants. BTW, the comment on
the function HoodieWriteClient::restoreToInstant reads "NOTE : This action
requires all writers (ingest and compact) to a table to be stopped before
proceeding". So probably the embedded timeline server can recreate the view
next time it comes back up?

Thanks
Prashant


On Wed, Mar 18, 2020 at 11:37 AM Balaji Varadarajan
 wrote:

>  Prashanth,
> I think we should not be reverting clean operations here. Cleans are done
> on the oldest file slices and a restore/rollback is not completely undoing
> the work of clean that happened before it.
> For incremental timeline syncing, embedded timeline server needs to read
> these clean metadata to sync its cached file-system view.
> Let me know your thoughts.
> Balaji.V
> On Wednesday, March 18, 2020, 11:23:09 AM PDT, Prashant Wason
>  wrote:
>
>  HI Team,
>
> I noticed that when a table is restored to a previous commit (
> HoodieWriteClient::restoreToInstant
> <
> https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_incubator-2Dhudi_blob_master_hudi-2Dclient_src_main_java_org_apache_hudi_client_HoodieWriteClient.java-23L735&d=DwIFaQ&c=r2dcLCtU9q6n0vrtnDw9vg&r=c89AU9T1AVhM4r2Xi3ctZA&m=ASTWkm7UUMnhZ7sBzpXGPkTc1PhNTJeO7q5IXlBCprY&s=43rqua7SdhvO91hA0ZhOPNQw8ON1nL3bAsCue5o8aYw&e=
> >),
> only the COMMIT, DELTA_COMMIT and COMPACTION instants are rolled back and
> their corresponding files are deleted from the timeline. If there are some
> CLEAN instants, they are left over.
>
> Is there a reason why CLEAN are not removed? Won't they be referring to
> files  which are no longer present and hence not useful?
>
> Thanks
> Prashant
>


Re: Query regarding restoring HUDI tables to older commits

2020-03-18 Thread Balaji Varadarajan
 Prashanth,
I think we should not be reverting clean operations here. Cleans are done on 
the oldest file slices and a restore/rollback is not completely undoing the 
work of clean that happened before it. 
For incremental timeline syncing, embedded timeline server needs to read these 
clean metadata to sync its cached file-system view.
Let me know your thoughts.
Balaji.V
On Wednesday, March 18, 2020, 11:23:09 AM PDT, Prashant Wason 
 wrote:  
 
 HI Team,

I noticed that when a table is restored to a previous commit (
HoodieWriteClient::restoreToInstant
),
only the COMMIT, DELTA_COMMIT and COMPACTION instants are rolled back and
their corresponding files are deleted from the timeline. If there are some
CLEAN instants, they are left over.

Is there a reason why CLEAN are not removed? Won't they be referring to
files  which are no longer present and hence not useful?

Thanks
Prashant
  

Re: Query regarding restoring HUDI tables to older commits

2020-03-18 Thread Vinoth Chandar
Hi Prashant,

Not sure if there is a specific reason. Mostly, it because until recently,
the clean metadata was not actually used.
Currently, incremental cleaning will use it, but even then, it only relies
on the partition paths being touched there.. So should be fine..

+100 though on consistently cleaning all of this up. Some of these
inconsistencies exist actually to ensure the old timelines for old users
(e.g uber) continue to work.
So I would like to actually have a conversation on streamlining all this,
so the system implementation is as simple/close to the design..

On Wed, Mar 18, 2020 at 11:23 AM Prashant Wason 
wrote:

> HI Team,
>
> I noticed that when a table is restored to a previous commit (
> HoodieWriteClient::restoreToInstant
> <
> https://github.com/apache/incubator-hudi/blob/master/hudi-client/src/main/java/org/apache/hudi/client/HoodieWriteClient.java#L735
> >),
> only the COMMIT, DELTA_COMMIT and COMPACTION instants are rolled back and
> their corresponding files are deleted from the timeline. If there are some
> CLEAN instants, they are left over.
>
> Is there a reason why CLEAN are not removed? Won't they be referring to
> files  which are no longer present and hence not useful?
>
> Thanks
> Prashant
>


Query regarding restoring HUDI tables to older commits

2020-03-18 Thread Prashant Wason
HI Team,

I noticed that when a table is restored to a previous commit (
HoodieWriteClient::restoreToInstant
),
only the COMMIT, DELTA_COMMIT and COMPACTION instants are rolled back and
their corresponding files are deleted from the timeline. If there are some
CLEAN instants, they are left over.

Is there a reason why CLEAN are not removed? Won't they be referring to
files  which are no longer present and hence not useful?

Thanks
Prashant


Re: deltastreamer group.id Noeffectaftersetting

2020-03-18 Thread Vinoth Chandar
DeltaStreamer actually just uses the same mechanism as Spark Streaming to
manage offsets. So wondering if you see the same behavior with a plain
spark streaming job. ?

It manages the offset checkpoints manually by itself within the hoodie
commit metadata, to do exactly once ingestion of data..

On Wed, Mar 18, 2020 at 3:07 AM 965147...@qq.com <965147...@qq.com> wrote:

>
> hello, all When using deltastreamer to use kafka data, I want to specify
> group.id, but the problem encountered is that after specifying it, I
> cannot find it on the kafka side. I found that there are no groups under my
> theme. why is it like this? I also manually set enable.auto.commit = true
> at the same time, but it didn't seem to work. In kafkaUtils.scala
> fixKafkaParams this method,
>kafkaParams.put (ConsumerConfig.ENABLE_AUTO_COMMIT_CONFIG, false:
> java.lang.Boolean),
> which is forced to be rewritten as false. I think it is one of the reasons
> why the group cannot be found, but when it is not automatically submitted,
> it is usually chosen to be submitted manually. , I did not feel this
> phenomenon.
>
>
> Please help answer
> thanks
> liujinhui
>
>
> 965147...@qq.com
>


Re: Question on DeltaStreamer

2020-03-18 Thread Vinoth Chandar
>>Lets say if I have a source table in Oracle in the format below, will my
avro schema for source and target will be same.

yes. if you do any transformations in between, then DeltaStreamer can make
the target schema automatically.

In the upcoming 0.5.2 release, we have also have
org.apache.hudi.utilities.schema.JdbcbasedSchemaProvider which should be
able to generate the source avro schema from the table metadata
automatically.
https://github.com/apache/incubator-hudi/pull/1200

>>Our plan is to use AWS DMS for initial load & CDC.
For DMS, you get Parquet source files, which are self describing..
DeltaStreamer does not interact with Oracle directly. DMS handles the
mapping of the Oracle table schema to Parquet schema.. Its much simpler .

On Wed, Mar 18, 2020 at 10:14 AM Shiyan Xu 
wrote:

> To answer your question regarding the properties file
> It is a way to manage a bunch of hoodie configuration; those confs will be
> merged with other confs passed from --hoodie-conf. See this line
> <
> https://github.com/apache/incubator-hudi/blob/779edc068865898049569da0fe750574f93a0dca/hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/HoodieDeltaStreamer.java#L362
> >.
> So any hoodie conf can be put there. Usually we put "configurations for
> hoodie client, schema provider, key generator and data source" (per the
> docs).
>
> On Wed, Mar 18, 2020 at 6:50 AM Syed Zaidi 
> wrote:
>
> > Hi,
> >
> > I hope things are good. We are planning on using DetalStreamer as a
> client
> > for hudi. Our plan is to use AWS DMS for initial load & CDC. The
> question I
> > have is around the documentation for the properties file that I need for
> > dfs, source & target. Where can I find more information on the properties
> > files need for the client.
> >
> > Lets say if I have a source table in Oracle in the format below, will my
> > avro schema for source and target will be same.
> >
> > CREATE TABLE orders
> >   (​
> > order_id NUMBER GENERATED BY DEFAULT AS IDENTITY START WITH 106
> > PRIMARY KEY,​
> > customer_id NUMBER( 6, 0 ) NOT NULL, ​
> > status  VARCHAR( 20 ) NOT NULL ,​
> > salesman_id NUMBER( 6, 0 ) , ​
> > order_date   TIMESTAMP NOT NULL​
> >   );
> >
> > I would appreciate your help in this regard.
> >
> > We are on this stack:
> >
> > EMR : emr-5.29.0
> > Spark: Spark 2.4.4, spark-avro_2.11:2.4.4
> >
> > Thanks
> > Syed Zaidi
> >
>


Re: Question on DeltaStreamer

2020-03-18 Thread Shiyan Xu
To answer your question regarding the properties file
It is a way to manage a bunch of hoodie configuration; those confs will be
merged with other confs passed from --hoodie-conf. See this line
.
So any hoodie conf can be put there. Usually we put "configurations for
hoodie client, schema provider, key generator and data source" (per the
docs).

On Wed, Mar 18, 2020 at 6:50 AM Syed Zaidi 
wrote:

> Hi,
>
> I hope things are good. We are planning on using DetalStreamer as a client
> for hudi. Our plan is to use AWS DMS for initial load & CDC. The question I
> have is around the documentation for the properties file that I need for
> dfs, source & target. Where can I find more information on the properties
> files need for the client.
>
> Lets say if I have a source table in Oracle in the format below, will my
> avro schema for source and target will be same.
>
> CREATE TABLE orders
>   (​
> order_id NUMBER GENERATED BY DEFAULT AS IDENTITY START WITH 106
> PRIMARY KEY,​
> customer_id NUMBER( 6, 0 ) NOT NULL, ​
> status  VARCHAR( 20 ) NOT NULL ,​
> salesman_id NUMBER( 6, 0 ) , ​
> order_date   TIMESTAMP NOT NULL​
>   );
>
> I would appreciate your help in this regard.
>
> We are on this stack:
>
> EMR : emr-5.29.0
> Spark: Spark 2.4.4, spark-avro_2.11:2.4.4
>
> Thanks
> Syed Zaidi
>


Re: [NOTIFICATION] Hudi 0.5.2 Release Daily Report-20200318

2020-03-18 Thread Vinoth Chandar
Thanks for the update, vino!

here's the -1 vote feedback for everyone's context..

As you bundled several ASF projects that have NOTICE files, their NOTICE
> files need to be examined and parts added to your NOTICE file. [1]
> License is missing information fo this file copyright Twitter [3]
> Perhaps you should consider using the work in progress disclaimer. [2]
> Thanks,
> Justin
> 1. http://www.apache.org/dev/licensing-howto.html#alv2-dep
> 2.
> https://incubator.apache.org/policy/incubation.html#work_in_progress_disclaimer



>>We may need to revert HUDI-676[3]: Address issues towards removing use of
WIP Disclaimer
I think we should address the feedback and ensure VOTE passed with "non
WIP" disclaimer.. the WIP disclaimer cannot be retained forever and needs
to be fixed before graduation IIUC.

>>As you bundled several ASF projects that have NOTICE files,
Again, we are back to binary vs source releases.. the source releases we
make DO NOT bundle anything.. So we should not have to add anything to
NOTICE. The same link mentioned, also states, "dont include stuff
unnecessarily" (not a quote). Does maven jars count as binary distribution
of software? I suggest we file a LEGAL issue, get it stamped by the right
folks and do that.  This can be a good strategy, in addition to looking at
other projects.

On Wed, Mar 18, 2020 at 3:27 AM vino yang  wrote:

> Hi all,
>
> We encountered some issues while voting RC1 on general@[1], so we canceled
> the vote for rc1. The blocker issues we are currently addressing are:
>
> * HUDI-720: NOTICE file needs to add more content based on the NOTICE files
> of the ASF projects that hudi bundles(I have opened a PR[2] to fix it)
> * We may need to revert HUDI-676[3]: Address issues towards removing use of
> WIP Disclaimer
>
> I will prepare 0.5.2 RC2 ASAP after fixing them. Please be patient.
>
> Best,
> Vino
>
> [1]:
>
> http://mail-archives.apache.org/mod_mbox/incubator-general/202003.mbox/%3C932F44A0-1CEE-4549-896B-70FB61EAA034%40classsoftware.com%3E
> [2]: https://github.com/apache/incubator-hudi/pull/1417
> [3]: https://github.com/apache/incubator-hudi/pull/1386
>


Question on DeltaStreamer

2020-03-18 Thread Syed Zaidi
Hi,

I hope things are good. We are planning on using DetalStreamer as a client for 
hudi. Our plan is to use AWS DMS for initial load & CDC. The question I have is 
around the documentation for the properties file that I need for dfs, source & 
target. Where can I find more information on the properties files need for the 
client.

Lets say if I have a source table in Oracle in the format below, will my avro 
schema for source and target will be same.

CREATE TABLE orders
  (​
order_id NUMBER GENERATED BY DEFAULT AS IDENTITY START WITH 106 PRIMARY 
KEY,​
customer_id NUMBER( 6, 0 ) NOT NULL, ​
status  VARCHAR( 20 ) NOT NULL ,​
salesman_id NUMBER( 6, 0 ) , ​
order_date   TIMESTAMP NOT NULL​
  );

I would appreciate your help in this regard.

We are on this stack:

EMR : emr-5.29.0
Spark: Spark 2.4.4, spark-avro_2.11:2.4.4

Thanks
Syed Zaidi


Re: contributor permission

2020-03-18 Thread 965147...@qq.com
this yarn log
.auto.commit.interval.ms = 5000
auto.offset.reset = latest
bootstrap.servers = [172.16.16.2:9092, 172.16.16.3:9092]
check.crcs = true
client.dns.lookup = default
client.id = 
connections.max.idle.ms = 54
default.api.timeout.ms = 6
enable.auto.commit = true
exclude.internal.topics = true
fetch.max.bytes = 52428800
fetch.max.wait.ms = 500
fetch.min.bytes = 1
group.id = t3_lingqu.t3_trip.t_route_plan_flink_group
heartbeat.interval.ms = 3000
interceptor.classes = []
internal.leave.group.on.close = true
isolation.level = read_uncommitted
key.deserializer = class 
org.apache.kafka.common.serialization.StringDeserializer
max.partition.fetch.bytes = 1048576
max.poll.interval.ms = 30
max.poll.records = 500
metadata.max.age.ms = 30
metric.reporters = []
..
...20/03/18 18:35:13 WARN KafkaUtils: overriding enable.auto.commit to 
false for executor
20/03/18 18:35:13 WARN KafkaUtils: overriding auto.offset.reset to none for 
executor
20/03/18 18:35:13 WARN KafkaUtils: overriding executor group.id to 
spark-executor-t3_lingqu.t3_trip.t_route_plan_flink_group
20/03/18 18:35:13 WARN KafkaUtils: overriding receive.buffer.bytes to 65536 see 
KAFKA-3135...



965147...@qq.com
 
From: 965147...@qq.com
Date: 2019-12-27 15:10
To: dev
Subject: contributor permission
Hi hudi,
I want to contribute to Apache Hudi. Would you please give me the contributor 
permission? My JIRA ID is liujinhui.

thanks  
liujinhui



965147...@qq.com


[NOTIFICATION] Hudi 0.5.2 Release Daily Report-20200318

2020-03-18 Thread vino yang
Hi all,

We encountered some issues while voting RC1 on general@[1], so we canceled
the vote for rc1. The blocker issues we are currently addressing are:

* HUDI-720: NOTICE file needs to add more content based on the NOTICE files
of the ASF projects that hudi bundles(I have opened a PR[2] to fix it)
* We may need to revert HUDI-676[3]: Address issues towards removing use of
WIP Disclaimer

I will prepare 0.5.2 RC2 ASAP after fixing them. Please be patient.

Best,
Vino

[1]:
http://mail-archives.apache.org/mod_mbox/incubator-general/202003.mbox/%3C932F44A0-1CEE-4549-896B-70FB61EAA034%40classsoftware.com%3E
[2]: https://github.com/apache/incubator-hudi/pull/1417
[3]: https://github.com/apache/incubator-hudi/pull/1386


deltastreamer group.id Noeffectaftersetting

2020-03-18 Thread 965147...@qq.com

hello, all When using deltastreamer to use kafka data, I want to specify 
group.id, but the problem encountered is that after specifying it, I cannot 
find it on the kafka side. I found that there are no groups under my theme. why 
is it like this? I also manually set enable.auto.commit = true at the same 
time, but it didn't seem to work. In kafkaUtils.scala fixKafkaParams this 
method,
   kafkaParams.put (ConsumerConfig.ENABLE_AUTO_COMMIT_CONFIG, false: 
java.lang.Boolean), 
which is forced to be rewritten as false. I think it is one of the reasons why 
the group cannot be found, but when it is not automatically submitted, it is 
usually chosen to be submitted manually. , I did not feel this phenomenon. 


Please help answer  
thanks
liujinhui  


965147...@qq.com