Re: Reg: Hudi Jira Ticket Conventions

2019-08-28 Thread Vinoth Chandar
+1 can we add this to contributing/community pages. As well

On Wed, Aug 28, 2019 at 2:33 PM vbal...@apache.org 
wrote:

> To all contributors of Hudi:
> Dear folks,
> When filing or updating a JIRA for Apache Hudi, kindly make sure the issue
> type and versions (when resolving the ticket) are set correctly. Also, the
> summary needs to be descriptive enough to catch the essence of the
> problem/features. This greatly helps in generating release notes.
> Thanks,Balaji.V


Reg: Hudi Jira Ticket Conventions

2019-08-28 Thread vbal...@apache.org
To all contributors of Hudi:
Dear folks,
When filing or updating a JIRA for Apache Hudi, kindly make sure the issue type 
and versions (when resolving the ticket) are set correctly. Also, the summary 
needs to be descriptive enough to catch the essence of the problem/features. 
This greatly helps in generating release notes.
Thanks,Balaji.V

Re: Upsert after Delete

2019-08-28 Thread vbal...@apache.org
 
Hi Kabeer,
I have requested some information in the github ticket. 
Balaji.VOn Wednesday, August 28, 2019, 10:46:04 AM PDT, Kabeer Ahmed 
 wrote:  
 
 Thanks for the quick response Vinoth. That is what I would have thought that 
there is nothing complex or different in upsert after a delete. Yes, I can 
reproduce the issue with simple example that I have written in the email.

I have dug into the issue in detail and it seems it is a bug. I have filed it 
at: https://github.com/apache/incubator-hudi/issues/859 
(https://link.getmailspring.com/link/23c57df5-045c-4021-a880-93a1c46a3...@getmailspring.com/0?redirect=https%3A%2F%2Fgithub.com%2Fapache%2Fincubator-hudi%2Fissues%2F859=ZGV2QGh1ZGkuYXBhY2hlLm9yZw%3D%3D).
 Let me know if more information is required.
Thank you,

On Aug 23 2019, at 1:37 am, Vinoth Chandar  wrote:
> yes. I was asking about the HUDI storage type..
>
> There is nothing complex about upsert() after delete(). It almost as if a
> delete() for (2, vinoth) happened in between.
>
> Are you able to repro this literally with this tiny example with 3 records?
> Some things to check
>
> - This sequence would have created 3 commits. You can look at the commit
> files and see if the number of record updated, inserted, deleted match
> expectations.
> - if they do, then you can use spark.read.parquet(.). on the individual
> parquet files and see what records they actually contain ..
>
> This should shed some light on the pattern of failure and when exactly (2,
> vinoth) disappeared.
>
> Alternatively, if you can give a small snippet that reproduces this, we can
> debug from there.
>
>
>
>
>
>
> On Thu, Aug 22, 2019 at 3:06 PM Kabeer Ahmed  wrote:
> > And if you meant HUDI storage type, I have left it to default COW - Copy
> > On Write.
> >
> > If anyone has tried this please let me know if you have hit similar issue.
> > Any experience would be greatly helpful.
> > On Aug 22 2019, at 11:01 pm, Kabeer Ahmed  wrote:
> > > Hi Vinoth - thanks for the quick response.
> > >
> > > I have followed the mail thread for deletes ->
> > http://mail-archives.apache.org/mod_mbox/hudi-commits/201904.mbox/<
> > 16722511.2660.9583626796839453...@gitbox.apache.org>
> > >
> > > For your convenience, the code that I use is below at the end of the
> > email. EmptyHoodieRecord is inserted for the relevant records that need to
> > be deleted. After the delete, I can query from Hive and confirm that the
> > rows intended to be deleted are no longer present and the records not
> > deleted can be seen in the Hive table via Hive and Presto.
> > > The issue starts when the upsert is done after a delete.
> > > The storage type is S3 and I dont think there is any eventual
> >
> > consistency in play as the record upserted is visible but the old records
> > that werent deleted are not visible.
> > > And for the sake of completion, my insert and upsert logic is based out
> >
> > of the code below:
> > https://github.com/apache/incubator-hudi/blob/a4f9d7575f39bb79089714049ffea12ba5f25ec8/hudi-spark/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala#L43
> > > Thanks
> > > Kabeer.
> > >
> > > > /**
> > > > * Empty payload used for deletions
> > > > */
> > > > public class EmptyHoodieRecordPayload implements
> > >
> >
> > HoodieRecordPayload
> > > > {
> > > > public EmptyHoodieRecordPayload(GenericRecord record, Comparable
> > >
> >
> > orderingVal) { }
> > > > @Override
> > > > public EmptyHoodieRecordPayload preCombine(EmptyHoodieRecordPayload
> > >
> >
> > another) {
> > > > return another;
> > > > }
> > > > @Override
> > > > public Optional combineAndGetUpdateValue(IndexedRecord
> > >
> >
> > currentValue,
> > > > chema schema) {
> > > > return Optional.empty();
> > > > }
> > > > @Override
> > > > public Optional getInsertValue(Schema schema) {
> > > > return Optional.empty();
> > > > }
> > > > }
> > >
> > > -- Forwarded Message -
> > >
> > > From: Vinoth Chandar 
> > > Subject: Re: Upsert after Delete
> > > Date: Aug 22 2019, at 8:38 pm
> > > To: dev@hudi.apache.org
> > >
> > > That’s interesting. Can you also share details on storage type and how
> > you
> > > are issuing the deletes and also the table/view (ro, rt) that you are
> > > querying?
> > >
> > > On Thu, Aug 22, 2019 at 9:49 AM Kabeer Ahmed 
> > wrote:
> > > > Hudi experts and Users,
> > > > Has anyone attempted an upsert after a delete? Here is a weird thing
> > >
> >
> > that
> > > > I have bumped into and it is a shame that this has come up when
> > >
> >
> > someone in
> > > > the team tested this whilst I failed to run this test.
> > > > Use case:
> > > > Insert data into a table. Say records (1, kabeer | 2, vinoth)
> > > >
> > > > Delete a record (1, kabeer). Data in the table is: (2, vinoth) and it
> > is
> > > > visible via sql through Presto/Hive.
> > > >
> > > > Upsert a new record into the same table (3, balaji). Query the table
> > and
> > > > only record that is visible is: (3, balaji). The record (2, vinoth) is
> > >
> >
> > 

[For Mentors] Readiness for IP Clearance

2019-08-28 Thread vbal...@apache.org
Dear Mentors,

We are able to setup nightly snapshot builds. At this moment, we have the 
following steps done (Master Jira: 
https://jira.apache.org/jira/browse/HUDI-121)    
   - Software Grant : Software grant from Uber to Apache has been completed
   - Contributor CLA : Done
   - License Conformance : All dependencies have been verified to be conforming 
as per  https://apache.org/legal/resolved.html   

   - Apache Style Package : We have renamed source code to follow 
"org.apache.hudi" package namespace. Migration guide for developers and 
customers : 
https://cwiki.apache.org/confluence/display/HUDI/Migration+Guide+From+com.uber.hoodie+to+org.apache.hudi
   - KEYS uploaded to dist.apache.org
   - Nightly snapshot builds setup 
(https://builds.apache.org/job/hudi-snapshot-deployment-0.5/)   

I am working on getting the release branch cut and built. I will soon send the 
first release candidate for voting. While this is happening, can one of you 
file IP clearance request to ASF in parallel  ( 
https://incubator.apache.org/ip-clearance/ip-clearance-template.html ) ? 
Thanks,Balaji.V


Re: Upsert after Delete

2019-08-28 Thread Kabeer Ahmed
Thanks for the quick response Vinoth. That is what I would have thought that 
there is nothing complex or different in upsert after a delete. Yes, I can 
reproduce the issue with simple example that I have written in the email.

I have dug into the issue in detail and it seems it is a bug. I have filed it 
at: https://github.com/apache/incubator-hudi/issues/859 
(https://link.getmailspring.com/link/23c57df5-045c-4021-a880-93a1c46a3...@getmailspring.com/0?redirect=https%3A%2F%2Fgithub.com%2Fapache%2Fincubator-hudi%2Fissues%2F859=ZGV2QGh1ZGkuYXBhY2hlLm9yZw%3D%3D).
 Let me know if more information is required.
Thank you,

On Aug 23 2019, at 1:37 am, Vinoth Chandar  wrote:
> yes. I was asking about the HUDI storage type..
>
> There is nothing complex about upsert() after delete(). It almost as if a
> delete() for (2, vinoth) happened in between.
>
> Are you able to repro this literally with this tiny example with 3 records?
> Some things to check
>
> - This sequence would have created 3 commits. You can look at the commit
> files and see if the number of record updated, inserted, deleted match
> expectations.
> - if they do, then you can use spark.read.parquet(.). on the individual
> parquet files and see what records they actually contain ..
>
> This should shed some light on the pattern of failure and when exactly (2,
> vinoth) disappeared.
>
> Alternatively, if you can give a small snippet that reproduces this, we can
> debug from there.
>
>
>
>
>
>
> On Thu, Aug 22, 2019 at 3:06 PM Kabeer Ahmed  wrote:
> > And if you meant HUDI storage type, I have left it to default COW - Copy
> > On Write.
> >
> > If anyone has tried this please let me know if you have hit similar issue.
> > Any experience would be greatly helpful.
> > On Aug 22 2019, at 11:01 pm, Kabeer Ahmed  wrote:
> > > Hi Vinoth - thanks for the quick response.
> > >
> > > I have followed the mail thread for deletes ->
> > http://mail-archives.apache.org/mod_mbox/hudi-commits/201904.mbox/<
> > 16722511.2660.9583626796839453...@gitbox.apache.org>
> > >
> > > For your convenience, the code that I use is below at the end of the
> > email. EmptyHoodieRecord is inserted for the relevant records that need to
> > be deleted. After the delete, I can query from Hive and confirm that the
> > rows intended to be deleted are no longer present and the records not
> > deleted can be seen in the Hive table via Hive and Presto.
> > > The issue starts when the upsert is done after a delete.
> > > The storage type is S3 and I dont think there is any eventual
> >
> > consistency in play as the record upserted is visible but the old records
> > that werent deleted are not visible.
> > > And for the sake of completion, my insert and upsert logic is based out
> >
> > of the code below:
> > https://github.com/apache/incubator-hudi/blob/a4f9d7575f39bb79089714049ffea12ba5f25ec8/hudi-spark/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala#L43
> > > Thanks
> > > Kabeer.
> > >
> > > > /**
> > > > * Empty payload used for deletions
> > > > */
> > > > public class EmptyHoodieRecordPayload implements
> > >
> >
> > HoodieRecordPayload
> > > > {
> > > > public EmptyHoodieRecordPayload(GenericRecord record, Comparable
> > >
> >
> > orderingVal) { }
> > > > @Override
> > > > public EmptyHoodieRecordPayload preCombine(EmptyHoodieRecordPayload
> > >
> >
> > another) {
> > > > return another;
> > > > }
> > > > @Override
> > > > public Optional combineAndGetUpdateValue(IndexedRecord
> > >
> >
> > currentValue,
> > > > chema schema) {
> > > > return Optional.empty();
> > > > }
> > > > @Override
> > > > public Optional getInsertValue(Schema schema) {
> > > > return Optional.empty();
> > > > }
> > > > }
> > >
> > > -- Forwarded Message -
> > >
> > > From: Vinoth Chandar 
> > > Subject: Re: Upsert after Delete
> > > Date: Aug 22 2019, at 8:38 pm
> > > To: dev@hudi.apache.org
> > >
> > > That’s interesting. Can you also share details on storage type and how
> > you
> > > are issuing the deletes and also the table/view (ro, rt) that you are
> > > querying?
> > >
> > > On Thu, Aug 22, 2019 at 9:49 AM Kabeer Ahmed 
> > wrote:
> > > > Hudi experts and Users,
> > > > Has anyone attempted an upsert after a delete? Here is a weird thing
> > >
> >
> > that
> > > > I have bumped into and it is a shame that this has come up when
> > >
> >
> > someone in
> > > > the team tested this whilst I failed to run this test.
> > > > Use case:
> > > > Insert data into a table. Say records (1, kabeer | 2, vinoth)
> > > >
> > > > Delete a record (1, kabeer). Data in the table is: (2, vinoth) and it
> > is
> > > > visible via sql through Presto/Hive.
> > > >
> > > > Upsert a new record into the same table (3, balaji). Query the table
> > and
> > > > only record that is visible is: (3, balaji). The record (2, vinoth) is
> > >
> >
> > not
> > > > displayed in the results.
> > > >
> > > > Any ideas on what could be at play here? Has someone done upsert after
> > > > delete?
> > > >
> > > > 

Re: [Hudi Improvement]: Introduce secondary source-ordering-field for breaking ties while writing

2019-08-28 Thread vbal...@apache.org
 Sure Pratyaksh, Whatever field works for your use-case is good enough. You do 
have the flexibility to generate a derived field or use one of the source 
fields 
Balaji.VOn Wednesday, August 28, 2019, 06:48:44 AM PDT, Pratyaksh Sharma 
 wrote:  
 
 Hi Balaji,

Sure I can do that. However after a considerable amount of time, the
bin-log position will get exhausted. To handle this, we can have secondary
ordering field as the ingestion_timestamp (the time when I am pushing the
event to Kafka to be consumed by DeltaStreamer) which will work always.

Please suggest.

On Thu, Aug 22, 2019 at 9:49 PM vbal...@apache.org 
wrote:

>  Hi Pratyaksh,
> The usual way we support this is to make use of
> com.uber.hoodie.utilities.transform.Transformer plugin in
> HoodieDeltaStreamer.  You can implement your own Transformer to add a new
> derived field which could be a combination of timestamp and
> binlog-position. You can then configure this new field to be used as source
> ordering field.
> Balaji.V
>
>    On Wednesday, August 21, 2019, 07:35:40 AM PDT, Pratyaksh Sharma <
> pratyaks...@gmail.com> wrote:
>
>  Hi,
>
> While building a CDC pipeline for capturing data changes in SQL using
> HoodieDeltaStreamer, I came across the following problem. We need to read
> SQL's bin log file for fetching all the modifications made to a particular
> table. However in production environment where we are handling hundreds
> of transactions per second (TPS), it is possible to have the same table row
> getting modified multiple times within a second.
>
> Here comes the problem with Mysql binlog as it has 32 bit timestamp upto
> seconds resolution. If we build CDC pipeline on top of such a table
> with huge TPS, then breaking ties between records with the same Hoodie key
> will not be possible with a single source-ordering-field (mentioned in
> HoodieDeltaStreamer.Config), which is binlog timestamp in this case.
>
> Example -  https://github.com/zendesk/maxwell/issues/925.
>
> Hence as a part of Hudi improvement, the proposal is to add one
> secondary-source-ordering-field for breaking ties among incoming records in
> such cases.  For example, we could have ingestion_timestamp or
> binlog_position as the secondary field.
>
> Please suggest. I have raised the issue here
> .
>
  

Re: [Hudi Improvement]: Introduce secondary source-ordering-field for breaking ties while writing

2019-08-28 Thread Pratyaksh Sharma
Hi Balaji,

Sure I can do that. However after a considerable amount of time, the
bin-log position will get exhausted. To handle this, we can have secondary
ordering field as the ingestion_timestamp (the time when I am pushing the
event to Kafka to be consumed by DeltaStreamer) which will work always.

Please suggest.

On Thu, Aug 22, 2019 at 9:49 PM vbal...@apache.org 
wrote:

>  Hi Pratyaksh,
> The usual way we support this is to make use of
> com.uber.hoodie.utilities.transform.Transformer plugin in
> HoodieDeltaStreamer.  You can implement your own Transformer to add a new
> derived field which could be a combination of timestamp and
> binlog-position. You can then configure this new field to be used as source
> ordering field.
> Balaji.V
>
> On Wednesday, August 21, 2019, 07:35:40 AM PDT, Pratyaksh Sharma <
> pratyaks...@gmail.com> wrote:
>
>  Hi,
>
> While building a CDC pipeline for capturing data changes in SQL using
> HoodieDeltaStreamer, I came across the following problem. We need to read
> SQL's bin log file for fetching all the modifications made to a particular
> table. However in production environment where we are handling hundreds
> of transactions per second (TPS), it is possible to have the same table row
> getting modified multiple times within a second.
>
> Here comes the problem with Mysql binlog as it has 32 bit timestamp upto
> seconds resolution. If we build CDC pipeline on top of such a table
> with huge TPS, then breaking ties between records with the same Hoodie key
> will not be possible with a single source-ordering-field (mentioned in
> HoodieDeltaStreamer.Config), which is binlog timestamp in this case.
>
> Example -  https://github.com/zendesk/maxwell/issues/925.
>
> Hence as a part of Hudi improvement, the proposal is to add one
> secondary-source-ordering-field for breaking ties among incoming records in
> such cases.  For example, we could have ingestion_timestamp or
> binlog_position as the secondary field.
>
> Please suggest. I have raised the issue here
> .
>