Hi Kabeer,
I have requested some information in the github ticket. 
Balaji.V    On Wednesday, August 28, 2019, 10:46:04 AM PDT, Kabeer Ahmed 
<kab...@linuxmail.org> wrote:  
 
 Thanks for the quick response Vinoth. That is what I would have thought that 
there is nothing complex or different in upsert after a delete. Yes, I can 
reproduce the issue with simple example that I have written in the email.

I have dug into the issue in detail and it seems it is a bug. I have filed it 
at: https://github.com/apache/incubator-hudi/issues/859 
(https://link.getmailspring.com/link/23c57df5-045c-4021-a880-93a1c46a3...@getmailspring.com/0?redirect=https%3A%2F%2Fgithub.com%2Fapache%2Fincubator-hudi%2Fissues%2F859&recipient=ZGV2QGh1ZGkuYXBhY2hlLm9yZw%3D%3D).
 Let me know if more information is required.
Thank you,

On Aug 23 2019, at 1:37 am, Vinoth Chandar <vin...@apache.org> wrote:
> yes. I was asking about the HUDI storage type..
>
> There is nothing complex about upsert() after delete(). It almost as if a
> delete() for (2, vinoth) happened in between.
>
> Are you able to repro this literally with this tiny example with 3 records?
> Some things to check
>
> - This sequence would have created 3 commits. You can look at the commit
> files and see if the number of record updated, inserted, deleted match
> expectations.
> - if they do, then you can use spark.read.parquet(.). on the individual
> parquet files and see what records they actually contain ..
>
> This should shed some light on the pattern of failure and when exactly (2,
> vinoth) disappeared.
>
> Alternatively, if you can give a small snippet that reproduces this, we can
> debug from there.
>
>
>
>
>
>
> On Thu, Aug 22, 2019 at 3:06 PM Kabeer Ahmed <kab...@linuxmail.org> wrote:
> > And if you meant HUDI storage type, I have left it to default COW - Copy
> > On Write.
> >
> > If anyone has tried this please let me know if you have hit similar issue.
> > Any experience would be greatly helpful.
> > On Aug 22 2019, at 11:01 pm, Kabeer Ahmed <kab...@linuxmail.org> wrote:
> > > Hi Vinoth - thanks for the quick response.
> > >
> > > I have followed the mail thread for deletes ->
> > http://mail-archives.apache.org/mod_mbox/hudi-commits/201904.mbox/<
> > 155556722511.2660.9583626796839453...@gitbox.apache.org>
> > >
> > > For your convenience, the code that I use is below at the end of the
> > email. EmptyHoodieRecord is inserted for the relevant records that need to
> > be deleted. After the delete, I can query from Hive and confirm that the
> > rows intended to be deleted are no longer present and the records not
> > deleted can be seen in the Hive table via Hive and Presto.
> > > The issue starts when the upsert is done after a delete.
> > > The storage type is S3 and I dont think there is any eventual
> >
> > consistency in play as the record upserted is visible but the old records
> > that werent deleted are not visible.
> > > And for the sake of completion, my insert and upsert logic is based out
> >
> > of the code below:
> > https://github.com/apache/incubator-hudi/blob/a4f9d7575f39bb79089714049ffea12ba5f25ec8/hudi-spark/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala#L43
> > > Thanks
> > > Kabeer.
> > >
> > > > /**
> > > > * Empty payload used for deletions
> > > > */
> > > > public class EmptyHoodieRecordPayload implements
> > >
> >
> > HoodieRecordPayload<EmptyHoodieRecordPayload>
> > > > {
> > > > public EmptyHoodieRecordPayload(GenericRecord record, Comparable
> > >
> >
> > orderingVal) { }
> > > > @Override
> > > > public EmptyHoodieRecordPayload preCombine(EmptyHoodieRecordPayload
> > >
> >
> > another) {
> > > > return another;
> > > > }
> > > > @Override
> > > > public Optional<IndexedRecord> combineAndGetUpdateValue(IndexedRecord
> > >
> >
> > currentValue,
> > > > chema schema) {
> > > > return Optional.empty();
> > > > }
> > > > @Override
> > > > public Optional<IndexedRecord> getInsertValue(Schema schema) {
> > > > return Optional.empty();
> > > > }
> > > > }
> > >
> > > ---------- Forwarded Message ---------
> > >
> > > From: Vinoth Chandar <vin...@apache.org>
> > > Subject: Re: Upsert after Delete
> > > Date: Aug 22 2019, at 8:38 pm
> > > To: dev@hudi.apache.org
> > >
> > > That’s interesting. Can you also share details on storage type and how
> > you
> > > are issuing the deletes and also the table/view (ro, rt) that you are
> > > querying?
> > >
> > > On Thu, Aug 22, 2019 at 9:49 AM Kabeer Ahmed <kab...@linuxmail.org>
> > wrote:
> > > > Hudi experts and Users,
> > > > Has anyone attempted an upsert after a delete? Here is a weird thing
> > >
> >
> > that
> > > > I have bumped into and it is a shame that this has come up when
> > >
> >
> > someone in
> > > > the team tested this whilst I failed to run this test.
> > > > Use case:
> > > > Insert data into a table. Say records (1, kabeer | 2, vinoth)
> > > >
> > > > Delete a record (1, kabeer). Data in the table is: (2, vinoth) and it
> > is
> > > > visible via sql through Presto/Hive.
> > > >
> > > > Upsert a new record into the same table (3, balaji). Query the table
> > and
> > > > only record that is visible is: (3, balaji). The record (2, vinoth) is
> > >
> >
> > not
> > > > displayed in the results.
> > > >
> > > > Any ideas on what could be at play here? Has someone done upsert after
> > > > delete?
> > > >
> > > > Thanks,
> > > > Kabeer
> > > >
> > > > PS: Please note that upsert functionality is well tested and if we do
> > (1,
> > > > vinoth) insert followed by upsert of (2, balaji) both the records are
> > > > visible. So something else is at play and would appreciate any help
> > >
> >
> > that
> > > > you experts can provide insight.
> > >
> >
>
>
  

Reply via email to