here after more details about ORC content and the fact we have duplicate
rows:
/delta_0011365_0011365_0000/bucket_00003
{"operation":0,"originalTransaction":11365,"bucket":3,"rowId":0,"currentTransaction":11365,"row":{"TS":1574156027915254212,"cle":5218,...}}
{"operation":0,"originalTransaction":11365,"bucket":3,"rowId":1,"currentTransaction":11365,"row":{"TS":1574156027915075038,"cle":5216,...}}
/delta_0011368_0011368_0000/bucket_00003
{"operation":2,"originalTransaction":11365,"bucket":3,"rowId":1,"currentTransaction":11368,"row":null}
{"operation":2,"originalTransaction":11365,"bucket":3,"rowId":0,"currentTransaction":11368,"row":null}
/delta_0011369_0011369_0000/bucket_00003
{"operation":0,"originalTransaction":11369,"bucket":3,"rowId":1,"currentTransaction":11369,"row":{"TS":1574157407855174144,"cle":5216,...}}
{"operation":0,"originalTransaction":11369,"bucket":3,"rowId":0,"currentTransaction":11369,"row":{"TS":1574157407855265906,"cle":5218,...}}
+-------------------------------------------------+-------+--+
| row__id | cle |
+-------------------------------------------------+-------+--+
| {"transactionid":11367,"bucketid":0,"rowid":0} | 5209 |
| {"transactionid":11369,"bucketid":0,"rowid":0} | 5211 |
| {"transactionid":11369,"bucketid":1,"rowid":0} | 5210 |
| {"transactionid":11369,"bucketid":2,"rowid":0} | 5214 |
| {"transactionid":11369,"bucketid":2,"rowid":1} | 5215 |
| {"transactionid":11365,"bucketid":3,"rowid":0} | *5218* |
| {"transactionid":11365,"bucketid":3,"rowid":1} | *5216* |
| {"transactionid":11369,"bucketid":3,"rowid":1} | *5216* |
| {"transactionid":11369,"bucketid":3,"rowid":0} | *5218* |
| {"transactionid":11369,"bucketid":4,"rowid":0} | 5217 |
| {"transactionid":11369,"bucketid":4,"rowid":1} | 5213 |
| {"transactionid":11369,"bucketid":7,"rowid":0} | 5212 |
+-------------------------------------------------+-------+--+
As you can see we have duplicate rows for column "cle" 5216 and 5218
Do we have to keep the rowids ordered ? because this is the only difference
I have noticed based on some tests with beeline.
Thanks
Le mar. 19 nov. 2019 à 00:18, David Morin <[email protected]> a
écrit :
> Hello,
>
> I'm trying to understand the purpose of the rowid column inside ORC delta
> file
> {"transactionid":11359,"bucketid":5,"*rowid*":0}
> Orc view: {"operation":0,"originalTransaction":11359,"bucket":5,"*rowId*
> ":0,"currentTransaction":11359,"row":...}
> I use HDP 2.6 => Hive 2
>
> If I want to be idempotent with INSERT / DELETE / INSERT.
> Do we have to keep the same rowid ?
> It seems that when the rowid is changed during the second INSERT I have a
> duplicate row.
> For me, I can create a new rowid for the new transaction during the second
> INSERT but that seems to generate duplicate records.
>
> Regards,
> David
>
>
>
>