[GitHub] [hudi] kazdy commented on pull request #7640: [HUDI-5514] Add in support for a keyless workflow

via GitHub Mon, 30 Jan 2023 07:11:56 -0800


kazdy commented on PR #7640:
URL: https://github.com/apache/hudi/pull/7640#issuecomment-1408800988


   > Let me know if this makes sense. happy to jam to see if we can really pull 
this off by a row Id sort of generating rather than based on record payload.
   
   @nsivabalan I did some reading and found out that oracle and postgres both 
use pseudo/ system columns to imitate PK if not defined. 
   Do you think it would be possible to do something similar to oracle 
[ROWID](https://docs.oracle.com/database/121/SQLRF/sql_elements001.htm#SQLRF00213)
 pseudo column or postgres 
[ctid](https://www.postgresql.org/docs/current/ddl-system-columns.html) system 
column?
   
   > Rowids contain the following information:
   The data block of the data file containing the row. The length of this 
string depends on your operating system.  
   - The row in the data block.
   - The database file containing the row. The first data file has the number 
1. The length of this string depends on your operating system.
   - The data object number, which is an identification number assigned to 
every database segment. You can retrieve the data object number from the data 
dictionary views USER_OBJECTS, DBA_OBJECTS, and ALL_OBJECTS. Objects that share 
the same segment (clustered tables in the same cluster, for example) have the 
same object number.
   
   It seems like it would be doable with vectorized parquet reader rowId/ 
Column Vector etc. instead of "row in data block", the file name is known and 
saved in meta columns.
   I don't know how would it handle the first write since there's no 
information about the column vector and hash can not be generated. So this can 
be impossible to use in an upsert partitioner and the whole idea does make any 
sense :) .
   
   I see some restrictions:
   - only support it in CoW (bc. parquet vectorized reader needs to be used?),
   - only available with Virtual Keys,
   - no incremental queries allowed (I think cdc from non-pk table is not 
supported in oracle rdbms),
   - no support for datasource write with upsert when ROWID/ recordkey is not 
provided (should not be a problem with spark sql since it first queries Hudi 
and therefore it would be possible to get ROWID) (?)
   
   But it would allow doing DS+SQL insert and  SQL updates and SQL deletes 
without the need to define PK on the table.
   Does it make any sense?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] kazdy commented on pull request #7640: [HUDI-5514] Add in support for a keyless workflow

Reply via email to