kazdy commented on PR #7640: URL: https://github.com/apache/hudi/pull/7640#issuecomment-1408800988
> Let me know if this makes sense. happy to jam to see if we can really pull this off by a row Id sort of generating rather than based on record payload. @nsivabalan I did some reading and found out that oracle and postgres both use pseudo/ system columns to imitate PK if not defined. Do you think it would be possible to do something similar to oracle [ROWID](https://docs.oracle.com/database/121/SQLRF/sql_elements001.htm#SQLRF00213) pseudo column or postgres [ctid](https://www.postgresql.org/docs/current/ddl-system-columns.html) system column? > Rowids contain the following information: The data block of the data file containing the row. The length of this string depends on your operating system. - The row in the data block. - The database file containing the row. The first data file has the number 1. The length of this string depends on your operating system. - The data object number, which is an identification number assigned to every database segment. You can retrieve the data object number from the data dictionary views USER_OBJECTS, DBA_OBJECTS, and ALL_OBJECTS. Objects that share the same segment (clustered tables in the same cluster, for example) have the same object number. It seems like it would be doable with vectorized parquet reader rowId/ Column Vector etc. instead of "row in data block", the file name is known and saved in meta columns. I don't know how would it handle the first write since there's no information about the column vector and hash can not be generated. So this can be impossible to use in an upsert partitioner and the whole idea does make any sense :) . I see some restrictions: - only support it in CoW (bc. parquet vectorized reader needs to be used?), - only available with Virtual Keys, - no incremental queries allowed (I think cdc from non-pk table is not supported in oracle rdbms), - no support for datasource write with upsert when ROWID/ recordkey is not provided (should not be a problem with spark sql since it first queries Hudi and therefore it would be possible to get ROWID) (?) But it would allow doing DS+SQL insert and SQL updates and SQL deletes without the need to define PK on the table. Does it make any sense? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
