vinothchandar commented on issue #1480: [SUPPORT] Backwards Incompatible Schema 
Evolution
URL: https://github.com/apache/incubator-hudi/issues/1480#issuecomment-608529410
 
 
   >>we would like the instant timestamps to be the same in the new target 
tables after the transformation so that downstream clients can continue to use 
their existing instant values while performing incremental pull queries. 
   
   IIUC the current initialization process hands you a single commit for the 
first ingest.. but you basically want a physical copy of the old data, as the 
new data , with just renamed fields/new schema.. In general, this may be worth 
adding support for in the new exporter tool cc @xushiyan ... wdyt? essentially, 
something that will preserve file names and just transform the data. 
   
   For now, even if you create those commit timeline files yourself in 
`.hoodie`, it may not work since the metadata inside will point to files that 
no longer exist in the new table..  Here's an approach that could work.. 
Writing a small program, that will 
   
   - First copy the `.hoodie` folder to new table location
   - Then list all files (directly using fs.listStatus()) and filter them such 
that their commit time < latest commit time in the `.hoodie` folder you copied 
above
   - Read all files out using AvroParquetReader to get RDD[GenericRecord] (if 
it's MOR, we need more work), do your schema adjusting to derive a new 
RDD[GenericRecord]
   - Write this out using HoodieAvroParquetWriter back into the same file 
names.. 
   
   Essentially, you will have the same file names and same timline (.hoodie) 
metadata, just with different schema.. 
   
   Let's also wait to hear from @xushiyan . may be the exporter tool could be 
reused here
   
   
   
   
   
   
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

Reply via email to