Thank you for your response. I understand... Just a few points before I accept that this is too complicated :)
The main idea is to keep different versions of data under the same table, similar to HBase but this is row level and you don't have to make the other versions accessible from Hive but only the most recent one. You just need to create an access layer to work on the most recent version of the row. If you can think of a different way of uniquely identifying a row to know the versions of it and timestamp (or counter or version #??) to know the most recent one, it doesn't have to be the columns that I specified before. It can be a different file that you create in the background (which can also be the index file!!). Oracle has ROWID for physical location of the row and locks it before the data manipulation. Hadoop has advantage of storage and map-reduce. So why not use it and keep all versions of changed data and access it via map-reduce for the most recent one. Accessing the data can get slower over time when there are many versions. And that can be fixed with flush or full replication of data time to time in a maintenance window by the end user. Hive is a great tool to access and manipulate Hadoop files. You are doing an amazing job. I have no idea what are the complications you face each day. Just disregard if I am talking nonsense to you keep up the good work! Cheers! Atreju, > > Your work is great. Personally I would not get too tied up in the > transactional side of hive. Once you start dealing with locking and > concurrency the problem becomes tricky. > > We hivers have a long time tradition on 'punting' on complicated stuff we > do not want to deal with. :) Thus we only have 'Insert Overwrite' no 'insert > update' :) > > Again, I think you wrote a really cool application. It would make a great > use case, blog post, or a stand alone application. Call it HiveMysqlRsync or > something :). However you mention several requirements that are specific to > your application timestamp and primary key. If you can abstract all your > application specific logic it could make it's way into hive. But it might be > a stand alone program because hive to rdbms replication might be a little > out of scope. > > Edward >
