InvisibleProgrammer commented on PR #6545: URL: https://github.com/apache/hive/pull/6545#issuecomment-4751215777
Hi, firstly, I have to say I love seeing maybe the first idea that accepts the fact that different databases have their different needs and considers individual implementations for each database engine that Hive supports. I have two high-level questions in my mind about the solution: Firstly, the HMS connection is just a connection string. There is no guarantee that the user uses that database only for Hive. Fix me if I'm wrong but I think that change can potentially have huge impact: especially for large databases, rebuilding all the indexes takes time. Honestly, it can take a lot of time. And if there is only one customer that shares the database with any kind of other software, Hive will impact their behaviour as well. I know it is unlikely but as it is possible I bet there will be at least one user whose production will be impacted. But on the other side, as I haven't seen such a setup so far, I would just suggesting to add a warning about the behaviour and also the PR title as well: it doesn't rebuild the Metastore indexes. It rebuilds all the indexes in the database that hosts Metastore objects. Secondly, I have little knowledge about other databases. But for example, for MSSQL, a nonclustered index points to a clustered index (exception is when the table itself a heap). For performance reasons, I would consider having an order in executing the rebuilds: clustered indexes first, nonclustered indexes second. And my +1 question is about execution time: I assume rebuilding indexes can be a long-running process. I haven't checked that part of the code so far so please excuse me if I ask trivial questions: what kind of user interaction we have? Is there a progress bar to show the progress? What if the customer stops the process in the middle? I don't know if all the supported databases are supporting rebuilding indexes. Do we have any of them that actually requires to drop the existing index? If yes, what happens if the process stops after dropping the old one? Will the customer have any kind of feedback about a missing index? And lastly, an other performance related question that came into my mind. I'm not a DBA so I don't know the answer, just curious: assuming the user has multiple instances, like a primary that accepts writes and multiple replicas. According to my knowledge, in that kind of setup the changes are synchronized with transaction log. I have a fear if we rebuild all the indexes in one step, it can potentially create a huge amount of transaction log. And if it happens, it can cause significant delay and/or network issues between the primary and the replicas. Honestly, I have no answer for this question right now: as a usual practise, I would recommend to do such a thing with waits so that the replicas can catch up with the changes. But to be able to fine tune that process, it is required to know the given setup. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
