InvisibleProgrammer commented on PR #6545:
URL: https://github.com/apache/hive/pull/6545#issuecomment-4751215777

   Hi, firstly, I have to say I love seeing maybe the first idea that accepts 
the fact that different databases have their different needs and considers 
individual implementations for each database engine that Hive supports. 
   
   I have two high-level questions in my mind about the solution: 
   
   Firstly, the HMS connection is just a connection string. There is no 
guarantee that the user uses that database only for Hive. Fix me if I'm wrong 
but I think that change can potentially have huge impact: especially for large 
databases, rebuilding all the indexes takes time. Honestly, it can take a lot 
of time. And if there is only one customer that shares the database with any 
kind of other software, Hive will impact their behaviour as well. 
   I know it is unlikely but as it is possible I bet there will be at least one 
user whose production will be impacted. 
   But on the other side, as I haven't seen such a setup so far, I would just 
suggesting to add a warning about the behaviour and also the PR title as well: 
it doesn't rebuild the Metastore indexes. It rebuilds all the indexes in the 
database that hosts Metastore objects.
   
   Secondly, I have little knowledge about other databases. But for example, 
for MSSQL, a nonclustered index points to a clustered index (exception is when 
the table itself a heap). For performance reasons, I would consider having an 
order in executing the rebuilds: clustered indexes first, nonclustered indexes 
second. 
   
   And my +1 question is about execution time: I assume rebuilding indexes can 
be a long-running process. I haven't checked that part of the code so far so 
please excuse me if I ask trivial questions: what kind of user interaction we 
have? Is there a progress bar to show the progress? What if the customer stops 
the process in the middle? I don't know if all the supported databases are 
supporting rebuilding indexes. Do we have any of them that actually requires to 
drop the existing index? If yes, what happens if the process stops after 
dropping the old one? Will the customer have any kind of feedback about a 
missing index? 
   
   And lastly, an other performance related question that came into my mind. 
I'm not a DBA so I don't know the answer, just curious: assuming the user has 
multiple instances, like a primary that accepts writes and multiple replicas. 
According to my knowledge, in that kind of setup the changes are synchronized 
with transaction log. 
   I have a fear if we rebuild all the indexes in one step, it can potentially 
create a huge amount of transaction log. And if it happens, it can cause 
significant delay and/or network issues between the primary and the replicas. 
   Honestly, I have no answer for this question right now: as a usual practise, 
I would recommend to do such a thing with waits so that the replicas can catch 
up with the changes. But to be able to fine tune that process, it is required 
to know the given setup. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to