Michael Chirico created HIVE-22702:
--------------------------------------

             Summary: ALTER TABLE REMOVE PARTITION is inefficient
                 Key: HIVE-22702
                 URL: https://issues.apache.org/jira/browse/HIVE-22702
             Project: Hive
          Issue Type: Improvement
          Components: Database/Schema
            Reporter: Michael Chirico


I recently realized the poor partitioning of a table of mine was becoming a 
major bottleneck and endeavored to reset the partitioning.

At this point, the table had about 56K partitions (year|month|day|city) 
combinations; moving to the more efficient year|month partitions means there's 
about 24.

In the process, I was having trouble fixing the registration of the table 
because of the size of its partition DB; I happened upon this SO Q&A which 
addresses the same issue:

https://stackoverflow.com/questions/50715939/drop-table-in-hive-via-spark-hangs/50814566#comment105440563_50814566

I set about batching through ALTER TABLE x DROP PARTITION (...), PARTITION 
(...) 200 at a time; it would run for about 2 hours to accomplish this, which 
strikes me as being quite inefficient.

(apologies that I haven't done a fully proper analysis of the scaling 
efficiency in this ticket)

If I were designing it from scratch, I would:

* Keep the database of existing partitions sorted
* Sort the incoming partitions to remove
* Iterate via "shrinking binary search" (each partition is searched with binary 
search, and we can eliminate from the existing DB anything "less than" the 
current index when moving to the next iteration)

Is there something preventing this from being achieved?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to