This would be a great improvement (and long overdue). Thanks for working on it. I would be inclined to option #2 and perhaps add an option to drillbit startup that allows partitioning all existing profiles in a forced manner (default can be the 1000 profiles that you proposed). The option makes the user aware that this could take longer. Having a separate thread is not quite needed since once the initial partitioning is done, the new profiles are anyways written to the sub-directories.
Aman On Tue, Apr 16, 2019 at 4:57 PM Kunal Khatua <[email protected]> wrote: > Hi guys > > I'm working on a draft PR to improve the management of Drill's query > profiles. > https://github.com/apache/drill/pull/1750 [ > https://github.com/apache/drill/pull/1750] > > The design basically partitions existing profiles into sub-directories > based on the structure 'yyyy/MM/dd' (can be customized). > All new profiles are directly written into partitioned directories. > For existing profiles in the `profiles` directory, the Drillbit will > partition the k-most-recent profiles (configurable) into the > sub-directories; but only once (during startup) to ensure we don't have a > Drillbit spending too long a time during startup. > This improves response time for profile listing in the > WebUI substantially. Especially when the number of profiles are in the > range of 100s of thousands of profiles. > > However, I have the challenge of figuring out what to do for users who > might be wanting to dump a profile in the same directory for the purpose of > rendering it in the WebUI. > > I have two options at the moment (and open to others): > > 1. Create a thread that periodically checks if there is a profile in the > root of the `profiles` directory that needs to be 'indexed' into its > correct partition. > 2. Avoid having the need for creating a thread, by creating a > unpartitioned sub directory within the `profiles` directory that is only > meant for hosting profiles for WebUI rendering. > For e.g., a developer should dump it into a `profiles/tmp` and view it. > > I'm inclined towards option #1 because it allows for guarantee that > eventually all profiles will be 'indexed' into their partitions and that we > don't need to do it only during start up. > > With option #2, e.g., if I have 100,000 profiles and my Drillbits is > configured to partition only 1000 most recent profiles at startup, i'll > eventually get all profiles partitioned after 100 restarts! > However, #2 would ensure that profiles that are only for the purpose of > rendering can be accessible (for sharing again) and not get indexed. Plus, > there is no need for an additional thread to be added to the Drillbit. > > Which one should I go for? Or is there a third alternative? > > Thanks in advance! > > ~ Kunal > > > >
