Hi Joel, Sounds reasonable. But if Drill checks this TTL property from metadata cache file for every query and for every file instead of file timestamp, it will not give the benefit. I suppose we can add this TTL property to only root metadata cache file and check it only once per query.
Could you clarify the details, what is the TTL time? How TTL info could be used to determine whether refresh is needed for the query? Kind regards Vitalii On Thu, Jul 12, 2018 at 4:40 PM Joel Pfaff <joel.pf...@gmail.com> wrote: > Hello, > > Today, on a table for which we have created statistics (through the REFRESH > TABLE METADATA <path to table> command), Drill validates the timestamp of > every files or directory involved in the scan. > > If the timestamps of the files are greater than the one of the metadata > file, then a re-regeneration of the meta-data file is triggered. > In the case the timestamp of the metadata file is the greatest, then the > planning continues without regenerating the metadata. > > When the number of files to be queried increases, this operation can take a > significant amount of time. > We have seen cases where this validation step alone is taking 3 to 5 > seconds (just checking the timestamps), meaning the planning time was > taking way more time than the querying time. > And this can be problematic in some usecases where the response time is > favored compared to the `accuracy` of the data. > > What would you think about adding an option to the metadata generation, so > that the metadata is trusted for a configurable time period > Example : REFRESH TABLE METADATA <path to table> WITH TTL='15m' > The exact syntax, of course, needs to be thought through. > > This TTL would be stored in the metadata file, and used to determine if a > refresh is needed at each query. And this would significantly decrease the > planning time when the number of files represented in the metadata file is > important. > > Of course, this means that there could be cases where the metadata would be > wrong, so cases like the one below would need to be solved (since they may > happen much more frequently): > https://issues.apache.org/jira/browse/DRILL-6194 > But my feeling is that since we already do have a kind of race condition > between the view of the file system at the planning time, and the state > that will be found during the execution, we could gracefully accept that > some files may have disappeared between the planning and the execution. > > In the case the TTL would need to be changed, or be removed completely, > this could be done by re-issuing a REFRESH TABLE METADATA, either with a > new TTL, or without TTL at all. > > What do you think? > > Regards, Joel >