If your use case can be addressed by adding session option to skip the
checks, that
would be simpler to do and it can be done much faster.
Adding TTL support would be more complex.

I will let someone else comment on long term plans as I don't know the
details.

Thanks
Padma





On Thu, Aug 2, 2018 at 8:54 AM, Joel Pfaff <joel.pf...@gmail.com> wrote:

> Hello,
>
> "I think the simplest thing that should be done first is to provide option
> to skip the check"
> I agree that whatever we do, we should not introduce any change in user
> experience by default.
> But since the default's behaviour is to not set any TTL in the meta-data, I
> have conflicted feelings about that.
> But in our usecase, we will mostly try to optimize a single application, so
> we could add this session option at querying time as well.
> Long story short: I do not have strong opinions about what the default
> should be.
>
> Concerning the overall change: the introduction of TTL, can we submit a
> design document, or would you prefer to invest on the longer term meta data
> repository?
>
> Regards, Joel
>
>
> On Thu, Aug 2, 2018 at 6:28 AM, Padma Penumarthy <
> penumarthy.pa...@gmail.com
> > wrote:
>
> > I think the simplest thing that should be done first is to provide option
> > to skip the check.
> > The default behavior for that option will be what we do today i.e. check
> > root directory
> > and all sub directories underneath.
> >
> > Thanks
> > Padma
> >
> >
> >
> > On Mon, Jul 30, 2018 at 3:01 AM, Joel Pfaff <joel.pf...@gmail.com>
> wrote:
> >
> > > Hello,
> > >
> > > Thanks a lot for all these feedbacks, trying to respond to everything
> > > below:
> > >
> > > @Parth:
> > > "I don't think we would want to maintain a TTL for the metadata store
> so
> > > introducing one now would mean that we might break backward
> compatibility
> > > down the road."
> > > Yes, I am aware of this activity starting, and I agree that whatever
> the
> > > solution decided later on for the new metadata store, it most probably
> > > won't support a concept of TTL. This means that we would either have to
> > > break the support of the `WITH TTL` extension of the SQL command, or to
> > > ignore it down the road. None of these solutions seem particularly
> > > appealing to me.
> > >
> > > @Padma:
> > > "What issues we need to worry about if different directories in the
> > > hierarchy are checked last at different times ?"
> > > Knowing that the refresh is always recursive, we can only have two
> cases:
> > > the parent level cache is refreshed at the same time as the child level
> > > cache, or the parent level cache is older than the child level cache
> > > (because a query has run in a sub-directory, that triggered a refresh
> of
> > > the metadata in this sub-directory). In both cases, checking the
> > timestamp
> > > of the cache file at the root directory of the query is enough to know
> if
> > > the TTL criteria is respected.
> > > In the case the cache files are not refreshed at the same time between
> > > parent and children directories, and that the parent's cache is still
> > valid
> > > with regards to its TTL, Drill would trust the parent cache, and issue
> an
> > > execution plan with this set of files. The same query on a child folder
> > > would use the children cache, that would have been refreshed more
> > recently,
> > > and this would potentially result in issuing an execution plan with
> > another
> > > set of files.
> > > So this basically, this TTL feature could create discrepancies in the
> > > results, and these discrepancies could last up to the TTL value.
> > >
> > > "Do we have to worry about coordinating against multiple drillbits ?"
> > > That would be better indeed, as the problem already exists today (I
> have
> > > not found any locking mechanism on metadata file), I am not sure this
> > > change would make it worse.
> > > So the reply is yes, we should worry, but I think the fix for that
> would
> > be
> > > independent to this change.
> > >
> > > "Another option is if the difference between modification time of
> > directory
> > > and metadata cache file is within TTL limit, do not do anything. If we
> do
> > > that, do we get errors during execution (I think so) ?"
> > > We would get errors if there would be files removed between the time of
> > the
> > > last generation of meta-data, and the time of the execution. As in the
> > case
> > > above, this can already happen, since there is currently no guarantee
> > that
> > > the files at planning time will still be there at execution time. The
> > > timeframe would increase from a few milliseconds to several minutes, so
> > the
> > > frequency of this kind of problem occurring would be much higher.
> > > I would recommend to quietly ignore missing files by considering them
> as
> > > empty files.
> > >
> > > "Also, how to reset that state and do metadata cache refresh eventually
> > ?"
> > > We could reuse the REFRESH TABLE METADATA command to force the refresh.
> > > This would allow for collaborative ingestion jobs to force the refresh
> > when
> > > the datasets have changed.
> > > Non-collaborative jobs would then rely on the TTL to get the new
> dataset
> > > available.
> > >
> > > "Instead of TTL, I think having a system/session option that will let
> us
> > > skip this check altogether would be a good thing to have. So, if we
> know
> > we
> > > are not adding new data, we can set that option."
> > > I would see the need to set TTL per Table. Since different tables will
> > have
> > > different update frequencies.
> > > I agree on a session option to bypass TTL check, so that this user will
> > > always see the last dataset.
> > > The question then becomes: what would be the default value for this
> > option?
> > >
> > > Regards, Joel
> > >
> > >
> > > On Fri, Jul 13, 2018 at 9:06 AM, Padma Penumarthy <
> > > penumarthy.pa...@gmail.com> wrote:
> > >
> > > > Hi Joel,
> > > >
> > > > This is my understanding:
> > > > We have list of all directories (i.e. all subdirectories and their
> > > > subdirectories etc.) in the metadata
> > > > cache file of each directory. We go through that list of directories
> > and
> > > > check
> > > > directory modification time against modification time of metadata
> cache
> > > > file in that directory.
> > > > If this does not match for any of the directories, we build the
> > metadata
> > > > cache for the whole hierarchy.
> > > > The reason we have to do this adding new files will only update
> > > > modification time of immediate
> > > > parent directory and not the whole hierarchy.
> > > >
> > > > Regarding your proposal, some random thoughts:
> > > > How will you get current time that can be compared against last
> > > > modification time set by the file system ?
> > > > I think you meant compare current system time of the running java
> > process
> > > > i.e. drillbit
> > > > against last time we checked if metadata cache needs to be updated
> for
> > > that
> > > > directory.
> > > > What issues we need to worry about if different directories in the
> > > > hierarchy are checked last at different times ?
> > > > Do we have to worry about coordinating against multiple drillbits ?
> > > >
> > > > Another option is if the difference between modification time of
> > > directory
> > > > and metadata cache file is within
> > > > TTL limit, do not do anything. If we do that, do we get errors during
> > > > execution (I think so) ?
> > > > Also, how to reset that state and do metadata cache refresh
> eventually
> > ?
> > > > We are not saving time for modification time checks here.
> > > >
> > > > Instead of TTL, I think having a system/session option that will let
> us
> > > > skip this check altogether would be a
> > > > good thing to have. So, if we know we are not adding new data, we can
> > set
> > > > that option.
> > > >
> > > > Instead of saving this TTL in metadata cache file for each
> > > > table(directory),
> > > > is it better to have this TTL as global system or session option ?
> > > > In that case, we cannot have a different TTL for each table, but it
> > makes
> > > > it much simpler.
> > > > Otherwise, there are some complications to think about.
> > > > We have a root metadata file per directory with each of the
> > > subdirectories
> > > > underneath having their own metadata file.
> > > > So, if we update the TTL of the root directory, do we update for all
> > the
> > > > subdirectories or just the top level directory ?
> > > > What issues we need to think about if TTL of the root directory and
> > > > subdirectories are different ?
> > > >
> > > >
> > > > Thanks
> > > > Padma
> > > >
> > > >
> > > >
> > > >
> > > > On Thu, Jul 12, 2018 at 8:07 AM, Joel Pfaff <joel.pf...@gmail.com>
> > > wrote:
> > > >
> > > > > Hello,
> > > > >
> > > > > Thanks for the feedback.
> > > > >
> > > > > The logic I had in mind was to add the TTL, as a refresh_interval
> > field
> > > > in
> > > > > the root metadata file.
> > > > >
> > > > > At each query, the current time would be compared to the addition
> of
> > > the
> > > > > modification time of the root metadata file and the
> refresh_interval.
> > > > > If the current time is greater, it would mean the metadata may be
> > > > invalid,
> > > > > so the regular process would apply: recursively going through the
> > file
> > > to
> > > > > check for updates, and trig a full metadata cache refresh any
> change
> > is
> > > > > detected, or just touch the metadata file to align its modification
> > > time
> > > > > with the current time if no change is detected.
> > > > > If the current time is smaller, the root metadata would be trusted
> > > > (without
> > > > > additional checks) and the planning would continue.
> > > > >
> > > > > So in most of the cases, only the timestamp of the root metadata
> file
> > > > would
> > > > > be checked. In the worst case (at most once per TTL), all the
> > > timestamps
> > > > > would be checked.
> > > > >
> > > > > Regards, Joel
> > > > >
> > > > > On Thu, Jul 12, 2018 at 4:47 PM, Vitalii Diravka <
> > > > > vitalii.dira...@gmail.com>
> > > > > wrote:
> > > > >
> > > > > > Hi Joel,
> > > > > >
> > > > > > Sounds reasonable.
> > > > > > But if Drill checks this TTL property from metadata cache file
> for
> > > > every
> > > > > > query and for every file instead of file timestamp, it will not
> > give
> > > > the
> > > > > > benefit.
> > > > > > I suppose we can add this TTL property to only root metadata
> cache
> > > file
> > > > > and
> > > > > > check it only once per query.
> > > > > >
> > > > > > Could you clarify the details, what is the TTL time?
> > > > > > How TTL info could be used to determine whether refresh is needed
> > for
> > > > the
> > > > > > query?
> > > > > >
> > > > > > Kind regards
> > > > > > Vitalii
> > > > > >
> > > > > >
> > > > > > On Thu, Jul 12, 2018 at 4:40 PM Joel Pfaff <joel.pf...@gmail.com
> >
> > > > wrote:
> > > > > >
> > > > > > > Hello,
> > > > > > >
> > > > > > > Today, on a table for which we have created statistics (through
> > the
> > > > > > REFRESH
> > > > > > > TABLE METADATA <path to table> command), Drill validates the
> > > > timestamp
> > > > > of
> > > > > > > every files or directory involved in the scan.
> > > > > > >
> > > > > > > If the timestamps of the files are greater than the one of the
> > > > metadata
> > > > > > > file, then a re-regeneration of the meta-data file is
> triggered.
> > > > > > > In the case the timestamp of the metadata file is the greatest,
> > > then
> > > > > the
> > > > > > > planning continues without regenerating the metadata.
> > > > > > >
> > > > > > > When the number of files to be queried increases, this
> operation
> > > can
> > > > > > take a
> > > > > > > significant amount of time.
> > > > > > > We have seen cases where this validation step alone is taking 3
> > to
> > > 5
> > > > > > > seconds (just checking the timestamps), meaning the planning
> time
> > > was
> > > > > > > taking way more time than the querying time.
> > > > > > > And this can be problematic in some usecases where the response
> > > time
> > > > is
> > > > > > > favored compared to the `accuracy` of the data.
> > > > > > >
> > > > > > > What would you think about adding an option to the metadata
> > > > generation,
> > > > > > so
> > > > > > > that the metadata is trusted for a configurable time period
> > > > > > > Example : REFRESH TABLE METADATA <path to table> WITH TTL='15m'
> > > > > > > The exact syntax, of course, needs to be thought through.
> > > > > > >
> > > > > > > This TTL would be stored in the metadata file, and used to
> > > determine
> > > > > if a
> > > > > > > refresh is needed at each query. And this would significantly
> > > > decrease
> > > > > > the
> > > > > > > planning time when the number of files represented in the
> > metadata
> > > > file
> > > > > > is
> > > > > > > important.
> > > > > > >
> > > > > > > Of course, this means that there could be cases where the
> > metadata
> > > > > would
> > > > > > be
> > > > > > > wrong, so cases like the one below would need to be solved
> (since
> > > > they
> > > > > > may
> > > > > > > happen much more frequently):
> > > > > > > https://issues.apache.org/jira/browse/DRILL-6194
> > > > > > > But my feeling is that since we already do have a kind of race
> > > > > condition
> > > > > > > between the view of the file system at the planning time, and
> the
> > > > state
> > > > > > > that will be found during the execution, we could gracefully
> > accept
> > > > > that
> > > > > > > some files may have disappeared between the planning and the
> > > > execution.
> > > > > > >
> > > > > > > In the case the TTL would need to be changed, or be removed
> > > > completely,
> > > > > > > this could be done by re-issuing a REFRESH TABLE METADATA,
> either
> > > > with
> > > > > a
> > > > > > > new TTL, or without TTL at all.
> > > > > > >
> > > > > > > What do you think?
> > > > > > >
> > > > > > > Regards, Joel
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Reply via email to