Re: [DISCUSS] Ideas to improve metadata cache read performance

Jacques Nadeau Thu, 29 Oct 2015 13:35:22 -0700

Agree with Steven here. Pruning could be added to query profile for post
execution verification.
On Oct 29, 2015 11:33 AM, "Steven Phillips" <[email protected]> wrote:


> I agree that this would present a small challenge for testing, but I don't
> think ease of testing should be the primary motivator in designing the
> software. Once we've decided what we want the software to do, then we can
> work together to figure out how to test it.
>
> On Thu, Oct 29, 2015 at 11:09 AM, rahul challapalli <
> [email protected]> wrote:
>
> > @steven If we end up pushing the partition pruning to the execution
> phase,
> > how would we know that partition pruning even took place. I am thinking
> > from the standpoint of adding functional tests around partition pruning.
> >
> > - Rahul
> >
> > On Wed, Oct 28, 2015 at 10:53 AM, Parth Chandra <[email protected]>
> wrote:
> >
> > > And ideally, I suppose, the merged schema would correspond to the
> > > information that we want to keep in a .drill file.
> > >
> > >
> > > On Tue, Oct 27, 2015 at 4:55 PM, Aman Sinha <[email protected]>
> wrote:
> > >
> > > > @Steven, w.r.t to your suggestion about doing the metadata operation
> > > during
> > > > execution phase, see the related discussion in DRILL-3838.
> > > >
> > > > A couple of more thoughts:
> > > >  - Parth and I were discussing keeping track of the merged schema as
> > part
> > > > of the refresh metadata and storing the merged schema for all files
> > that
> > > > have the identical schema (currently this is repeated and is a huge
> > > > contributor to the size of the file).   To Jacques' point about
> keeping
> > > > minimum information needed for planning purposes,  we certainly could
> > do
> > > a
> > > > better job in keeping it lean.   The row count of the table could be
> > > > computed at the time of running refresh metadata command.  Similarly
> > the
> > > > analysis of single-value can be done at that time instead of on a
> > > per-query
> > > > basis.
> > > >
> > > >  - We should revisit DRILL-2517(
> > > > https://issues.apache.org/jira/browse/DRILL-2517)
> > > >   Consider the following 2 queries and their total elapsed times
> > against
> > > a
> > > > table with 310000 files:
> > > >     (A) SELECT  count(*) FROM table WHERE `date` = '2015-07-01';
> > > >           elapsed time: 980 secs
> > > >
> > > >     (B) SELECT count(*) FROM  `table/20150701` ;
> > > >           elapsed time: 54 secs
> > > >
> > > >     From the user perspective, both queries should perform nearly the
> > > same,
> > > > which was essentially the intent of DRILL-2517.
> > > >
> > > >
> > > > On Tue, Oct 27, 2015 at 12:04 PM, Steven Phillips <[email protected]
> >
> > > > wrote:
> > > >
> > > > > I think we need to come up with a way to push partition pruning to
> > > > > execution time.  The other solutions may relive the problem in some
> > > > cases,
> > > > > but won't solve the fundamental problem.
> > > > >
> > > > > For example, even if we do figure out how to use multiple threads
> for
> > > > > reading the metadata, that may be fine for a couple hundred
> thousand
> > > > files,
> > > > > but what about when we have millions or tens of millions of files.
> It
> > > > will
> > > > > still be a huge bottle neck.
> > > > >
> > > > > I actually think we should use the Drill execution engine to probe
> > the
> > > > > metadata and generate the work assignments. We could have an
> > additional
> > > > > fragment or fragments of the query that would recursively probe the
> > > > > filesystem, read the metadata, and make assignments, and then pipe
> > the
> > > > > results into the Scanners, which will create readers on the fly.
> This
> > > way
> > > > > the query could actually begin doing work before the metadata has
> > even
> > > > been
> > > > > fully read.
> > > > >
> > > > > On Mon, Oct 26, 2015 at 2:42 PM, Jacques Nadeau <
> [email protected]>
> > > > > wrote:
> > > > >
> > > > > > My first thought is we've gotten too generous in what we're
> storing
> > > in
> > > > > the
> > > > > > Parquet metadata file. Early implementations were very lean and
> it
> > > > seems
> > > > > > far larger today. For example, early implementations didn't keep
> > > > > statistics
> > > > > > and ignored row groups (files, schema and block locations only).
> If
> > > we
> > > > > need
> > > > > > multiple levels of information, we may want to stagger (or
> > normalize)
> > > > > them
> > > > > > in the file. Also, we may think about what is the minimum that
> must
> > > be
> > > > > done
> > > > > > in planning. We could do the file pruning at execution time
> rather
> > > than
> > > > > > single-tracking these things (makes stats harder though).
> > > > > >
> > > > > > I also think we should be cautious around jumping to a conclusion
> > > until
> > > > > > DRILL-3973 provides more insight.
> > > > > >
> > > > > > In terms of caching, I'd be more inclined to rely on file system
> > > > caching
> > > > > > and make sure serialization/deserialization is as efficient as
> > > possible
> > > > > as
> > > > > > opposed to implementing an application-level cache. (We already
> > have
> > > > > enough
> > > > > > problems managing memory without having to figure out when we
> > should
> > > > > drop a
> > > > > > metadata cache :D).
> > > > > >
> > > > > > Aside, I always liked this post for entertainment and the
> thoughts
> > on
> > > > > > virtual memory:
> > > https://www.varnish-cache.org/trac/wiki/ArchitectNotes
> > > > > >
> > > > > >
> > > > > > --
> > > > > > Jacques Nadeau
> > > > > > CTO and Co-Founder, Dremio
> > > > > >
> > > > > > On Mon, Oct 26, 2015 at 2:25 PM, Hanifi Gunes <
> [email protected]
> > >
> > > > > wrote:
> > > > > >
> > > > > > > One more thing, for workloads running queries over subsets of
> > same
> > > > > > parquet
> > > > > > > files, we can consider maintaining an in-memory cache as well.
> > > > Assuming
> > > > > > > metadata memory footprint per file is low and parquet files are
> > > > static,
> > > > > > not
> > > > > > > needing us to invalidate the cache often.
> > > > > > >
> > > > > > > H+
> > > > > > >
> > > > > > > On Mon, Oct 26, 2015 at 2:10 PM, Hanifi Gunes <
> > [email protected]
> > > >
> > > > > > wrote:
> > > > > > >
> > > > > > > > I am not familiar with the contents of metadata stored but if
> > > > > > > > deserialization workload seems to be fitting to any of
> > > > afterburner's
> > > > > > > > claimed improvement points [1] It could well be worth trying
> > > given
> > > > > the
> > > > > > > > claimed gain on throughput is substantial.
> > > > > > > >
> > > > > > > > It could also be a good idea to partition caching over a
> number
> > > of
> > > > > > files
> > > > > > > > for better parallelization given number of cache files
> > generated
> > > is
> > > > > > > > *significantly* less than number of parquet files.
> Maintaining
> > > > global
> > > > > > > > statistics seems an improvement point too.
> > > > > > > >
> > > > > > > >
> > > > > > > > -H+
> > > > > > > >
> > > > > > > > 1:
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/FasterXML/jackson-module-afterburner#what-is-optimized
> > > > > > > >
> > > > > > > > On Sun, Oct 25, 2015 at 9:33 AM, Aman Sinha <
> > > [email protected]>
> > > > > > > wrote:
> > > > > > > >
> > > > > > > >> Forgot to include the link for Jackson's AfterBurner module:
> > > > > > > >>   https://github.com/FasterXML/jackson-module-afterburner
> > > > > > > >>
> > > > > > > >> On Sun, Oct 25, 2015 at 9:28 AM, Aman Sinha <
> > > [email protected]
> > > > >
> > > > > > > wrote:
> > > > > > > >>
> > > > > > > >> > I was going to file an enhancement JIRA but thought I will
> > > > discuss
> > > > > > > here
> > > > > > > >> > first:
> > > > > > > >> >
> > > > > > > >> > The parquet metadata cache file is a JSON file that
> > contains a
> > > > > > subset
> > > > > > > of
> > > > > > > >> > the metadata extracted from the parquet files.  The cache
> > file
> > > > can
> > > > > > get
> > > > > > > >> > really large .. a few GBs for a few hundred thousand
> files.
> > > > > > > >> > I have filed a separate JIRA: DRILL-3973 for profiling the
> > > > various
> > > > > > > >> aspects
> > > > > > > >> > of planning including metadata operations.  In the
> meantime,
> > > the
> > > > > > > >> timestamps
> > > > > > > >> > in the drillbit.log output indicate a large chunk of time
> > > spent
> > > > in
> > > > > > > >> creating
> > > > > > > >> > the drill table to begin with, which indicates bottleneck
> in
> > > > > reading
> > > > > > > the
> > > > > > > >> > metadata.  (I can provide performance numbers later once
> we
> > > > > confirm
> > > > > > > >> through
> > > > > > > >> > profiling).
> > > > > > > >> >
> > > > > > > >> > A few thoughts around improvements:
> > > > > > > >> >  - The jackson deserialization of the JSON file is very
> > slow..
> > > > can
> > > > > > > this
> > > > > > > >> be
> > > > > > > >> > speeded up ? .. for instance the AfterBurner module of
> > jackson
> > > > > > claims
> > > > > > > to
> > > > > > > >> > improve performance by 30-40% by avoiding the use of
> > > reflection.
> > > > > > > >> >  - The cache file read is a single threaded process.  If
> we
> > > were
> > > > > > > >> directly
> > > > > > > >> > reading from parquet files, we use a default of 16
> threads.
> > > > What
> > > > > > can
> > > > > > > be
> > > > > > > >> > done to parallelize the read ?
> > > > > > > >> >  - Any operation that can be done one time during the
> > REFRESH
> > > > > > METADATA
> > > > > > > >> > command ?  for instance..examining the min/max values to
> > > > determine
> > > > > > > >> > single-value for partition column could be eliminated if
> we
> > do
> > > > > this
> > > > > > > >> > computation during REFRESH METADATA command and store the
> > > > summary
> > > > > > one
> > > > > > > >> time.
> > > > > > > >> >
> > > > > > > >> >  - A pertinent question is: should the cache file be
> stored
> > > in a
> > > > > > more
> > > > > > > >> > efficient format such as Parquet instead of JSON ?
> > > > > > > >> >
> > > > > > > >> > Aman
> > > > > > > >> >
> > > > > > > >> >
> > > > > > > >>
> > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] Ideas to improve metadata cache read performance

Reply via email to