I definitely think we should subdivide into a couple different spaces:

XX files where we only need to query N (via partition pruning)
XX files where we need to query all or most of the files

where XX is [10, 10M, 10MM] and N is [100%, 0.0001%]

Is that a reasonable matrix?

I always don't want to make everything more complicated for the
10MM/0.0001% case at the cost of the other cases as I'm not sure how common
that case is.

I also think that Steven's parallelization comments are more applicable to
some of these than others.



--
Jacques Nadeau
CTO and Co-Founder, Dremio

On Wed, Nov 4, 2015 at 3:59 PM, Parth Chandra <[email protected]> wrote:

> It's been on my things to try. I'll give it a shot. At the moment, I've got
> the file size reduced to 40% of the previous. With some other changes this
> might give us good enough performance, but it would be far from ideal.
>
> Any other thoughts are welcome. I'll run some experiments and then write up
> a proposal.
>
>
>
> On Wed, Nov 4, 2015 at 9:15 AM, Aman Sinha <[email protected]> wrote:
>
> > It would be good to try; however I recall that we encountered a
> > SchemaChangeException when querying the JSON cache file.  Parth might
> have
> > more success once he has simplified the metadata.
> >
> > Aman
> >
> > On Wed, Nov 4, 2015 at 8:31 AM, Jacques Nadeau <[email protected]>
> wrote:
> >
> > > I've been thinking more about this and I think Aman's suggestion of
> > Parquet
> > > files is worth a poc.
> > >
> > > What we could do:
> > >
> > > Run a select * order by partCol1, partCol2, ... , partColN query
> against
> > > the existing large json partition file and create a new Parquet version
> > of
> > > the file.
> > > Hand write a partition type read against the Parquet APIs using the
> > filter
> > > APIs and see what performance looks like.
> > >
> > > Thoughts?
> > >
> > > --
> > > Jacques Nadeau
> > > CTO and Co-Founder, Dremio
> > >
> > > On Fri, Oct 30, 2015 at 3:36 PM, Parth Chandra <[email protected]>
> > wrote:
> > >
> > > > Thanks Steven for the link.
> > > > Your suggestion of storing only the single valued columns is a good
> > one.
> > > > It might be OK to have some of the count* queries run a little slower
> > as
> > > > reading the cache itself is taking way to long.  I'm also looking at
> > > > squashing the column datatype info as there is a lot of redundancy
> > there.
> > > >
> > > >
> > > >
> > > >
> > > > On Fri, Oct 30, 2015 at 3:22 PM, Steven Phillips <[email protected]>
> > > > wrote:
> > > >
> > > > > My view on storing it in some other format is that, yes, it will
> > > probably
> > > > > reduce the size of the file, but if we gzip the json file, it
> should
> > be
> > > > > pretty compact. As for deserialization cost, other formats would be
> > > > faster,
> > > > > but not dramatically faster. Certainly not the order of magnitude
> > > faster
> > > > > that we really need it to be. The reason we chose JSON was because
> it
> > > is
> > > > > readable and easier to deal with.
> > > > >
> > > > > As for the old code, I can point you at a branch, but it's probably
> > not
> > > > > very helpful. Unless we want to essentially disable value-based
> > > partition
> > > > > pruning when using the cache, the old code will not work.
> > > > >
> > > > > My recommendation would be to come up with a new version of the
> > format
> > > > > which stores only the name and value of columns which are
> > single-valued
> > > > for
> > > > > each file or row group. This will allow partition pruning to work,
> > but
> > > > some
> > > > > count queries may not be as fast any more, because the cache won't
> > have
> > > > > column value counts on a per-rowgroup basis any more.
> > > > >
> > > > > Anyway, here is the link to the original branch.
> > > > >
> > > > > https://github.com/StevenMPhillips/drill/tree/meta
> > > > >
> > > > > On Fri, Oct 30, 2015 at 3:01 PM, Parth Chandra <[email protected]>
> > > > wrote:
> > > > >
> > > > > > Hey Jacques, Steven,
> > > > > >
> > > > > >   Do we have a branch somewhere which has the initial prototype
> > code?
> > > > I'd
> > > > > > like to prune the file a bit as it looks like reducing the size
> of
> > > the
> > > > > > metadata cache file might yield the best results.
> > > > > >
> > > > > >   Also, did we have a particular reason for going with JSON as
> > > opposed
> > > > > to a
> > > > > > more compact binary format? Are there any arguments against
> saving
> > > this
> > > > > as
> > > > > > a protobuf/BSON/Parquet file?
> > > > > >
> > > > > > Parth
> > > > > >
> > > > > > On Mon, Oct 26, 2015 at 2:42 PM, Jacques Nadeau <
> > [email protected]>
> > > > > > wrote:
> > > > > >
> > > > > > > My first thought is we've gotten too generous in what we're
> > storing
> > > > in
> > > > > > the
> > > > > > > Parquet metadata file. Early implementations were very lean and
> > it
> > > > > seems
> > > > > > > far larger today. For example, early implementations didn't
> keep
> > > > > > statistics
> > > > > > > and ignored row groups (files, schema and block locations
> only).
> > If
> > > > we
> > > > > > need
> > > > > > > multiple levels of information, we may want to stagger (or
> > > normalize)
> > > > > > them
> > > > > > > in the file. Also, we may think about what is the minimum that
> > must
> > > > be
> > > > > > done
> > > > > > > in planning. We could do the file pruning at execution time
> > rather
> > > > than
> > > > > > > single-tracking these things (makes stats harder though).
> > > > > > >
> > > > > > > I also think we should be cautious around jumping to a
> conclusion
> > > > until
> > > > > > > DRILL-3973 provides more insight.
> > > > > > >
> > > > > > > In terms of caching, I'd be more inclined to rely on file
> system
> > > > > caching
> > > > > > > and make sure serialization/deserialization is as efficient as
> > > > possible
> > > > > > as
> > > > > > > opposed to implementing an application-level cache. (We already
> > > have
> > > > > > enough
> > > > > > > problems managing memory without having to figure out when we
> > > should
> > > > > > drop a
> > > > > > > metadata cache :D).
> > > > > > >
> > > > > > > Aside, I always liked this post for entertainment and the
> > thoughts
> > > on
> > > > > > > virtual memory:
> > > > https://www.varnish-cache.org/trac/wiki/ArchitectNotes
> > > > > > >
> > > > > > >
> > > > > > > --
> > > > > > > Jacques Nadeau
> > > > > > > CTO and Co-Founder, Dremio
> > > > > > >
> > > > > > > On Mon, Oct 26, 2015 at 2:25 PM, Hanifi Gunes <
> > [email protected]
> > > >
> > > > > > wrote:
> > > > > > >
> > > > > > > > One more thing, for workloads running queries over subsets of
> > > same
> > > > > > > parquet
> > > > > > > > files, we can consider maintaining an in-memory cache as
> well.
> > > > > Assuming
> > > > > > > > metadata memory footprint per file is low and parquet files
> are
> > > > > static,
> > > > > > > not
> > > > > > > > needing us to invalidate the cache often.
> > > > > > > >
> > > > > > > > H+
> > > > > > > >
> > > > > > > > On Mon, Oct 26, 2015 at 2:10 PM, Hanifi Gunes <
> > > [email protected]
> > > > >
> > > > > > > wrote:
> > > > > > > >
> > > > > > > > > I am not familiar with the contents of metadata stored but
> if
> > > > > > > > > deserialization workload seems to be fitting to any of
> > > > > afterburner's
> > > > > > > > > claimed improvement points [1] It could well be worth
> trying
> > > > given
> > > > > > the
> > > > > > > > > claimed gain on throughput is substantial.
> > > > > > > > >
> > > > > > > > > It could also be a good idea to partition caching over a
> > number
> > > > of
> > > > > > > files
> > > > > > > > > for better parallelization given number of cache files
> > > generated
> > > > is
> > > > > > > > > *significantly* less than number of parquet files.
> > Maintaining
> > > > > global
> > > > > > > > > statistics seems an improvement point too.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > -H+
> > > > > > > > >
> > > > > > > > > 1:
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/FasterXML/jackson-module-afterburner#what-is-optimized
> > > > > > > > >
> > > > > > > > > On Sun, Oct 25, 2015 at 9:33 AM, Aman Sinha <
> > > > [email protected]>
> > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > >> Forgot to include the link for Jackson's AfterBurner
> module:
> > > > > > > > >>   https://github.com/FasterXML/jackson-module-afterburner
> > > > > > > > >>
> > > > > > > > >> On Sun, Oct 25, 2015 at 9:28 AM, Aman Sinha <
> > > > [email protected]
> > > > > >
> > > > > > > > wrote:
> > > > > > > > >>
> > > > > > > > >> > I was going to file an enhancement JIRA but thought I
> will
> > > > > discuss
> > > > > > > > here
> > > > > > > > >> > first:
> > > > > > > > >> >
> > > > > > > > >> > The parquet metadata cache file is a JSON file that
> > > contains a
> > > > > > > subset
> > > > > > > > of
> > > > > > > > >> > the metadata extracted from the parquet files.  The
> cache
> > > file
> > > > > can
> > > > > > > get
> > > > > > > > >> > really large .. a few GBs for a few hundred thousand
> > files.
> > > > > > > > >> > I have filed a separate JIRA: DRILL-3973 for profiling
> the
> > > > > various
> > > > > > > > >> aspects
> > > > > > > > >> > of planning including metadata operations.  In the
> > meantime,
> > > > the
> > > > > > > > >> timestamps
> > > > > > > > >> > in the drillbit.log output indicate a large chunk of
> time
> > > > spent
> > > > > in
> > > > > > > > >> creating
> > > > > > > > >> > the drill table to begin with, which indicates
> bottleneck
> > in
> > > > > > reading
> > > > > > > > the
> > > > > > > > >> > metadata.  (I can provide performance numbers later once
> > we
> > > > > > confirm
> > > > > > > > >> through
> > > > > > > > >> > profiling).
> > > > > > > > >> >
> > > > > > > > >> > A few thoughts around improvements:
> > > > > > > > >> >  - The jackson deserialization of the JSON file is very
> > > slow..
> > > > > can
> > > > > > > > this
> > > > > > > > >> be
> > > > > > > > >> > speeded up ? .. for instance the AfterBurner module of
> > > jackson
> > > > > > > claims
> > > > > > > > to
> > > > > > > > >> > improve performance by 30-40% by avoiding the use of
> > > > reflection.
> > > > > > > > >> >  - The cache file read is a single threaded process.  If
> > we
> > > > were
> > > > > > > > >> directly
> > > > > > > > >> > reading from parquet files, we use a default of 16
> > threads.
> > > > > What
> > > > > > > can
> > > > > > > > be
> > > > > > > > >> > done to parallelize the read ?
> > > > > > > > >> >  - Any operation that can be done one time during the
> > > REFRESH
> > > > > > > METADATA
> > > > > > > > >> > command ?  for instance..examining the min/max values to
> > > > > determine
> > > > > > > > >> > single-value for partition column could be eliminated if
> > we
> > > do
> > > > > > this
> > > > > > > > >> > computation during REFRESH METADATA command and store
> the
> > > > > summary
> > > > > > > one
> > > > > > > > >> time.
> > > > > > > > >> >
> > > > > > > > >> >  - A pertinent question is: should the cache file be
> > stored
> > > > in a
> > > > > > > more
> > > > > > > > >> > efficient format such as Parquet instead of JSON ?
> > > > > > > > >> >
> > > > > > > > >> > Aman
> > > > > > > > >> >
> > > > > > > > >> >
> > > > > > > > >>
> > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Reply via email to