Re: Caching metastore objects

Sivaramakrishnan Narayanan Tue, 26 May 2015 22:34:47 -0700

Awesome!!

On Wed, May 27, 2015 at 10:55 AM, Ashutosh Chauhan <hashut...@apache.org>
wrote:


> Siva / Scott,
>
> Such a framework exists in some form  :
> https://issues.apache.org/jira/browse/HIVE-2038
> To make it even more generic there was a proposal
> https://issues.apache.org/jira/browse/HIVE-2147 But there was a resistance
> from a community for it. May be now community is ready for it : )
>
> Ashutosh
>
> On Tue, May 26, 2015 at 10:12 PM, Sivaramakrishnan Narayanan <
> tarb...@gmail.com> wrote:
>
> > Thanks for the replies.
> >
> > @Ashutosh - thanks for the pointer! Yes I was running 0.11 metastore. Let
> > me try with 0.13 metastore! Maybe my woes will be gone. If they don't
> then
> > I'll continue working along these lines.
> >
> > @Alan - agreed. Caching MTables seems like a better approach if 0.13
> > metastore perf is not as good as I'd like.
> >
> > @Scott - a pluggable hook for metastore calls would be super useful. If
> you
> > want to generate events for client-side actions, I suppose you could just
> > implement a dynamic proxy class over the metastore client class which
> does
> > whatever you need it to. Similar technique could work in the server side
> -
> > I believe there is already a RetryingMetaStoreClient proxy class in
> place.
> >
> >
> > On Wed, May 27, 2015 at 7:32 AM, Ashutosh Chauhan <hashut...@apache.org>
> > wrote:
> >
> > > Are you running pre-0.12 or with hive.metastore.try.direct.sql = false;
> > >
> > > Work done on https://issues.apache.org/jira/browse/HIVE-4051 should
> > > alleviate some of your problems.
> > >
> > >
> > > On Mon, May 25, 2015 at 8:19 PM, Sivaramakrishnan Narayanan <
> > > tarb...@gmail.com> wrote:
> > >
> > > > Apologies if this has been discussed in the past - my searches did
> not
> > > pull
> > > > up any relevant threads. If there are better solutions available out
> of
> > > the
> > > > box, please let me know!
> > > >
> > > > Problem statement
> > > > --------------------------
> > > >
> > > > We have a setup where a single metastoredb is used by Hive, Presto
> and
> > > > SparkSQL. In addition, there are 1000s of hive queries submitted in
> > batch
> > > > form from multiple machines. Oftentimes, the metastoredb ends up
> being
> > > > remote (in a different region in AWS etc) and round-trip latency is
> > high.
> > > > We've seen single thrift calls getting translated into lots of small
> > SQL
> > > > calls by datanucleus and the roundtrip latency ends up killing
> > > performance.
> > > > Furthermore, any of these systems may create / modify a hive table
> and
> > > this
> > > > should be reflected in the other system. Example, I may create a
> table
> > in
> > > > hive and query it using Presto or vice versa. In our setup, there may
> > be
> > > > multiple thrift metastore servers pointing to the same metastore db.
> > > >
> > > > Investigation
> > > > -------------------
> > > >
> > > > Basically, we've been looking at caching to solve this problem (will
> > come
> > > > to invalidation in a bit). I looked briefly at DN's support for
> > caching -
> > > > these two parameters seem to be switched off by default.
> > > >
> > > >     METASTORE_CACHE_LEVEL2("datanucleus.cache.level2", false),
> > > >     METASTORE_CACHE_LEVEL2_TYPE("datanucleus.cache.level2.type",
> > "none"),
> > > >
> > > > Furthermore, my reading of
> > > > http://www.datanucleus.org/products/datanucleus/jdo/cache.html
> > suggests
> > > > that there is no sophistication in invalidation - seems like only
> > > > time-based invalidation is supported and it can't work across
> multiple
> > > PMFs
> > > > (therefore, multiple thrift metastore servers)
> > > >
> > > > Solution Outline
> > > > -----------------------
> > > >
> > > >    - Every table / partition will have an additional property called
> > > >    'version'
> > > >    - Any call that modifies table or partition will bump up version
> of
> > > the
> > > >    table / partition
> > > >    - Guava based cache of thrift objects that come from metastore
> calls
> > > >    - We fire a single SQL matching versions before returning from
> cache
> > > >    - It is conceivable to have a mode wherein invalidation based on
> > > version
> > > >    happens in a background thread (for higher performance, lower
> > > fidelity)
> > > >    - Not proposing any locking (not shooting for world peace here :)
> )
> > > >    - We could extend HiveMetaStore class or create a new server
> > > altogether
> > > >
> > > > Is this something that would be interesting to the community? Is this
> > > > problem already solved and should I spend my time watching GoT
> instead?
> > > >
> > > > Thanks
> > > > Siva
> > > >
> > >
> >
>

Re: Caching metastore objects

Reply via email to