Re: New Impala PMC member: Joe McDonnell

2018-08-21 Thread Yongjun Zhang
Congratulations Joe, great achievement!

--Yongjun

On Tue, Aug 21, 2018 at 3:30 PM, Tim Armstrong 
wrote:

>  The Project Management Committee (PMC) for Apache Impala has invited Joe
> McDonnell to become a PMC member and we are pleased to announce that they
> have accepted.
> Congratulations and welcome, Joe!
>


Re: Improving Kudu Build Support

2018-08-21 Thread Tim Armstrong
Is there a path to building a version of Kudu locally for an arbitrary
linux distro?

Personally I am less concerned about 14.04 support and more concerned about
what the path to upgrading to 18.04. It would also be nice for it to be at
least possible to develop on RedHat-derived distros even if it requires
some extra effort.

On Tue, Aug 21, 2018 at 6:48 AM, Laszlo Gaal 
wrote:

> +1 for simplifying Kudu updates.
>
> I am also still on Ubuntu 14.04, but I am all for simplifying Kudu
> integration:
> I agree with Thomas that Kudu snapshots should be grouped with the other
> CDH components.
> Given that Ubuntu 14.04 will be EOL'd next spring, upgrading the
> development OS
> is a reasonably small price to pay -- especially that it will soon become
> necessary anyway.
>
> Thanks for doing this Thomas!
>
>   - Laszlo
>
> On Tue, Aug 21, 2018 at 12:34 AM Lars Volker  wrote:
>
> > I'm in favor of not spending developer time and effort to maintain
> > compatibility with 14.04. Personally I'm still developing on Ubuntu 14.04
> > so I'd be happy if we can support it without much pain. On the other hand
> > it EOLs in April 2019, so I might as well go to 18.04 now, should we
> decide
> > to drop support. Maybe not many other folks are on 14.04 after all?
> >
> >
> >
> > On Mon, Aug 20, 2018 at 10:06 AM Thomas Tauber-Marshall <
> > tmarsh...@cloudera.com> wrote:
> >
> > > Impala community,
> > >
> > > For years now, Impala has utilized tarballs built by Cloudera and
> > uploaded
> > > to S3 for running most of the Hadoop components in the testing
> > minicluster.
> > > The one exception to this is Kudu, which is instead provided by the
> > > toolchain.
> > >
> > > This was never ideal - native-toolchain makes more sense for libraries
> > > where we want to build against a fairly static version, but Kudu is
> under
> > > active development and we'd like to always build against a relatively
> > > up-to-date version. As a result, patches just bumping the version of
> Kudu
> > > make up a significant portion of the commit history of
> native-toolchain.
> > >
> > > Thanks to work I'm currently doing at Cloudera, there will soon be
> > snapshot
> > > tarballs of Kudu getting uploaded to S3 along with the other Hadoop
> > > components. I would like to propose that Impala switch to using those
> > > instead of the toolchain Kudu.
> > >
> > > One problem here is that the new Kudu tarballs will not be getting
> build
> > > for Ubuntu 14.04, only 16.04, but we still officially say we support
> > > development on 14.04.
> > >
> > > One option here would be to maintain the toolchain Kudu for now and
> hide
> > > downloading of the new tarballs behind a flag. We could also postpone
> > some
> > > of this work until 14.04 is less common. Or, given that the
> > > bootstrap_development script already only supports 16.04, we might want
> > to
> > > just drop support for building on 14.04.
> > >
> > > Thoughts?
> > >
> >
>


New Impala PMC member: Joe McDonnell

2018-08-21 Thread Tim Armstrong
 The Project Management Committee (PMC) for Apache Impala has invited Joe
McDonnell to become a PMC member and we are pleased to announce that they
have accepted.
Congratulations and welcome, Joe!


Re: Improving latency of catalog update propagation?

2018-08-21 Thread Tim Armstrong
Yeah I think the angle makes sense to pursue. I don't feel strongly about
whether now or later is the right time to pursue it, but it does seem like
it's not the immediate highest priority.

On Tue, Aug 21, 2018 at 1:57 PM, Tianyi Wang  wrote:

> GetCatalogDelta used to block catalogd from executing DDLs and the pending
> struct was yet another cache to smooth things a little.
>
> On Tue, Aug 21, 2018 at 11:28 AM Todd Lipcon  wrote:
>
> > One more parting thought: why don't we just call 'GetCatalogDelta()'
> > directly from the catalog callback in order to do a direct handoff,
> instead
> > of storing them in this 'pending' struct? Given the statestore uses a
> > dedicated thread per subscriber (right?) it seems like it would be fine
> for
> > the update callback to take a long time, no?
> >
> > -Todd
> >
> > On Tue, Aug 21, 2018 at 11:09 AM, Todd Lipcon  wrote:
> >
> > > Thanks, Tim. I'm guessing once we switch over these RPCs to KRPC
> instead
> > > of Thrift we'll alleviate some of the scalability issues and maybe we
> can
> > > look into increasing frequency or doing a "push" to the statestore,
> etc.
> > I
> > > probably won't work on this in the near term to avoid complicating the
> > > ongoing changes with catalog.
> > >
> > > -Todd
> > >
> > > On Tue, Aug 21, 2018 at 10:22 AM, Tim Armstrong <
> tarmstr...@cloudera.com
> > >
> > > wrote:
> > >
> > >> This is somewhat relevant for admission control too - I had thought
> > about
> > >> some of these issues in that context, because reducing the latency of
> > >> admission controls state propagation helps avoid overadmission but
> > having
> > >> a
> > >> very low statestore frequency is very inefficient and doesn't scale
> well
> > >> to
> > >> larger clusters.
> > >>
> > >> For the catalog updates I agree we could do something with long polls
> > >> since
> > >> it's a single producer so that the "idle" state of the system has a
> > thread
> > >> sitting in the update callback on catalogd waiting for an update.
> > >>
> > >> I'd also thought at one point about allowing subscribers to notify the
> > >> statestore that they had something to add to the topic. That could be
> > >> treated as a hint to the statestore to schedule the subscriber update
> > >> sooner. This would also work for admission control since coordinators
> > >> could
> > >> notify the statestore when the first query was admitted after the
> > previous
> > >> statestore update.
> > >>
> > >> On Tue, Aug 21, 2018 at 9:41 AM, Todd Lipcon 
> wrote:
> > >>
> > >> > Hey folks,
> > >> >
> > >> > In my recent forays into the catalog->statestore->impalad metadata
> > >> > propagation code base, I noticed that the latency of any update is
> > >> > typically between 2-4 seconds with the standard 2-second statestore
> > >> polling
> > >> > interval. That's because the code currently works as follows:
> > >> >
> > >> > 1. in the steady state with no recent metadata changes, the
> catalogd's
> > >> > state is:
> > >> > -- topic_updates_ready_ = true
> > >> > -- pending_topic_updates_ = empty
> > >> >
> > >> > 2. some metadata change happens, which modifies the version numbers
> in
> > >> the
> > >> > Java catalog but doesn't modify any of the C++ side state
> > >> >
> > >> > 3. the next statestore poll happens due to the normal interval
> > >> expiring. On
> > >> > average, this will take *1/2 the polling interval*
> > >> > -- this sees that pending_topic_updates_ is empty, so returns no
> > >> results.
> > >> > -- it sets topic_updates_ready_ = false and triggers the "gather"
> > thread
> > >> >
> > >> > 4. the "gather" thread wakes up and gathers updates, filling in
> > >> > 'pending_topic_updates_' and setting 'topic_updates_ready_' back to
> > true
> > >> > (typically subsecond in smallish catalogs, so this happens before
> the
> > >> next
> > >> > poll)
> > >> >
> > >> > 5. wait *another full statestore polling interval* (2 seconds) after
> > >> step
> > >> > #3 above, at which point we deliver the metadata update to the
> > >> statestore
> > >> >
> > >> > 6. wait on average* 1/2 the polling interval* until any particular
> > >> impalad
> > >> > gets the update from #4
> > >> >
> > >> > So. in the absolute best case, we wait one full polling interval (2
> > >> > seconds), and in the worst case we wait two polling intervals (4
> > >> seconds).
> > >> >
> > >> > Has anyone looked into optimizing this at all? It seems like we
> could
> > >> have
> > >> > metadata changes trigger an immediate "collection" into the C++
> side,
> > >> and
> > >> > have the statestore update callback wait ("long poll" style) for an
> > >> update
> > >> > rather than skip if there is nothing available.
> > >> >
> > >> > -Todd
> > >> > --
> > >> > Todd Lipcon
> > >> > Software Engineer, Cloudera
> > >> >
> > >>
> > >
> > >
> > >
> > > --
> > > Todd Lipcon
> > > Software Engineer, Cloudera
> > >
> >
> >
> >
> > --
> > Todd Lipcon
> > Software Engineer, Cloudera
> >
> --
> Tianyi Wang
>


Re: Improving latency of catalog update propagation?

2018-08-21 Thread Tianyi Wang
GetCatalogDelta used to block catalogd from executing DDLs and the pending
struct was yet another cache to smooth things a little.

On Tue, Aug 21, 2018 at 11:28 AM Todd Lipcon  wrote:

> One more parting thought: why don't we just call 'GetCatalogDelta()'
> directly from the catalog callback in order to do a direct handoff, instead
> of storing them in this 'pending' struct? Given the statestore uses a
> dedicated thread per subscriber (right?) it seems like it would be fine for
> the update callback to take a long time, no?
>
> -Todd
>
> On Tue, Aug 21, 2018 at 11:09 AM, Todd Lipcon  wrote:
>
> > Thanks, Tim. I'm guessing once we switch over these RPCs to KRPC instead
> > of Thrift we'll alleviate some of the scalability issues and maybe we can
> > look into increasing frequency or doing a "push" to the statestore, etc.
> I
> > probably won't work on this in the near term to avoid complicating the
> > ongoing changes with catalog.
> >
> > -Todd
> >
> > On Tue, Aug 21, 2018 at 10:22 AM, Tim Armstrong  >
> > wrote:
> >
> >> This is somewhat relevant for admission control too - I had thought
> about
> >> some of these issues in that context, because reducing the latency of
> >> admission controls state propagation helps avoid overadmission but
> having
> >> a
> >> very low statestore frequency is very inefficient and doesn't scale well
> >> to
> >> larger clusters.
> >>
> >> For the catalog updates I agree we could do something with long polls
> >> since
> >> it's a single producer so that the "idle" state of the system has a
> thread
> >> sitting in the update callback on catalogd waiting for an update.
> >>
> >> I'd also thought at one point about allowing subscribers to notify the
> >> statestore that they had something to add to the topic. That could be
> >> treated as a hint to the statestore to schedule the subscriber update
> >> sooner. This would also work for admission control since coordinators
> >> could
> >> notify the statestore when the first query was admitted after the
> previous
> >> statestore update.
> >>
> >> On Tue, Aug 21, 2018 at 9:41 AM, Todd Lipcon  wrote:
> >>
> >> > Hey folks,
> >> >
> >> > In my recent forays into the catalog->statestore->impalad metadata
> >> > propagation code base, I noticed that the latency of any update is
> >> > typically between 2-4 seconds with the standard 2-second statestore
> >> polling
> >> > interval. That's because the code currently works as follows:
> >> >
> >> > 1. in the steady state with no recent metadata changes, the catalogd's
> >> > state is:
> >> > -- topic_updates_ready_ = true
> >> > -- pending_topic_updates_ = empty
> >> >
> >> > 2. some metadata change happens, which modifies the version numbers in
> >> the
> >> > Java catalog but doesn't modify any of the C++ side state
> >> >
> >> > 3. the next statestore poll happens due to the normal interval
> >> expiring. On
> >> > average, this will take *1/2 the polling interval*
> >> > -- this sees that pending_topic_updates_ is empty, so returns no
> >> results.
> >> > -- it sets topic_updates_ready_ = false and triggers the "gather"
> thread
> >> >
> >> > 4. the "gather" thread wakes up and gathers updates, filling in
> >> > 'pending_topic_updates_' and setting 'topic_updates_ready_' back to
> true
> >> > (typically subsecond in smallish catalogs, so this happens before the
> >> next
> >> > poll)
> >> >
> >> > 5. wait *another full statestore polling interval* (2 seconds) after
> >> step
> >> > #3 above, at which point we deliver the metadata update to the
> >> statestore
> >> >
> >> > 6. wait on average* 1/2 the polling interval* until any particular
> >> impalad
> >> > gets the update from #4
> >> >
> >> > So. in the absolute best case, we wait one full polling interval (2
> >> > seconds), and in the worst case we wait two polling intervals (4
> >> seconds).
> >> >
> >> > Has anyone looked into optimizing this at all? It seems like we could
> >> have
> >> > metadata changes trigger an immediate "collection" into the C++ side,
> >> and
> >> > have the statestore update callback wait ("long poll" style) for an
> >> update
> >> > rather than skip if there is nothing available.
> >> >
> >> > -Todd
> >> > --
> >> > Todd Lipcon
> >> > Software Engineer, Cloudera
> >> >
> >>
> >
> >
> >
> > --
> > Todd Lipcon
> > Software Engineer, Cloudera
> >
>
>
>
> --
> Todd Lipcon
> Software Engineer, Cloudera
>
-- 
Tianyi Wang


Re: Impalad JVM OOM minutes after restart

2018-08-21 Thread Brock Noland
Jeezy - yes unfortunately I cannot share the query details at this
time. No hs_err file was generated.

Philip - Yeah that seems to be the way to go.

On Tue, Aug 21, 2018 at 1:51 PM, Philip Zeyliger  wrote:
> Hi Brock,
>
> If you want to make Eclipse MAT more usable, set JAVA_TOOL_OPTIONS="-Xmx2g
>  -XX:+HeapDumpOnOutOfMemoryError" and you should see the max heap at 2GB,
> thereby making Eclipse MAT friendlier. Folks have also been using
> http://www.jxray.com/.
>
> The query itself will also be interesting. If there's something like an
> loop in analyzing it, you could imagine that showing up as an OOM. The heap
> dump should tell us.
>
> -- Philip
>
> On Tue, Aug 21, 2018 at 11:32 AM Brock Noland  wrote:
>
>> Hi Jeezy,
>>
>> Thanks, good tip.
>>
>> The MS is quite small. Even mysqldump format is only 12MB. The largest
>> catalog-update I could find is only 1.5MB which should be easy to
>> process with 32GB of of heap. Lastly, it's possible we can reproduce
>> by running the query the impalad was processing during the issue,
>> going to wait until after the users head home to try, but it doesn't
>> appear reproducible in the method you describe. When we restarted, it
>> did not reproduce until users started running queries.
>>
>> I0820 19:45:25.106437 25474 statestore.cc:568] Preparing initial
>> catalog-update topic update for impalad@XXX:22000. Size = 1.45 MB
>>
>> Brock
>>
>> On Tue, Aug 21, 2018 at 1:18 PM, Jeszy  wrote:
>> > Hey,
>> >
>> > If it happens shortly after a restart, there is a fair chance you're
>> > crashing while processing the initial catalog topic update. Statestore
>> > logs will tell you how big that was (it takes more memory to process
>> > it than the actual size of the update).
>> > If this is the case, it should also be reproducible, ie. the daemon
>> > will keep restarting and running OOM on initial update until you clear
>> > the metadata cache either by restarting catalog or via a (global)
>> > invalidate metadata.
>> >
>> > HTH
>> > On Tue, 21 Aug 2018 at 20:13, Brock Noland  wrote:
>> >>
>> >> Hi folks,
>> >>
>> >> I've got an Impala CDH 5.14.2 cluster with a handful of users, 2-3, at
>> >> any one time. All of a sudden the JVM inside the Impalad started
>> >> running out of memory.
>> >>
>> >> I got a heap dump, but the heap was 32GB, host is 240GB, so it's very
>> >> large. Thus I wasn't able to get Memory Analyzer Tool (MAT) to open
>> >> it. I was able to get JHAT to opening it when setting JHAT's heap to
>> >> 160GB. It's pretty unwieldy so much of the JHAT functionality doesn't
>> >> work.
>> >>
>> >> I am spelunking around, but really curious if there is some places I
>> >> should check
>> >>
>> >> I am only an occasional reader of Impala source so I am just pointing
>> >> out things which felt interesting:
>> >>
>> >> * Impalad was restarted shortly before the JVM OOM
>> >> * Joining Parquet on S3 with Kudu
>> >> * Only 13  instances of org.apache.impala.catalog.HdfsTable
>> >> * 176836 instances of org.apache.impala.analysis.Analyzer - this feels
>> >> odd to me. I remember one bug a while back in Hive when it would clone
>> >> the query tree until it ran OOM.
>> >> * 176796 of those _user fields point at the same user
>> >> * org.apache.impala.thrift.TQueryCt@0x7f90975297f8 has 11048
>> >> org.apache.impala.analysis.Analyzer@GlobalState objects pointing at
>> >> it.
>> >> *  There is only a single instance of
>> >> org.apache.impala.thrift.TQueryCtx alive in the JVM which appears to
>> >> indicate there is only a single query running. I've tracked that query
>> >> down in CM. The users need to compute stats, but I don't feel that is
>> >> relevant to this JVM OOM condition.
>> >>
>> >> Any pointers on what I might look for?
>> >>
>> >> Cheers,
>> >> Brock
>>


Re: Impalad JVM OOM minutes after restart

2018-08-21 Thread Philip Zeyliger
Hi Brock,

If you want to make Eclipse MAT more usable, set JAVA_TOOL_OPTIONS="-Xmx2g
 -XX:+HeapDumpOnOutOfMemoryError" and you should see the max heap at 2GB,
thereby making Eclipse MAT friendlier. Folks have also been using
http://www.jxray.com/.

The query itself will also be interesting. If there's something like an
loop in analyzing it, you could imagine that showing up as an OOM. The heap
dump should tell us.

-- Philip

On Tue, Aug 21, 2018 at 11:32 AM Brock Noland  wrote:

> Hi Jeezy,
>
> Thanks, good tip.
>
> The MS is quite small. Even mysqldump format is only 12MB. The largest
> catalog-update I could find is only 1.5MB which should be easy to
> process with 32GB of of heap. Lastly, it's possible we can reproduce
> by running the query the impalad was processing during the issue,
> going to wait until after the users head home to try, but it doesn't
> appear reproducible in the method you describe. When we restarted, it
> did not reproduce until users started running queries.
>
> I0820 19:45:25.106437 25474 statestore.cc:568] Preparing initial
> catalog-update topic update for impalad@XXX:22000. Size = 1.45 MB
>
> Brock
>
> On Tue, Aug 21, 2018 at 1:18 PM, Jeszy  wrote:
> > Hey,
> >
> > If it happens shortly after a restart, there is a fair chance you're
> > crashing while processing the initial catalog topic update. Statestore
> > logs will tell you how big that was (it takes more memory to process
> > it than the actual size of the update).
> > If this is the case, it should also be reproducible, ie. the daemon
> > will keep restarting and running OOM on initial update until you clear
> > the metadata cache either by restarting catalog or via a (global)
> > invalidate metadata.
> >
> > HTH
> > On Tue, 21 Aug 2018 at 20:13, Brock Noland  wrote:
> >>
> >> Hi folks,
> >>
> >> I've got an Impala CDH 5.14.2 cluster with a handful of users, 2-3, at
> >> any one time. All of a sudden the JVM inside the Impalad started
> >> running out of memory.
> >>
> >> I got a heap dump, but the heap was 32GB, host is 240GB, so it's very
> >> large. Thus I wasn't able to get Memory Analyzer Tool (MAT) to open
> >> it. I was able to get JHAT to opening it when setting JHAT's heap to
> >> 160GB. It's pretty unwieldy so much of the JHAT functionality doesn't
> >> work.
> >>
> >> I am spelunking around, but really curious if there is some places I
> >> should check
> >>
> >> I am only an occasional reader of Impala source so I am just pointing
> >> out things which felt interesting:
> >>
> >> * Impalad was restarted shortly before the JVM OOM
> >> * Joining Parquet on S3 with Kudu
> >> * Only 13  instances of org.apache.impala.catalog.HdfsTable
> >> * 176836 instances of org.apache.impala.analysis.Analyzer - this feels
> >> odd to me. I remember one bug a while back in Hive when it would clone
> >> the query tree until it ran OOM.
> >> * 176796 of those _user fields point at the same user
> >> * org.apache.impala.thrift.TQueryCt@0x7f90975297f8 has 11048
> >> org.apache.impala.analysis.Analyzer@GlobalState objects pointing at
> >> it.
> >> *  There is only a single instance of
> >> org.apache.impala.thrift.TQueryCtx alive in the JVM which appears to
> >> indicate there is only a single query running. I've tracked that query
> >> down in CM. The users need to compute stats, but I don't feel that is
> >> relevant to this JVM OOM condition.
> >>
> >> Any pointers on what I might look for?
> >>
> >> Cheers,
> >> Brock
>


Re: Impalad JVM OOM minutes after restart

2018-08-21 Thread Jeszy
Hm, that's interesting because:
- I haven't yet seen query planning itself cause OOM
- if it was catalog related to the tables involved in the query,
following initial topic size would be bigger

Can you share diagnostic data, like the query text, definitions and
stats for tables involved, hs_err_pid written on crash, etc?
On Tue, 21 Aug 2018 at 20:32, Brock Noland  wrote:
>
> Hi Jeezy,
>
> Thanks, good tip.
>
> The MS is quite small. Even mysqldump format is only 12MB. The largest
> catalog-update I could find is only 1.5MB which should be easy to
> process with 32GB of of heap. Lastly, it's possible we can reproduce
> by running the query the impalad was processing during the issue,
> going to wait until after the users head home to try, but it doesn't
> appear reproducible in the method you describe. When we restarted, it
> did not reproduce until users started running queries.
>
> I0820 19:45:25.106437 25474 statestore.cc:568] Preparing initial
> catalog-update topic update for impalad@XXX:22000. Size = 1.45 MB
>
> Brock
>
> On Tue, Aug 21, 2018 at 1:18 PM, Jeszy  wrote:
> > Hey,
> >
> > If it happens shortly after a restart, there is a fair chance you're
> > crashing while processing the initial catalog topic update. Statestore
> > logs will tell you how big that was (it takes more memory to process
> > it than the actual size of the update).
> > If this is the case, it should also be reproducible, ie. the daemon
> > will keep restarting and running OOM on initial update until you clear
> > the metadata cache either by restarting catalog or via a (global)
> > invalidate metadata.
> >
> > HTH
> > On Tue, 21 Aug 2018 at 20:13, Brock Noland  wrote:
> >>
> >> Hi folks,
> >>
> >> I've got an Impala CDH 5.14.2 cluster with a handful of users, 2-3, at
> >> any one time. All of a sudden the JVM inside the Impalad started
> >> running out of memory.
> >>
> >> I got a heap dump, but the heap was 32GB, host is 240GB, so it's very
> >> large. Thus I wasn't able to get Memory Analyzer Tool (MAT) to open
> >> it. I was able to get JHAT to opening it when setting JHAT's heap to
> >> 160GB. It's pretty unwieldy so much of the JHAT functionality doesn't
> >> work.
> >>
> >> I am spelunking around, but really curious if there is some places I
> >> should check
> >>
> >> I am only an occasional reader of Impala source so I am just pointing
> >> out things which felt interesting:
> >>
> >> * Impalad was restarted shortly before the JVM OOM
> >> * Joining Parquet on S3 with Kudu
> >> * Only 13  instances of org.apache.impala.catalog.HdfsTable
> >> * 176836 instances of org.apache.impala.analysis.Analyzer - this feels
> >> odd to me. I remember one bug a while back in Hive when it would clone
> >> the query tree until it ran OOM.
> >> * 176796 of those _user fields point at the same user
> >> * org.apache.impala.thrift.TQueryCt@0x7f90975297f8 has 11048
> >> org.apache.impala.analysis.Analyzer@GlobalState objects pointing at
> >> it.
> >> *  There is only a single instance of
> >> org.apache.impala.thrift.TQueryCtx alive in the JVM which appears to
> >> indicate there is only a single query running. I've tracked that query
> >> down in CM. The users need to compute stats, but I don't feel that is
> >> relevant to this JVM OOM condition.
> >>
> >> Any pointers on what I might look for?
> >>
> >> Cheers,
> >> Brock


Re: Impalad JVM OOM minutes after restart

2018-08-21 Thread Brock Noland
Hi Jeezy,

Thanks, good tip.

The MS is quite small. Even mysqldump format is only 12MB. The largest
catalog-update I could find is only 1.5MB which should be easy to
process with 32GB of of heap. Lastly, it's possible we can reproduce
by running the query the impalad was processing during the issue,
going to wait until after the users head home to try, but it doesn't
appear reproducible in the method you describe. When we restarted, it
did not reproduce until users started running queries.

I0820 19:45:25.106437 25474 statestore.cc:568] Preparing initial
catalog-update topic update for impalad@XXX:22000. Size = 1.45 MB

Brock

On Tue, Aug 21, 2018 at 1:18 PM, Jeszy  wrote:
> Hey,
>
> If it happens shortly after a restart, there is a fair chance you're
> crashing while processing the initial catalog topic update. Statestore
> logs will tell you how big that was (it takes more memory to process
> it than the actual size of the update).
> If this is the case, it should also be reproducible, ie. the daemon
> will keep restarting and running OOM on initial update until you clear
> the metadata cache either by restarting catalog or via a (global)
> invalidate metadata.
>
> HTH
> On Tue, 21 Aug 2018 at 20:13, Brock Noland  wrote:
>>
>> Hi folks,
>>
>> I've got an Impala CDH 5.14.2 cluster with a handful of users, 2-3, at
>> any one time. All of a sudden the JVM inside the Impalad started
>> running out of memory.
>>
>> I got a heap dump, but the heap was 32GB, host is 240GB, so it's very
>> large. Thus I wasn't able to get Memory Analyzer Tool (MAT) to open
>> it. I was able to get JHAT to opening it when setting JHAT's heap to
>> 160GB. It's pretty unwieldy so much of the JHAT functionality doesn't
>> work.
>>
>> I am spelunking around, but really curious if there is some places I
>> should check
>>
>> I am only an occasional reader of Impala source so I am just pointing
>> out things which felt interesting:
>>
>> * Impalad was restarted shortly before the JVM OOM
>> * Joining Parquet on S3 with Kudu
>> * Only 13  instances of org.apache.impala.catalog.HdfsTable
>> * 176836 instances of org.apache.impala.analysis.Analyzer - this feels
>> odd to me. I remember one bug a while back in Hive when it would clone
>> the query tree until it ran OOM.
>> * 176796 of those _user fields point at the same user
>> * org.apache.impala.thrift.TQueryCt@0x7f90975297f8 has 11048
>> org.apache.impala.analysis.Analyzer@GlobalState objects pointing at
>> it.
>> *  There is only a single instance of
>> org.apache.impala.thrift.TQueryCtx alive in the JVM which appears to
>> indicate there is only a single query running. I've tracked that query
>> down in CM. The users need to compute stats, but I don't feel that is
>> relevant to this JVM OOM condition.
>>
>> Any pointers on what I might look for?
>>
>> Cheers,
>> Brock


Re: Improving latency of catalog update propagation?

2018-08-21 Thread Todd Lipcon
One more parting thought: why don't we just call 'GetCatalogDelta()'
directly from the catalog callback in order to do a direct handoff, instead
of storing them in this 'pending' struct? Given the statestore uses a
dedicated thread per subscriber (right?) it seems like it would be fine for
the update callback to take a long time, no?

-Todd

On Tue, Aug 21, 2018 at 11:09 AM, Todd Lipcon  wrote:

> Thanks, Tim. I'm guessing once we switch over these RPCs to KRPC instead
> of Thrift we'll alleviate some of the scalability issues and maybe we can
> look into increasing frequency or doing a "push" to the statestore, etc. I
> probably won't work on this in the near term to avoid complicating the
> ongoing changes with catalog.
>
> -Todd
>
> On Tue, Aug 21, 2018 at 10:22 AM, Tim Armstrong 
> wrote:
>
>> This is somewhat relevant for admission control too - I had thought about
>> some of these issues in that context, because reducing the latency of
>> admission controls state propagation helps avoid overadmission but having
>> a
>> very low statestore frequency is very inefficient and doesn't scale well
>> to
>> larger clusters.
>>
>> For the catalog updates I agree we could do something with long polls
>> since
>> it's a single producer so that the "idle" state of the system has a thread
>> sitting in the update callback on catalogd waiting for an update.
>>
>> I'd also thought at one point about allowing subscribers to notify the
>> statestore that they had something to add to the topic. That could be
>> treated as a hint to the statestore to schedule the subscriber update
>> sooner. This would also work for admission control since coordinators
>> could
>> notify the statestore when the first query was admitted after the previous
>> statestore update.
>>
>> On Tue, Aug 21, 2018 at 9:41 AM, Todd Lipcon  wrote:
>>
>> > Hey folks,
>> >
>> > In my recent forays into the catalog->statestore->impalad metadata
>> > propagation code base, I noticed that the latency of any update is
>> > typically between 2-4 seconds with the standard 2-second statestore
>> polling
>> > interval. That's because the code currently works as follows:
>> >
>> > 1. in the steady state with no recent metadata changes, the catalogd's
>> > state is:
>> > -- topic_updates_ready_ = true
>> > -- pending_topic_updates_ = empty
>> >
>> > 2. some metadata change happens, which modifies the version numbers in
>> the
>> > Java catalog but doesn't modify any of the C++ side state
>> >
>> > 3. the next statestore poll happens due to the normal interval
>> expiring. On
>> > average, this will take *1/2 the polling interval*
>> > -- this sees that pending_topic_updates_ is empty, so returns no
>> results.
>> > -- it sets topic_updates_ready_ = false and triggers the "gather" thread
>> >
>> > 4. the "gather" thread wakes up and gathers updates, filling in
>> > 'pending_topic_updates_' and setting 'topic_updates_ready_' back to true
>> > (typically subsecond in smallish catalogs, so this happens before the
>> next
>> > poll)
>> >
>> > 5. wait *another full statestore polling interval* (2 seconds) after
>> step
>> > #3 above, at which point we deliver the metadata update to the
>> statestore
>> >
>> > 6. wait on average* 1/2 the polling interval* until any particular
>> impalad
>> > gets the update from #4
>> >
>> > So. in the absolute best case, we wait one full polling interval (2
>> > seconds), and in the worst case we wait two polling intervals (4
>> seconds).
>> >
>> > Has anyone looked into optimizing this at all? It seems like we could
>> have
>> > metadata changes trigger an immediate "collection" into the C++ side,
>> and
>> > have the statestore update callback wait ("long poll" style) for an
>> update
>> > rather than skip if there is nothing available.
>> >
>> > -Todd
>> > --
>> > Todd Lipcon
>> > Software Engineer, Cloudera
>> >
>>
>
>
>
> --
> Todd Lipcon
> Software Engineer, Cloudera
>



-- 
Todd Lipcon
Software Engineer, Cloudera


Re: Impalad JVM OOM minutes after restart

2018-08-21 Thread Jeszy
Hey,

If it happens shortly after a restart, there is a fair chance you're
crashing while processing the initial catalog topic update. Statestore
logs will tell you how big that was (it takes more memory to process
it than the actual size of the update).
If this is the case, it should also be reproducible, ie. the daemon
will keep restarting and running OOM on initial update until you clear
the metadata cache either by restarting catalog or via a (global)
invalidate metadata.

HTH
On Tue, 21 Aug 2018 at 20:13, Brock Noland  wrote:
>
> Hi folks,
>
> I've got an Impala CDH 5.14.2 cluster with a handful of users, 2-3, at
> any one time. All of a sudden the JVM inside the Impalad started
> running out of memory.
>
> I got a heap dump, but the heap was 32GB, host is 240GB, so it's very
> large. Thus I wasn't able to get Memory Analyzer Tool (MAT) to open
> it. I was able to get JHAT to opening it when setting JHAT's heap to
> 160GB. It's pretty unwieldy so much of the JHAT functionality doesn't
> work.
>
> I am spelunking around, but really curious if there is some places I
> should check
>
> I am only an occasional reader of Impala source so I am just pointing
> out things which felt interesting:
>
> * Impalad was restarted shortly before the JVM OOM
> * Joining Parquet on S3 with Kudu
> * Only 13  instances of org.apache.impala.catalog.HdfsTable
> * 176836 instances of org.apache.impala.analysis.Analyzer - this feels
> odd to me. I remember one bug a while back in Hive when it would clone
> the query tree until it ran OOM.
> * 176796 of those _user fields point at the same user
> * org.apache.impala.thrift.TQueryCt@0x7f90975297f8 has 11048
> org.apache.impala.analysis.Analyzer@GlobalState objects pointing at
> it.
> *  There is only a single instance of
> org.apache.impala.thrift.TQueryCtx alive in the JVM which appears to
> indicate there is only a single query running. I've tracked that query
> down in CM. The users need to compute stats, but I don't feel that is
> relevant to this JVM OOM condition.
>
> Any pointers on what I might look for?
>
> Cheers,
> Brock


Impalad JVM OOM minutes after restart

2018-08-21 Thread Brock Noland
Hi folks,

I've got an Impala CDH 5.14.2 cluster with a handful of users, 2-3, at
any one time. All of a sudden the JVM inside the Impalad started
running out of memory.

I got a heap dump, but the heap was 32GB, host is 240GB, so it's very
large. Thus I wasn't able to get Memory Analyzer Tool (MAT) to open
it. I was able to get JHAT to opening it when setting JHAT's heap to
160GB. It's pretty unwieldy so much of the JHAT functionality doesn't
work.

I am spelunking around, but really curious if there is some places I
should check

I am only an occasional reader of Impala source so I am just pointing
out things which felt interesting:

* Impalad was restarted shortly before the JVM OOM
* Joining Parquet on S3 with Kudu
* Only 13  instances of org.apache.impala.catalog.HdfsTable
* 176836 instances of org.apache.impala.analysis.Analyzer - this feels
odd to me. I remember one bug a while back in Hive when it would clone
the query tree until it ran OOM.
* 176796 of those _user fields point at the same user
* org.apache.impala.thrift.TQueryCt@0x7f90975297f8 has 11048
org.apache.impala.analysis.Analyzer@GlobalState objects pointing at
it.
*  There is only a single instance of
org.apache.impala.thrift.TQueryCtx alive in the JVM which appears to
indicate there is only a single query running. I've tracked that query
down in CM. The users need to compute stats, but I don't feel that is
relevant to this JVM OOM condition.

Any pointers on what I might look for?

Cheers,
Brock


Re: Improving latency of catalog update propagation?

2018-08-21 Thread Todd Lipcon
Thanks, Tim. I'm guessing once we switch over these RPCs to KRPC instead of
Thrift we'll alleviate some of the scalability issues and maybe we can look
into increasing frequency or doing a "push" to the statestore, etc. I
probably won't work on this in the near term to avoid complicating the
ongoing changes with catalog.

-Todd

On Tue, Aug 21, 2018 at 10:22 AM, Tim Armstrong 
wrote:

> This is somewhat relevant for admission control too - I had thought about
> some of these issues in that context, because reducing the latency of
> admission controls state propagation helps avoid overadmission but having a
> very low statestore frequency is very inefficient and doesn't scale well to
> larger clusters.
>
> For the catalog updates I agree we could do something with long polls since
> it's a single producer so that the "idle" state of the system has a thread
> sitting in the update callback on catalogd waiting for an update.
>
> I'd also thought at one point about allowing subscribers to notify the
> statestore that they had something to add to the topic. That could be
> treated as a hint to the statestore to schedule the subscriber update
> sooner. This would also work for admission control since coordinators could
> notify the statestore when the first query was admitted after the previous
> statestore update.
>
> On Tue, Aug 21, 2018 at 9:41 AM, Todd Lipcon  wrote:
>
> > Hey folks,
> >
> > In my recent forays into the catalog->statestore->impalad metadata
> > propagation code base, I noticed that the latency of any update is
> > typically between 2-4 seconds with the standard 2-second statestore
> polling
> > interval. That's because the code currently works as follows:
> >
> > 1. in the steady state with no recent metadata changes, the catalogd's
> > state is:
> > -- topic_updates_ready_ = true
> > -- pending_topic_updates_ = empty
> >
> > 2. some metadata change happens, which modifies the version numbers in
> the
> > Java catalog but doesn't modify any of the C++ side state
> >
> > 3. the next statestore poll happens due to the normal interval expiring.
> On
> > average, this will take *1/2 the polling interval*
> > -- this sees that pending_topic_updates_ is empty, so returns no results.
> > -- it sets topic_updates_ready_ = false and triggers the "gather" thread
> >
> > 4. the "gather" thread wakes up and gathers updates, filling in
> > 'pending_topic_updates_' and setting 'topic_updates_ready_' back to true
> > (typically subsecond in smallish catalogs, so this happens before the
> next
> > poll)
> >
> > 5. wait *another full statestore polling interval* (2 seconds) after step
> > #3 above, at which point we deliver the metadata update to the statestore
> >
> > 6. wait on average* 1/2 the polling interval* until any particular
> impalad
> > gets the update from #4
> >
> > So. in the absolute best case, we wait one full polling interval (2
> > seconds), and in the worst case we wait two polling intervals (4
> seconds).
> >
> > Has anyone looked into optimizing this at all? It seems like we could
> have
> > metadata changes trigger an immediate "collection" into the C++ side, and
> > have the statestore update callback wait ("long poll" style) for an
> update
> > rather than skip if there is nothing available.
> >
> > -Todd
> > --
> > Todd Lipcon
> > Software Engineer, Cloudera
> >
>



-- 
Todd Lipcon
Software Engineer, Cloudera


Re: Improving latency of catalog update propagation?

2018-08-21 Thread Tim Armstrong
This is somewhat relevant for admission control too - I had thought about
some of these issues in that context, because reducing the latency of
admission controls state propagation helps avoid overadmission but having a
very low statestore frequency is very inefficient and doesn't scale well to
larger clusters.

For the catalog updates I agree we could do something with long polls since
it's a single producer so that the "idle" state of the system has a thread
sitting in the update callback on catalogd waiting for an update.

I'd also thought at one point about allowing subscribers to notify the
statestore that they had something to add to the topic. That could be
treated as a hint to the statestore to schedule the subscriber update
sooner. This would also work for admission control since coordinators could
notify the statestore when the first query was admitted after the previous
statestore update.

On Tue, Aug 21, 2018 at 9:41 AM, Todd Lipcon  wrote:

> Hey folks,
>
> In my recent forays into the catalog->statestore->impalad metadata
> propagation code base, I noticed that the latency of any update is
> typically between 2-4 seconds with the standard 2-second statestore polling
> interval. That's because the code currently works as follows:
>
> 1. in the steady state with no recent metadata changes, the catalogd's
> state is:
> -- topic_updates_ready_ = true
> -- pending_topic_updates_ = empty
>
> 2. some metadata change happens, which modifies the version numbers in the
> Java catalog but doesn't modify any of the C++ side state
>
> 3. the next statestore poll happens due to the normal interval expiring. On
> average, this will take *1/2 the polling interval*
> -- this sees that pending_topic_updates_ is empty, so returns no results.
> -- it sets topic_updates_ready_ = false and triggers the "gather" thread
>
> 4. the "gather" thread wakes up and gathers updates, filling in
> 'pending_topic_updates_' and setting 'topic_updates_ready_' back to true
> (typically subsecond in smallish catalogs, so this happens before the next
> poll)
>
> 5. wait *another full statestore polling interval* (2 seconds) after step
> #3 above, at which point we deliver the metadata update to the statestore
>
> 6. wait on average* 1/2 the polling interval* until any particular impalad
> gets the update from #4
>
> So. in the absolute best case, we wait one full polling interval (2
> seconds), and in the worst case we wait two polling intervals (4 seconds).
>
> Has anyone looked into optimizing this at all? It seems like we could have
> metadata changes trigger an immediate "collection" into the C++ side, and
> have the statestore update callback wait ("long poll" style) for an update
> rather than skip if there is nothing available.
>
> -Todd
> --
> Todd Lipcon
> Software Engineer, Cloudera
>


Improving latency of catalog update propagation?

2018-08-21 Thread Todd Lipcon
Hey folks,

In my recent forays into the catalog->statestore->impalad metadata
propagation code base, I noticed that the latency of any update is
typically between 2-4 seconds with the standard 2-second statestore polling
interval. That's because the code currently works as follows:

1. in the steady state with no recent metadata changes, the catalogd's
state is:
-- topic_updates_ready_ = true
-- pending_topic_updates_ = empty

2. some metadata change happens, which modifies the version numbers in the
Java catalog but doesn't modify any of the C++ side state

3. the next statestore poll happens due to the normal interval expiring. On
average, this will take *1/2 the polling interval*
-- this sees that pending_topic_updates_ is empty, so returns no results.
-- it sets topic_updates_ready_ = false and triggers the "gather" thread

4. the "gather" thread wakes up and gathers updates, filling in
'pending_topic_updates_' and setting 'topic_updates_ready_' back to true
(typically subsecond in smallish catalogs, so this happens before the next
poll)

5. wait *another full statestore polling interval* (2 seconds) after step
#3 above, at which point we deliver the metadata update to the statestore

6. wait on average* 1/2 the polling interval* until any particular impalad
gets the update from #4

So. in the absolute best case, we wait one full polling interval (2
seconds), and in the worst case we wait two polling intervals (4 seconds).

Has anyone looked into optimizing this at all? It seems like we could have
metadata changes trigger an immediate "collection" into the C++ side, and
have the statestore update callback wait ("long poll" style) for an update
rather than skip if there is nothing available.

-Todd
-- 
Todd Lipcon
Software Engineer, Cloudera


Re: Range partition on HDFS

2018-08-21 Thread Vuk Ercegovac
Were you thinking of something like this?
https://www.cloudera.com/documentation/enterprise/latest/topics/impala_partitioning.html

On Tue, Aug 21, 2018 at 7:37 AM Yuming Wang  wrote:

> Hi,
>
> Only kudu supports range partition, can HDFS support this feature?
>
>
> https://kudu.apache.org/docs/kudu_impala_integration.html#basic_partitioning
>
> Thanks.
>


Re: Re: New Impala committer - Quanlong Huang

2018-08-21 Thread Zoltan Borok-Nagy
Congrats Quanlong!

On Tue, Aug 21, 2018 at 9:34 AM Gabor Kaszab 
wrote:

> Congrats!
>
> On Sat, Aug 18, 2018 at 3:11 AM Quanlong Huang 
> wrote:
>
> > Thanks! Glad to work with you all!--Quanlong
> >
> > At 2018-08-18 03:09:38, "Yongjun Zhang"  wrote:
> > >Congratulations Quanlong!
> > >
> > >--Yngjun
> > >
> > >On Fri, Aug 17, 2018 at 12:07 PM, Jeszy  wrote:
> > >
> > >> Congrats Quanlong!
> > >>
> > >> On 17 August 2018 at 19:51, Csaba Ringhofer  >
> > >> wrote:
> > >> > Congrats!
> > >> >
> > >> > On Fri, Aug 17, 2018 at 6:32 PM, Philip Zeyliger <
> phi...@cloudera.com
> > >
> > >> > wrote:
> > >> >
> > >> >> Congrats!
> > >> >>
> > >> >> On Fri, Aug 17, 2018 at 9:29 AM Tim Armstrong <
> > tarmstr...@cloudera.com>
> > >> >> wrote:
> > >> >>
> > >> >> >  The Project Management Committee (PMC) for Apache Impala has
> > invited
> > >> >> > Quanlong Huang to become a committer and we are pleased to
> announce
> > >> that
> > >> >> > they have accepted. Congratulations and welcome, Quanlong Huang!
> > >> >> >
> > >> >>
> > >>
> >
>


Range partition on HDFS

2018-08-21 Thread Yuming Wang
Hi,

Only kudu supports range partition, can HDFS support this feature?

https://kudu.apache.org/docs/kudu_impala_integration.html#basic_partitioning

Thanks.


Re: Improving Kudu Build Support

2018-08-21 Thread Laszlo Gaal
+1 for simplifying Kudu updates.

I am also still on Ubuntu 14.04, but I am all for simplifying Kudu
integration:
I agree with Thomas that Kudu snapshots should be grouped with the other
CDH components.
Given that Ubuntu 14.04 will be EOL'd next spring, upgrading the
development OS
is a reasonably small price to pay -- especially that it will soon become
necessary anyway.

Thanks for doing this Thomas!

  - Laszlo

On Tue, Aug 21, 2018 at 12:34 AM Lars Volker  wrote:

> I'm in favor of not spending developer time and effort to maintain
> compatibility with 14.04. Personally I'm still developing on Ubuntu 14.04
> so I'd be happy if we can support it without much pain. On the other hand
> it EOLs in April 2019, so I might as well go to 18.04 now, should we decide
> to drop support. Maybe not many other folks are on 14.04 after all?
>
>
>
> On Mon, Aug 20, 2018 at 10:06 AM Thomas Tauber-Marshall <
> tmarsh...@cloudera.com> wrote:
>
> > Impala community,
> >
> > For years now, Impala has utilized tarballs built by Cloudera and
> uploaded
> > to S3 for running most of the Hadoop components in the testing
> minicluster.
> > The one exception to this is Kudu, which is instead provided by the
> > toolchain.
> >
> > This was never ideal - native-toolchain makes more sense for libraries
> > where we want to build against a fairly static version, but Kudu is under
> > active development and we'd like to always build against a relatively
> > up-to-date version. As a result, patches just bumping the version of Kudu
> > make up a significant portion of the commit history of native-toolchain.
> >
> > Thanks to work I'm currently doing at Cloudera, there will soon be
> snapshot
> > tarballs of Kudu getting uploaded to S3 along with the other Hadoop
> > components. I would like to propose that Impala switch to using those
> > instead of the toolchain Kudu.
> >
> > One problem here is that the new Kudu tarballs will not be getting build
> > for Ubuntu 14.04, only 16.04, but we still officially say we support
> > development on 14.04.
> >
> > One option here would be to maintain the toolchain Kudu for now and hide
> > downloading of the new tarballs behind a flag. We could also postpone
> some
> > of this work until 14.04 is less common. Or, given that the
> > bootstrap_development script already only supports 16.04, we might want
> to
> > just drop support for building on 14.04.
> >
> > Thoughts?
> >
>


Re: Re: New Impala committer - Quanlong Huang

2018-08-21 Thread Gabor Kaszab
Congrats!

On Sat, Aug 18, 2018 at 3:11 AM Quanlong Huang 
wrote:

> Thanks! Glad to work with you all!--Quanlong
>
> At 2018-08-18 03:09:38, "Yongjun Zhang"  wrote:
> >Congratulations Quanlong!
> >
> >--Yngjun
> >
> >On Fri, Aug 17, 2018 at 12:07 PM, Jeszy  wrote:
> >
> >> Congrats Quanlong!
> >>
> >> On 17 August 2018 at 19:51, Csaba Ringhofer 
> >> wrote:
> >> > Congrats!
> >> >
> >> > On Fri, Aug 17, 2018 at 6:32 PM, Philip Zeyliger  >
> >> > wrote:
> >> >
> >> >> Congrats!
> >> >>
> >> >> On Fri, Aug 17, 2018 at 9:29 AM Tim Armstrong <
> tarmstr...@cloudera.com>
> >> >> wrote:
> >> >>
> >> >> >  The Project Management Committee (PMC) for Apache Impala has
> invited
> >> >> > Quanlong Huang to become a committer and we are pleased to announce
> >> that
> >> >> > they have accepted. Congratulations and welcome, Quanlong Huang!
> >> >> >
> >> >>
> >>
>