Re: New Impala PMC member: Joe McDonnell
Congratulations Joe, great achievement! --Yongjun On Tue, Aug 21, 2018 at 3:30 PM, Tim Armstrong wrote: > The Project Management Committee (PMC) for Apache Impala has invited Joe > McDonnell to become a PMC member and we are pleased to announce that they > have accepted. > Congratulations and welcome, Joe! >
Re: Improving Kudu Build Support
Is there a path to building a version of Kudu locally for an arbitrary linux distro? Personally I am less concerned about 14.04 support and more concerned about what the path to upgrading to 18.04. It would also be nice for it to be at least possible to develop on RedHat-derived distros even if it requires some extra effort. On Tue, Aug 21, 2018 at 6:48 AM, Laszlo Gaal wrote: > +1 for simplifying Kudu updates. > > I am also still on Ubuntu 14.04, but I am all for simplifying Kudu > integration: > I agree with Thomas that Kudu snapshots should be grouped with the other > CDH components. > Given that Ubuntu 14.04 will be EOL'd next spring, upgrading the > development OS > is a reasonably small price to pay -- especially that it will soon become > necessary anyway. > > Thanks for doing this Thomas! > > - Laszlo > > On Tue, Aug 21, 2018 at 12:34 AM Lars Volker wrote: > > > I'm in favor of not spending developer time and effort to maintain > > compatibility with 14.04. Personally I'm still developing on Ubuntu 14.04 > > so I'd be happy if we can support it without much pain. On the other hand > > it EOLs in April 2019, so I might as well go to 18.04 now, should we > decide > > to drop support. Maybe not many other folks are on 14.04 after all? > > > > > > > > On Mon, Aug 20, 2018 at 10:06 AM Thomas Tauber-Marshall < > > tmarsh...@cloudera.com> wrote: > > > > > Impala community, > > > > > > For years now, Impala has utilized tarballs built by Cloudera and > > uploaded > > > to S3 for running most of the Hadoop components in the testing > > minicluster. > > > The one exception to this is Kudu, which is instead provided by the > > > toolchain. > > > > > > This was never ideal - native-toolchain makes more sense for libraries > > > where we want to build against a fairly static version, but Kudu is > under > > > active development and we'd like to always build against a relatively > > > up-to-date version. As a result, patches just bumping the version of > Kudu > > > make up a significant portion of the commit history of > native-toolchain. > > > > > > Thanks to work I'm currently doing at Cloudera, there will soon be > > snapshot > > > tarballs of Kudu getting uploaded to S3 along with the other Hadoop > > > components. I would like to propose that Impala switch to using those > > > instead of the toolchain Kudu. > > > > > > One problem here is that the new Kudu tarballs will not be getting > build > > > for Ubuntu 14.04, only 16.04, but we still officially say we support > > > development on 14.04. > > > > > > One option here would be to maintain the toolchain Kudu for now and > hide > > > downloading of the new tarballs behind a flag. We could also postpone > > some > > > of this work until 14.04 is less common. Or, given that the > > > bootstrap_development script already only supports 16.04, we might want > > to > > > just drop support for building on 14.04. > > > > > > Thoughts? > > > > > >
New Impala PMC member: Joe McDonnell
The Project Management Committee (PMC) for Apache Impala has invited Joe McDonnell to become a PMC member and we are pleased to announce that they have accepted. Congratulations and welcome, Joe!
Re: Improving latency of catalog update propagation?
Yeah I think the angle makes sense to pursue. I don't feel strongly about whether now or later is the right time to pursue it, but it does seem like it's not the immediate highest priority. On Tue, Aug 21, 2018 at 1:57 PM, Tianyi Wang wrote: > GetCatalogDelta used to block catalogd from executing DDLs and the pending > struct was yet another cache to smooth things a little. > > On Tue, Aug 21, 2018 at 11:28 AM Todd Lipcon wrote: > > > One more parting thought: why don't we just call 'GetCatalogDelta()' > > directly from the catalog callback in order to do a direct handoff, > instead > > of storing them in this 'pending' struct? Given the statestore uses a > > dedicated thread per subscriber (right?) it seems like it would be fine > for > > the update callback to take a long time, no? > > > > -Todd > > > > On Tue, Aug 21, 2018 at 11:09 AM, Todd Lipcon wrote: > > > > > Thanks, Tim. I'm guessing once we switch over these RPCs to KRPC > instead > > > of Thrift we'll alleviate some of the scalability issues and maybe we > can > > > look into increasing frequency or doing a "push" to the statestore, > etc. > > I > > > probably won't work on this in the near term to avoid complicating the > > > ongoing changes with catalog. > > > > > > -Todd > > > > > > On Tue, Aug 21, 2018 at 10:22 AM, Tim Armstrong < > tarmstr...@cloudera.com > > > > > > wrote: > > > > > >> This is somewhat relevant for admission control too - I had thought > > about > > >> some of these issues in that context, because reducing the latency of > > >> admission controls state propagation helps avoid overadmission but > > having > > >> a > > >> very low statestore frequency is very inefficient and doesn't scale > well > > >> to > > >> larger clusters. > > >> > > >> For the catalog updates I agree we could do something with long polls > > >> since > > >> it's a single producer so that the "idle" state of the system has a > > thread > > >> sitting in the update callback on catalogd waiting for an update. > > >> > > >> I'd also thought at one point about allowing subscribers to notify the > > >> statestore that they had something to add to the topic. That could be > > >> treated as a hint to the statestore to schedule the subscriber update > > >> sooner. This would also work for admission control since coordinators > > >> could > > >> notify the statestore when the first query was admitted after the > > previous > > >> statestore update. > > >> > > >> On Tue, Aug 21, 2018 at 9:41 AM, Todd Lipcon > wrote: > > >> > > >> > Hey folks, > > >> > > > >> > In my recent forays into the catalog->statestore->impalad metadata > > >> > propagation code base, I noticed that the latency of any update is > > >> > typically between 2-4 seconds with the standard 2-second statestore > > >> polling > > >> > interval. That's because the code currently works as follows: > > >> > > > >> > 1. in the steady state with no recent metadata changes, the > catalogd's > > >> > state is: > > >> > -- topic_updates_ready_ = true > > >> > -- pending_topic_updates_ = empty > > >> > > > >> > 2. some metadata change happens, which modifies the version numbers > in > > >> the > > >> > Java catalog but doesn't modify any of the C++ side state > > >> > > > >> > 3. the next statestore poll happens due to the normal interval > > >> expiring. On > > >> > average, this will take *1/2 the polling interval* > > >> > -- this sees that pending_topic_updates_ is empty, so returns no > > >> results. > > >> > -- it sets topic_updates_ready_ = false and triggers the "gather" > > thread > > >> > > > >> > 4. the "gather" thread wakes up and gathers updates, filling in > > >> > 'pending_topic_updates_' and setting 'topic_updates_ready_' back to > > true > > >> > (typically subsecond in smallish catalogs, so this happens before > the > > >> next > > >> > poll) > > >> > > > >> > 5. wait *another full statestore polling interval* (2 seconds) after > > >> step > > >> > #3 above, at which point we deliver the metadata update to the > > >> statestore > > >> > > > >> > 6. wait on average* 1/2 the polling interval* until any particular > > >> impalad > > >> > gets the update from #4 > > >> > > > >> > So. in the absolute best case, we wait one full polling interval (2 > > >> > seconds), and in the worst case we wait two polling intervals (4 > > >> seconds). > > >> > > > >> > Has anyone looked into optimizing this at all? It seems like we > could > > >> have > > >> > metadata changes trigger an immediate "collection" into the C++ > side, > > >> and > > >> > have the statestore update callback wait ("long poll" style) for an > > >> update > > >> > rather than skip if there is nothing available. > > >> > > > >> > -Todd > > >> > -- > > >> > Todd Lipcon > > >> > Software Engineer, Cloudera > > >> > > > >> > > > > > > > > > > > > -- > > > Todd Lipcon > > > Software Engineer, Cloudera > > > > > > > > > > > -- > > Todd Lipcon > > Software Engineer, Cloudera > > > -- > Tianyi Wang >
Re: Improving latency of catalog update propagation?
GetCatalogDelta used to block catalogd from executing DDLs and the pending struct was yet another cache to smooth things a little. On Tue, Aug 21, 2018 at 11:28 AM Todd Lipcon wrote: > One more parting thought: why don't we just call 'GetCatalogDelta()' > directly from the catalog callback in order to do a direct handoff, instead > of storing them in this 'pending' struct? Given the statestore uses a > dedicated thread per subscriber (right?) it seems like it would be fine for > the update callback to take a long time, no? > > -Todd > > On Tue, Aug 21, 2018 at 11:09 AM, Todd Lipcon wrote: > > > Thanks, Tim. I'm guessing once we switch over these RPCs to KRPC instead > > of Thrift we'll alleviate some of the scalability issues and maybe we can > > look into increasing frequency or doing a "push" to the statestore, etc. > I > > probably won't work on this in the near term to avoid complicating the > > ongoing changes with catalog. > > > > -Todd > > > > On Tue, Aug 21, 2018 at 10:22 AM, Tim Armstrong > > > wrote: > > > >> This is somewhat relevant for admission control too - I had thought > about > >> some of these issues in that context, because reducing the latency of > >> admission controls state propagation helps avoid overadmission but > having > >> a > >> very low statestore frequency is very inefficient and doesn't scale well > >> to > >> larger clusters. > >> > >> For the catalog updates I agree we could do something with long polls > >> since > >> it's a single producer so that the "idle" state of the system has a > thread > >> sitting in the update callback on catalogd waiting for an update. > >> > >> I'd also thought at one point about allowing subscribers to notify the > >> statestore that they had something to add to the topic. That could be > >> treated as a hint to the statestore to schedule the subscriber update > >> sooner. This would also work for admission control since coordinators > >> could > >> notify the statestore when the first query was admitted after the > previous > >> statestore update. > >> > >> On Tue, Aug 21, 2018 at 9:41 AM, Todd Lipcon wrote: > >> > >> > Hey folks, > >> > > >> > In my recent forays into the catalog->statestore->impalad metadata > >> > propagation code base, I noticed that the latency of any update is > >> > typically between 2-4 seconds with the standard 2-second statestore > >> polling > >> > interval. That's because the code currently works as follows: > >> > > >> > 1. in the steady state with no recent metadata changes, the catalogd's > >> > state is: > >> > -- topic_updates_ready_ = true > >> > -- pending_topic_updates_ = empty > >> > > >> > 2. some metadata change happens, which modifies the version numbers in > >> the > >> > Java catalog but doesn't modify any of the C++ side state > >> > > >> > 3. the next statestore poll happens due to the normal interval > >> expiring. On > >> > average, this will take *1/2 the polling interval* > >> > -- this sees that pending_topic_updates_ is empty, so returns no > >> results. > >> > -- it sets topic_updates_ready_ = false and triggers the "gather" > thread > >> > > >> > 4. the "gather" thread wakes up and gathers updates, filling in > >> > 'pending_topic_updates_' and setting 'topic_updates_ready_' back to > true > >> > (typically subsecond in smallish catalogs, so this happens before the > >> next > >> > poll) > >> > > >> > 5. wait *another full statestore polling interval* (2 seconds) after > >> step > >> > #3 above, at which point we deliver the metadata update to the > >> statestore > >> > > >> > 6. wait on average* 1/2 the polling interval* until any particular > >> impalad > >> > gets the update from #4 > >> > > >> > So. in the absolute best case, we wait one full polling interval (2 > >> > seconds), and in the worst case we wait two polling intervals (4 > >> seconds). > >> > > >> > Has anyone looked into optimizing this at all? It seems like we could > >> have > >> > metadata changes trigger an immediate "collection" into the C++ side, > >> and > >> > have the statestore update callback wait ("long poll" style) for an > >> update > >> > rather than skip if there is nothing available. > >> > > >> > -Todd > >> > -- > >> > Todd Lipcon > >> > Software Engineer, Cloudera > >> > > >> > > > > > > > > -- > > Todd Lipcon > > Software Engineer, Cloudera > > > > > > -- > Todd Lipcon > Software Engineer, Cloudera > -- Tianyi Wang
Re: Impalad JVM OOM minutes after restart
Jeezy - yes unfortunately I cannot share the query details at this time. No hs_err file was generated. Philip - Yeah that seems to be the way to go. On Tue, Aug 21, 2018 at 1:51 PM, Philip Zeyliger wrote: > Hi Brock, > > If you want to make Eclipse MAT more usable, set JAVA_TOOL_OPTIONS="-Xmx2g > -XX:+HeapDumpOnOutOfMemoryError" and you should see the max heap at 2GB, > thereby making Eclipse MAT friendlier. Folks have also been using > http://www.jxray.com/. > > The query itself will also be interesting. If there's something like an > loop in analyzing it, you could imagine that showing up as an OOM. The heap > dump should tell us. > > -- Philip > > On Tue, Aug 21, 2018 at 11:32 AM Brock Noland wrote: > >> Hi Jeezy, >> >> Thanks, good tip. >> >> The MS is quite small. Even mysqldump format is only 12MB. The largest >> catalog-update I could find is only 1.5MB which should be easy to >> process with 32GB of of heap. Lastly, it's possible we can reproduce >> by running the query the impalad was processing during the issue, >> going to wait until after the users head home to try, but it doesn't >> appear reproducible in the method you describe. When we restarted, it >> did not reproduce until users started running queries. >> >> I0820 19:45:25.106437 25474 statestore.cc:568] Preparing initial >> catalog-update topic update for impalad@XXX:22000. Size = 1.45 MB >> >> Brock >> >> On Tue, Aug 21, 2018 at 1:18 PM, Jeszy wrote: >> > Hey, >> > >> > If it happens shortly after a restart, there is a fair chance you're >> > crashing while processing the initial catalog topic update. Statestore >> > logs will tell you how big that was (it takes more memory to process >> > it than the actual size of the update). >> > If this is the case, it should also be reproducible, ie. the daemon >> > will keep restarting and running OOM on initial update until you clear >> > the metadata cache either by restarting catalog or via a (global) >> > invalidate metadata. >> > >> > HTH >> > On Tue, 21 Aug 2018 at 20:13, Brock Noland wrote: >> >> >> >> Hi folks, >> >> >> >> I've got an Impala CDH 5.14.2 cluster with a handful of users, 2-3, at >> >> any one time. All of a sudden the JVM inside the Impalad started >> >> running out of memory. >> >> >> >> I got a heap dump, but the heap was 32GB, host is 240GB, so it's very >> >> large. Thus I wasn't able to get Memory Analyzer Tool (MAT) to open >> >> it. I was able to get JHAT to opening it when setting JHAT's heap to >> >> 160GB. It's pretty unwieldy so much of the JHAT functionality doesn't >> >> work. >> >> >> >> I am spelunking around, but really curious if there is some places I >> >> should check >> >> >> >> I am only an occasional reader of Impala source so I am just pointing >> >> out things which felt interesting: >> >> >> >> * Impalad was restarted shortly before the JVM OOM >> >> * Joining Parquet on S3 with Kudu >> >> * Only 13 instances of org.apache.impala.catalog.HdfsTable >> >> * 176836 instances of org.apache.impala.analysis.Analyzer - this feels >> >> odd to me. I remember one bug a while back in Hive when it would clone >> >> the query tree until it ran OOM. >> >> * 176796 of those _user fields point at the same user >> >> * org.apache.impala.thrift.TQueryCt@0x7f90975297f8 has 11048 >> >> org.apache.impala.analysis.Analyzer@GlobalState objects pointing at >> >> it. >> >> * There is only a single instance of >> >> org.apache.impala.thrift.TQueryCtx alive in the JVM which appears to >> >> indicate there is only a single query running. I've tracked that query >> >> down in CM. The users need to compute stats, but I don't feel that is >> >> relevant to this JVM OOM condition. >> >> >> >> Any pointers on what I might look for? >> >> >> >> Cheers, >> >> Brock >>
Re: Impalad JVM OOM minutes after restart
Hi Brock, If you want to make Eclipse MAT more usable, set JAVA_TOOL_OPTIONS="-Xmx2g -XX:+HeapDumpOnOutOfMemoryError" and you should see the max heap at 2GB, thereby making Eclipse MAT friendlier. Folks have also been using http://www.jxray.com/. The query itself will also be interesting. If there's something like an loop in analyzing it, you could imagine that showing up as an OOM. The heap dump should tell us. -- Philip On Tue, Aug 21, 2018 at 11:32 AM Brock Noland wrote: > Hi Jeezy, > > Thanks, good tip. > > The MS is quite small. Even mysqldump format is only 12MB. The largest > catalog-update I could find is only 1.5MB which should be easy to > process with 32GB of of heap. Lastly, it's possible we can reproduce > by running the query the impalad was processing during the issue, > going to wait until after the users head home to try, but it doesn't > appear reproducible in the method you describe. When we restarted, it > did not reproduce until users started running queries. > > I0820 19:45:25.106437 25474 statestore.cc:568] Preparing initial > catalog-update topic update for impalad@XXX:22000. Size = 1.45 MB > > Brock > > On Tue, Aug 21, 2018 at 1:18 PM, Jeszy wrote: > > Hey, > > > > If it happens shortly after a restart, there is a fair chance you're > > crashing while processing the initial catalog topic update. Statestore > > logs will tell you how big that was (it takes more memory to process > > it than the actual size of the update). > > If this is the case, it should also be reproducible, ie. the daemon > > will keep restarting and running OOM on initial update until you clear > > the metadata cache either by restarting catalog or via a (global) > > invalidate metadata. > > > > HTH > > On Tue, 21 Aug 2018 at 20:13, Brock Noland wrote: > >> > >> Hi folks, > >> > >> I've got an Impala CDH 5.14.2 cluster with a handful of users, 2-3, at > >> any one time. All of a sudden the JVM inside the Impalad started > >> running out of memory. > >> > >> I got a heap dump, but the heap was 32GB, host is 240GB, so it's very > >> large. Thus I wasn't able to get Memory Analyzer Tool (MAT) to open > >> it. I was able to get JHAT to opening it when setting JHAT's heap to > >> 160GB. It's pretty unwieldy so much of the JHAT functionality doesn't > >> work. > >> > >> I am spelunking around, but really curious if there is some places I > >> should check > >> > >> I am only an occasional reader of Impala source so I am just pointing > >> out things which felt interesting: > >> > >> * Impalad was restarted shortly before the JVM OOM > >> * Joining Parquet on S3 with Kudu > >> * Only 13 instances of org.apache.impala.catalog.HdfsTable > >> * 176836 instances of org.apache.impala.analysis.Analyzer - this feels > >> odd to me. I remember one bug a while back in Hive when it would clone > >> the query tree until it ran OOM. > >> * 176796 of those _user fields point at the same user > >> * org.apache.impala.thrift.TQueryCt@0x7f90975297f8 has 11048 > >> org.apache.impala.analysis.Analyzer@GlobalState objects pointing at > >> it. > >> * There is only a single instance of > >> org.apache.impala.thrift.TQueryCtx alive in the JVM which appears to > >> indicate there is only a single query running. I've tracked that query > >> down in CM. The users need to compute stats, but I don't feel that is > >> relevant to this JVM OOM condition. > >> > >> Any pointers on what I might look for? > >> > >> Cheers, > >> Brock >
Re: Impalad JVM OOM minutes after restart
Hm, that's interesting because: - I haven't yet seen query planning itself cause OOM - if it was catalog related to the tables involved in the query, following initial topic size would be bigger Can you share diagnostic data, like the query text, definitions and stats for tables involved, hs_err_pid written on crash, etc? On Tue, 21 Aug 2018 at 20:32, Brock Noland wrote: > > Hi Jeezy, > > Thanks, good tip. > > The MS is quite small. Even mysqldump format is only 12MB. The largest > catalog-update I could find is only 1.5MB which should be easy to > process with 32GB of of heap. Lastly, it's possible we can reproduce > by running the query the impalad was processing during the issue, > going to wait until after the users head home to try, but it doesn't > appear reproducible in the method you describe. When we restarted, it > did not reproduce until users started running queries. > > I0820 19:45:25.106437 25474 statestore.cc:568] Preparing initial > catalog-update topic update for impalad@XXX:22000. Size = 1.45 MB > > Brock > > On Tue, Aug 21, 2018 at 1:18 PM, Jeszy wrote: > > Hey, > > > > If it happens shortly after a restart, there is a fair chance you're > > crashing while processing the initial catalog topic update. Statestore > > logs will tell you how big that was (it takes more memory to process > > it than the actual size of the update). > > If this is the case, it should also be reproducible, ie. the daemon > > will keep restarting and running OOM on initial update until you clear > > the metadata cache either by restarting catalog or via a (global) > > invalidate metadata. > > > > HTH > > On Tue, 21 Aug 2018 at 20:13, Brock Noland wrote: > >> > >> Hi folks, > >> > >> I've got an Impala CDH 5.14.2 cluster with a handful of users, 2-3, at > >> any one time. All of a sudden the JVM inside the Impalad started > >> running out of memory. > >> > >> I got a heap dump, but the heap was 32GB, host is 240GB, so it's very > >> large. Thus I wasn't able to get Memory Analyzer Tool (MAT) to open > >> it. I was able to get JHAT to opening it when setting JHAT's heap to > >> 160GB. It's pretty unwieldy so much of the JHAT functionality doesn't > >> work. > >> > >> I am spelunking around, but really curious if there is some places I > >> should check > >> > >> I am only an occasional reader of Impala source so I am just pointing > >> out things which felt interesting: > >> > >> * Impalad was restarted shortly before the JVM OOM > >> * Joining Parquet on S3 with Kudu > >> * Only 13 instances of org.apache.impala.catalog.HdfsTable > >> * 176836 instances of org.apache.impala.analysis.Analyzer - this feels > >> odd to me. I remember one bug a while back in Hive when it would clone > >> the query tree until it ran OOM. > >> * 176796 of those _user fields point at the same user > >> * org.apache.impala.thrift.TQueryCt@0x7f90975297f8 has 11048 > >> org.apache.impala.analysis.Analyzer@GlobalState objects pointing at > >> it. > >> * There is only a single instance of > >> org.apache.impala.thrift.TQueryCtx alive in the JVM which appears to > >> indicate there is only a single query running. I've tracked that query > >> down in CM. The users need to compute stats, but I don't feel that is > >> relevant to this JVM OOM condition. > >> > >> Any pointers on what I might look for? > >> > >> Cheers, > >> Brock
Re: Impalad JVM OOM minutes after restart
Hi Jeezy, Thanks, good tip. The MS is quite small. Even mysqldump format is only 12MB. The largest catalog-update I could find is only 1.5MB which should be easy to process with 32GB of of heap. Lastly, it's possible we can reproduce by running the query the impalad was processing during the issue, going to wait until after the users head home to try, but it doesn't appear reproducible in the method you describe. When we restarted, it did not reproduce until users started running queries. I0820 19:45:25.106437 25474 statestore.cc:568] Preparing initial catalog-update topic update for impalad@XXX:22000. Size = 1.45 MB Brock On Tue, Aug 21, 2018 at 1:18 PM, Jeszy wrote: > Hey, > > If it happens shortly after a restart, there is a fair chance you're > crashing while processing the initial catalog topic update. Statestore > logs will tell you how big that was (it takes more memory to process > it than the actual size of the update). > If this is the case, it should also be reproducible, ie. the daemon > will keep restarting and running OOM on initial update until you clear > the metadata cache either by restarting catalog or via a (global) > invalidate metadata. > > HTH > On Tue, 21 Aug 2018 at 20:13, Brock Noland wrote: >> >> Hi folks, >> >> I've got an Impala CDH 5.14.2 cluster with a handful of users, 2-3, at >> any one time. All of a sudden the JVM inside the Impalad started >> running out of memory. >> >> I got a heap dump, but the heap was 32GB, host is 240GB, so it's very >> large. Thus I wasn't able to get Memory Analyzer Tool (MAT) to open >> it. I was able to get JHAT to opening it when setting JHAT's heap to >> 160GB. It's pretty unwieldy so much of the JHAT functionality doesn't >> work. >> >> I am spelunking around, but really curious if there is some places I >> should check >> >> I am only an occasional reader of Impala source so I am just pointing >> out things which felt interesting: >> >> * Impalad was restarted shortly before the JVM OOM >> * Joining Parquet on S3 with Kudu >> * Only 13 instances of org.apache.impala.catalog.HdfsTable >> * 176836 instances of org.apache.impala.analysis.Analyzer - this feels >> odd to me. I remember one bug a while back in Hive when it would clone >> the query tree until it ran OOM. >> * 176796 of those _user fields point at the same user >> * org.apache.impala.thrift.TQueryCt@0x7f90975297f8 has 11048 >> org.apache.impala.analysis.Analyzer@GlobalState objects pointing at >> it. >> * There is only a single instance of >> org.apache.impala.thrift.TQueryCtx alive in the JVM which appears to >> indicate there is only a single query running. I've tracked that query >> down in CM. The users need to compute stats, but I don't feel that is >> relevant to this JVM OOM condition. >> >> Any pointers on what I might look for? >> >> Cheers, >> Brock
Re: Improving latency of catalog update propagation?
One more parting thought: why don't we just call 'GetCatalogDelta()' directly from the catalog callback in order to do a direct handoff, instead of storing them in this 'pending' struct? Given the statestore uses a dedicated thread per subscriber (right?) it seems like it would be fine for the update callback to take a long time, no? -Todd On Tue, Aug 21, 2018 at 11:09 AM, Todd Lipcon wrote: > Thanks, Tim. I'm guessing once we switch over these RPCs to KRPC instead > of Thrift we'll alleviate some of the scalability issues and maybe we can > look into increasing frequency or doing a "push" to the statestore, etc. I > probably won't work on this in the near term to avoid complicating the > ongoing changes with catalog. > > -Todd > > On Tue, Aug 21, 2018 at 10:22 AM, Tim Armstrong > wrote: > >> This is somewhat relevant for admission control too - I had thought about >> some of these issues in that context, because reducing the latency of >> admission controls state propagation helps avoid overadmission but having >> a >> very low statestore frequency is very inefficient and doesn't scale well >> to >> larger clusters. >> >> For the catalog updates I agree we could do something with long polls >> since >> it's a single producer so that the "idle" state of the system has a thread >> sitting in the update callback on catalogd waiting for an update. >> >> I'd also thought at one point about allowing subscribers to notify the >> statestore that they had something to add to the topic. That could be >> treated as a hint to the statestore to schedule the subscriber update >> sooner. This would also work for admission control since coordinators >> could >> notify the statestore when the first query was admitted after the previous >> statestore update. >> >> On Tue, Aug 21, 2018 at 9:41 AM, Todd Lipcon wrote: >> >> > Hey folks, >> > >> > In my recent forays into the catalog->statestore->impalad metadata >> > propagation code base, I noticed that the latency of any update is >> > typically between 2-4 seconds with the standard 2-second statestore >> polling >> > interval. That's because the code currently works as follows: >> > >> > 1. in the steady state with no recent metadata changes, the catalogd's >> > state is: >> > -- topic_updates_ready_ = true >> > -- pending_topic_updates_ = empty >> > >> > 2. some metadata change happens, which modifies the version numbers in >> the >> > Java catalog but doesn't modify any of the C++ side state >> > >> > 3. the next statestore poll happens due to the normal interval >> expiring. On >> > average, this will take *1/2 the polling interval* >> > -- this sees that pending_topic_updates_ is empty, so returns no >> results. >> > -- it sets topic_updates_ready_ = false and triggers the "gather" thread >> > >> > 4. the "gather" thread wakes up and gathers updates, filling in >> > 'pending_topic_updates_' and setting 'topic_updates_ready_' back to true >> > (typically subsecond in smallish catalogs, so this happens before the >> next >> > poll) >> > >> > 5. wait *another full statestore polling interval* (2 seconds) after >> step >> > #3 above, at which point we deliver the metadata update to the >> statestore >> > >> > 6. wait on average* 1/2 the polling interval* until any particular >> impalad >> > gets the update from #4 >> > >> > So. in the absolute best case, we wait one full polling interval (2 >> > seconds), and in the worst case we wait two polling intervals (4 >> seconds). >> > >> > Has anyone looked into optimizing this at all? It seems like we could >> have >> > metadata changes trigger an immediate "collection" into the C++ side, >> and >> > have the statestore update callback wait ("long poll" style) for an >> update >> > rather than skip if there is nothing available. >> > >> > -Todd >> > -- >> > Todd Lipcon >> > Software Engineer, Cloudera >> > >> > > > > -- > Todd Lipcon > Software Engineer, Cloudera > -- Todd Lipcon Software Engineer, Cloudera
Re: Impalad JVM OOM minutes after restart
Hey, If it happens shortly after a restart, there is a fair chance you're crashing while processing the initial catalog topic update. Statestore logs will tell you how big that was (it takes more memory to process it than the actual size of the update). If this is the case, it should also be reproducible, ie. the daemon will keep restarting and running OOM on initial update until you clear the metadata cache either by restarting catalog or via a (global) invalidate metadata. HTH On Tue, 21 Aug 2018 at 20:13, Brock Noland wrote: > > Hi folks, > > I've got an Impala CDH 5.14.2 cluster with a handful of users, 2-3, at > any one time. All of a sudden the JVM inside the Impalad started > running out of memory. > > I got a heap dump, but the heap was 32GB, host is 240GB, so it's very > large. Thus I wasn't able to get Memory Analyzer Tool (MAT) to open > it. I was able to get JHAT to opening it when setting JHAT's heap to > 160GB. It's pretty unwieldy so much of the JHAT functionality doesn't > work. > > I am spelunking around, but really curious if there is some places I > should check > > I am only an occasional reader of Impala source so I am just pointing > out things which felt interesting: > > * Impalad was restarted shortly before the JVM OOM > * Joining Parquet on S3 with Kudu > * Only 13 instances of org.apache.impala.catalog.HdfsTable > * 176836 instances of org.apache.impala.analysis.Analyzer - this feels > odd to me. I remember one bug a while back in Hive when it would clone > the query tree until it ran OOM. > * 176796 of those _user fields point at the same user > * org.apache.impala.thrift.TQueryCt@0x7f90975297f8 has 11048 > org.apache.impala.analysis.Analyzer@GlobalState objects pointing at > it. > * There is only a single instance of > org.apache.impala.thrift.TQueryCtx alive in the JVM which appears to > indicate there is only a single query running. I've tracked that query > down in CM. The users need to compute stats, but I don't feel that is > relevant to this JVM OOM condition. > > Any pointers on what I might look for? > > Cheers, > Brock
Impalad JVM OOM minutes after restart
Hi folks, I've got an Impala CDH 5.14.2 cluster with a handful of users, 2-3, at any one time. All of a sudden the JVM inside the Impalad started running out of memory. I got a heap dump, but the heap was 32GB, host is 240GB, so it's very large. Thus I wasn't able to get Memory Analyzer Tool (MAT) to open it. I was able to get JHAT to opening it when setting JHAT's heap to 160GB. It's pretty unwieldy so much of the JHAT functionality doesn't work. I am spelunking around, but really curious if there is some places I should check I am only an occasional reader of Impala source so I am just pointing out things which felt interesting: * Impalad was restarted shortly before the JVM OOM * Joining Parquet on S3 with Kudu * Only 13 instances of org.apache.impala.catalog.HdfsTable * 176836 instances of org.apache.impala.analysis.Analyzer - this feels odd to me. I remember one bug a while back in Hive when it would clone the query tree until it ran OOM. * 176796 of those _user fields point at the same user * org.apache.impala.thrift.TQueryCt@0x7f90975297f8 has 11048 org.apache.impala.analysis.Analyzer@GlobalState objects pointing at it. * There is only a single instance of org.apache.impala.thrift.TQueryCtx alive in the JVM which appears to indicate there is only a single query running. I've tracked that query down in CM. The users need to compute stats, but I don't feel that is relevant to this JVM OOM condition. Any pointers on what I might look for? Cheers, Brock
Re: Improving latency of catalog update propagation?
Thanks, Tim. I'm guessing once we switch over these RPCs to KRPC instead of Thrift we'll alleviate some of the scalability issues and maybe we can look into increasing frequency or doing a "push" to the statestore, etc. I probably won't work on this in the near term to avoid complicating the ongoing changes with catalog. -Todd On Tue, Aug 21, 2018 at 10:22 AM, Tim Armstrong wrote: > This is somewhat relevant for admission control too - I had thought about > some of these issues in that context, because reducing the latency of > admission controls state propagation helps avoid overadmission but having a > very low statestore frequency is very inefficient and doesn't scale well to > larger clusters. > > For the catalog updates I agree we could do something with long polls since > it's a single producer so that the "idle" state of the system has a thread > sitting in the update callback on catalogd waiting for an update. > > I'd also thought at one point about allowing subscribers to notify the > statestore that they had something to add to the topic. That could be > treated as a hint to the statestore to schedule the subscriber update > sooner. This would also work for admission control since coordinators could > notify the statestore when the first query was admitted after the previous > statestore update. > > On Tue, Aug 21, 2018 at 9:41 AM, Todd Lipcon wrote: > > > Hey folks, > > > > In my recent forays into the catalog->statestore->impalad metadata > > propagation code base, I noticed that the latency of any update is > > typically between 2-4 seconds with the standard 2-second statestore > polling > > interval. That's because the code currently works as follows: > > > > 1. in the steady state with no recent metadata changes, the catalogd's > > state is: > > -- topic_updates_ready_ = true > > -- pending_topic_updates_ = empty > > > > 2. some metadata change happens, which modifies the version numbers in > the > > Java catalog but doesn't modify any of the C++ side state > > > > 3. the next statestore poll happens due to the normal interval expiring. > On > > average, this will take *1/2 the polling interval* > > -- this sees that pending_topic_updates_ is empty, so returns no results. > > -- it sets topic_updates_ready_ = false and triggers the "gather" thread > > > > 4. the "gather" thread wakes up and gathers updates, filling in > > 'pending_topic_updates_' and setting 'topic_updates_ready_' back to true > > (typically subsecond in smallish catalogs, so this happens before the > next > > poll) > > > > 5. wait *another full statestore polling interval* (2 seconds) after step > > #3 above, at which point we deliver the metadata update to the statestore > > > > 6. wait on average* 1/2 the polling interval* until any particular > impalad > > gets the update from #4 > > > > So. in the absolute best case, we wait one full polling interval (2 > > seconds), and in the worst case we wait two polling intervals (4 > seconds). > > > > Has anyone looked into optimizing this at all? It seems like we could > have > > metadata changes trigger an immediate "collection" into the C++ side, and > > have the statestore update callback wait ("long poll" style) for an > update > > rather than skip if there is nothing available. > > > > -Todd > > -- > > Todd Lipcon > > Software Engineer, Cloudera > > > -- Todd Lipcon Software Engineer, Cloudera
Re: Improving latency of catalog update propagation?
This is somewhat relevant for admission control too - I had thought about some of these issues in that context, because reducing the latency of admission controls state propagation helps avoid overadmission but having a very low statestore frequency is very inefficient and doesn't scale well to larger clusters. For the catalog updates I agree we could do something with long polls since it's a single producer so that the "idle" state of the system has a thread sitting in the update callback on catalogd waiting for an update. I'd also thought at one point about allowing subscribers to notify the statestore that they had something to add to the topic. That could be treated as a hint to the statestore to schedule the subscriber update sooner. This would also work for admission control since coordinators could notify the statestore when the first query was admitted after the previous statestore update. On Tue, Aug 21, 2018 at 9:41 AM, Todd Lipcon wrote: > Hey folks, > > In my recent forays into the catalog->statestore->impalad metadata > propagation code base, I noticed that the latency of any update is > typically between 2-4 seconds with the standard 2-second statestore polling > interval. That's because the code currently works as follows: > > 1. in the steady state with no recent metadata changes, the catalogd's > state is: > -- topic_updates_ready_ = true > -- pending_topic_updates_ = empty > > 2. some metadata change happens, which modifies the version numbers in the > Java catalog but doesn't modify any of the C++ side state > > 3. the next statestore poll happens due to the normal interval expiring. On > average, this will take *1/2 the polling interval* > -- this sees that pending_topic_updates_ is empty, so returns no results. > -- it sets topic_updates_ready_ = false and triggers the "gather" thread > > 4. the "gather" thread wakes up and gathers updates, filling in > 'pending_topic_updates_' and setting 'topic_updates_ready_' back to true > (typically subsecond in smallish catalogs, so this happens before the next > poll) > > 5. wait *another full statestore polling interval* (2 seconds) after step > #3 above, at which point we deliver the metadata update to the statestore > > 6. wait on average* 1/2 the polling interval* until any particular impalad > gets the update from #4 > > So. in the absolute best case, we wait one full polling interval (2 > seconds), and in the worst case we wait two polling intervals (4 seconds). > > Has anyone looked into optimizing this at all? It seems like we could have > metadata changes trigger an immediate "collection" into the C++ side, and > have the statestore update callback wait ("long poll" style) for an update > rather than skip if there is nothing available. > > -Todd > -- > Todd Lipcon > Software Engineer, Cloudera >
Improving latency of catalog update propagation?
Hey folks, In my recent forays into the catalog->statestore->impalad metadata propagation code base, I noticed that the latency of any update is typically between 2-4 seconds with the standard 2-second statestore polling interval. That's because the code currently works as follows: 1. in the steady state with no recent metadata changes, the catalogd's state is: -- topic_updates_ready_ = true -- pending_topic_updates_ = empty 2. some metadata change happens, which modifies the version numbers in the Java catalog but doesn't modify any of the C++ side state 3. the next statestore poll happens due to the normal interval expiring. On average, this will take *1/2 the polling interval* -- this sees that pending_topic_updates_ is empty, so returns no results. -- it sets topic_updates_ready_ = false and triggers the "gather" thread 4. the "gather" thread wakes up and gathers updates, filling in 'pending_topic_updates_' and setting 'topic_updates_ready_' back to true (typically subsecond in smallish catalogs, so this happens before the next poll) 5. wait *another full statestore polling interval* (2 seconds) after step #3 above, at which point we deliver the metadata update to the statestore 6. wait on average* 1/2 the polling interval* until any particular impalad gets the update from #4 So. in the absolute best case, we wait one full polling interval (2 seconds), and in the worst case we wait two polling intervals (4 seconds). Has anyone looked into optimizing this at all? It seems like we could have metadata changes trigger an immediate "collection" into the C++ side, and have the statestore update callback wait ("long poll" style) for an update rather than skip if there is nothing available. -Todd -- Todd Lipcon Software Engineer, Cloudera
Re: Range partition on HDFS
Were you thinking of something like this? https://www.cloudera.com/documentation/enterprise/latest/topics/impala_partitioning.html On Tue, Aug 21, 2018 at 7:37 AM Yuming Wang wrote: > Hi, > > Only kudu supports range partition, can HDFS support this feature? > > > https://kudu.apache.org/docs/kudu_impala_integration.html#basic_partitioning > > Thanks. >
Re: Re: New Impala committer - Quanlong Huang
Congrats Quanlong! On Tue, Aug 21, 2018 at 9:34 AM Gabor Kaszab wrote: > Congrats! > > On Sat, Aug 18, 2018 at 3:11 AM Quanlong Huang > wrote: > > > Thanks! Glad to work with you all!--Quanlong > > > > At 2018-08-18 03:09:38, "Yongjun Zhang" wrote: > > >Congratulations Quanlong! > > > > > >--Yngjun > > > > > >On Fri, Aug 17, 2018 at 12:07 PM, Jeszy wrote: > > > > > >> Congrats Quanlong! > > >> > > >> On 17 August 2018 at 19:51, Csaba Ringhofer > > > >> wrote: > > >> > Congrats! > > >> > > > >> > On Fri, Aug 17, 2018 at 6:32 PM, Philip Zeyliger < > phi...@cloudera.com > > > > > >> > wrote: > > >> > > > >> >> Congrats! > > >> >> > > >> >> On Fri, Aug 17, 2018 at 9:29 AM Tim Armstrong < > > tarmstr...@cloudera.com> > > >> >> wrote: > > >> >> > > >> >> > The Project Management Committee (PMC) for Apache Impala has > > invited > > >> >> > Quanlong Huang to become a committer and we are pleased to > announce > > >> that > > >> >> > they have accepted. Congratulations and welcome, Quanlong Huang! > > >> >> > > > >> >> > > >> > > >
Range partition on HDFS
Hi, Only kudu supports range partition, can HDFS support this feature? https://kudu.apache.org/docs/kudu_impala_integration.html#basic_partitioning Thanks.
Re: Improving Kudu Build Support
+1 for simplifying Kudu updates. I am also still on Ubuntu 14.04, but I am all for simplifying Kudu integration: I agree with Thomas that Kudu snapshots should be grouped with the other CDH components. Given that Ubuntu 14.04 will be EOL'd next spring, upgrading the development OS is a reasonably small price to pay -- especially that it will soon become necessary anyway. Thanks for doing this Thomas! - Laszlo On Tue, Aug 21, 2018 at 12:34 AM Lars Volker wrote: > I'm in favor of not spending developer time and effort to maintain > compatibility with 14.04. Personally I'm still developing on Ubuntu 14.04 > so I'd be happy if we can support it without much pain. On the other hand > it EOLs in April 2019, so I might as well go to 18.04 now, should we decide > to drop support. Maybe not many other folks are on 14.04 after all? > > > > On Mon, Aug 20, 2018 at 10:06 AM Thomas Tauber-Marshall < > tmarsh...@cloudera.com> wrote: > > > Impala community, > > > > For years now, Impala has utilized tarballs built by Cloudera and > uploaded > > to S3 for running most of the Hadoop components in the testing > minicluster. > > The one exception to this is Kudu, which is instead provided by the > > toolchain. > > > > This was never ideal - native-toolchain makes more sense for libraries > > where we want to build against a fairly static version, but Kudu is under > > active development and we'd like to always build against a relatively > > up-to-date version. As a result, patches just bumping the version of Kudu > > make up a significant portion of the commit history of native-toolchain. > > > > Thanks to work I'm currently doing at Cloudera, there will soon be > snapshot > > tarballs of Kudu getting uploaded to S3 along with the other Hadoop > > components. I would like to propose that Impala switch to using those > > instead of the toolchain Kudu. > > > > One problem here is that the new Kudu tarballs will not be getting build > > for Ubuntu 14.04, only 16.04, but we still officially say we support > > development on 14.04. > > > > One option here would be to maintain the toolchain Kudu for now and hide > > downloading of the new tarballs behind a flag. We could also postpone > some > > of this work until 14.04 is less common. Or, given that the > > bootstrap_development script already only supports 16.04, we might want > to > > just drop support for building on 14.04. > > > > Thoughts? > > >
Re: Re: New Impala committer - Quanlong Huang
Congrats! On Sat, Aug 18, 2018 at 3:11 AM Quanlong Huang wrote: > Thanks! Glad to work with you all!--Quanlong > > At 2018-08-18 03:09:38, "Yongjun Zhang" wrote: > >Congratulations Quanlong! > > > >--Yngjun > > > >On Fri, Aug 17, 2018 at 12:07 PM, Jeszy wrote: > > > >> Congrats Quanlong! > >> > >> On 17 August 2018 at 19:51, Csaba Ringhofer > >> wrote: > >> > Congrats! > >> > > >> > On Fri, Aug 17, 2018 at 6:32 PM, Philip Zeyliger > > >> > wrote: > >> > > >> >> Congrats! > >> >> > >> >> On Fri, Aug 17, 2018 at 9:29 AM Tim Armstrong < > tarmstr...@cloudera.com> > >> >> wrote: > >> >> > >> >> > The Project Management Committee (PMC) for Apache Impala has > invited > >> >> > Quanlong Huang to become a committer and we are pleased to announce > >> that > >> >> > they have accepted. Congratulations and welcome, Quanlong Huang! > >> >> > > >> >> > >> >