On Thu, Feb 14, 2019 at 4:58 PM Michael Lewis <mle...@entrata.com> wrote:
> This is beyond my expertise except to say that if your storage is SSDs in > AWS, then you definitely want random_page_cost close to the same as > seq_page_cost (1 by default) assuming your data is likely to be in cache as > discussed in the documentation > <https://www.postgresql.org/docs/current/runtime-config-query.html#RUNTIME-CONFIG-QUERY-CONSTANTS>. > As it says- "Reducing this value relative to seq_page_cost will cause the > system to prefer index scans" as you saw. Changing the value on production > would again depend on the storage type used, and how good the cache hit > rate is. > > As far as I know, dropping old partitions should not be significantly > impactful to the system other than no longer needing to store that data > (cost, time for full backups, etc). > > Again, as I understand things, there is not a big impact from having old > unused tables in terms of maintenance. They should be ignored by normal > processes. > > Glad you got your issue resolved. > > > *Michael Lewis* > Thanks for the feedback.You have been giving your thoughts/suggestions since the beginning of the case. It was helpful. I think I realized later based on your suggestion to increase the default statistics target from 100. It was not correctly initially as I had that set at session level without setting them on the partition tables. As next steps I have the stats to 1000 on all of the partitiion tables we use from Nov 2018 to Feb 19 and then did ANALYZE. Iam currently running a load test to test how the DB performance is behaving right now and so far its running good than before. ( I have reset all the previous changes done except the statistics change). I will keep you posted after the test finishes Questions. 1) Can i further increase the Setting to 3000 and see the system behaves. ?. How do I know the best value to be used for my database in terms of the sampling limit with the default statistics setting ?. 2) Apart from analyzing the tables do I need to do any other changes with the statistics setting ? 3) Also the current work mem is set to 4 MB and we didnt play with this value so far. For future needs can I increase the WORK MEM setting ?. Appreciate your reply. Thanks > On Thu, Feb 14, 2019 at 3:11 PM github kran <githubk...@gmail.com> wrote: > >> >> >> On Thu, Feb 14, 2019 at 12:43 PM Michael Lewis <mle...@entrata.com> >> wrote: >> >>> How many total rows in these tables? I am assuming these are partitions >>> and those 100 million rows got distributed. If the data difference is >>> significant, then you aren't guaranteed similar performance. You may want >>> to follow more of the suggested steps on. >>> >>> https://wiki.postgresql.org/wiki/SlowQueryQuestions >>> >>> >>> *Michael Lewis | Software Engineer* >>> *Entrata* >>> *c: **619.370.8697 <619-370-8697>* >>> >>> Michael - Yes correct the data of 100 million rows is distributed to all >> the partitions. >> FInally I feel we have come to conclusion after we changed the >> random_page_cost from 4 to 2 in Non prod instance and we see improvements >> in the query which use to take from 1 minute to 1 -2 seconds. >> That's correct we have around 490 million rows in few of our partition >> tables. The partition tables are created for every 7 days. >> >> We ran our API test which hits the backend database Aurora RDS PostgreSQL >> and see our query response times , requests/sec are better than before. Do >> you recommend this setting on a Production instance? Right now we are >> planning to go >> implement this option of random_page_cost to 2. ( We know it also works >> if we turn off the hash_join but we dont want to implement this change but >> rather use random_page_cost to 2). >> >> Questions. >> 1) What is your recommendation on this ? Can we modify this change on >> Prod instance which is performing better today or only keep this change to >> Non prod instance ?. ( Actually we want to implement this change on Non >> Prod instance) >> 2) Are there are any downside in changing this value ?. >> 3) Also another question we have a 64TB storage and running on r4 16 x >> large EC2 instance. Few of our partitions are bigger and few of them are >> smaller. We have data from last 9 months and we are planning to keep the >> data for about >> close to 1 year till May. Do you see any concerns on this ?. Eventually >> we are thinking to archive this data in next 2 months by dropping of older >> partitions. >> 3) What could be the problems of keeping the data longer if there is a >> regular maintenance like VACUUM and other maintenace activities >> >> >> >>> >>> On Thu, Feb 14, 2019 at 8:48 AM github kran <githubk...@gmail.com> >>> wrote: >>> >>>> >>>> >>>> On Wed, Feb 13, 2019 at 11:38 AM Michael Lewis <mle...@entrata.com> >>>> wrote: >>>> >>>>> I didn't see your email yesterday, sorry about that. Index scans >>>>> instead of sequential scans and nested loop instead of hash join means >>>>> that >>>>> you have bad row count estimates on "Non prod Aurora RDS instance" as far >>>>> as I can figure. Have you run commands like- >>>>> >>>>> analyze asset_info_2019_2_part4; >>>>> analyze asset_info_2019_2_part2; >>>>> >>>>> etc? If data are very similar, indexes all exist, and >>>>> default_statistics_target are the same, then you should be getting the >>>>> same >>>>> plans. >>>>> >>>>> >>>>> *Michael Lewis | Software Engineer* >>>>> *Entrata* >>>>> *c: **619.370.8697 <619-370-8697>* >>>>> >>>>> 1) Yes did the VACUUM for all the tables like >>>> asset_info_2019_2_part1, part2 , part4 and also for location_info_xxx to >>>> remove any dead tuples and also rebuilt the indexes.. >>>> 2) REINDEX table location_data_2018_12_part4;( Like wise ran for all >>>> the tables and also VACUUM for all the tables). >>>> 3) The data in Non prod instance is more. One thing to mention here >>>> when we built the Non prod instance we copied SNAPSHOT from Prod instance >>>> and on top of that inserted data about 100 million rows and then did VACUUM >>>> and re-indexed the tables. >>>> >>>> I cant think of anything we can do here but let us know if you need any >>>> more details on this problem. Iam happy to share more details. >>>> >>>> >>>>> >>>>> On Wed, Feb 13, 2019 at 8:49 AM github kran <githubk...@gmail.com> >>>>> wrote: >>>>> >>>>>> >>>>>> >>>>>> On Tue, Feb 12, 2019 at 12:55 PM github kran <githubk...@gmail.com> >>>>>> wrote: >>>>>> >>>>>>> >>>>>>> >>>>>>> On Tue, Feb 12, 2019 at 12:33 PM Michael Lewis <mle...@entrata.com> >>>>>>> wrote: >>>>>>> >>>>>>>> Did you update the stats by running ANALYZE on the tables involved, >>>>>>>> or perhaps the entire database on the 'Non prod Aurora RDS instance'? >>>>>>>> Can >>>>>>>> you share the two execution plans? >>>>>>>> >>>>>>>> *Michael Lewis | Software Engineer* >>>>>>>> *Entrata* >>>>>>>> *c: **619.370.8697 <619-370-8697>* >>>>>>>> >>>>>>> >>>>>>> Here is the plan for both of the DB instances. >>>>>>> >>>>>> >>>>>> Michael - Did you get any chance to look at this issue. Also we see >>>>>> there is a sequential scan being done instead of index scan. >>>>>> >>>>>>> >>>>>>>> >>>>>>>> On Tue, Feb 12, 2019 at 11:27 AM github kran <githubk...@gmail.com> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> On Tue, Feb 12, 2019 at 7:07 AM github kran <githubk...@gmail.com> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Mon, Feb 11, 2019 at 6:00 PM github kran <githubk...@gmail.com> >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Mon, Feb 11, 2019 at 3:29 PM Michael Lewis < >>>>>>>>>>> mle...@entrata.com> wrote: >>>>>>>>>>> >>>>>>>>>>>> Are default statistics target the same on both prod and AWS? >>>>>>>>>>>> Have you analyzed all tables being used in this query to ensure >>>>>>>>>>>> stats are >>>>>>>>>>>> up proper? If the optimizer is choosing a different plan, then the >>>>>>>>>>>> stats >>>>>>>>>>>> must be different IMO. >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> *Michael Lewis | Software Engineer* >>>>>>>>>>>> *Entrata* >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> Thanks for your reply I have verified few of the tables and >>>>>>>>>>> their default statistics target and they seem to be same but is >>>>>>>>>>> there >>>>>>>>>>> anything in particular you want me to look at it to differentiate >>>>>>>>>>> Prod and >>>>>>>>>>> Non prod databases ?. ( Also the DB instance size is same but there >>>>>>>>>>> is >>>>>>>>>>> little >>>>>>>>>>> more data in the Non prod Aurora RDS instance compared to Prod >>>>>>>>>>> instance). >>>>>>>>>>> >>>>>>>>>>> Query used. >>>>>>>>>>>> select * from pg_stats where tablename = 'tableName' >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> On Mon, Feb 11, 2019 at 2:15 PM github kran < >>>>>>>>>>>> githubk...@gmail.com> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> Hi Postgres Team, >>>>>>>>>>>>> We are using AWS Aurora RDS Postgres DB 9.6.8 Engine. We >>>>>>>>>>>>> recently deleted few million rows from the database and ran into >>>>>>>>>>>>> a issue in >>>>>>>>>>>>> one of our dev account where the >>>>>>>>>>>>> DB was not normal after this deletion. We did re index, >>>>>>>>>>>>> vacuuming entire database but we couldnt bring it to the same >>>>>>>>>>>>> state as >>>>>>>>>>>>> earlier. So next steps we deleted the database and >>>>>>>>>>>>> recreated the database by copying the snapshot from a >>>>>>>>>>>>> production instance. Further did vacumming, re-index on the >>>>>>>>>>>>> database. >>>>>>>>>>>>> >>>>>>>>>>>>> After this now the dev database seems to be in a better state >>>>>>>>>>>>> than earlier but we are seeing few of our DB calls are taking >>>>>>>>>>>>> more than 1 >>>>>>>>>>>>> minute when we are fetching data and we observed >>>>>>>>>>>>> this is because the query plan was executing a hash join as >>>>>>>>>>>>> part of the query whereas a similar query on prod instance is not >>>>>>>>>>>>> doing any >>>>>>>>>>>>> hash join and is returning faster. >>>>>>>>>>>>> >>>>>>>>>>>>> Also we did not want to experiment by modifing the DB settings >>>>>>>>>>>>> by doing enable_hash_join to off or random_page_count to 1 as we >>>>>>>>>>>>> dont have >>>>>>>>>>>>> these settings in Prod instance. >>>>>>>>>>>>> >>>>>>>>>>>>> Note: >>>>>>>>>>>>> The partition table sizes we have here is between 40 GB to 75 >>>>>>>>>>>>> GB and this is our normal size range, we have a new partition >>>>>>>>>>>>> table for >>>>>>>>>>>>> every 7 days. >>>>>>>>>>>>> >>>>>>>>>>>>> Appreciate your ideas on what we could be missing and what we >>>>>>>>>>>>> can correct here to reduce the query latency. >>>>>>>>>>>>> >>>>>>>>>>>>> Thanks >>>>>>>>>>>>> githubKran >>>>>>>>>>>>> >>>>>>>>>>>>