RE: Could use some thoughts on a problem with a query in production

2017-06-09 Thread Bergquist, Brett
I agree Bryan.   I am trying both routes in going to provide some more detailed 
monitor at the customer site as well as working on trying to simulate the same 
access pattern and hopefully get something reproducible.

So more insight.   This table would have had a high insert rate and probably 
around 60M records in it indexed by a composite key which includes a timestamp 
so basically this table has an index in the correct order for querying.  
Insertion is about 10M records/day.

The problem query should be accessing about 160K records in its result set.

We have configured a large page cache (64000 pages).

In this scenario, the query happens on a Sunday morning around 1:45 AM and is 
querying for Saturday’s data from 1:00 AM to 2:00 AM.   In the meantime, there 
are several other accesses that have queried this same table throughout 
Saturday and Sunday for more recent time intervals.   For example, one access 
pattern is looking at data every 5 minutes for the previous 15 minute interval. 
   So it seems to me that these query access patterns would be in pages in the 
cache.

Now along comes the problem query and it is much larger (1 hours worth of data) 
and much older (yesterday’s data) so that data is probably not going to be in 
the cache, and is going to try to populate the cache with the pages from the 
query, triggering the “clear”.

The CPU utilization is a key I believe as that implies some sort of “busy” 
code.   It almost looks like a “busy wait” being performed inside of a loop.

Another key is that the other queries hitting the same table but different 
(recent) timespans do no hit this problem even while the problem queries are 
tied up in knots.   This leads me to believe it is not a statistics problem 
with the table.  Also the stack traces show no “bulk table scan” but rather 
“index read” activity.

Another clue is a copy of this database while the condition was occurring did 
not show the issue.   The copy is obtained by “freezing” the database and 
performing a ZFS filesystem snapshot of the database files, and then 
“unfreezing” the database.   Firing up this copy of the database means the 
database is “cold” and will not have cached anything.   The problem query works 
fine on this copy of the database.

The problem query that is going on is relatively new to the production system 
and it is the only one that is querying for relatively “old” data.  Again, this 
leads me down the path to something to do with the cache.

I am just throwing out there some thoughts with my detective hat on to see if 
anything might trigger some thoughts.

So I have two plans of attack.

The first is to prepare a monitor script that can be execute around the time 
this issue happens on the customer’s system and hope that it in fact does 
happen again and the monitor script can get some more detailed information.

The second is to replay write an application that can replay the queries for a 
couple of days without any “think” time.  This should hit the database from a 
query point of view in the same manner as what happens on the customer’s 
system.  Hopefully this will trigger a reproducible problem.

From: Bryan Pendleton [mailto:bpendleton.de...@gmail.com]
Sent: Wednesday, June 07, 2017 9:47 AM
To: derby-dev@db.apache.org
Subject: Re: Could use some thoughts on a problem with a query in production

Hi Brett,

Cache cleaning in general is a very I/O intensive activity, so I agree that it 
is odd that your system appeared to be CPU busy during that time.

It's interesting that whatever it was, persisted for such a long time.

In the past, I have developed small operator scripts which collect system 
information, which can be run by a simple monitoring tool and can keep a 
history about the observations they make.

If you were to build and install such a monitoring system, you might be able to 
develop more clues about the "larger picture" of activity on your machine 
during that time period.

Sorry to be so vague and non-specific, but I find that retrospective analysis 
of problems like this is VERY challenging.

Often, the best you can do is:
1) Try to figure out some way to make the problem happen on demand, so you can 
cause it and observe it.
2) Instrument everything, and ensure you are preserving a history of your 
instrumentation recordings, so that you can have a mountain of detail when 
those rare events occur.

It strikes me as being like high-energy physics, where your experiment 
generates volumes of data, and it takes weeks or months of analysis afterward 
to figure out what actually occurred.

Not that I'm much good at high-energy physics, either, I'm afraid. :)

bryan



Canoga Perkins
20600 Prairie Street
Chatsworth, CA 91311
(818) 718-6300

This e-mail and any attached document(s) is confidential and is intended only 
for the review of the party to whom it is addressed. If you have received this 
transmission in error, please notify the sender 

Re: Could use some thoughts on a problem with a query in production

2017-06-07 Thread Bryan Pendleton
Hi Brett,

Cache cleaning in general is a very I/O intensive activity, so I agree that
it is odd that your system appeared to be CPU busy during that time.

It's interesting that whatever it was, persisted for such a long time.

In the past, I have developed small operator scripts which collect system
information, which can be run by a simple monitoring tool and can keep a
history about the observations they make.

If you were to build and install such a monitoring system, you might be
able to develop more clues about the "larger picture" of activity on your
machine during that time period.

Sorry to be so vague and non-specific, but I find that retrospective
analysis of problems like this is VERY challenging.

Often, the best you can do is:
1) Try to figure out some way to make the problem happen on demand, so you
can cause it and observe it.
2) Instrument everything, and ensure you are preserving a history of your
instrumentation recordings, so that you can have a mountain of detail when
those rare events occur.

It strikes me as being like high-energy physics, where your experiment
generates volumes of data, and it takes weeks or months of analysis
afterward to figure out what actually occurred.

Not that I'm much good at high-energy physics, either, I'm afraid. :)

bryan


Re: Could use some thoughts on a problem with a query in production

2017-06-06 Thread Bergquist, Brett
Thanks Bryan.  I am looking at all things :)   The fact that it is not 
reproducible with a copy of the database is a hinderance but also can indicate 
something external.

The interesting thing is that the same query with the same parameters was run 
40 times over about an 8 hour period with about 30 minutes apart.  Each of 
these hit the exactly the same problem.   At the same time, the same query 
requesting different time periods worked perfectly.   Insertion rates into the 
database also were not affected.

Does anyone know anything about the “cache cleaning”?   I have started to peak 
and poke at what it does but I am wondering if there is something in there that 
might be triggering the issue.   The reason that it is peaking my interest is 
that within the code where the BLOCKED threads are occurring are checked for 
the “cleaning”.   These are the areas that the code is waiting.   it is almost 
such that something gets lock stepped in which the code cannot progress because 
of something holding a monitor and the monitor is is being released and 
acquired almost once per result.   Also the fact that the CPU utilization of 
derby went from about 5% to about 45% during this time indicates something is 
being worked on or a busy wait being performed.

On Jun 6, 2017, at 9:28 AM, Bryan Pendleton 
> wrote:

Hi Brett?

What's the underlying storage system like? In particular, is it an SSD
storage system? Is there any chance that you happened to hit a spot
in time where there was some non-Derby event which was causing the
storage system to misbehave?

I understand that certain SSD implementations, and certain file
systems, have to perform automatic maintenance activity on the
SSD media, such as "wear leveling", which can impose significant
overhead on the file system and cause normal use of the file system
to be VERY slow for a significant period.

Here's a relatively old article on the subject; you might try some web searching
to see if you can get more recent information.

http://www.computerworld.com/article/2524589/data-center/analysis--ssd-performanceis-a-slowdown-inevitable-.html

Anyway, I guess what I'm suggesting is that you consider investigating
whether a non-Derby external factor is at work on the machine which
is suffering the performance problems.

thanks,

bryan




Canoga Perkins
20600 Prairie Street
Chatsworth, CA 91311
(818) 718-6300

This e-mail and any attached document(s) is confidential and is intended only 
for the review of the party to whom it is addressed. If you have received this 
transmission in error, please notify the sender immediately and discard the 
original message and any attachment(s).


Re: Could use some thoughts on a problem with a query in production

2017-06-06 Thread Bryan Pendleton
Hi Brett?

What's the underlying storage system like? In particular, is it an SSD
storage system? Is there any chance that you happened to hit a spot
in time where there was some non-Derby event which was causing the
storage system to misbehave?

I understand that certain SSD implementations, and certain file
systems, have to perform automatic maintenance activity on the
SSD media, such as "wear leveling", which can impose significant
overhead on the file system and cause normal use of the file system
to be VERY slow for a significant period.

Here's a relatively old article on the subject; you might try some web
searching
to see if you can get more recent information.

http://www.computerworld.com/article/2524589/data-center/analysis--ssd-performanceis-a-slowdown-inevitable-.html

Anyway, I guess what I'm suggesting is that you consider investigating
whether a non-Derby external factor is at work on the machine which
is suffering the performance problems.

thanks,

bryan


RE: Could use some thoughts on a problem with a query in production

2017-06-06 Thread Bergquist, Brett
Thanks Katherine, I will look into that.

It is interesting that this one query had the issue repeatedly over about a 12 
hour period whereas others querying performing the same query with different 
time periods during the same time did not experience any issue.   One would 
think if the statistics were bad or being updated, they also would have a 
similar issue.

The stack traces seem to show that the indexes are being used; I don’t see any 
table scans being performed which is usually a good indicator that the indexes 
are not being use.

From: Katherine Marsden [mailto:kmarsdende...@sbcglobal.net]
Sent: Monday, June 05, 2017 3:39 PM
To: derby-dev@db.apache.org
Subject: Re: Could use some thoughts on a problem with a query in production

Hi Brett,

I just happened to see your post on my phone and what came to mind was I wonder 
how the automatic statistics update is implemented and if that might somehow be 
at play here.  I may be totally off base but remember at one point this was 
implemented and there was discussion of how and when to do it and concern that 
at some point, some query was going to get hit with the cost (for the team so 
to speak) . It might be worthwhile looking at what triggers the statistics 
update and if it might be at play here.

Thanks

Katherine

Sent from my iPhone

On Jun 5, 2017, at 12:04 PM, Bergquist, Brett 
> wrote:
Background:


· A large database approximately 750G

· derby.storage.pageCacheSize=64000

· Inserts going into the database about 125/second

· Other database updates and deletes are being performed at a lower 
rate.

A query is run by the customer that does not complete in a reasonable amount of 
time.   Unfortunately the customer’s program retries this query up to 40 times 
compounding the problem.

The query retrieves a set of records for a period of time.  The query is 
constructed using the indexes of the table such that the rows should be in 
sorted order and an external sort not required.   Basically the query fires off 
periodically and requests the records for a specific hour of the day.   The 
next time it fires off and retrieves the next hour, etc.

The query has run successfully with other parameters repeatedly through out the 
week with no issues.  Only one set of query parameters caused the issue.   The 
same problem occurred exactly one week later again for the same time period of 
the day.

I was able to retrieve the database locally from the customer and I configured 
my server to time and time zone and ran the query with no issue.   So it does 
not appear to be an optimizer problem with it choosing the wrong path.

There is a script that is run at the customer’s site to retrieve some data that 
I use for debugging these types of problems as access to the customer site is 
not possible due to security reasons.   One of the things that is captures is 
stack traces every 5 seconds for 1 minute.

The stack traces show no deadlocks and also seem to show that the query path is 
not doing a large external sort or anything like that.   Many of the threads 
are blocked however and they seem to be blocked at a low level.For example 
many threads look like this:

"DRDAConnThread_98" #123 prio=5 os_prio=64 tid=0x0001027b3800 nid=0x94 
waiting for monitor entry [0xfffd37afe000]
   java.lang.Thread.State: BLOCKED (on object monitor)
at 
org.apache.derby.impl.store.raw.data.FileContainer.pageValid(Unknown Source)
- waiting to lock <0xfffd70d5d598> (a 
org.apache.derby.impl.store.raw.data.AllocationCache)
at 
org.apache.derby.impl.store.raw.data.FileContainer.getUserPage(Unknown Source)
at 
org.apache.derby.impl.store.raw.data.FileContainer.getPage(Unknown Source)
at 
org.apache.derby.impl.store.raw.data.BaseContainerHandle.getPage(Unknown Source)
at 
org.apache.derby.impl.store.access.conglomerate.OpenConglomerate.latchPage(Unknown
 Source)
at 
org.apache.derby.impl.store.access.conglomerate.GenericConglomerateController.fetch(Unknown
 Source)
at 
org.apache.derby.impl.sql.execute.IndexRowToBaseRowResultSet.getNextRowCore(Unknown
 Source)
at 
org.apache.derby.impl.sql.execute.ProjectRestrictResultSet.getNextRowCore(Unknown
 Source)
at 
org.apache.derby.impl.sql.execute.NestedLoopJoinResultSet.getNextRowCore(Unknown
 Source)
at 
org.apache.derby.impl.sql.execute.NestedLoopJoinResultSet.getNextRowCore(Unknown
 Source)
at 
org.apache.derby.impl.sql.execute.NestedLoopJoinResultSet.getNextRowCore(Unknown
 Source)
at 
org.apache.derby.impl.sql.execute.ProjectRestrictResultSet.getNextRowCore(Unknown
 Source)
at 
org.apache.derby.impl.sql.execute.SortResultSet.getRowFromResultSet(Unknown 
Source)
at 

Re: Could use some thoughts on a problem with a query in production

2017-06-05 Thread Katherine Marsden
Hi Brett,

I just happened to see your post on my phone and what came to mind was I wonder 
how the automatic statistics update is implemented and if that might somehow be 
at play here.  I may be totally off base but remember at one point this was 
implemented and there was discussion of how and when to do it and concern that 
at some point, some query was going to get hit with the cost (for the team so 
to speak) . It might be worthwhile looking at what triggers the statistics 
update and if it might be at play here.

Thanks

Katherine

Sent from my iPhone

> On Jun 5, 2017, at 12:04 PM, Bergquist, Brett  wrote:
> 
> Background:
>  
> · A large database approximately 750G
> · derby.storage.pageCacheSize=64000
> · Inserts going into the database about 125/second
> · Other database updates and deletes are being performed at a lower 
> rate.
>  
> A query is run by the customer that does not complete in a reasonable amount 
> of time.   Unfortunately the customer’s program retries this query up to 40 
> times compounding the problem. 
>  
> The query retrieves a set of records for a period of time.  The query is 
> constructed using the indexes of the table such that the rows should be in 
> sorted order and an external sort not required.   Basically the query fires 
> off periodically and requests the records for a specific hour of the day.   
> The next time it fires off and retrieves the next hour, etc.
>  
> The query has run successfully with other parameters repeatedly through out 
> the week with no issues.  Only one set of query parameters caused the issue.  
>  The same problem occurred exactly one week later again for the same time 
> period of the day.
>  
> I was able to retrieve the database locally from the customer and I 
> configured my server to time and time zone and ran the query with no issue.   
> So it does not appear to be an optimizer problem with it choosing the wrong 
> path.
>  
> There is a script that is run at the customer’s site to retrieve some data 
> that I use for debugging these types of problems as access to the customer 
> site is not possible due to security reasons.   One of the things that is 
> captures is stack traces every 5 seconds for 1 minute.
>  
> The stack traces show no deadlocks and also seem to show that the query path 
> is not doing a large external sort or anything like that.   Many of the 
> threads are blocked however and they seem to be blocked at a low level.
> For example many threads look like this:
>  
> "DRDAConnThread_98" #123 prio=5 os_prio=64 tid=0x0001027b3800 nid=0x94 
> waiting for monitor entry [0xfffd37afe000]
>java.lang.Thread.State: BLOCKED (on object monitor)
> at 
> org.apache.derby.impl.store.raw.data.FileContainer.pageValid(Unknown Source)
> - waiting to lock <0xfffd70d5d598> (a 
> org.apache.derby.impl.store.raw.data.AllocationCache)
> at 
> org.apache.derby.impl.store.raw.data.FileContainer.getUserPage(Unknown Source)
> at 
> org.apache.derby.impl.store.raw.data.FileContainer.getPage(Unknown Source)
> at 
> org.apache.derby.impl.store.raw.data.BaseContainerHandle.getPage(Unknown 
> Source)
> at 
> org.apache.derby.impl.store.access.conglomerate.OpenConglomerate.latchPage(Unknown
>  Source)
> at 
> org.apache.derby.impl.store.access.conglomerate.GenericConglomerateController.fetch(Unknown
>  Source)
> at 
> org.apache.derby.impl.sql.execute.IndexRowToBaseRowResultSet.getNextRowCore(Unknown
>  Source)
> at 
> org.apache.derby.impl.sql.execute.ProjectRestrictResultSet.getNextRowCore(Unknown
>  Source)
> at 
> org.apache.derby.impl.sql.execute.NestedLoopJoinResultSet.getNextRowCore(Unknown
>  Source)
> at 
> org.apache.derby.impl.sql.execute.NestedLoopJoinResultSet.getNextRowCore(Unknown
>  Source)
> at 
> org.apache.derby.impl.sql.execute.NestedLoopJoinResultSet.getNextRowCore(Unknown
>  Source)
> at 
> org.apache.derby.impl.sql.execute.ProjectRestrictResultSet.getNextRowCore(Unknown
>  Source)
> at 
> org.apache.derby.impl.sql.execute.SortResultSet.getRowFromResultSet(Unknown 
> Source)
> at 
> org.apache.derby.impl.sql.execute.SortResultSet.getNextRowFromRS(Unknown 
> Source)
> at 
> org.apache.derby.impl.sql.execute.SortResultSet.loadSorter(Unknown Source)
> at 
> org.apache.derby.impl.sql.execute.SortResultSet.openCore(Unknown Source)
> at 
> org.apache.derby.impl.sql.execute.BasicNoPutResultSetImpl.open(Unknown Source)
> at 
> org.apache.derby.impl.sql.GenericPreparedStatement.executeStmt(Unknown Source)
> at 
> org.apache.derby.impl.sql.GenericPreparedStatement.execute(Unknown Source)
> at 
> 

Could use some thoughts on a problem with a query in production

2017-06-05 Thread Bergquist, Brett
Background:


* A large database approximately 750G

* derby.storage.pageCacheSize=64000

* Inserts going into the database about 125/second

* Other database updates and deletes are being performed at a lower 
rate.

A query is run by the customer that does not complete in a reasonable amount of 
time.   Unfortunately the customer's program retries this query up to 40 times 
compounding the problem.

The query retrieves a set of records for a period of time.  The query is 
constructed using the indexes of the table such that the rows should be in 
sorted order and an external sort not required.   Basically the query fires off 
periodically and requests the records for a specific hour of the day.   The 
next time it fires off and retrieves the next hour, etc.

The query has run successfully with other parameters repeatedly through out the 
week with no issues.  Only one set of query parameters caused the issue.   The 
same problem occurred exactly one week later again for the same time period of 
the day.

I was able to retrieve the database locally from the customer and I configured 
my server to time and time zone and ran the query with no issue.   So it does 
not appear to be an optimizer problem with it choosing the wrong path.

There is a script that is run at the customer's site to retrieve some data that 
I use for debugging these types of problems as access to the customer site is 
not possible due to security reasons.   One of the things that is captures is 
stack traces every 5 seconds for 1 minute.

The stack traces show no deadlocks and also seem to show that the query path is 
not doing a large external sort or anything like that.   Many of the threads 
are blocked however and they seem to be blocked at a low level.For example 
many threads look like this:

"DRDAConnThread_98" #123 prio=5 os_prio=64 tid=0x0001027b3800 nid=0x94 
waiting for monitor entry [0xfffd37afe000]
   java.lang.Thread.State: BLOCKED (on object monitor)
at 
org.apache.derby.impl.store.raw.data.FileContainer.pageValid(Unknown Source)
- waiting to lock <0xfffd70d5d598> (a 
org.apache.derby.impl.store.raw.data.AllocationCache)
at 
org.apache.derby.impl.store.raw.data.FileContainer.getUserPage(Unknown Source)
at 
org.apache.derby.impl.store.raw.data.FileContainer.getPage(Unknown Source)
at 
org.apache.derby.impl.store.raw.data.BaseContainerHandle.getPage(Unknown Source)
at 
org.apache.derby.impl.store.access.conglomerate.OpenConglomerate.latchPage(Unknown
 Source)
at 
org.apache.derby.impl.store.access.conglomerate.GenericConglomerateController.fetch(Unknown
 Source)
at 
org.apache.derby.impl.sql.execute.IndexRowToBaseRowResultSet.getNextRowCore(Unknown
 Source)
at 
org.apache.derby.impl.sql.execute.ProjectRestrictResultSet.getNextRowCore(Unknown
 Source)
at 
org.apache.derby.impl.sql.execute.NestedLoopJoinResultSet.getNextRowCore(Unknown
 Source)
at 
org.apache.derby.impl.sql.execute.NestedLoopJoinResultSet.getNextRowCore(Unknown
 Source)
at 
org.apache.derby.impl.sql.execute.NestedLoopJoinResultSet.getNextRowCore(Unknown
 Source)
at 
org.apache.derby.impl.sql.execute.ProjectRestrictResultSet.getNextRowCore(Unknown
 Source)
at 
org.apache.derby.impl.sql.execute.SortResultSet.getRowFromResultSet(Unknown 
Source)
at 
org.apache.derby.impl.sql.execute.SortResultSet.getNextRowFromRS(Unknown Source)
at 
org.apache.derby.impl.sql.execute.SortResultSet.loadSorter(Unknown Source)
at 
org.apache.derby.impl.sql.execute.SortResultSet.openCore(Unknown Source)
at 
org.apache.derby.impl.sql.execute.BasicNoPutResultSetImpl.open(Unknown Source)
at 
org.apache.derby.impl.sql.GenericPreparedStatement.executeStmt(Unknown Source)
at 
org.apache.derby.impl.sql.GenericPreparedStatement.execute(Unknown Source)
at 
org.apache.derby.impl.jdbc.EmbedStatement.executeStatement(Unknown Source)
- locked <0xfffd95a59900> (a 
org.apache.derby.impl.jdbc.EmbedConnection40)
at 
org.apache.derby.impl.jdbc.EmbedPreparedStatement.executeStatement(Unknown 
Source)
at 
org.apache.derby.impl.jdbc.EmbedPreparedStatement.execute(Unknown Source)
at 
org.apache.derby.iapi.jdbc.BrokeredPreparedStatement.execute(Unknown Source)
at org.apache.derby.impl.drda.DRDAStatement.execute(Unknown 
Source)
at 
org.apache.derby.impl.drda.DRDAConnThread.processCommands(Unknown Source)
at org.apache.derby.impl.drda.DRDAConnThread.run(Unknown Source)

and they are waiting on a thread that looks like this:

"DRDAConnThread_61" #86 prio=5 os_prio=64 tid=0x000102d4c000 nid=0x6f 
waiting for