Thanks for the input!

There is no possibility of running the consistency check on the customer's 
database on their system as it needs to be running 24x7 and cannot be taken 
down.  As far as I can tell at this point, the database came back up ok after 
the restart and is operating normally.

I am able to get a copy of the database via file system backup that occurs each 
night.   Using ZFS allows us to do this by freezing the database (using FREEZE 
derby calls), doing a ZFS snapshot of the file system, unfreezing the database 
(using UNFREEZE derby call) and then accessing the ZFS snapshot to make the 
file system backup.   It takes me a couple of days to get all of the database 
transferred but then I can stage it locally and run a consistency check on the 
local copy.

I will open a JIRA on the NullPointerException's that were reported after Derby 
did its shutdown like Bryan suggested.

For some background, the database is used in a telecommunications environment, 
being the persistent storage for the configuration for about 90K pieces of 
network equipment and receives about 10M monitoring updates per day 24x7.   The 
database has been around for about 8 years continually growing and having derby 
being upgraded.  It is currently at 10.10.2.0.   We also do a poor man's 
partitioning in that we have 53 database tables, one for each week of the year, 
and our 10M inserts are directed to the correct database table for the week of 
the year and queries are built upon those weeks as well with a VIEW that is 
created as a UNION query across all 53 tables when needed to process queries 
that span weeks.   We needed to do this as there was no practical way of 
deleting older data while simultaneously inserting data into the table at the 
rate or 10M/day and not having database performance issues, locking contention, 
and even getting the deletions done in a reasonable amount of time, and also 
recovering and reusing the freed database space.   Now we simply truncate the 
tables that are to be purged which is nearly instantaneous.   At some point I 
may investigate and contact the group here on how one might implement a real 
partitioning scheme that would be more efficient especially on the queries and 
add this capability back into derby, so if anyone has any ideas on this, I am 
all ears.

Brett

-----Original Message-----
From: mike matrigali [mailto:[email protected]]
Sent: Friday, September 04, 2015 12:37 AM
To: [email protected]
Subject: Re: Derby received an error "ERROR XSDG0: Page 
Page(1325564,Container(0, 30832)) could not be read from disk."

I agree with all of bryan's suggestions.  If you can't get access to the actual 
db there is not much to be done.  My usual customer support answer to this 
situation would be to tell you to shut the db and do a consistency check on it, 
which would read every page from the table and would certainly run into the 
error you got eventually if there was a persistent problem.
Given the size of the db and that derby has no optimizations for db's of this 
size that is likely to take some time.

 From the stack I can tell you that the problem is in a base page, not an 
index.  Which is
much harder to fix if it is persistent.   In derby db's the output
  Container(0, 30832) is saying container in segment 0 (seg0 directory) and 
container id
30832  (impressed by the number of containers that db has gone through).  Also 
you will see system catalog talk about conglomerate numbers.  In derby 
currently there is always a 1-1 mapping of conglomerate num to container number.
Ancient history, in cloudscape we thought we might need the abstraction and it 
was a pain to do the map at the lowest level so we took the opportunity when we 
redid the arch to make it 1-1 for "now" but allow a map if anyone wanted to do 
one in the future:
And here is a note from bryan minus 6 years on how to go from that number in 
the error to file name and table name.:
http://bryanpendleton.blogspot.com/2009/09/whats-in-those-files-in-my-derby-db.html

A quick check if you could get a ls -l of the seg0 directory would be to look 
at the size of the associated file and do the math bryan mentioned to see if 
the file now has a full page.
including the page size if you figure it out would help as derby page size vs 
file system page size can be an issue  - but usually only on machine crashes.

I would suggest filing a JIRA for this.  If it really is the case that you got 
the I/O error for a non-persistent problem it may be that derby can be improved 
to avoid it.  Before the code was changed to use FileChannel's derby often had 
retry loops on I/O errors - especially on reads of pages from disk.  In the 
long past this just avoided some intermittent i/o problems that were in most 
case network related (even though we likely did not support the network disk 
officially).  Not sure if the old retry code is still around in the trunk as it 
was for running in older JVM's.

Also I have also seen wierd timing errors from maybe multiple processing 
accessing the same file (like backup/virus/... vs the sever), but mostly on 
windows OS vs unix based ones.

Getting a partial page read is a very weird error for derby as it goes out of 
its way to write only full pages.
On 9/3/2015 5:39 PM, Bryan Pendleton wrote:
> On 9/3/2015 3:35 PM, Bergquist, Brett wrote:
>> Reached end of file while attempting to read a whole page
>
> You should probably take a close read through all the discussion on
> this slightly old Derby JIRA Issue:
>
> https://issues.apache.org/jira/browse/DERBY-5234
>
> There are some suggestions about how to diagnose the conglomerate in
> question in more detail, and also some observations about possible
> causes and possible courses of action you can take subsequently.
>
> thanks,
>
> bryan
>
>


--
email:    Mike Matrigali - [email protected]
linkedin: https://www.linkedin.com/in/MikeMatrigali


Canoga Perkins
20600 Prairie Street
Chatsworth, CA 91311
(818) 718-6300

This e-mail and any attached document(s) is confidential and is intended only 
for the review of the party to whom it is addressed. If you have received this 
transmission in error, please notify the sender immediately and discard the 
original message and any attachment(s).

Reply via email to