Re: ERROR XSDG2: Invalid checksum on Page Page(0,Container(0, 1313))

Stanley Bradbury Tue, 01 Apr 2008 11:01:06 -0700

David Sitsky wrote:

Hi Stanley,
The problem is happening on the Page 0 [the first page] ofconglomerate 1313 (the conglomerateId) - you can see what table/indexthis corresponds to with the following query:
select CONGLOMERATENUMBER, CONGLOMERATENAME
 from sys.sysconglomerates
 where conglomeratenumber =  1313;
From the errors in your log I suspect this to be the table or one ofthe indexes of the failing query listed:
   INSERT INTO text_table (guidhigh, guid, data)
I ran ij from the command-line, and connected to one of theproblematic databases after killing the load. Sure enough 1313 refersto the primary key of text_table (there are no other indexes definedon this table). This is what is output:
CONGLOMERATENUMBER  |CONGLOMERATENAME
--------------------------------------------------------------------------------
---------------------------------------------------------------------
1313                |SQL080331145004520

This is the SQL we used to create this table:

CREATE TABLE text_table (guidhigh BIGINT NOT NULL,
                         guid BIGINT NOT NULL,
                         data BLOB (1G) NOT NULL,
                         PRIMARY KEY (guidhigh, guid))
What is interesting is if I perform a simple query on text_table, Iget the following:
ij> select count(*) from text_table;
ERROR XSDG2: Invalid checksum on Page Page(0,Container(0, 1313)),expected=304,608,373, on-disk version=2,462,088,751, page dumpfollows: Hex dump:
00000000: 0076 0000 0001 0000 0000 0000 27ea 0000  .v..............
00000010: 0000 0006 0000 0000 0000 0000 0000 0000  ................
00000020: 0000 0000 0001 0000 0000 0000 0000 0000  ................
....
Same as my application. I have also noticed the connection isautomatically closed, as is the case in my application if I try andperform another operation.
ij> select count(*) from text_table;
ERROR 08003: No current connection.

Reconnecting to the database has the same result.
So I guess it is not surprising now that conglomerate 1313 is alwaysreturned since all worker processes create the same database onstartup. However the corruption is always happening on page 0. If wewere experiencing true disk problems, I'd expect the page number to berandom across the machines.
Any ideas on what to try next?

Hi Dave -

The good news is that it looks like you do not have to recover thisdatabase from backups because:1) since you can boot the database using IJ the corruption is not inthe transaction log (I assume you did not directly modify or delete anyfiles in the database directory tree in order to bypass recovery - *yes,I have seen people do this*)2) since the corruption is in an index you should be able to drop andrecreate the index thus restoring the index without any data loss.

I also have an idea that might get around it. Since the problem is withan index and during a data load (I am assuming the load is a one timedatabase initialization streaming lots of data?) then I would trycreating the index AFTER loading the data rather than declaring theindex in the table Create statement. If the problem involves handling alarge number of database pages and the index pages at the same timeseparating the activities will avoid it. Doing this also insures thatindex statistics are created (for more info on this see:http://wiki.apache.org/db-derby/CheckingForIndexStatistics)

As for the cause of the problem I can't be sure but since this happensrepeatedly and with different physical files it sounds like the problemis not with the physical disk surface (I think the RAID architecturehelps protect against this type of failure anyway).I'd like to see this problem captured in an JIRA entry. You haveprovided a lot of good information here and the results of performing aload then creating the Primary Key will narrow it down more. Mygut-level feel is that reproducing this will be hard without anenvironment similar to yours. With the multiple processors and the RAIDarray I imagine it is a pretty fast machine and this may be exposing athread synchronization issue in Derby (or not). Having your informationand stack traces may give the stores folks enough information to performa code inspection for this or other issues. Would you file a JIRAdescribing your processing, the machine environment, the JVM and attachyour derby log files?

Re: ERROR XSDG2: Invalid checksum on Page Page(0,Container(0, 1313))

Reply via email to