[
https://issues.apache.org/jira/browse/DERBY-5234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Rick Hillegas updated DERBY-5234:
---------------------------------
Attachment: derby-5234-01-aa-emptyAllocPage.diff
Attaching derby-5234-01-aa-emptyAllocPage.diff. These small changes make the
repro run correctly. Regression tests pass cleanly on this patch.
I have stumbled across at least 3 separate problems in the compression code.
However, that may simply mean that I don't understand the code. The 3 problems
are:
1) A boundary checking error which causes an allocation extent to think that it
still has pages, even though those pages have been released to the operating
system. This is what causes the repro to fail.
2) A confusion about whether a variable represents a bit position or a page
number. This causes the code to not understand that all of the pages in an
extent have been released. Fixing this check does not change any user-visible
behavior, but I think that fixing the check is a step in the right direction.
3) The inability of the compression code to release pages held by the first
allocation page. I don't understand this problem yet. Before looking into this
one, I need advice about whether I am heading in the right direction.
More information about these 3 problems follows:
-------------------
Concerning (1), the boundary check which causes the repro to fail:
In AllocExtent.compressPages(), the new_highest_page argument can be -1. This
happens if all of the pages in the extent turn out to be empty. However, if
new_highest_page is -1, then the code does not fall into the block at line 577;
that's the code which actually marks the pages as released. The value of
new_highest_page was calculated by AllocExtent.compress(). The variable name
new_highest_page is a confusing name. This is a bit position and not a page
number, and in the case when it is -1, it is a flag that all pages are empty.
AllocExtent.compress() returns new_highest_page + 1, triggering its caller to
fall into a block at line 1074 in AllocPage.compress(); that block releases
pages to the operating system. That is how we end up in the situation that the
pages are actually released but AllocExtent still thinks they are allocated.
That, in turn, is what tricks a later INSERT into trying to write onto a
non-existent page.
The fix is to make the code fall into the block at 577 if new_highest_page is
-1.
-------------------
Concerning (2), the confusion about whether AllocExtent.compress() returns a
bit position or a page number:
At line 1080 in AllocPage.compress(), the code compares a bit position to a
page number. Bit positions are small integers, e.g., in the range 0-200. Page
numbers are potentially larger integers in, say, the range 12000-12200. The
weird comparison at line 1080 causes AllocPage.compress() to not recognize that
all of the pages in the extent have been released.
I have renamed last_valid_page to last_valid_page_bit to clarify that this is a
bit position, not a page number. And I have changed the check at 1080 to
compare the bit position to another bit position. This comparison deserves the
attention of someone who knows this code better than I do. Is this the right
comparison?
In a follow-on cleanup issue, it might make sense to change variable names in
the allocation code to clarify what is a bit number and what is a page number.
This may disclose other questionable code in this area.
-------------------
Concerning (3), the inability of the compression code to release empty pages
managed by the first allocation page:
I had hoped that the change for (2) would cause the compress to release more
space. But it didn't. The compress only releases the pages managed by the
second (last) allocation page. All of the pages managed by the first allocation
page are also empty, but they are not released. This seems wrong to me. I would
expect the file to shrink back to its initial size.
Before pursuing this follow-on issue, I would like advice about whether I am
headed in the right direction. Should the compress shrink the page back to its
initial size? Or should SYSCS_UTIL.SYSCS_INPLACE_COMPRESS_TABLE('APP',
'OPERATIONS', 0, 0, 1) just release empty pages managed by the second and
higher allocation pages?
-------------------
Touches the following files:
M java/engine/org/apache/derby/impl/store/raw/data/AllocExtent.java
Fix for (1).
----------
M java/engine/org/apache/derby/impl/store/raw/data/AllocPage.java
Clarification for (2).
> Unable to insert data into table. Failed due be "ERROR XSDG0: Page
> Page(51919,Container(0, 1104)) could not be read from disk."
> -------------------------------------------------------------------------------------------------------------------------------
>
> Key: DERBY-5234
> URL: https://issues.apache.org/jira/browse/DERBY-5234
> Project: Derby
> Issue Type: Bug
> Components: Store
> Affects Versions: 10.5.3.0
> Environment: HP-UX 11iv2 in production environment with JDK1.6;
> Solaris 5/10 in test environment with JDK 1.6
> Reporter: Varma R
> Priority: Critical
> Labels: ERROR, XSDG0, apache, corruption, data, derby,
> derby_triage10_9
> Attachments: 5234_alloc.out, 5234_page_10219.out, 5234_summary.out,
> DataFileReader_Output.zip, DbCompressErrorTester.java,
> derby-5234-01-aa-emptyAllocPage.diff, log191.dat, log85.dat
>
>
> One of the derby database table "gets corrupted"/"indicates connection not
> available" during processing inserts from java client application as shown in
> the trace and the only way to recover from this error is to rebuild the DB -
> by deleting the data and creating the tables again. This happens once in a
> while (thrice in a span of two months) and the java application (run in
> multiple servers), which updates the database, processes around 100 million
> transactions per hour (in total and each transation results in 4-5 updates to
> the DB)
> There are eight tables in the derby database.
> TABLE NAME ROWS COUNT (at time of corruption)
> ---------------------------------------------------------------------------------
> KPI.KPI_MERGEIN; 362917
> KPI.KPI_IN; 422508
> KPI.KPI_DROPPED; 53667
> KPI.KPI_ERROR1; 0
> KPI.KPI_ERROR2; 2686
> KPI.KPI_ERRORMERGE; 0
> KPI.KPI_MERGEOUT; 362669
> KPI.KPI_OUT; 125873
> The derby database has been started with the following parameters
> CMD="java -Dderby.system.home=$DERBY_OPTS -Dderby.locks.monitor=true
> -Dderby.locks.deadlockTrace=true -Dderby.locks.escalationThreshold=50000
> -Dderby.locks.waitTimeout=
> -1 -Dderby.storage.pageCacheSize=100000 -Xms512M -Xmx3072M -XX:NewSize=256M
> -classpath $DERBY_CLASSPATH org.apache.derby.drda.NetworkServerControl start
> -h $KPIDERBYHOST -p $DERBY_KPI_PORT"
> The corrupted database tar (filesystem) in live environment was moved to a
> test system (Solaris system) and few checks were run on the corrupted DB as
> part of analysis (DB does start fine)
> While trying to insert a row in any table expect KPI.KPI_MERGEIN, it is
> successful. But when a new row is inserted into KPI.KPI_MERGEIN table using
> command line tool it's throwing below error message (the same message that
> appeared in live
> ij> INSERT INTO KPI.KPI_MERGEIN (A0_TXN_ID, A1_NE_ID, A2_CHU_IP_ADDR,
> A3_BATCH_DATE,A5_CODE) VALUES (-1, 'BMTDE', '192.2.1.3', 231456879, 'KSD');
> ERROR 08006: A network protocol error was encountered and the connection has
> been terminated: the requested command encountered an unarchitected and
> implementation-specific condition for which there was no architected message
> and in derby.log file it shows below error stacktrace.
> ERROR XSDG0: Page Page(51919,Container(0, 1104)) could not be read from disk.
> at org.apache.derby.iapi.error.StandardException.newException(Unknown
> Source)
> at org.apache.derby.impl.store.raw.data.CachedPage.readPage(Unknown
> Source)
> at
> org.apache.derby.impl.store.raw.data.CachedPage.setIdentity(Unknown Source)
> at org.apache.derby.impl.services.cache.ConcurrentCache.find(Unknown
> Source)
> at
> org.apache.derby.impl.store.raw.data.FileContainer.initPage(Unknown Source)
> at org.apache.derby.impl.store.raw.data.FileContainer.newPage(Unknown
> Source)
> at org.apache.derby.impl.store.raw.data.BaseContainer.addPage(Unknown
> Source)
> at
> org.apache.derby.impl.store.raw.data.BaseContainerHandle.addPage(Unknown
> Source)
> at
> org.apache.derby.impl.store.access.heap.HeapController.doInsert(Unknown
> Source)
> at
> org.apache.derby.impl.store.access.heap.HeapController.insertAndFetchLocation(Unknown
> Source)
> at org.apache.derby.impl.sql.execute.RowChangerImpl.insertRow(Unknown
> Source)
> at
> org.apache.derby.impl.sql.execute.InsertResultSet.normalInsertCore(Unknown
> Source)
> at org.apache.derby.impl.sql.execute.InsertResultSet.open(Unknown
> Source)
> at
> org.apache.derby.impl.sql.GenericPreparedStatement.executeStmt(Unknown Source)
> at org.apache.derby.impl.sql.GenericPreparedStatement.execute(Unknown
> Source)
> at org.apache.derby.impl.jdbc.EmbedStatement.executeStatement(Unknown
> Source)
> at org.apache.derby.impl.jdbc.EmbedStatement.execute(Unknown Source)
> at org.apache.derby.impl.jdbc.EmbedStatement.executeUpdate(Unknown
> Source)
> at org.apache.derby.impl.drda.DRDAConnThread.parseEXCSQLIMM(Unknown
> Source)
> at org.apache.derby.impl.drda.DRDAConnThread.processCommands(Unknown
> Source)
> at org.apache.derby.impl.drda.DRDAConnThread.run(Unknown Source)
> Caused by: java.io.EOFException: Reached end of file while attempting to read
> a whole page.
> at
> org.apache.derby.impl.store.raw.data.RAFContainer4.readFull(Unknown Source)
> at
> org.apache.derby.impl.store.raw.data.RAFContainer4.readPage0(Unknown Source)
> at
> org.apache.derby.impl.store.raw.data.RAFContainer4.readPage(Unknown Source)
> ... 20 more
> ============= begin nested exception, level (1) ===========
> java.io.EOFException: Reached end of file while attempting to read a whole
> page.
> at
> org.apache.derby.impl.store.raw.data.RAFContainer4.readFull(Unknown Source)
> at
> org.apache.derby.impl.store.raw.data.RAFContainer4.readPage0(Unknown Source)
> at
> org.apache.derby.impl.store.raw.data.RAFContainer4.readPage(Unknown Source)
> at org.apache.derby.impl.store.raw.data.CachedPage.readPage(Unknown
> Source)
> at
> org.apache.derby.impl.store.raw.data.CachedPage.setIdentity(Unknown Source)
> at org.apache.derby.impl.services.cache.ConcurrentCache.find(Unknown
> Source)
> at
> org.apache.derby.impl.store.raw.data.FileContainer.initPage(Unknown Source)
> at org.apache.derby.impl.store.raw.data.FileContainer.newPage(Unknown
> Source)
> at org.apache.derby.impl.store.raw.data.BaseContainer.addPage(Unknown
> Source)
> at
> org.apache.derby.impl.store.raw.data.BaseContainerHandle.addPage(Unknown
> Source)
> at
> org.apache.derby.impl.store.access.heap.HeapController.doInsert(Unknown
> Source)
> at
> org.apache.derby.impl.store.access.heap.HeapController.insertAndFetchLocation(Unknown
> Source)
> at org.apache.derby.impl.sql.execute.RowChangerImpl.insertRow(Unknown
> Source)
> at
> org.apache.derby.impl.sql.execute.InsertResultSet.normalInsertCore(Unknown
> Source)
> at org.apache.derby.impl.sql.execute.InsertResultSet.open(Unknown
> Source)
> at
> org.apache.derby.impl.sql.GenericPreparedStatement.executeStmt(Unknown Source)
> at org.apache.derby.impl.sql.GenericPreparedStatement.execute(Unknown
> Source)
> at org.apache.derby.impl.jdbc.EmbedStatement.executeStatement(Unknown
> Source)
> at org.apache.derby.impl.jdbc.EmbedStatement.execute(Unknown Source)
> at org.apache.derby.impl.jdbc.EmbedStatement.executeUpdate(Unknown
> Source)
> at org.apache.derby.impl.drda.DRDAConnThread.parseEXCSQLIMM(Unknown
> Source)
> at org.apache.derby.impl.drda.DRDAConnThread.processCommands(Unknown
> Source)
> at org.apache.derby.impl.drda.DRDAConnThread.run(Unknown Source)
> ============= end nested exception, level (1) ===========
> 2011-05-16 10:37:21.392 GMT:
> Shutting down instance a816c00e-012f-f85f-7892-ffff874c3ff6
> ----------------------------------------------------------------
> Cleanup action completed
> The problem is only with INSERT statement. When i try SELECT statement on
> KPI.KPI_MERGEIN table it is working well.The database file system size (in
> seg0) is 1.3 GB
> Can anyone help me out in identifying the problem that why for one table
> alone its throwing the above error message ? Would upgrade to a new version
> help ?
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira