[
https://issues.apache.org/jira/browse/CASSANDRA-4223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13271618#comment-13271618
]
Jonathan Ellis commented on CASSANDRA-4223:
-------------------------------------------
Right, so the problem is when A creates a {{<B, 1>}} Session in StreamIn, while
B simultaneously creates a {{<B, 1>>}} for an unrelated
StreamOut.transferRanges (move or unbootstrap).
> Non Unique Streaming session ID's
> ---------------------------------
>
> Key: CASSANDRA-4223
> URL: https://issues.apache.org/jira/browse/CASSANDRA-4223
> Project: Cassandra
> Issue Type: Bug
> Components: Core
> Environment: Ubuntu 10.04.2 LTS
> java version "1.6.0_24"
> Java(TM) SE Runtime Environment (build 1.6.0_24-b07)
> Java HotSpot(TM) 64-Bit Server VM (build 19.1-b02, mixed mode)
> "Bare metal" servers from
> https://www.stormondemand.com/servers/baremetal.html
> The servers run on a custom hypervisor.
>
> Reporter: Aaron Morton
> Assignee: Aaron Morton
> Labels: datastax_qa
> Fix For: 1.0.11, 1.1.1
>
> Attachments: 4223_counter_session_id.diff, NanoTest.java, fmm
> streaming bug.txt
>
>
> I have observed repair processes failing due to duplicate Streaming session
> ID's. In this installation it is preventing rebalance from completing. I
> believe it has also prevented repair from completing in the past.
> The attached streaming-logs.txt file contains log messages and an explanation
> of what was happening during a repair operation. it has the evidence for
> duplicate session ID's.
> The duplicate session id's were generated on the repairing node and sent to
> the streaming node. The streaming source replaced the first session with the
> second which resulted in both sessions failing when the first FILE_COMPLETE
> message was received.
> The errors were:
> {code:java}
> DEBUG [MiscStage:1] 2012-05-03 21:40:33,997 StreamReplyVerbHandler.java (line
> 47) Received StreamReply StreamReply(sessionId=26132848816442266,
> file='/var/lib/cassandra/data/FMM_Studio/PartsData-hc-1-Data.db',
> action=FILE_FINISHED)
> ERROR [MiscStage:1] 2012-05-03 21:40:34,027 AbstractCassandraDaemon.java
> (line 139) Fatal exception in thread Thread[MiscStage:1,5,main]
> java.lang.IllegalStateException: target reports current file is
> /var/lib/cassandra/data/FMM_Studio/PartsData-hc-1-Data.db but is null
> at
> org.apache.cassandra.streaming.StreamOutSession.validateCurrentFile(StreamOutSession.java:195)
> at
> org.apache.cassandra.streaming.StreamReplyVerbHandler.doVerb(StreamReplyVerbHandler.java:58)
> at
> org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:59)
> at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(Unknown
> Source)
> at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
> at java.lang.Thread.run(Unknown Source)
> {code}
> and
> {code:java}
> DEBUG [MiscStage:2] 2012-05-03 21:40:36,497 StreamReplyVerbHandler.java (line
> 47) Received StreamReply StreamReply(sessionId=26132848816442266,
> file='/var/lib/cassandra/data/OpsCenter/rollups7200-hc-3-Data.db',
> action=FILE_FINISHED)
> ERROR [MiscStage:2] 2012-05-03 21:40:36,497 AbstractCassandraDaemon.java
> (line 139) Fatal exception in thread Thread[MiscStage:2,5,main]
> java.lang.IllegalStateException: target reports current file is
> /var/lib/cassandra/data/OpsCenter/rollups7200-hc-3-Data.db but is null
> at
> org.apache.cassandra.streaming.StreamOutSession.validateCurrentFile(StreamOutSession.java:195)
> at
> org.apache.cassandra.streaming.StreamReplyVerbHandler.doVerb(StreamReplyVerbHandler.java:58)
> at
> org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:59)
> at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(Unknown
> Source)
> at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
> at java.lang.Thread.run(Unknown Source)
> {code}
> I think this is because System.nanoTime() is used for the session ID when
> creating the StreamInSession objects (driven from
> StorageService.requestRanges()) .
> From the documentation
> (http://docs.oracle.com/javase/6/docs/api/java/lang/System.html#nanoTime())
> {quote}
> This method provides nanosecond precision, but not necessarily nanosecond
> accuracy. No guarantees are made about how frequently values change.
> {quote}
> Also some info here on clocks and timers
> https://blogs.oracle.com/dholmes/entry/inside_the_hotspot_vm_clocks
> The hypervisor may be at fault here. But it seems like we cannot rely on
> successive calls to nanoTime() to return different values.
> To avoid message/interface changes on the StreamHeader it would be good to
> keep the session ID a long. The simplest approach may be to make successive
> calls to nanoTime until the result changes. We could fail if a certain number
> of milliseconds have passed.
> Hashing the file names and ranges is also a possibility, but more involved.
> (We may also want to drop latency times that are 0 nano seconds.)
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira