[
https://issues.apache.org/jira/browse/CASSANDRA-21197?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Jon Haddad updated CASSANDRA-21197:
-----------------------------------
Summary: nodetool import silently failing resulting in data loss with
analytics jobs (was: nodetool import silently not importing resulting in data
loss with analytics jobs)
> nodetool import silently failing resulting in data loss with analytics jobs
> ---------------------------------------------------------------------------
>
> Key: CASSANDRA-21197
> URL: https://issues.apache.org/jira/browse/CASSANDRA-21197
> Project: Apache Cassandra
> Issue Type: Bug
> Components: Analytics Library, Sidecar
> Reporter: Jon Haddad
> Priority: Normal
>
> When evaluating the analytics bulk writer I found jobs were reported as
> successful, but the data wasn't being correctly imported. I'm testing using
> C* 5.0.6, sidecar trunk (as of yesterday), and the latest analytics code as
> of... recent. I've verified this is an issue with both single token and 4
> token clusters, using all C* defaults from the tarball release otherwise
> except these:
> {noformat}
> ---
> cluster_name: "test"
> num_tokens: 1
> seed_provider:
> class_name: "org.apache.cassandra.locator.SimpleSeedProvider"
> parameters:
> seeds: "10.14.1.95"
> hints_directory: "/mnt/db1/cassandra/hints"
> data_file_directories:
> - "/mnt/db1/cassandra/data"
> commitlog_directory: "/mnt/db1/cassandra/commitlog"
> concurrent_reads: 64
> concurrent_writes: 64
> trickle_fsync: true
> endpoint_snitch: "Ec2Snitch"{noformat}
> I've traced the network and filesystem calls and have found this is the
> series of events:
> 1. Spark job runs
> 2. data lands on disk from sidecar
> 3. import is called, C* says nothing to import
> 4. sidecar then deletes the data files
> resulting in all my data getting deleted off disk, without import happening.
> I have tested this dozens of times a day for almost a week and it's happened
> 100% of the time.
> I haven't yet determined why Cassandra doesn't import anything, but given the
> nature of the issue I'm hoping more eyes on this will help. It's possible
> there's something specific about my setup that's causing this issue - I know
> there are quite a few tests around sidecar, so I'm surprised it's happening.
> That said, if C* isn't correctly importing data, it should have a way of
> telling sidecar that so sidecar doesn't delete the results of a bulk write
> job.
> {*}Note{*}: the names of the files might not match up here, I've done this
> over several days now with about a dozen clusters and 100 spark jobs.
> [Spark job
> runs|[https://github.com/rustyrazorblade/easy-db-lab/blob/main/bin/submit-direct-bulk-writer]].
> The data files are written to disk, then renamed. I've captured that
> several ways, the easiest way to see it is here for the rename, captured with
> sysdig:
> {noformat}
> sudo sysdig "evt.category=file and (proc.pid=24272 or proc.pid=30444)" | grep
> 'cassandra/import'{noformat}
> Here's the relevant output, where the vertx process (sidecar) performs the
> rename to the expected data file name:
> {noformat}
> 2198732 01:17:36.437748828 1 vert.x-internal (30642) < rename res=0
> oldpath=/mnt/db1/cassandra/import/0-0-1d50c5e6-8fbe-44c7-98ec-a06132e78c1f-e9293be0-1767-11f1-887e-0ff1d5cec701/bulk_test/data/oa-1-big-Index.db16346060661306473655.tmp
>
> newpath=/mnt/db1/cassandra/import/0-0-1d50c5e6-8fbe-44c7-98ec-a06132e78c1f-e9293be0-1767-11f1-887e-0ff1d5cec701/bulk_test/data/oa-1-big-Index.db
> 2199993 01:17:36.450173069 6 vert.x-internal (30635) < rename res=0
> oldpath=/mnt/db1/cassandra/import/0-0-1d50c5e6-8fbe-44c7-98ec-a06132e78c1f-e9293be0-1767-11f1-887e-0ff1d5cec701/bulk_test/data/oa-1-big-Filter.db4989982398684709072.tmp
>
> newpath=/mnt/db1/cassandra/import/0-0-1d50c5e6-8fbe-44c7-98ec-a06132e78c1f-e9293be0-1767-11f1-887e-0ff1d5cec701/bulk_test/data/oa-1-big-Filter.db{noformat}
>
>
> Process 30528 (cassandra) import is called. I captured the filesystem event
> where it receives 10 entries:
> {noformat}
> sudo strace -p 30528 -e trace=getdents64 -y 2>&1 | grep import
> getdents64(402</mnt/db1/cassandra/import/0-0-28c91aa3-fcae-4c97-bf5a-e520f070e1f9-a0a1bdd0-176b-11f1-bc8d-55a3317257c0/bulk_test/data>,
> 0x7176a803a0c0 /* 10 entries */, 32768) = 392{noformat}
> but the log entry says nothing is imported:
>
> {noformat}
> INFO [RMI TCP Connection(92)-127.0.0.1] 2026-03-04 01:44:12,773
> SSTableImporter.java:80 - [af506331-6517-4461-a10f-3846baaf30c6] Loading new
> SSTables for bulk_test/data:
> Options{srcPaths='[/mnt/db1/cassandra/import/0-0-28c91aa3-fcae-4c97-bf5a-e520f070e1f9-a0a1bdd0-176b-11f1-bc8d-55a3317257c0/bulk_test/data]',
> resetLevel=true, clearRepaired=true, verifySSTables=true, verifyTokens=true,
> invalidateCaches=true, extendedVerify=false, copyData= false,
> failOnMissingIndex= false, validateIndexChecksum= true}
> INFO [RMI TCP Connection(92)-127.0.0.1] 2026-03-04 01:44:12,781
> SSTableImporter.java:214 - [af506331-6517-4461-a10f-3846baaf30c6] No new
> SSTables were found for bulk_test/data{noformat}
> sidecar then comes around and unlinks the files, resulting in data loss:
>
> {noformat}
> 2248856 01:17:37.778334683 1 vert.x-internal (30642) < unlink res=0
> path=/mnt/db1/cassandra/import/0-0-1d50c5e6-8fbe-44c7-98ec-a06132e78c1f-e9293be0-1767-11f1-887e-0ff1d5cec701/bulk_test/data/oa-1-big-CompressionInfo.db
> 2248866 01:17:37.778345865 1 vert.x-internal (30642) < newfstatat res=0
> dirfd=-100(AT_FDCWD)
> path=/mnt/db1/cassandra/import/0-0-1d50c5e6-8fbe-44c7-98ec-a06132e78c1f-e9293be0-1767-11f1-887e-0ff1d5cec701/bulk_test/data/oa-1-big-Statistics.db
> flags=256(AT_SYMLINK_NOFOLLOW)
> 2248868 01:17:37.778352848 1 vert.x-internal (30642) < newfstatat res=0
> dirfd=-100(AT_FDCWD)
> path=/mnt/db1/cassandra/import/0-0-1d50c5e6-8fbe-44c7-98ec-a06132e78c1f-e9293be0-1767-11f1-887e-0ff1d5cec701/bulk_test/data/oa-1-big-Statistics.db
> flags=256(AT_SYMLINK_NOFOLLOW)
> 2248875 01:17:37.778370298 1 vert.x-internal (30642) < unlink res=0
> path=/mnt/db1/cassandra/import/0-0-1d50c5e6-8fbe-44c7-98ec-a06132e78c1f-e9293be0-1767-11f1-887e-0ff1d5cec701/bulk_test/data/oa-1-big-Statistics.db{noformat}
>
> I haven't yet determined why Cassandra doesn't import the data. It sees the
> files in the listing, but there's no additional debug available to identify
> why it doesn't consider them valid.
>
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]