Re: nodetool suddenly failing with "Access denied!"
Sergio and Abe, thanks so much for responding to me so quickly! I managed to figure out the problem and the solution. In the Terraform scripts we used to stand up the EC2 instances, we have a template file for the jmxremote.access file with the content: monitorRole readonly controlRole readwrite \ create javax.management.monitor.*,javax.management.timer.* \ unregister ${USERNAME} ${JMX_USER_ACCESSTYPE} ${JMX_USER_ACCESSTYPE} will get replaced by readwrite. Of course, ${USERNAME} will get replaced by the user we run nodetool as. What I'm seeing now is that the jmxremote.access file just contains the default content and that, at least in one case, it has a fairly recent timestamp. This indicates that some security upgrades were indeed performed without consulting with the Cassandra DBAs. If I restore the desired content, and in particular the last line, nodetool works again. The mystery is why this problem is only occurring piecemeal in some nodes when the security upgrade was performed in a much broader cross-section of nodes. But I might just need to leave that as a mystery. On Sun, Feb 26, 2023 at 11:46 AM Sergio wrote: > Hey! > I would try to spin up a new node and see if the problem occurs on it. > If it happens, I would check the history of changes on the > cookbook recipe, if you don't find any problem on the new node you might > replace all the nodes having problems one by one with a new one and > decommission the affected ones. > it would cost some time and money but better than having a node tool not > working > > Best, > > Sergio > > Il giorno dom 26 feb 2023 alle ore 10:51 Abe Ratnofsky ha > scritto: > >> Hey Mitch, >> >> The security upgrade schedule that your colleague is working on may well >> be relevant. Is your entire cluster on 3.11.6 or are the failing hosts >> possibly on a newer version? >> >> Abe >> >> On Feb 26, 2023, at 10:38, Mitch Gitman wrote: >> >> >> >> We're running Cassandra 3.11.6 on AWS EC2 instances. These clusters have >> been running for a few years. >> >> >> We're suddenly noticing now that on one of our clusters the nodetool >> command is failing on certain nodes but not on others. >> >> >> The failure: >> >> nodetool: Failed to connect to '...:7199' - SecurityException: 'Access >> denied! Invalid access level for requested MBeanServer operation.'. >> >> >> I suspect that this stems from some colleague I'm not in coordination >> with recently doing some security upgrades, but that's a bit of an academic >> matter for now. >> >> >> I've compared the jmxremote.access and jvm.options files on a host where >> nodetool is not working vs. a host where nodetool is working, and no >> meaningful differences. >> >> >> Any ideas? The interesting aspect of this problem is that it is occurring >> on some nodes in the one cluster but not others. >> >> >> I'll update on this thread if I find any solutions on my end. >> >>
nodetool suddenly failing with "Access denied!"
We're running Cassandra 3.11.6 on AWS EC2 instances. These clusters have been running for a few years. We're suddenly noticing now that on one of our clusters the nodetool command is failing on certain nodes but not on others. The failure: nodetool: Failed to connect to '...:7199' - SecurityException: 'Access denied! Invalid access level for requested MBeanServer operation.'. I suspect that this stems from some colleague I'm not in coordination with recently doing some security upgrades, but that's a bit of an academic matter for now. I've compared the jmxremote.access and jvm.options files on a host where nodetool is not working vs. a host where nodetool is working, and no meaningful differences. Any ideas? The interesting aspect of this problem is that it is occurring on some nodes in the one cluster but not others. I'll update on this thread if I find any solutions on my end.
Re: running repairs on insert-only tables
Jeff, good to hear from you. Based on what you're saying, we can avoid regular repairs on these tables. We can live with the read repairs because the bulk of these tables are being used strictly for analytics queries by Spark jobs or for ad-hoc queries by members of the technical team. They're not part of the application read path. Thanks. Not having to do these repairs on a regular basis is a big win for us. On Thu, Nov 5, 2020 at 11:33 AM Jeff Jirsa wrote: > > > > On Nov 5, 2020, at 10:18 AM, Mitch Gitman wrote: > > > > Hi! > > > > > Now, we could comfortably run all the repairs we need to within our > off-hours window if we just left out all our tables that are insert-only. > By insert-only, I mean that we have certain classes of tables that we're > only inserting into; we're never updating them or deleting them. Therefore, > these are tables that have no tombstones, and if repairs are just about > clearing out tombstones, then ostensibly they shouldn't need to be > repaired. The question is, is that really the case? Is there any reason to > still run repairs on insert-only tables? > > > > If I come up with my own answer I'm satisfied with, I'll reply to myself > here. > > A table that never does deletes does indeed have different repair > requirements. > > You strictly don’t need to repair it EXCEPT to guarantee consistency when > replacing a host. If you do have a host fail, then strictly speaking you > should repair all of the replicas of the down host before you stream in the > replacement host, but that’s likely rare and this is true for all workloads > and almost nobody does it today but that’s the only real repair requirement > for a table that doesn’t have deletes. > > That said: repair does help reduce differences which may reduce read > repairs, but you’re relying on consistency level for time between insert > and repair ANYWAY so it’s probably fine. > > > > - > To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org > For additional commands, e-mail: user-h...@cassandra.apache.org > >
running repairs on insert-only tables
With all the years I've been working with Cassandra, I'm embarrassed that I have to ask this question. We have some tables that are taking longer to repair than we're comfortable with. We're on Cassandra 3.11, so we have to run full repairs as opposed to incremental repairs, which to my understanding can't be counted on until Cassandra 4.0. We're running sequential repairs, as opposed to the default parallel repairs, so that the repairs can run in a low-intensity fashion while the keyspace is still able to take write and read requests, albeit preferably and primarily during off-hours. The problem with sequential repairs is they're taking too long for us and extending into "on-hours." We could run parallel repairs to speed things up, but that would require suspending the services in our write pipeline, which we'd rather not resort to. Now, we could comfortably run all the repairs we need to within our off-hours window if we just left out all our tables that are insert-only. By insert-only, I mean that we have certain classes of tables that we're only inserting into; we're never updating them or deleting them. Therefore, these are tables that have no tombstones, and if repairs are just about clearing out tombstones, then ostensibly they shouldn't need to be repaired. The question is, is that really the case? Is there any reason to still run repairs on insert-only tables? If I come up with my own answer I'm satisfied with, I'll reply to myself here.
Re: Restore a table with dropped columns to a new cluster fails
Fabulous tip. Thanks, Sean. I will definitely check out dsbulk. Great to see it's a Cassandra-general tool and not just limited to DataStax Enterprise. On Fri, Jul 24, 2020 at 12:58 PM Durity, Sean R wrote: > I would use dsbulk to unload and load. Then the schemas don’t really > matter. You define which fields in the resulting file are loaded into which > columns. You also won’t have the limitations and slowness of COPY TO/FROM. > > > > > > Sean Durity > > > > *From:* Mitch Gitman > *Sent:* Friday, July 24, 2020 2:22 PM > *To:* user@cassandra.apache.org > *Subject:* [EXTERNAL] Re: Restore a table with dropped columns to a new > cluster fails > > > > I'm reviving this thread because I'm looking for a non-hacky way to > migrate data from one cluster to another using nodetool snapshot and > sstableloader without having to preserve dropped columns in the new schema. > In my view, that's just cruft and confusion that keeps building. > > The best idea I can come up with is to do the following in the source > cluster: > >1. Use the cqlsh COPY FROM command to export the data in the table. >2. Drop the table. >3. Re-create the table. >4. Use the cqlsh COPY TO command to import the data into the new >incarnation of the table. > > > This approach is predicated on two assumptions: > >- The re-created table has no knowledge of the history of the old >table by the same name. >- The amount of data in the table doesn't exceed what the COPY command >can handle. > > > If the dropped columns exist in the table in an environment where there's > a lot of data, then we'd have to use some other mechanism to capture and > reload the data. > > If you see something wrong about this approach or you have a better way to > do it, I'd be glad to hear from you. > > > > On Tue, Feb 19, 2019 at 11:31 AM Jeff Jirsa wrote: > > You can also manually add the dropped column to the appropriate table to > eliminate the issue. Has to be done by a human, a new cluster would have no > way of learning about a dropped column, and the missing metadata cannot be > inferred. > > > > > > On Tue, Feb 19, 2019 at 10:58 AM Elliott Sims > wrote: > > When a snapshot is taken, it includes a "schema.cql" file. That should be > sufficient to restore whatever you need to restore. I'd argue that neither > automatically resurrecting a dropped table nor silently failing to restore > it is a good behavior, so it's not unreasonable to have the user re-create > the table then choose if they want to re-drop it. > > > > On Tue, Feb 19, 2019 at 7:28 AM Hannu Kröger wrote: > > Hi, > > > > I would like to bring this issue to your attention. > > > > Link to the ticket: > > https://issues.apache.org/jira/browse/CASSANDRA-14336 [issues.apache.org] > <https://urldefense.com/v3/__https:/issues.apache.org/jira/browse/CASSANDRA-14336__;!!M-nmYVHPHQ!eJ1PiiThRyq9y1v7PnYgHnaxFUJ6Lloy4Zs_wSgCcg_DSsLcbHgZxGqhKQ0vCapZPSmg3JY$> > > > > Basically if a table contains dropped columns and you try to restore a > snapshot to a new cluster, that will fail because of an error like > "java.lang.RuntimeException: Unknown column XXX during deserialization”. > > > > I feel this is quite serious problem for backup and restore functionality > of Cassandra. You cannot restore a backup to a new cluster if columns have > been dropped. > > > > There have been other similar tickets that have been apparently closed but > based on my test with 3.11.4, the issue still persists. > > > > Best Regards, > > Hannu Kröger > > > -- > > The information in this Internet Email is confidential and may be legally > privileged. It is intended solely for the addressee. Access to this Email > by anyone else is unauthorized. If you are not the intended recipient, any > disclosure, copying, distribution or any action taken or omitted to be > taken in reliance on it, is prohibited and may be unlawful. When addressed > to our clients any opinions or advice contained in this Email are subject > to the terms and conditions expressed in any applicable governing The Home > Depot terms of business or client engagement letter. The Home Depot > disclaims all responsibility and liability for the accuracy and content of > this attachment and for any damages or losses arising from any > inaccuracies, errors, viruses, e.g., worms, trojan horses, etc., or other > items of a destructive nature, which may be contained in this attachment > and shall not be liable for direct, indirect, consequential or special > damages in connection with this e-mail message or its attachment. >
Re: Restore a table with dropped columns to a new cluster fails
I'm reviving this thread because I'm looking for a non-hacky way to migrate data from one cluster to another using nodetool snapshot and sstableloader without having to preserve dropped columns in the new schema. In my view, that's just cruft and confusion that keeps building. The best idea I can come up with is to do the following in the source cluster: 1. Use the cqlsh COPY FROM command to export the data in the table. 2. Drop the table. 3. Re-create the table. 4. Use the cqlsh COPY TO command to import the data into the new incarnation of the table. This approach is predicated on two assumptions: - The re-created table has no knowledge of the history of the old table by the same name. - The amount of data in the table doesn't exceed what the COPY command can handle. If the dropped columns exist in the table in an environment where there's a lot of data, then we'd have to use some other mechanism to capture and reload the data. If you see something wrong about this approach or you have a better way to do it, I'd be glad to hear from you. On Tue, Feb 19, 2019 at 11:31 AM Jeff Jirsa wrote: > You can also manually add the dropped column to the appropriate table to > eliminate the issue. Has to be done by a human, a new cluster would have no > way of learning about a dropped column, and the missing metadata cannot be > inferred. > > > On Tue, Feb 19, 2019 at 10:58 AM Elliott Sims > wrote: > >> When a snapshot is taken, it includes a "schema.cql" file. That should >> be sufficient to restore whatever you need to restore. I'd argue that >> neither automatically resurrecting a dropped table nor silently failing to >> restore it is a good behavior, so it's not unreasonable to have the user >> re-create the table then choose if they want to re-drop it. >> >> >> On Tue, Feb 19, 2019 at 7:28 AM Hannu Kröger wrote: >> >>> Hi, >>> >>> I would like to bring this issue to your attention. >>> >>> Link to the ticket: >>> https://issues.apache.org/jira/browse/CASSANDRA-14336 >>> >>> Basically if a table contains dropped columns and you try to restore a >>> snapshot to a new cluster, that will fail because of an error like >>> "java.lang.RuntimeException: Unknown column XXX during deserialization”. >>> >>> I feel this is quite serious problem for backup and restore >>> functionality of Cassandra. You cannot restore a backup to a new cluster if >>> columns have been dropped. >>> >>> There have been other similar tickets that have been apparently closed >>> but based on my test with 3.11.4, the issue still persists. >>> >>> Best Regards, >>> Hannu Kröger >>> >>
quietness of full nodetool repair on large dataset
I'm on Apache Cassandra 3.10. I'm interested in moving over to Reaper for repairs, but in the meantime, I want to get nodetool repair working a little more gracefully. What I'm noticing is that, when I'm running a repair for the first time with the --full option after a large initial load of data, the client will say it's starting on a repair job and then cease to produce any output for not just minutes but a few hours. This causes SSH inactivity timeouts. I have tried running the repair with the --trace option, but then that leads to the other extreme where there's just a torrent of output, scarcely any of which I'll typically need. As a literal solution to my SSH inactivity timeouts, I could extend the timeouts, or I could do some scripting jujitsu with StrictHostKeyChecking=no and a loop that spits some arbitrary output until the command finishes. But even if the timeouts were no concern, the sheer unresponsiveness is apt to make an operator nervous. And I'd like to think there's a Goldilocks way to run a full nodetool repair on a large dataset where it's just a bit more responsive without going all TMI. Thoughts? Anyone else notice this?
Re: nodetool repair failure
Michael, thanks for the input. I don't think I'm going to need to upgrade to 3.11 for the sake of getting nodetool repair working for me. Instead, I have another plausible explanation and solution for my particular situation. First, I should say that disk usage proved to be a red herring. There was plenty of disk space available. When I said that the error message I was seeing was no more precise than "Some repair failed," I misstated things. Just above that error message was another further detail: "Validation failed in /(IP address of host)." Of course, that's still vague. What validation failed? However, that extra information led me to this JIRA ticket: https://issues.apache.org/jira/browse/CASSANDRA-10057. In particular this comment: "If you invoke repair on multiple node at once, this can be happen. Can you confirm? And once it happens, the error will continue unless you restart the node since some resources remain due to the hang. I will post the patch not to hang." Now, the particular symptom to which that response refers is not what I was seeing, but the response got me thinking that perhaps the failures I was getting were on account of attempting to run "nodetool repair --partitioner-range" simultaneously on all the nodes in my cluster. These are only three-node dev clusters, and what I would see is that the repair would pass on one node but fail on the other two. So I tried running the repairs sequentially on each of the nodes. With this change the repair works, and I have every expectation that it will continue to work--that running repair sequentially is the solution to my particular problem. If this is the case and repairs are intended to be run sequentially, then that constitutes a contract change for nodetool repair. This is the first time I'm running a repair on a multi-node cluster on Cassandra 3.10, and only with 3.10 was I seeing this problem. I'd never seen it previously running repairs on Cassandra 2.1 clusters, which is what I was upgrading from. The last comment in that particular JIRA ticket is coming from someone reporting the same problem I'm seeing, and their experience indirectly corroborates mine, or at least it doesn't contradict mine. On Thu, Jul 27, 2017 at 10:26 AM, Michael Shuler <mich...@pbandjelly.org> wrote: > On 07/27/2017 12:10 PM, Mitch Gitman wrote: > > I'm using Apache Cassandra 3.10. > > > this is a dev cluster I'm talking about. > > > Further insights welcome... > > Upgrade and see if one of the many fixes for 3.11.0 helped? > > https://github.com/apache/cassandra/blob/cassandra-3.11. > 0/CHANGES.txt#L1-L129 > > If you can reproduce on 3.11.0, hit JIRA with the steps to repro. There > are several bug fixes committed to the cassandra-3.11 branch, pending a > 3.11.1 release, but I don't see one that's particularly relevant to your > trace. > > https://github.com/apache/cassandra/blob/cassandra-3.11/CHANGES.txt > > -- > Kind regards, > Michael > > - > To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org > For additional commands, e-mail: user-h...@cassandra.apache.org > >
Re: nodetool repair failure
I want to add an extra data point to this thread having encountered much the same problem. I'm using Apache Cassandra 3.10. I attempted to run an incremental repair that was optimized to take advantage of some downtime where the cluster is not fielding traffic and only repair each node's primary partitioner range: nodetool repair --partitioner-range On a couple nodes, I was seeing the repair fail with the vague "Some repair failed" message: [2017-07-27 15:30:59,283] Some repair failed [2017-07-27 15:30:59,286] Repair command #2 finished in 10 seconds error: Repair job has failed with the error message: [2017-07-27 15:30:59,283] Some repair failed -- StackTrace -- java.lang.RuntimeException: Repair job has failed with the error message: [2017-07-27 15:30:59,283] Some repair failed at org.apache.cassandra.tools.RepairRunner.progress(RepairRunner.java:116) at org.apache.cassandra.utils.progress.jmx.JMXNotificationProgressListener.handleNotification(JMXNotificationProgressListener.java:77) at com.sun.jmx.remote.internal.ClientNotifForwarder$NotifFetcher.dispatchNotification(ClientNotifForwarder.java:583) at com.sun.jmx.remote.internal.ClientNotifForwarder$NotifFetcher.doRun(ClientNotifForwarder.java:533) at com.sun.jmx.remote.internal.ClientNotifForwarder$NotifFetcher.run(ClientNotifForwarder.java:452) at com.sun.jmx.remote.internal.ClientNotifForwarder$LinearExecutor$1.run(ClientNotifForwarder.java:108) Running with the --trace option yielded no additional relevant information. On one node where this was arising, I was able to run the repair again with just the keyspace of interest, see that work, run the repair another time across all keyspaces, and see that work as well. On another node, just trying again did not work. What did work was running a "nodetool compact". The subsequent repair on that node succeeded, even though it took inordinately long. Strangely, another repair after that failed. But then the next couple succeeded. I proceeded to do a "df -h" on the Ubuntu hosts and noticed that the disk usage was inordinately high. This is my hypothesis as to the underlying cause. Fortunately for me, this is a dev cluster I'm talking about. Pertinent troubleshooting steps: * nodetool compact * Check disk usage. Better yet, preemptively alert on disk usage exceeding a certain threshold. Further insights welcome...
Re: Mapping a continuous range to a discrete value
I just happened to run into a similar situation myself and I can see it's through a bad schema design (and query design) on my part. What I wanted to do was narrow down by the range on one clustering column and then by another range on the next clustering column. Failing to adequately think through how Cassandra stores its sorted rows on disk, I just figured, hey, why not? The result? The same error message you got. But then, going back over some old notes from a DataStax CQL webinar, I came across this (my words): "You can do selects with combinations of the different primary keys including ranges on individual columns. The range will only work if you've narrowed things down already by equality on all the prior columns. Cassandra creates a composite type to store the column name." My new solution in response. Create two tables: one that's sorted by (in my situation) a high timestamp, the other that's sorted by (in my situation) a low timestamp. What had been two clustering columns gets broken up into one clustering column each in two different tables. Then I do two queries, one with the one range, the other with the other, and I programmatically merge the results. The funny thing is, that was my original design which my most recent, and failed, design is replacing. My new solution goes back to my old solution. On Thu, Apr 7, 2016 at 1:37 AM, Peer, Odedwrote: > I have a table mapping continuous ranges to discrete values. > > > > CREATE TABLE range_mapping (k int, lower int, upper int, mapped_value int, > PRIMARY KEY (k, lower, upper)); > > INSERT INTO range_mapping (k, lower, upper, mapped_value) VALUES (0, 0, > 99, 0); > > INSERT INTO range_mapping (k, lower, upper, mapped_value) VALUES (0, 100, > 199, 100); > > INSERT INTO range_mapping (k, lower, upper, mapped_value) VALUES (0, 200, > 299, 200); > > > > I then want to query this table to find mapping of a specific value. > > In SQL I would use: *select mapped_value from range_mapping where k=0 and > ? between lower and upper* > > > > If the variable is bound to the value 150 then the mapped_value returned > is 100. > > > > I can’t use the same type of query in CQL. > > Using the query “*select * from range_mapping where k = 0 and lower <= > 150 and upper >= 150;*” returns an error "Clustering column "upper" > cannot be restricted (preceding column "lower" is restricted by a non-EQ > relation)" > > > > I thought of using multi-column restrictions but they don’t work as I > expected as the following query returns two rows instead of the one I > expected: > > > > *select * from range_mapping where k = 0 and (lower,upper) <= (150,999) > and (lower,upper) >= (-999,150);* > > > > k | lower | upper | mapped_value > > ---+---+---+-- > > 0 | 0 |99 |0 > > 0 | 100 | 199 | 100 > > > > I’d appreciate any thoughts on the subject. > > >
something fishy about a Metrics+Ganglia integration
I've inherited an integration of Cassandra's Codahale Metrics reporting with Ganglia that looks sensible enough on the Metrics side. The metrics-reporter-config.yaml points to a gmond.conf on the node. Excerpt: ganglia: - period: 60 timeunit: 'SECONDS' gmondConf: '/etc/ganglia/gmond.conf' Now if I look at the gmond.conf on one of the nodes in the Cassandra cluster, I see that there's a UDP entry for EVERY node in the cluster, plus OpsCenter: udp_send_channel { host = sandbox-cas00 port = 8649 ttl = 1 } udp_send_channel { host = sandbox-cas01 port = 8649 ttl = 1 } udp_send_channel { host = sandbox-cas02 port = 8649 ttl = 1 } udp_send_channel { host = sandbox-cas03 port = 8649 ttl = 1 } udp_send_channel { host = sandbox-cas04 port = 8649 ttl = 1 } udp_send_channel { host = sandbox-opscenter port = 8649 ttl = 1 } What strikes me funny about this configuration is that every node in the cluster is referenced. I was expecting that each node's gmond.conf would only define a udp_send_channel for itself. This is fine now with a five-node cluster, and we have enough deployment automation in place to mitigate the fact that every time we add a Cassandra node every other Cassandra node's gmond.conf would have to be updated. But is it really the intention that every Ganglia host needs to be aware of every other host? I was expecting that it would only be the Ganglia server these hosts are reporting to that would need to be aware of the full membership of the cluster. OK, I realize at this point I'm asking more of a Ganglia question than a Cassandra or Metrics question. Apparently, we're using Ganglia in unicast mode rather than multicast.
Re: sstableloader Could not retrieve endpoint ranges
I want to follow up on this thread to describe what I was able to get working. My goal was to switch a cluster to vnodes, in the process preserving the data for a single table, endpoints.endpoint_messages. Otherwise, I could afford to start from a clean slate. As should be apparent, I could also afford to do this within a maintenance window where the cluster was down. In other words, I had the luxury of not having to add a new data center to a live cluster per DataStax's documented procedure to enable vnodes: http://docs.datastax.com/en/cassandra/1.2/cassandra/configuration/configVnodesProduction_t.html http://docs.datastax.com/en/cassandra/2.1/cassandra/configuration/configVnodesProduction_t.html What I got working relies on the nodetool snapshot command to create various SSTable snapshots under endpoints/endpoint_messages/snapshots/SNAPSHOT_NAME. The snapshots represent the data being backed up and restored from. The backup and restore is not directly, literally working against the original SSTables directly in various endpoints/endpoint_messages/ directories. - endpoints/endpoint_messages/snapshots/SNAPSHOT_NAME/: These SSTables are being copied off and restored from. - endpoints/endpoint_messages/: These SSTables are obviously the source of the snapshots but are not being copied off and restored from. Instead of using sstableloader to load the snapshots into the re-initialized Cassandra cluster, I used the JMX StorageService.bulkLoad command after establishing a JConsole session to each node. I copied off the snapshots to load to a directory path that ends with endpoints/endpoint_messages/ to give the bulk-loader a path it expects. The directory path that is the destination for nodetool snapshot and the source for StorageService.bulkLoad is on the same host as the Cassandra node but outside the purview of the Cassandra node. This procedure can be summarized as follows: 1. For each node, create a snapshot of the endpoint_messages table as a backup. 2. Stop the cluster. 3. On each node, wipe all the data, i.e. the contents of data_files_directories, commitlog, and saved_caches. 4. Deploy the cassandra.yaml configuration that makes the switch to vnodes and restart the cluster to apply the vnodes change. 5. Re-create the endpoints keyspace. 6. On each node, bulk-load the snapshots for that particular node. This summary can be reduced even further: 1. On each node, export the data to preserve. 2. On each node, wipe the data. 3. On all nodes, switch to vnodes. 4. On each node, import back in the exported data. I'm sure this process could have been streamlined. One caveat for anyone looking to emulate this: Our situation might have been a little easier to reason about because our original endpoint_messages table had a replication factor of 1. We used the vnodes switch as an opportunity to up the RF to 3. I can only speculate as to why what I was originally attempting wasn't working. But what I was originally attempting wasn't precisely the use case I care about. What I'm following up with now was. On Fri, Jun 19, 2015 at 8:22 PM, Mitch Gitman mgit...@gmail.com wrote: I checked the system.log for the Cassandra node that I did the jconsole JMX session against and which had the data to load. Lot of log output indicating that it's busy loading the files. Lot of stacktraces indicating a broken pipe. I have no reason to believe there are connectivity issues between the nodes, but verifying that is beyond my expertise. What's indicative is this last bit of log output: INFO [Streaming to /10.205.55.101:5] 2015-06-19 21:20:45,441 StreamReplyVerbHandler.java (line 44) Successfully sent /srv/cas-snapshot-06-17-2015/endpoints/endpoint_messages/endpoints-endpoint_messages-ic-34-Data.db to /10.205.55.101 INFO [Streaming to /10.205.55.101:5] 2015-06-19 21:20:45,457 OutputHandler.java (line 42) Streaming session to /10.205.55.101 failed ERROR [Streaming to /10.205.55.101:5] 2015-06-19 21:20:45,458 CassandraDaemon.java (line 253) Exception in thread Thread[Streaming to / 10.205.55.101:5,5,RMI Runtime] java.lang.RuntimeException: java.io.IOException: Broken pipe at com.google.common.base.Throwables.propagate(Throwables.java:160) at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:32) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Caused by: java.io.IOException: Broken pipe at sun.nio.ch.FileChannelImpl.transferTo0(Native Method) at sun.nio.ch.FileChannelImpl.transferToDirectly(FileChannelImpl.java:433) at sun.nio.ch.FileChannelImpl.transferTo(FileChannelImpl.java:565) at org.apache.cassandra.streaming.compress.CompressedFileStreamTask.stream(CompressedFileStreamTask.java:93) at org.apache.cassandra.streaming.FileStreamTask.runMayThrow(FileStreamTask.java:91
Re: sstableloader Could not retrieve endpoint ranges
Fabien, thanks for the reply. We do have Thrift enabled. From what I can tell, the Could not retrieve endpoint ranges: crops up under various circumstances. From further reading on sstableloader, it occurred to me that it might be a safer bet to use the JMX StorageService bulkLoad command, considering that the data to import was already on one of the Cassandra nodes, just in an arbitrary directory outside the Cassandra data directories. I was able to get this bulkLoad command to fail with a message that the directory structure did not follow the expected keyspace/table/ pattern. So I created a keyspace directory and then a table directory within that and moved all the files under the table directory. Executed bulkLoad, passing in that directory. It succeeded. Then I went and ran a nodetool refresh on the table in question. Only one problem. If I then went to query the table for, well, anything, nothing came back. And this was after successfully querying the table before and truncating the table just prior to the bulkLoad, so that I knew that only the data coming from the bulkLoad could show up there. Oh, and for good measure, I stopped and started all the nodes too. No luck still. What's puzzling about this is that the bulkLoad silently succeeds, even though it doesn't appear to be doing anything. I haven't bothered yet to check the Cassandra logs. On Fri, Jun 19, 2015 at 12:28 AM, Fabien Rousseau fabifab...@gmail.com wrote: Hi, I already got this error on a 2.1 clusters because thrift was disabled. So you should check that thrift is enabled and accessible from the sstableloader process. Hope this help Fabien Le 19 juin 2015 05:44, Mitch Gitman mgit...@gmail.com a écrit : I'm using sstableloader to bulk-load a table from one cluster to another. I can't just copy sstables because the clusters have different topologies. While we're looking to upgrade soon to Cassandra 2.0.x, we're on Cassandra 1.2.19. The source data comes from a nodetool snapshot. Here's the command I ran: sstableloader -d *IP_ADDRESSES_OF_SEED_NOTES* */SNAPSHOT_DIRECTORY/* Here's the result I got: Could not retrieve endpoint ranges: -pr,--principal kerberos principal -k,--keytab keytab location --ssl-keystoressl keystore location --ssl-keystore-password ssl keystore password --ssl-keystore-type ssl keystore type --ssl-truststore ssl truststore location --ssl-truststore-password ssl truststore password --ssl-truststore-type ssl truststore type Not sure what to make of this, what with the hints at security arguments that pop up. The source and destination clusters have no security. Hoping this might ring a bell with someone out there.
Re: sstableloader Could not retrieve endpoint ranges
I checked the system.log for the Cassandra node that I did the jconsole JMX session against and which had the data to load. Lot of log output indicating that it's busy loading the files. Lot of stacktraces indicating a broken pipe. I have no reason to believe there are connectivity issues between the nodes, but verifying that is beyond my expertise. What's indicative is this last bit of log output: INFO [Streaming to /10.205.55.101:5] 2015-06-19 21:20:45,441 StreamReplyVerbHandler.java (line 44) Successfully sent /srv/cas-snapshot-06-17-2015/endpoints/endpoint_messages/endpoints-endpoint_messages-ic-34-Data.db to /10.205.55.101 INFO [Streaming to /10.205.55.101:5] 2015-06-19 21:20:45,457 OutputHandler.java (line 42) Streaming session to /10.205.55.101 failed ERROR [Streaming to /10.205.55.101:5] 2015-06-19 21:20:45,458 CassandraDaemon.java (line 253) Exception in thread Thread[Streaming to / 10.205.55.101:5,5,RMI Runtime] java.lang.RuntimeException: java.io.IOException: Broken pipe at com.google.common.base.Throwables.propagate(Throwables.java:160) at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:32) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Caused by: java.io.IOException: Broken pipe at sun.nio.ch.FileChannelImpl.transferTo0(Native Method) at sun.nio.ch.FileChannelImpl.transferToDirectly(FileChannelImpl.java:433) at sun.nio.ch.FileChannelImpl.transferTo(FileChannelImpl.java:565) at org.apache.cassandra.streaming.compress.CompressedFileStreamTask.stream(CompressedFileStreamTask.java:93) at org.apache.cassandra.streaming.FileStreamTask.runMayThrow(FileStreamTask.java:91) at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28) ... 3 more And then right after that I see what appears to be the output from the nodetool refresh: INFO [RMI TCP Connection(2480)-10.2.101.114] 2015-06-19 21:22:56,877 ColumnFamilyStore.java (line 478) Loading new SSTables for endpoints/endpoint_messages... INFO [RMI TCP Connection(2480)-10.2.101.114] 2015-06-19 21:22:56,878 ColumnFamilyStore.java (line 524) No new SSTables were found for endpoints/endpoint_messages Notice that Cassandra hasn't found any new SSTables, even though it was just so busy loading them. What's also noteworthy is that the output from the originating node shows it successfully sent endpoints-endpoint_messages-ic-34-Data.db to another node. But then in the system.log for that destination node, I see no mention of that file. What I do see on the destination node are a few INFO messages about streaming one of the .db files, and every time that's immediately followed by an error message: INFO [Thread-108] 2015-06-19 21:20:45,453 StreamInSession.java (line 142) Streaming of file /srv/cas-snapshot-06-17-2015/endpoints/endpoint_messages/endpoints-endpoint_messages-ic-26-Data.db sections=1 progress=0/105137329 - 0% for org.apache.cassandra.streaming.StreamInSession@46c039ef failed: requesting a retry. ERROR [Thread-109] 2015-06-19 21:20:45,456 CassandraDaemon.java (line 253) Exception in thread Thread[Thread-109,5,main] java.lang.RuntimeException: java.nio.channels.AsynchronousCloseException at com.google.common.base.Throwables.propagate(Throwables.java:160) at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:32) at java.lang.Thread.run(Thread.java:745) Caused by: java.nio.channels.AsynchronousCloseException at java.nio.channels.spi.AbstractInterruptibleChannel.end(AbstractInterruptibleChannel.java:205) at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:412) at sun.nio.ch.SocketAdaptor$SocketInputStream.read(SocketAdaptor.java:203) at sun.nio.ch.ChannelInputStream.read(ChannelInputStream.java:103) at org.apache.cassandra.streaming.compress.CompressedInputStream$Reader.runMayThrow(CompressedInputStream.java:151) at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28) ... 1 more I don't know, I'm seeing enough flakiness here as to consider Cassandra bulk-loading a lost cause, even if there is something wrong and fixable about my particular cluster. On to exporting and re-importing data at the proprietary application level. Life is too short. On Fri, Jun 19, 2015 at 2:40 PM, Mitch Gitman mgit...@gmail.com wrote: Fabien, thanks for the reply. We do have Thrift enabled. From what I can tell, the Could not retrieve endpoint ranges: crops up under various circumstances. From further reading on sstableloader, it occurred to me that it might be a safer bet to use the JMX StorageService bulkLoad command, considering that the data to import was already on one of the Cassandra nodes, just in an arbitrary directory outside the Cassandra data directories. I was able to get this bulkLoad command to fail with a message that the directory structure did not follow the expected keyspace/table/ pattern. So I created
sstableloader Could not retrieve endpoint ranges
I'm using sstableloader to bulk-load a table from one cluster to another. I can't just copy sstables because the clusters have different topologies. While we're looking to upgrade soon to Cassandra 2.0.x, we're on Cassandra 1.2.19. The source data comes from a nodetool snapshot. Here's the command I ran: sstableloader -d *IP_ADDRESSES_OF_SEED_NOTES* */SNAPSHOT_DIRECTORY/* Here's the result I got: Could not retrieve endpoint ranges: -pr,--principal kerberos principal -k,--keytab keytab location --ssl-keystoressl keystore location --ssl-keystore-password ssl keystore password --ssl-keystore-type ssl keystore type --ssl-truststore ssl truststore location --ssl-truststore-password ssl truststore password --ssl-truststore-type ssl truststore type Not sure what to make of this, what with the hints at security arguments that pop up. The source and destination clusters have no security. Hoping this might ring a bell with someone out there.