[
https://issues.apache.org/jira/browse/CASSANDRA-15669?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17324337#comment-17324337
]
Alexey Zotov edited comment on CASSANDRA-15669 at 4/17/21, 8:51 PM:
--------------------------------------------------------------------
I have checked this issue. I feel I have a kind of clear understanding of what
is going on.
[~sunhaihong]
Just a point of my curiosity - what values for {{sstable_size_in_mb}} and
{{fanout_size}} params do you use in prod and how much data do you have? I'm
just wondering how you were able to face this issue.
[~marcuse]
Looks like you are the best person to discuss this issue (as I can see, you
actively participated in LCS development).
First of all, I was able to reproduce this issue. I explored the code and
probably I found a couple of minor issues.
# *Wrong estimates calculation*
There is the following comment in the code:
{code:java}
// allocate enough generations for a PB of data, with a 1-MB sstable size.
(Note that if maxSSTableSize is
// updated, we will still have sstables of the older, potentially smaller size.
So don't make this
// dependent on maxSSTableSize.)
static final int MAX_LEVEL_COUNT = (int) Math.log10(1000 * 1000 * 1000);
{code}
It states about a PB of data for 1-MB sstable size configuration, but it does
not seem to be correct. It would be correct if 10 levels were supported.
However, 9 levels are currently supported. Here are my calculations (1-MB
sstable size and 10 fanout size):
{code:java}
L0: 4 * 1 MB = 4 MB
L1: 10^1 * 1 MB = 10 MB
L2: 10^2 * 1 MB = 100 MB
L3: 10^3 * 1 MB = 1000 MB
L4: 10^4 * 1 MB = 10000 MB = 9.76 GB
L5: 10^5 * 1 MB = 100000 MB = 97.65 GB
L6: 10^6 * 1 MB = 1000000 MB = 976.56 GB
L7: 10^7 * 1 MB = 10000000 MB = 9765.62 GB = 9.53 TB
L8: 10^8 * 1 MB = 100000000 MB = 97656.25 GB = 95.36 TB
L9: 10^9 * 1 MB = 1000000000 MB = 976562.50 GB = 953.67 TB <-- this level is
not supported {code}
Here is the place where it is clearly shown that 9 levels (including L0) are
supported at the moment:
{code:java}
// note that since l0 is broken out, levels[0] represents L1:
private final TreeSet<SSTableReader> [] levels = new TreeSet[MAX_LEVEL_COUNT -
1];
{code}
Either the comment needs to be fixed or the number of levels needs to be
increased. I believe fixing the comment would be easier and amount of data
still would be enough for a regular C* setup.
# *There is no proper handling of a situation when there is more data than
supported*
The issue happens when compaction for L8 is going to be started. Here is the
flow: {{getCompactionCandidates}} --> {{getCandidatesFor\(i\)}} -->
{{generations.get(level + 1)}}. So while checking the compaction candidates for
L8, it tries to see what's going on L9 level and immediately fails. And that's
fair because we target to support a certain amount of data only.
Currently the above flow is triggered when {{score > 1.001}} (there is more
data than it should be on a level). In fact, we should not even try to check
candidates for compaction on the highest level, we should just fail fast since
it is an impossible situation for a properly configured C* cluster. I think a
clear error should be thrown when there is an attempt to handle more data than
expected on the highest level, smth like:
{code:java}
if (score > 1.001)
{
// the highest level should not ever exceed its maximum size
if (i == generations.levelCount() - 1)
throw new RuntimeException("Highest level (L" + i + ") should not exceed its
maximum size (" + maxBytesForLevel + "), but it has " + bytesForLevel + "
bytes");
// before proceeding with a higher level, let's see if L0 is far enough
behind to warrant STCS
if (l0Compaction != null)
return l0Compaction;
...
}
{code}
I'd be glad to hear you feedback on the points above. If you find the
suggestions reasonable, I'd like to come up with a patch (I have a draft, but
before polishing it I'd like to validate my understanding). Probably I'd also
update the documentation to clearly state number of levels supported and the
ways to estimate data size.
was (Author: azotcsit):
I have checked this issue. I feel I have a kind of clear understanding of what
is going on.
[~sunhaihong]
Just a point of my curiosity - what values for {{sstable_size_in_mb}} and
{{fanout_size}} params do you use in prod and how much data do you have? I'm
just wondering how you were able to face this issue.
[~marcuse]
Looks like you are the best person to discuss this issue (as I can see, you
actively participated in LCS development).
First of all, I was able to reproduce this issue. I explored the code and
probably I found a few issues.
# Wrong estimates calculation
There is the following comment in the code:
{code:java}
// allocate enough generations for a PB of data, with a 1-MB sstable size.
(Note that if maxSSTableSize is
// updated, we will still have sstables of the older, potentially smaller size.
So don't make this
// dependent on maxSSTableSize.)
static final int MAX_LEVEL_COUNT = (int) Math.log10(1000 * 1000 * 1000);
{code}
It states about a PB of data for 1-MB sstable size configuration, but it does
not seem to be correct. It would be correct if 10 levels were supported.
However, 9 levels are currently supported. Here are my calculations (1-MB
sstable size and 10 fanout size):
{code:java}
L0: 4 * 1 MB = 4 MB
L1: 10^1 * 1 MB = 10 MB
L2: 10^2 * 1 MB = 100 MB
L3: 10^3 * 1 MB = 1000 MB
L4: 10^4 * 1 MB = 10000 MB = 9.76 GB
L5: 10^5 * 1 MB = 100000 MB = 97.65 GB
L6: 10^6 * 1 MB = 1000000 MB = 976.56 GB
L7: 10^7 * 1 MB = 10000000 MB = 9765.62 GB = 9.53 TB
L8: 10^8 * 1 MB = 100000000 MB = 97656.25 GB = 95.36 TB
L9: 10^9 * 1 MB = 1000000000 MB = 976562.50 GB = 953.67 TB <-- this level is
not supported {code}
Here is the place where it is clearly shown that 9 levels (including L0) are
supported at the moment:
{code:java}
// note that since l0 is broken out, levels[0] represents L1:
private final TreeSet<SSTableReader> [] levels = new TreeSet[MAX_LEVEL_COUNT -
1];
{code}
Either the comment needs to be fixed or the number of levels needs to be
increased. I believe fixing the comment would be easier and amount of data
still would be enough for a regular C* setup.
# L8 is not being handled properly
The issue happens when compaction for L8 is going to be started. Here is the
flow: {{getCompactionCandidates}} --> {{getCandidatesFor\(i\)}} -->
{{generations.get(level + 1)}}. So while checking the compaction candidates for
the L8, it tries to see what's going on L9 level and immediately fails. Which
means L8 is not ever being built/compacted. I think we need to handle the
highest level (8th) separately in {{getCandidatesFor\(i\)}} method as we handle
L0 (with a separate condition).
# There is no proper handling of a situation when there is more data than
supported
As it is stated in the first point, we target to support a certain amount of
data. I think a clear error should be thrown when there is an attempt to handle
more data than expected on the highest level, smth like:
{code:java}
long bytesForLevel = SSTableReader.getTotalBytes(sstablesInLevel);
long maxBytesForLevel = maxBytesForLevel(i, maxSSTableSizeInBytes);
if (i == generations.levelCount() - 1 && bytesForLevel > maxBytesForLevel)
throw new RuntimeException("Current size (" + bytesForLevel + ") exceeds
max size (" + maxBytesForLevel + ") for " + i + " level");
{code}
I'd be glad to hear you feedback on the points above. If you find the
suggestions reasonable, I'd like to come up with a patch (I have a draft, but
before polishing it I'd like to validate my understanding).
> LeveledCompactionStrategy compact last level throw an
> ArrayIndexOutOfBoundsException
> ------------------------------------------------------------------------------------
>
> Key: CASSANDRA-15669
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15669
> Project: Cassandra
> Issue Type: Bug
> Reporter: sunhaihong
> Assignee: sunhaihong
> Priority: Normal
> Attachments: cfs_compaction_info.png, error_info.png
>
>
> Cassandra will throw an ArrayIndexOutOfBoundsException when compact last
> level.
> My test is as follows:
> # Create a table with LeveledCompactionStrategy and its params are
> 'enabled': 'true', 'fanout_size': '2', 'max_threshold': '32',
> 'min_threshold': '4', 'sstable_size_in_mb': '2'(fanout_size and
> sstable_size_in_mb are too small just to make it easier to reproduce the
> problem);
> # Insert data into the table by stress;
> # Cassandra throw an ArrayIndexOutOfBoundsException when compact level9
> sstables(this level score bigger than 1.001)
> ERROR [CompactionExecutor:4] 2020-03-28 08:59:00,990 CassandraDaemon.java:442
> - Exception in thread Thread[CompactionExecutor:4,1,main]
> java.lang.ArrayIndexOutOfBoundsException: 9
> at
> org.apache.cassandra.db.compaction.LeveledManifest.getLevel(LeveledManifest.java:814)
> at
> org.apache.cassandra.db.compaction.LeveledManifest.getCandidatesFor(LeveledManifest.java:746)
> at
> org.apache.cassandra.db.compaction.LeveledManifest.getCompactionCandidates(LeveledManifest.java:398)
> at
> org.apache.cassandra.db.compaction.LeveledCompactionStrategy.getNextBackgroundTask(LeveledCompactionStrategy.java:131)
> at
> org.apache.cassandra.db.compaction.CompactionStrategyHolder.lambda$getBackgroundTaskSuppliers$0(CompactionStrategyHolder.java:109)
> at
> org.apache.cassandra.db.compaction.AbstractStrategyHolder$TaskSupplier.getTask(AbstractStrategyHolder.java:66)
> at
> org.apache.cassandra.db.compaction.CompactionStrategyManager.getNextBackgroundTask(CompactionStrategyManager.java:214)
> at
> org.apache.cassandra.db.compaction.CompactionManager$BackgroundCompactionCandidate.run(CompactionManager.java:289)
> at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> at java.util.concurrent.FutureTask.run$$$capture(FutureTask.java:266)
> at java.util.concurrent.FutureTask.run(FutureTask.java)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at
> io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
> at java.lang.Thread.run(Thread.java:748)
> I tested it on cassandra version 3.11.3 & 4.0-alpha3. The exception all
> happened.
> once it triggers, level1- leveln compaction no longer works, level0 is still
> valid
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]