[
https://issues.apache.org/jira/browse/CASSANDRA-11383?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15202088#comment-15202088
]
DOAN DuyHai commented on CASSANDRA-11383:
-----------------------------------------
[~jkrupan]
1. Not that large, see below the Spark script to generate randomized data:
{code:scala}
import java.util.UUID
import com.datastax.spark.connector._
case class Resource(dsrId:UUID, relSeq:Long, seq:Long,
dspReleaseCode:String,
commercialOfferCode:String, transferCode:String,
mediaCode:String,
modelCode:String, unicWork:String,
title:String, status:String,
contributorsName:List[String],
periodEndMonthInt:Int, dspCode:String,
territoryCode:String,
payingNetQty:Long, authorizedSocietiesTxt: String,
relType:String)
val allDsps = List("youtube", "itunes", "spotify", "deezer", "vevo",
"google-play", "7digital", "spotify", "youtube", "spotify", "youtube",
"youtube", "youtube")
val allCountries = List("FR", "UK", "BE", "IT", "NL", "ES", "FR", "FR")
val allPeriodsEndMonths:Seq[Int] = for(year <- 2013 to 2015; month <- 1 to
12) yield (year.toString + f"$month%02d").toInt
val allModelCodes = List("PayAsYouGo", "AdFunded", "Subscription")
val allMediaCodes = List("Music","Ringtone")
val allTransferCodes = List("Streaming","Download")
val allCommercialOffers = List("Premium","Free")
val status = "Declared"
val authorizedSocietiesTxt: String="sacem sgae"
val relType = "whatever"
val titlesAndContributors: Array[(String, String)] =
sc.textFile("/tmp/top_100.csv").map(line => {val split = line.split(";");
(split(1),split(2))}).distinct.collect
for(i<- 1 to 100) {
sc.parallelize((1 to 40000000).map(i => UUID.randomUUID)).
map(dsrId => {
val r = new java.util.Random(System.currentTimeMillis())
val relSeq = r.nextLong()
val seq = r.nextLong()
val dspReleaseCode = seq.toString
val dspCode = allDsps(r.nextInt(allDsps.size))
val periodEndMonth =
allPeriodsEndMonths(r.nextInt(allPeriodsEndMonths.size))
val territoryCode = allCountries(r.nextInt(allCountries.size))
val modelCode = allModelCodes(r.nextInt(allModelCodes.size))
val mediaCode = allMediaCodes(r.nextInt(allMediaCodes.size))
val transferCode =
allTransferCodes(r.nextInt(allTransferCodes.size))
val commercialOffer =
allCommercialOffers(r.nextInt(allCommercialOffers.size))
val titleAndContributor: (String, String) =
titlesAndContributors(r.nextInt(titlesAndContributors.size))
val title = titleAndContributor._1
val contributorsName = titleAndContributor._2.split(",").toList
val unicWork = title + "|" + titleAndContributor._2
val payingNetQty = r.nextInt(100).toLong
Resource(dsrId, relSeq, seq, dspReleaseCode, commercialOffer,
transferCode, mediaCode, modelCode,
unicWork, title, status, contributorsName, periodEndMonth,
dspCode, territoryCode, payingNetQty,
authorizedSocietiesTxt, relType)
}).
saveToCassandra("keyspace", "resource")
Thread.sleep(500)
}
{code:scala}
2. Does OOM occur if SASI indexes are created one at a time - serially, waiting
for full index to build before moving on to the next? --> *Yes it does*, see
log file with CMS settings attached above
3. Do you need a 32G heap to build just one index? I cringe when I see a heap
larger than 14G. See if you can get a single SASI index build to work in 10-12G
or less.
--> Well the 32Gb heap was for analytics use-cases and I was using G1 GC. But
changing to CMS with 8Gb heap has the same result, OOM. see log file with CMS
settings attached above
> SASI index build leads to massive OOM
> -------------------------------------
>
> Key: CASSANDRA-11383
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11383
> Project: Cassandra
> Issue Type: Bug
> Components: CQL
> Environment: C* 3.4
> Reporter: DOAN DuyHai
> Attachments: CASSANDRA-11383.patch, new_system_log_CMS_8GB_OOM.log,
> system.log_sasi_build_oom
>
>
> 13 bare metal machines
> - 6 cores CPU (12 HT)
> - 64Gb RAM
> - 4 SSD in RAID0
> JVM settings:
> - G1 GC
> - Xms32G, Xmx32G
> Data set:
> - ≈ 100Gb/per node
> - 1.3 Tb cluster-wide
> - ≈ 20Gb for all SASI indices
> C* settings:
> - concurrent_compactors: 1
> - compaction_throughput_mb_per_sec: 256
> - memtable_heap_space_in_mb: 2048
> - memtable_offheap_space_in_mb: 2048
> I created 9 SASI indices
> - 8 indices with text field, NonTokenizingAnalyser, PREFIX mode,
> case-insensitive
> - 1 index with numeric field, SPARSE mode
> After a while, the nodes just gone OOM.
> I attach log files. You can see a lot of GC happening while index segments
> are flush to disk. At some point the node OOM ...
> /cc [~xedin]
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)