I want to do Index similar to RDBMS on keyPnl on the pnl_type_code so that group by can be done efficitently. How do I achieve that? Currently below code blow out of memory in Spark on 60GB of data. keyPnl is very large file. We have been stuck for 1 week. trying kryo, mapvalue etc but without prevail. We want to do partition on pnl_type_code but has no idea how to do that. Please advice.
val keyPnl = pnl.filter(_.rf_level == "0").keyBy(f=>f.portfolio_code) val keyPosition = positions.filter(_.pl0_code == "3").keyBy(f => f.portfolio_code) val JoinPnlPortfolio = keyPnl.leftOuterJoin(keyPosition) var result = JoinPnlPortfolio.groupBy(r => (r._2._1.pnl_type_code)) .mapValues(kv => (kv.map(mapper).fold (List[Double]()) (Vector.reduceVector _))) .mapValues(kv => (Var.percentile(kv, 0.99))) -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/RDD-join-composite-keys-tp8696p9423.html Sent from the Apache Spark User List mailing list archive at Nabble.com.