Re: scala.MatchError while doing BinaryClassificationMetrics
Thank, Nick. This worked for me. val evaluator = new BinaryClassificationEvaluator(). setLabelCol("label"). setRawPredictionCol("ModelProbability"). setMetricName("areaUnderROC") val auROC = evaluator.evaluate(testResults) On Mon, Nov 14, 2016 at 4:00 PM, Nick Pentreath wrote: > Typically you pass in the result of a model transform to the evaluator. > > So: > val model = estimator.fit(data) > val auc = evaluator.evaluate(model.transform(testData) > > Check Scala API docs for some details: http://spark.apache. > org/docs/latest/api/scala/index.html#org.apache.spark.ml.evaluation. > BinaryClassificationEvaluator > > On Mon, 14 Nov 2016 at 20:02 Bhaarat Sharma wrote: > > Can you please suggest how I can use BinaryClassificationEvaluator? I > tried: > > scala> import org.apache.spark.ml.evaluation.BinaryClassificationEvaluator > import org.apache.spark.ml.evaluation.BinaryClassificationEvaluator > > scala> val evaluator = new BinaryClassificationEvaluator() > evaluator: org.apache.spark.ml.evaluation.BinaryClassificationEvaluator = > binEval_0d57372b7579 > > Try 1: > > scala> evaluator.evaluate(testScoreAndLabel.rdd) > :105: error: type mismatch; > found : org.apache.spark.rdd.RDD[(Double, Double)] > required: org.apache.spark.sql.Dataset[_] >evaluator.evaluate(testScoreAndLabel.rdd) > > Try 2: > > scala> evaluator.evaluate(testScoreAndLabel) > java.lang.IllegalArgumentException: Field "rawPrediction" does not exist. > at org.apache.spark.sql.types.StructType$$anonfun$apply$1. > apply(StructType.scala:228) > > Try 3: > > scala> evaluator.evaluate(testScoreAndLabel.select(" > Label","ModelProbability")) > org.apache.spark.sql.AnalysisException: cannot resolve '`Label`' given > input columns: [_1, _2]; > at org.apache.spark.sql.catalyst.analysis.package$ > AnalysisErrorAt.failAnalysis(package.scala:42) > > > On Mon, Nov 14, 2016 at 1:44 PM, Nick Pentreath > wrote: > > DataFrame.rdd returns an RDD[Row]. You'll need to use map to extract the > doubles from the test score and label DF. > > But you may prefer to just use spark.ml evaluators, which work with > DataFrames. Try BinaryClassificationEvaluator. > > On Mon, 14 Nov 2016 at 19:30, Bhaarat Sharma wrote: > > I am getting scala.MatchError in the code below. I'm not able to see why > this would be happening. I am using Spark 2.0.1 > > scala> testResults.columns > res538: Array[String] = Array(TopicVector, subject_id, hadm_id, isElective, > isNewborn, isUrgent, isEmergency, isMale, isFemale, oasis_score, > sapsii_score, sofa_score, age, hosp_death, test, ModelFeatures, Label, > rawPrediction, ModelProbability, ModelPrediction) > > scala> testResults.select("Label","ModelProbability").take(1) > res542: Array[org.apache.spark.sql.Row] = > Array([0.0,[0.737304818744076,0.262695181255924]]) > > scala> val testScoreAndLabel = testResults. > | select("Label","ModelProbability"). > | map { case Row(l:Double, p:Vector) => (p(1), l) } > testScoreAndLabel: org.apache.spark.sql.Dataset[(Double, Double)] = [_1: > double, _2: double] > > scala> testScoreAndLabel > res539: org.apache.spark.sql.Dataset[(Double, Double)] = [_1: double, _2: > double] > > scala> testScoreAndLabel.columns > res540: Array[String] = Array(_1, _2) > > scala> val testMetrics = new > BinaryClassificationMetrics(testScoreAndLabel.rdd) > testMetrics: org.apache.spark.mllib.evaluation.BinaryClassificationMetrics = > org.apache.spark.mllib.evaluation.BinaryClassificationMetrics@36e780d1 > > The code below gives the error > > val auROC = testMetrics.areaUnderROC() //this line gives the error > > Caused by: scala.MatchError: [0.0,[0.7316583497453766,0.2683416502546234]] > (of class org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema) > > >
Re: scala.MatchError while doing BinaryClassificationMetrics
Typically you pass in the result of a model transform to the evaluator. So: val model = estimator.fit(data) val auc = evaluator.evaluate(model.transform(testData) Check Scala API docs for some details: http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.ml.evaluation.BinaryClassificationEvaluator On Mon, 14 Nov 2016 at 20:02 Bhaarat Sharma wrote: Can you please suggest how I can use BinaryClassificationEvaluator? I tried: scala> import org.apache.spark.ml.evaluation.BinaryClassificationEvaluator import org.apache.spark.ml.evaluation.BinaryClassificationEvaluator scala> val evaluator = new BinaryClassificationEvaluator() evaluator: org.apache.spark.ml.evaluation.BinaryClassificationEvaluator = binEval_0d57372b7579 Try 1: scala> evaluator.evaluate(testScoreAndLabel.rdd) :105: error: type mismatch; found : org.apache.spark.rdd.RDD[(Double, Double)] required: org.apache.spark.sql.Dataset[_] evaluator.evaluate(testScoreAndLabel.rdd) Try 2: scala> evaluator.evaluate(testScoreAndLabel) java.lang.IllegalArgumentException: Field "rawPrediction" does not exist. at org.apache.spark.sql.types.StructType$$anonfun$apply$1.apply(StructType.scala:228) Try 3: scala> evaluator.evaluate(testScoreAndLabel.select("Label","ModelProbability")) org.apache.spark.sql.AnalysisException: cannot resolve '`Label`' given input columns: [_1, _2]; at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42) On Mon, Nov 14, 2016 at 1:44 PM, Nick Pentreath wrote: DataFrame.rdd returns an RDD[Row]. You'll need to use map to extract the doubles from the test score and label DF. But you may prefer to just use spark.ml evaluators, which work with DataFrames. Try BinaryClassificationEvaluator. On Mon, 14 Nov 2016 at 19:30, Bhaarat Sharma wrote: I am getting scala.MatchError in the code below. I'm not able to see why this would be happening. I am using Spark 2.0.1 scala> testResults.columns res538: Array[String] = Array(TopicVector, subject_id, hadm_id, isElective, isNewborn, isUrgent, isEmergency, isMale, isFemale, oasis_score, sapsii_score, sofa_score, age, hosp_death, test, ModelFeatures, Label, rawPrediction, ModelProbability, ModelPrediction) scala> testResults.select("Label","ModelProbability").take(1) res542: Array[org.apache.spark.sql.Row] = Array([0.0,[0.737304818744076,0.262695181255924]]) scala> val testScoreAndLabel = testResults. | select("Label","ModelProbability"). | map { case Row(l:Double, p:Vector) => (p(1), l) } testScoreAndLabel: org.apache.spark.sql.Dataset[(Double, Double)] = [_1: double, _2: double] scala> testScoreAndLabel res539: org.apache.spark.sql.Dataset[(Double, Double)] = [_1: double, _2: double] scala> testScoreAndLabel.columns res540: Array[String] = Array(_1, _2) scala> val testMetrics = new BinaryClassificationMetrics(testScoreAndLabel.rdd) testMetrics: org.apache.spark.mllib.evaluation.BinaryClassificationMetrics = org.apache.spark.mllib.evaluation.BinaryClassificationMetrics@36e780d1 The code below gives the error val auROC = testMetrics.areaUnderROC() //this line gives the error Caused by: scala.MatchError: [0.0,[0.7316583497453766,0.2683416502546234]] (of class org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema)
Re: scala.MatchError while doing BinaryClassificationMetrics
Can you please suggest how I can use BinaryClassificationEvaluator? I tried: scala> import org.apache.spark.ml.evaluation.BinaryClassificationEvaluator import org.apache.spark.ml.evaluation.BinaryClassificationEvaluator scala> val evaluator = new BinaryClassificationEvaluator() evaluator: org.apache.spark.ml.evaluation.BinaryClassificationEvaluator = binEval_0d57372b7579 Try 1: scala> evaluator.evaluate(testScoreAndLabel.rdd) :105: error: type mismatch; found : org.apache.spark.rdd.RDD[(Double, Double)] required: org.apache.spark.sql.Dataset[_] evaluator.evaluate(testScoreAndLabel.rdd) Try 2: scala> evaluator.evaluate(testScoreAndLabel) java.lang.IllegalArgumentException: Field "rawPrediction" does not exist. at org.apache.spark.sql.types.StructType$$anonfun$apply$1.apply(StructType.scala:228) Try 3: scala> evaluator.evaluate(testScoreAndLabel.select("Label","ModelProbability")) org.apache.spark.sql.AnalysisException: cannot resolve '`Label`' given input columns: [_1, _2]; at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42) On Mon, Nov 14, 2016 at 1:44 PM, Nick Pentreath wrote: > DataFrame.rdd returns an RDD[Row]. You'll need to use map to extract the > doubles from the test score and label DF. > > But you may prefer to just use spark.ml evaluators, which work with > DataFrames. Try BinaryClassificationEvaluator. > > On Mon, 14 Nov 2016 at 19:30, Bhaarat Sharma wrote: > >> I am getting scala.MatchError in the code below. I'm not able to see why >> this would be happening. I am using Spark 2.0.1 >> >> scala> testResults.columns >> res538: Array[String] = Array(TopicVector, subject_id, hadm_id, isElective, >> isNewborn, isUrgent, isEmergency, isMale, isFemale, oasis_score, >> sapsii_score, sofa_score, age, hosp_death, test, ModelFeatures, Label, >> rawPrediction, ModelProbability, ModelPrediction) >> >> scala> testResults.select("Label","ModelProbability").take(1) >> res542: Array[org.apache.spark.sql.Row] = >> Array([0.0,[0.737304818744076,0.262695181255924]]) >> >> scala> val testScoreAndLabel = testResults. >> | select("Label","ModelProbability"). >> | map { case Row(l:Double, p:Vector) => (p(1), l) } >> testScoreAndLabel: org.apache.spark.sql.Dataset[(Double, Double)] = [_1: >> double, _2: double] >> >> scala> testScoreAndLabel >> res539: org.apache.spark.sql.Dataset[(Double, Double)] = [_1: double, _2: >> double] >> >> scala> testScoreAndLabel.columns >> res540: Array[String] = Array(_1, _2) >> >> scala> val testMetrics = new >> BinaryClassificationMetrics(testScoreAndLabel.rdd) >> testMetrics: org.apache.spark.mllib.evaluation.BinaryClassificationMetrics = >> org.apache.spark.mllib.evaluation.BinaryClassificationMetrics@36e780d1 >> >> The code below gives the error >> >> val auROC = testMetrics.areaUnderROC() //this line gives the error >> >> Caused by: scala.MatchError: [0.0,[0.7316583497453766,0.2683416502546234]] >> (of class org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema) >> >>
Re: scala.MatchError while doing BinaryClassificationMetrics
DataFrame.rdd returns an RDD[Row]. You'll need to use map to extract the doubles from the test score and label DF. But you may prefer to just use spark.ml evaluators, which work with DataFrames. Try BinaryClassificationEvaluator. On Mon, 14 Nov 2016 at 19:30, Bhaarat Sharma wrote: > I am getting scala.MatchError in the code below. I'm not able to see why > this would be happening. I am using Spark 2.0.1 > > scala> testResults.columns > res538: Array[String] = Array(TopicVector, subject_id, hadm_id, isElective, > isNewborn, isUrgent, isEmergency, isMale, isFemale, oasis_score, > sapsii_score, sofa_score, age, hosp_death, test, ModelFeatures, Label, > rawPrediction, ModelProbability, ModelPrediction) > > scala> testResults.select("Label","ModelProbability").take(1) > res542: Array[org.apache.spark.sql.Row] = > Array([0.0,[0.737304818744076,0.262695181255924]]) > > scala> val testScoreAndLabel = testResults. > | select("Label","ModelProbability"). > | map { case Row(l:Double, p:Vector) => (p(1), l) } > testScoreAndLabel: org.apache.spark.sql.Dataset[(Double, Double)] = [_1: > double, _2: double] > > scala> testScoreAndLabel > res539: org.apache.spark.sql.Dataset[(Double, Double)] = [_1: double, _2: > double] > > scala> testScoreAndLabel.columns > res540: Array[String] = Array(_1, _2) > > scala> val testMetrics = new > BinaryClassificationMetrics(testScoreAndLabel.rdd) > testMetrics: org.apache.spark.mllib.evaluation.BinaryClassificationMetrics = > org.apache.spark.mllib.evaluation.BinaryClassificationMetrics@36e780d1 > > The code below gives the error > > val auROC = testMetrics.areaUnderROC() //this line gives the error > > Caused by: scala.MatchError: [0.0,[0.7316583497453766,0.2683416502546234]] > (of class org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema) > >
scala.MatchError while doing BinaryClassificationMetrics
I am getting scala.MatchError in the code below. I'm not able to see why this would be happening. I am using Spark 2.0.1 scala> testResults.columns res538: Array[String] = Array(TopicVector, subject_id, hadm_id, isElective, isNewborn, isUrgent, isEmergency, isMale, isFemale, oasis_score, sapsii_score, sofa_score, age, hosp_death, test, ModelFeatures, Label, rawPrediction, ModelProbability, ModelPrediction) scala> testResults.select("Label","ModelProbability").take(1) res542: Array[org.apache.spark.sql.Row] = Array([0.0,[0.737304818744076,0.262695181255924]]) scala> val testScoreAndLabel = testResults. | select("Label","ModelProbability"). | map { case Row(l:Double, p:Vector) => (p(1), l) } testScoreAndLabel: org.apache.spark.sql.Dataset[(Double, Double)] = [_1: double, _2: double] scala> testScoreAndLabel res539: org.apache.spark.sql.Dataset[(Double, Double)] = [_1: double, _2: double] scala> testScoreAndLabel.columns res540: Array[String] = Array(_1, _2) scala> val testMetrics = new BinaryClassificationMetrics(testScoreAndLabel.rdd) testMetrics: org.apache.spark.mllib.evaluation.BinaryClassificationMetrics = org.apache.spark.mllib.evaluation.BinaryClassificationMetrics@36e780d1 The code below gives the error val auROC = testMetrics.areaUnderROC() //this line gives the error Caused by: scala.MatchError: [0.0,[0.7316583497453766,0.2683416502546234]] (of class org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema)
Re: scala.MatchError on stand-alone cluster mode
Hi, Rishabh Bhardwaj, Saisai Shao, Thx for your help. I hava found that the key reason is I forgot to upload the jar package to all of the node in cluster, so after the master distributed the job and selected one node as the driver, the driver can not find the jar package and throw an exception. -- Mekal Zheng Sent with Airmail 发件人: Rishabh Bhardwaj 回复: Rishabh Bhardwaj 日期: July 15, 2016 at 17:28:43 至: Saisai Shao 抄送: Mekal Zheng , spark users 主题: Re: scala.MatchError on stand-alone cluster mode Hi Mekal, It may be a scala version mismatch error,kindly check whether you are running both (your streaming app and spark cluster ) on 2.10 scala or 2.11. Thanks, Rishabh. On Fri, Jul 15, 2016 at 1:38 PM, Saisai Shao wrote: > The error stack is throwing from your code: > > Caused by: scala.MatchError: [Ljava.lang.String;@68d279ec (of class > [Ljava.lang.String;) > at com.jd.deeplog.LogAggregator$.main(LogAggregator.scala:29) > at com.jd.deeplog.LogAggregator.main(LogAggregator.scala) > > I think you should debug the code yourself, it may not be the problem of > Spark. > > On Fri, Jul 15, 2016 at 3:17 PM, Mekal Zheng > wrote: > >> Hi, >> >> I have a Spark Streaming job written in Scala and is running well on >> local and client mode, but when I submit it on cluster mode, the driver >> reported an error shown as below. >> Is there anyone know what is wrong here? >> pls help me! >> >> the Job CODE is after >> >> 16/07/14 17:28:21 DEBUG ByteBufUtil: >> -Dio.netty.threadLocalDirectBufferSize: 65536 >> 16/07/14 17:28:21 DEBUG NetUtil: Loopback interface: lo (lo, >> 0:0:0:0:0:0:0:1%lo) >> 16/07/14 17:28:21 DEBUG NetUtil: /proc/sys/net/core/somaxconn: 32768 >> 16/07/14 17:28:21 DEBUG TransportServer: Shuffle server started on port >> :43492 >> 16/07/14 17:28:21 INFO Utils: Successfully started service 'Driver' on >> port 43492. >> 16/07/14 17:28:21 INFO WorkerWatcher: Connecting to worker spark:// >> Worker@172.20.130.98:23933 >> 16/07/14 17:28:21 DEBUG TransportClientFactory: Creating new connection >> to /172.20.130.98:23933 >> Exception in thread "main" java.lang.reflect.InvocationTargetException >> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) >> at >> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) >> at >> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) >> at java.lang.reflect.Method.invoke(Method.java:497) >> at >> org.apache.spark.deploy.worker.DriverWrapper$.main(DriverWrapper.scala:58) >> at >> org.apache.spark.deploy.worker.DriverWrapper.main(DriverWrapper.scala) >> Caused by: scala.MatchError: [Ljava.lang.String;@68d279ec (of class >> [Ljava.lang.String;) >> at com.jd.deeplog.LogAggregator$.main(LogAggregator.scala:29) >> at com.jd.deeplog.LogAggregator.main(LogAggregator.scala) >> ... 6 more >> >> == >> Job CODE: >> >> object LogAggregator { >> >> val batchDuration = Seconds(5) >> >> def main(args:Array[String]) { >> >> val usage = >> """Usage: LogAggregator >> >> | logFormat: fieldName:fieldRole[,fieldName:fieldRole] each field >> must have both name and role >> | logFormat.role: can be key|avg|enum|sum|ignore >> """.stripMargin >> >> if (args.length < 9) { >> System.err.println(usage) >> System.exit(1) >> } >> >> val Array(zkQuorum, group, topics, numThreads, logFormat, logSeparator, >> batchDuration, destType, destPath) = args >> >> println("Start streaming calculation...") >> >> val conf = new SparkConf().setAppName("LBHaproxy-LogAggregator") >> val ssc = new StreamingContext(conf, Seconds(batchDuration.toInt)) >> >> val topicMap = topics.split(",").map((_, numThreads.toInt)).toMap >> >> val lines = KafkaUtils.createStream(ssc, zkQuorum, group, >> topicMap).map(_._2) >> >> val logFields = logFormat.split(",").map(field => { >> val fld = field.split(":") >> if (fld.size != 2) { >> System.err.println("Wrong parameters for logFormat!\n") >> System.err.println(usage) >> System.exit(1) >> } >> // TODO: ensure the field has both 'name' and 'role' >> new LogField(fld(0), fld(1)) &
Re: scala.MatchError on stand-alone cluster mode
The error stack is throwing from your code: Caused by: scala.MatchError: [Ljava.lang.String;@68d279ec (of class [Ljava.lang.String;) at com.jd.deeplog.LogAggregator$.main(LogAggregator.scala:29) at com.jd.deeplog.LogAggregator.main(LogAggregator.scala) I think you should debug the code yourself, it may not be the problem of Spark. On Fri, Jul 15, 2016 at 3:17 PM, Mekal Zheng wrote: > Hi, > > I have a Spark Streaming job written in Scala and is running well on local > and client mode, but when I submit it on cluster mode, the driver reported > an error shown as below. > Is there anyone know what is wrong here? > pls help me! > > the Job CODE is after > > 16/07/14 17:28:21 DEBUG ByteBufUtil: > -Dio.netty.threadLocalDirectBufferSize: 65536 > 16/07/14 17:28:21 DEBUG NetUtil: Loopback interface: lo (lo, > 0:0:0:0:0:0:0:1%lo) > 16/07/14 17:28:21 DEBUG NetUtil: /proc/sys/net/core/somaxconn: 32768 > 16/07/14 17:28:21 DEBUG TransportServer: Shuffle server started on port > :43492 > 16/07/14 17:28:21 INFO Utils: Successfully started service 'Driver' on > port 43492. > 16/07/14 17:28:21 INFO WorkerWatcher: Connecting to worker spark:// > Worker@172.20.130.98:23933 > 16/07/14 17:28:21 DEBUG TransportClientFactory: Creating new connection to > /172.20.130.98:23933 > Exception in thread "main" java.lang.reflect.InvocationTargetException > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:497) > at > org.apache.spark.deploy.worker.DriverWrapper$.main(DriverWrapper.scala:58) > at > org.apache.spark.deploy.worker.DriverWrapper.main(DriverWrapper.scala) > Caused by: scala.MatchError: [Ljava.lang.String;@68d279ec (of class > [Ljava.lang.String;) > at com.jd.deeplog.LogAggregator$.main(LogAggregator.scala:29) > at com.jd.deeplog.LogAggregator.main(LogAggregator.scala) > ... 6 more > > == > Job CODE: > > object LogAggregator { > > val batchDuration = Seconds(5) > > def main(args:Array[String]) { > > val usage = > """Usage: LogAggregator > > | logFormat: fieldName:fieldRole[,fieldName:fieldRole] each field > must have both name and role > | logFormat.role: can be key|avg|enum|sum|ignore > """.stripMargin > > if (args.length < 9) { > System.err.println(usage) > System.exit(1) > } > > val Array(zkQuorum, group, topics, numThreads, logFormat, logSeparator, > batchDuration, destType, destPath) = args > > println("Start streaming calculation...") > > val conf = new SparkConf().setAppName("LBHaproxy-LogAggregator") > val ssc = new StreamingContext(conf, Seconds(batchDuration.toInt)) > > val topicMap = topics.split(",").map((_, numThreads.toInt)).toMap > > val lines = KafkaUtils.createStream(ssc, zkQuorum, group, > topicMap).map(_._2) > > val logFields = logFormat.split(",").map(field => { > val fld = field.split(":") > if (fld.size != 2) { > System.err.println("Wrong parameters for logFormat!\n") > System.err.println(usage) > System.exit(1) > } > // TODO: ensure the field has both 'name' and 'role' > new LogField(fld(0), fld(1)) > }) > > val keyFields = logFields.filter(logFieldName => { > logFieldName.role == "key" > }) > val keys = keyFields.map(key => { > key.name > }) > > val logsByKey = lines.map(line => { > val log = new Log(logFields, line, logSeparator) > log.toMap > }).filter(log => log.nonEmpty).map(log => { > val keys = keyFields.map(logField => { > log(logField.name).value > }) > > val key = keys.reduce((key1, key2) => { > key1.asInstanceOf[String] + key2.asInstanceOf[String] > }) > > val fullLog = log + ("count" -> new LogSegment("sum", 1)) > > (key, fullLog) > }) > > > val aggResults = logsByKey.reduceByKey((log_a, log_b) => { > > log_a.map(logField => { > val logFieldName = logField._1 > val logSegment_a = logField._2 > val logSegment_b = log_b(logFieldName) > > val segValue = logSegment_a.role match { > case "avg&q
scala.MatchError on stand-alone cluster mode
Hi, I have a Spark Streaming job written in Scala and is running well on local and client mode, but when I submit it on cluster mode, the driver reported an error shown as below. Is there anyone know what is wrong here? pls help me! the Job CODE is after 16/07/14 17:28:21 DEBUG ByteBufUtil: -Dio.netty.threadLocalDirectBufferSize: 65536 16/07/14 17:28:21 DEBUG NetUtil: Loopback interface: lo (lo, 0:0:0:0:0:0:0:1%lo) 16/07/14 17:28:21 DEBUG NetUtil: /proc/sys/net/core/somaxconn: 32768 16/07/14 17:28:21 DEBUG TransportServer: Shuffle server started on port :43492 16/07/14 17:28:21 INFO Utils: Successfully started service 'Driver' on port 43492. 16/07/14 17:28:21 INFO WorkerWatcher: Connecting to worker spark:// Worker@172.20.130.98:23933 16/07/14 17:28:21 DEBUG TransportClientFactory: Creating new connection to / 172.20.130.98:23933 Exception in thread "main" java.lang.reflect.InvocationTargetException at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:497) at org.apache.spark.deploy.worker.DriverWrapper$.main(DriverWrapper.scala:58) at org.apache.spark.deploy.worker.DriverWrapper.main(DriverWrapper.scala) Caused by: scala.MatchError: [Ljava.lang.String;@68d279ec (of class [Ljava.lang.String;) at com.jd.deeplog.LogAggregator$.main(LogAggregator.scala:29) at com.jd.deeplog.LogAggregator.main(LogAggregator.scala) ... 6 more == Job CODE: object LogAggregator { val batchDuration = Seconds(5) def main(args:Array[String]) { val usage = """Usage: LogAggregator | logFormat: fieldName:fieldRole[,fieldName:fieldRole] each field must have both name and role | logFormat.role: can be key|avg|enum|sum|ignore """.stripMargin if (args.length < 9) { System.err.println(usage) System.exit(1) } val Array(zkQuorum, group, topics, numThreads, logFormat, logSeparator, batchDuration, destType, destPath) = args println("Start streaming calculation...") val conf = new SparkConf().setAppName("LBHaproxy-LogAggregator") val ssc = new StreamingContext(conf, Seconds(batchDuration.toInt)) val topicMap = topics.split(",").map((_, numThreads.toInt)).toMap val lines = KafkaUtils.createStream(ssc, zkQuorum, group, topicMap).map(_._2) val logFields = logFormat.split(",").map(field => { val fld = field.split(":") if (fld.size != 2) { System.err.println("Wrong parameters for logFormat!\n") System.err.println(usage) System.exit(1) } // TODO: ensure the field has both 'name' and 'role' new LogField(fld(0), fld(1)) }) val keyFields = logFields.filter(logFieldName => { logFieldName.role == "key" }) val keys = keyFields.map(key => { key.name }) val logsByKey = lines.map(line => { val log = new Log(logFields, line, logSeparator) log.toMap }).filter(log => log.nonEmpty).map(log => { val keys = keyFields.map(logField => { log(logField.name).value }) val key = keys.reduce((key1, key2) => { key1.asInstanceOf[String] + key2.asInstanceOf[String] }) val fullLog = log + ("count" -> new LogSegment("sum", 1)) (key, fullLog) }) val aggResults = logsByKey.reduceByKey((log_a, log_b) => { log_a.map(logField => { val logFieldName = logField._1 val logSegment_a = logField._2 val logSegment_b = log_b(logFieldName) val segValue = logSegment_a.role match { case "avg" => { logSegment_a.value.toString.toInt + logSegment_b.value.toString.toInt } case "sum" => { logSegment_a.value.toString.toInt + logSegment_b.value.toString.toInt } case "enum" => { val list_a = logSegment_a.value.asInstanceOf[List[(String, Int)]] val list_b = logSegment_b.value.asInstanceOf[List[(String, Int)]] list_a ++ list_b } case _ => logSegment_a.value } (logFieldName, new LogSegment(logSegment_a.role, segValue)) }) }).map(logRecord => { val log = logRecord._2 val count = log("count").value.toString.toInt val logContent = log.map(logField => { val logFieldName = logField._1 val logSegment = logField._2 val fieldValue = logSegment.role match { case "avg" => { logSegment.value.toString.
Re: [MARKETING] Spark Streaming stateful transformation mapWithState function getting error scala.MatchError: [Ljava.lang.Object]
Hi Iain, Thanks for your reply. Actually i changed my trackStateFunc, it's working now. For reference my working code with mapWithState: def trackStateFunc(batchTime: Time, key: String, value: Option[Array[Long]], state: State[Array[Long]]) : Option[(String, Array[Long])] = { // Check if state exists if (state.exists) { val newState:Array[Long] = Array(state.get, value.get).transpose.map(_.sum) state.update(newState)// Set the new state Some((key, newState)) } else { val initialState = value.get state.update(initialState) // Set the initial state Some((key, initialState)) } } // StateSpec[KeyType, ValueType, StateType, MappedType] val stateSpec: StateSpec[String, Array[Long], Array[Long], (String, Array[Long])] = StateSpec.function(trackStateFunc _) val state: MapWithStateDStream[String, Array[Long], Array[Long], (String, Array[Long])] = parsedStream.mapWithState(stateSpec) Thanks & Regards, Vinti On Mon, Mar 14, 2016 at 7:06 AM, Iain Cundy wrote: > Hi Vinti > > > > I don’t program in scala, but I think you’ve changed the meaning of the > current variable – look again at what it state and what is new data. > > > > Assuming it works like the Java API, to use this function to maintain > State you must call State.update, while you can return anything, not just > the state. > > > > Cheers > > Iain > > > > *From:* Vinti Maheshwari [mailto:vinti.u...@gmail.com] > *Sent:* 12 March 2016 22:10 > *To:* user > *Subject:* [MARKETING] Spark Streaming stateful transformation > mapWithState function getting error scala.MatchError: [Ljava.lang.Object] > > > > Hi All, > > I wanted to replace my updateStateByKey function with mapWithState > function (Spark 1.6) to improve performance of my program. > > I was following these two documents: > https://databricks.com/blog/2016/02/01/faster-stateful-stream-processing-in-spark-streaming.html > > > https://docs.cloud.databricks.com/docs/spark/1.6/index.html#examples/Streaming%20mapWithState.html > > but i am getting error *scala.MatchError: [Ljava.lang.Object]* > > org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in > stage 71.0 failed 4 times, most recent failure: Lost task 0.3 in stage 71.0 > (TID 88, ttsv-lab-vmdb-01.englab.juniper.net): scala.MatchError: > [Ljava.lang.Object;@eaf8bc8 (of class [Ljava.lang.Object;) > > at > HbaseCovrageStream$$anonfun$HbaseCovrageStream$$tracketStateFunc$1$3.apply(HbaseCoverageStream_mapwithstate.scala:84) > > at > HbaseCovrageStream$$anonfun$HbaseCovrageStream$$tracketStateFunc$1$3.apply(HbaseCoverageStream_mapwithstate.scala:84) > > at scala.Option.flatMap(Option.scala:170) > > at > HbaseCovrageStream$.HbaseCovrageStream$$tracketStateFunc$1(HbaseCoverageStream_mapwithstate.scala:84) > > Reference code: > > def trackStateFunc(key:String, value:Option[Array[Long]], > current:State[Array[Long]]) = { > > > > //either we can use this > > // current.update(value) > > > > value.map(_ :+ current).orElse(Some(current)).flatMap{ > > case x:Array[Long] => Try(x.map(BDV(_)).reduce(_ + > _).toArray).toOption > > case None => ??? > > } > > } > > > > val statespec:StateSpec[String, Array[Long], Array[Long], > Option[Array[Long]]] = StateSpec.function(trackStateFunc _) > > > > val state: MapWithStateDStream[String, Array[Long], Array[Long], > Option[Array[Long]]] = parsedStream.mapWithState(statespec) > > My previous working code which was using updateStateByKey function: > > val state: DStream[(String, Array[Long])] = parsedStream.updateStateByKey( > > (current: Seq[Array[Long]], prev: Option[Array[Long]]) => { > > prev.map(_ +: current).orElse(Some(current)) > > .flatMap(as => Try(as.map(BDV(_)).reduce(_ + _).toArray).toOption) > > }) > > Anyone has idea what can be the issue? > > Thanks & Regards, > > Vinti > This message and the information contained herein is proprietary and > confidential and subject to the Amdocs policy statement, you may review at > http://www.amdocs.com/email_disclaimer.asp >
RE: [MARKETING] Spark Streaming stateful transformation mapWithState function getting error scala.MatchError: [Ljava.lang.Object]
Hi Vinti I don’t program in scala, but I think you’ve changed the meaning of the current variable – look again at what it state and what is new data. Assuming it works like the Java API, to use this function to maintain State you must call State.update, while you can return anything, not just the state. Cheers Iain From: Vinti Maheshwari [mailto:vinti.u...@gmail.com] Sent: 12 March 2016 22:10 To: user Subject: [MARKETING] Spark Streaming stateful transformation mapWithState function getting error scala.MatchError: [Ljava.lang.Object] Hi All, I wanted to replace my updateStateByKey function with mapWithState function (Spark 1.6) to improve performance of my program. I was following these two documents: https://databricks.com/blog/2016/02/01/faster-stateful-stream-processing-in-spark-streaming.html https://docs.cloud.databricks.com/docs/spark/1.6/index.html#examples/Streaming%20mapWithState.html but i am getting error scala.MatchError: [Ljava.lang.Object] org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 71.0 failed 4 times, most recent failure: Lost task 0.3 in stage 71.0 (TID 88, ttsv-lab-vmdb-01.englab.juniper.net): scala.MatchError: [Ljava.lang.Object;@eaf8bc8 (of class [Ljava.lang.Object;) at HbaseCovrageStream$$anonfun$HbaseCovrageStream$$tracketStateFunc$1$3.apply(HbaseCoverageStream_mapwithstate.scala:84) at HbaseCovrageStream$$anonfun$HbaseCovrageStream$$tracketStateFunc$1$3.apply(HbaseCoverageStream_mapwithstate.scala:84) at scala.Option.flatMap(Option.scala:170) at HbaseCovrageStream$.HbaseCovrageStream$$tracketStateFunc$1(HbaseCoverageStream_mapwithstate.scala:84) Reference code: def trackStateFunc(key:String, value:Option[Array[Long]], current:State[Array[Long]]) = { //either we can use this // current.update(value) value.map(_ :+ current).orElse(Some(current)).flatMap{ case x:Array[Long] => Try(x.map(BDV(_)).reduce(_ + _).toArray).toOption case None => ??? } } val statespec:StateSpec[String, Array[Long], Array[Long], Option[Array[Long]]] = StateSpec.function(trackStateFunc _) val state: MapWithStateDStream[String, Array[Long], Array[Long], Option[Array[Long]]] = parsedStream.mapWithState(statespec) My previous working code which was using updateStateByKey function: val state: DStream[(String, Array[Long])] = parsedStream.updateStateByKey( (current: Seq[Array[Long]], prev: Option[Array[Long]]) => { prev.map(_ +: current).orElse(Some(current)) .flatMap(as => Try(as.map(BDV(_)).reduce(_ + _).toArray).toOption) }) Anyone has idea what can be the issue? Thanks & Regards, Vinti This message and the information contained herein is proprietary and confidential and subject to the Amdocs policy statement, you may review at http://www.amdocs.com/email_disclaimer.asp
Spark Streaming stateful transformation mapWithState function getting error scala.MatchError: [Ljava.lang.Object]
Hi All, I wanted to replace my updateStateByKey function with mapWithState function (Spark 1.6) to improve performance of my program. I was following these two documents: https://databricks.com/blog/2016/02/01/faster-stateful-stream-processing-in-spark-streaming.html https://docs.cloud.databricks.com/docs/spark/1.6/index.html#examples/Streaming%20mapWithState.html but i am getting error *scala.MatchError: [Ljava.lang.Object]* org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 71.0 failed 4 times, most recent failure: Lost task 0.3 in stage 71.0 (TID 88, ttsv-lab-vmdb-01.englab.juniper.net): scala.MatchError: [Ljava.lang.Object;@eaf8bc8 (of class [Ljava.lang.Object;) at HbaseCovrageStream$$anonfun$HbaseCovrageStream$$tracketStateFunc$1$3.apply(HbaseCoverageStream_mapwithstate.scala:84) at HbaseCovrageStream$$anonfun$HbaseCovrageStream$$tracketStateFunc$1$3.apply(HbaseCoverageStream_mapwithstate.scala:84) at scala.Option.flatMap(Option.scala:170) at HbaseCovrageStream$.HbaseCovrageStream$$tracketStateFunc$1(HbaseCoverageStream_mapwithstate.scala:84) Reference code: def trackStateFunc(key:String, value:Option[Array[Long]], current:State[Array[Long]]) = { //either we can use this // current.update(value) value.map(_ :+ current).orElse(Some(current)).flatMap{ case x:Array[Long] => Try(x.map(BDV(_)).reduce(_ + _).toArray).toOption case None => ??? } } val statespec:StateSpec[String, Array[Long], Array[Long], Option[Array[Long]]] = StateSpec.function(trackStateFunc _) val state: MapWithStateDStream[String, Array[Long], Array[Long], Option[Array[Long]]] = parsedStream.mapWithState(statespec) My previous working code which was using updateStateByKey function: val state: DStream[(String, Array[Long])] = parsedStream.updateStateByKey( (current: Seq[Array[Long]], prev: Option[Array[Long]]) => { prev.map(_ +: current).orElse(Some(current)) .flatMap(as => Try(as.map(BDV(_)).reduce(_ + _).toArray).toOption) }) Anyone has idea what can be the issue? Thanks & Regards, Vinti
Re: sql:Exception in thread "main" scala.MatchError: StringType
Spark only support one json object per line. You need to reformat your file. On Mon, Jan 4, 2016 at 11:26 AM, Bonsen wrote: > (sbt) scala: > import org.apache.spark.SparkContext > import org.apache.spark.SparkConf > import org.apache.spark.sql > object SimpleApp { > def main(args: Array[String]) { > val conf = new SparkConf() > conf.setAppName("mytest").setMaster("spark://Master:7077") > val sc = new SparkContext(conf) > val sqlContext = new sql.SQLContext(sc) > val > > d=sqlContext.read.json("/home/hadoop/2015data_test/Data/Data/100808cb11e9898816ef15fcdde4e1d74cbc0b/Db6Jh2XeQ.json") > sc.stop() > } > } > > __ > after sbt package : > ./spark-submit --class "SimpleApp" > > /home/hadoop/Downloads/sbt/bin/target/scala-2.10/simple-project_2.10-1.0.jar > > ___ > json fIle: > { > "programmers": [ > { > "firstName": "Brett", > "lastName": "McLaughlin", > "email": "" > }, > { > "firstName": "Jason", > "lastName": "Hunter", > "email": "" > }, > { > "firstName": "Elliotte", > "lastName": "Harold", > "email": "" > } > ], > "authors": [ > { > "firstName": "Isaac", > "lastName": "Asimov", > "genre": "sciencefiction" > }, > { > "firstName": "Tad", > "lastName": "Williams", > "genre": "fantasy" > }, > { > "firstName": "Frank", > "lastName": "Peretti", > "genre": "christianfiction" > } > ], > "musicians": [ > { > "firstName": "Eric", > "lastName": "Clapton", > "instrument": "guitar" > }, > { > "firstName": "Sergei", > "lastName": "Rachmaninoff", > "instrument": "piano" > } > ] > } > > ___ > Exception in thread "main" scala.MatchError: StringType (of class > org.apache.spark.sql.types.StringType$) > at > org.apache.spark.sql.json.InferSchema$.apply(InferSchema.scala:58) > at > > org.apache.spark.sql.json.JSONRelation$$anonfun$schema$1.apply(JSONRelation.scala:139) > > ___ > why > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/sql-Exception-in-thread-main-scala-MatchError-StringType-tp25868.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > - > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > > -- Best Regards Jeff Zhang
sql:Exception in thread "main" scala.MatchError: StringType
(sbt) scala: import org.apache.spark.SparkContext import org.apache.spark.SparkConf import org.apache.spark.sql object SimpleApp { def main(args: Array[String]) { val conf = new SparkConf() conf.setAppName("mytest").setMaster("spark://Master:7077") val sc = new SparkContext(conf) val sqlContext = new sql.SQLContext(sc) val d=sqlContext.read.json("/home/hadoop/2015data_test/Data/Data/100808cb11e9898816ef15fcdde4e1d74cbc0b/Db6Jh2XeQ.json") sc.stop() } } __ after sbt package : ./spark-submit --class "SimpleApp" /home/hadoop/Downloads/sbt/bin/target/scala-2.10/simple-project_2.10-1.0.jar ___ json fIle: { "programmers": [ { "firstName": "Brett", "lastName": "McLaughlin", "email": "" }, { "firstName": "Jason", "lastName": "Hunter", "email": "" }, { "firstName": "Elliotte", "lastName": "Harold", "email": "" } ], "authors": [ { "firstName": "Isaac", "lastName": "Asimov", "genre": "sciencefiction" }, { "firstName": "Tad", "lastName": "Williams", "genre": "fantasy" }, { "firstName": "Frank", "lastName": "Peretti", "genre": "christianfiction" } ], "musicians": [ { "firstName": "Eric", "lastName": "Clapton", "instrument": "guitar" }, { "firstName": "Sergei", "lastName": "Rachmaninoff", "instrument": "piano" } ] } ___ Exception in thread "main" scala.MatchError: StringType (of class org.apache.spark.sql.types.StringType$) at org.apache.spark.sql.json.InferSchema$.apply(InferSchema.scala:58) at org.apache.spark.sql.json.JSONRelation$$anonfun$schema$1.apply(JSONRelation.scala:139) ___ why -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/sql-Exception-in-thread-main-scala-MatchError-StringType-tp25868.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Reading JSON in Pyspark throws scala.MatchError
You are correct, that was the issue. On Tue, Oct 20, 2015 at 10:18 PM, Jeff Zhang wrote: > BTW, I think Json Parser should verify the json format at least when > inferring the schema of json. > > On Wed, Oct 21, 2015 at 12:59 PM, Jeff Zhang wrote: > >> I think this is due to the json file format. DataFrame can only accept >> json file with one valid record per line. Multiple line per record is >> invalid for DataFrame. >> >> >> On Tue, Oct 6, 2015 at 2:48 AM, Davies Liu wrote: >> >>> Could you create a JIRA to track this bug? >>> >>> On Fri, Oct 2, 2015 at 1:42 PM, balajikvijayan >>> wrote: >>> > Running Windows 8.1, Python 2.7.x, Scala 2.10.5, Spark 1.4.1. >>> > >>> > I'm trying to read in a large quantity of json data in a couple of >>> files and >>> > I receive a scala.MatchError when I do so. Json, Python and stack >>> trace all >>> > shown below. >>> > >>> > Json: >>> > >>> > { >>> > "dataunit": { >>> > "page_view": { >>> > "nonce": 438058072, >>> > "person": { >>> > "user_id": 5846 >>> > }, >>> > "page": { >>> > "url": "http://mysite.com/blog"; >>> > } >>> > } >>> > }, >>> > "pedigree": { >>> > "true_as_of_secs": 1438627992 >>> > } >>> > } >>> > >>> > Python: >>> > >>> > import pyspark >>> > sc = pyspark.SparkContext() >>> > sqlContext = pyspark.SQLContext(sc) >>> > pageviews = sqlContext.read.json("[Path to folder containing file with >>> above >>> > json]") >>> > pageviews.collect() >>> > >>> > Stack Trace: >>> > Py4JJavaError: An error occurred while calling >>> > z:org.apache.spark.api.python.PythonRDD.collectAndServe. >>> > : org.apache.spark.SparkException: Job aborted due to stage failure: >>> Task 1 >>> > in stage 32.0 failed 1 times, most recent failure: Lost task 1.0 in >>> stage >>> > 32.0 (TID 133, localhost): scala.MatchError: >>> > (VALUE_STRING,ArrayType(StructType(),true)) (of class scala.Tuple2) >>> > at >>> > >>> org.apache.spark.sql.json.JacksonParser$.convertField(JacksonParser.scala:49) >>> > at >>> > >>> org.apache.spark.sql.json.JacksonParser$$anonfun$parseJson$1$$anonfun$apply$1.apply(JacksonParser.scala:201) >>> > at >>> > >>> org.apache.spark.sql.json.JacksonParser$$anonfun$parseJson$1$$anonfun$apply$1.apply(JacksonParser.scala:193) >>> > at >>> scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) >>> > at >>> scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) >>> > at >>> scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) >>> > at >>> > >>> org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.hasNext(SerDeUtil.scala:116) >>> > at scala.collection.Iterator$class.foreach(Iterator.scala:727) >>> > at >>> > >>> org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.foreach(SerDeUtil.scala:111) >>> > at >>> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) >>> > at >>> > >>> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) >>> > at >>> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) >>> > at scala.collection.TraversableOnce$class.to >>> (TraversableOnce.scala:273) >>> > at >>> > >>> org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.to(SerDeUtil.scala:111) >>> > at >>> > >>> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265) >>> > at >>> > >>> org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.toBuffer(SerDeUtil.scala:111) >>> > at >>> > >>> scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252) >>> > at >>> >
Re: Reading JSON in Pyspark throws scala.MatchError
BTW, I think Json Parser should verify the json format at least when inferring the schema of json. On Wed, Oct 21, 2015 at 12:59 PM, Jeff Zhang wrote: > I think this is due to the json file format. DataFrame can only accept > json file with one valid record per line. Multiple line per record is > invalid for DataFrame. > > > On Tue, Oct 6, 2015 at 2:48 AM, Davies Liu wrote: > >> Could you create a JIRA to track this bug? >> >> On Fri, Oct 2, 2015 at 1:42 PM, balajikvijayan >> wrote: >> > Running Windows 8.1, Python 2.7.x, Scala 2.10.5, Spark 1.4.1. >> > >> > I'm trying to read in a large quantity of json data in a couple of >> files and >> > I receive a scala.MatchError when I do so. Json, Python and stack trace >> all >> > shown below. >> > >> > Json: >> > >> > { >> > "dataunit": { >> > "page_view": { >> > "nonce": 438058072, >> > "person": { >> > "user_id": 5846 >> > }, >> > "page": { >> > "url": "http://mysite.com/blog"; >> > } >> > } >> > }, >> > "pedigree": { >> > "true_as_of_secs": 1438627992 >> > } >> > } >> > >> > Python: >> > >> > import pyspark >> > sc = pyspark.SparkContext() >> > sqlContext = pyspark.SQLContext(sc) >> > pageviews = sqlContext.read.json("[Path to folder containing file with >> above >> > json]") >> > pageviews.collect() >> > >> > Stack Trace: >> > Py4JJavaError: An error occurred while calling >> > z:org.apache.spark.api.python.PythonRDD.collectAndServe. >> > : org.apache.spark.SparkException: Job aborted due to stage failure: >> Task 1 >> > in stage 32.0 failed 1 times, most recent failure: Lost task 1.0 in >> stage >> > 32.0 (TID 133, localhost): scala.MatchError: >> > (VALUE_STRING,ArrayType(StructType(),true)) (of class scala.Tuple2) >> > at >> > >> org.apache.spark.sql.json.JacksonParser$.convertField(JacksonParser.scala:49) >> > at >> > >> org.apache.spark.sql.json.JacksonParser$$anonfun$parseJson$1$$anonfun$apply$1.apply(JacksonParser.scala:201) >> > at >> > >> org.apache.spark.sql.json.JacksonParser$$anonfun$parseJson$1$$anonfun$apply$1.apply(JacksonParser.scala:193) >> > at >> scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) >> > at >> scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) >> > at >> scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) >> > at >> > >> org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.hasNext(SerDeUtil.scala:116) >> > at scala.collection.Iterator$class.foreach(Iterator.scala:727) >> > at >> > >> org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.foreach(SerDeUtil.scala:111) >> > at >> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) >> > at >> > >> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) >> > at >> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) >> > at scala.collection.TraversableOnce$class.to >> (TraversableOnce.scala:273) >> > at >> > >> org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.to(SerDeUtil.scala:111) >> > at >> > >> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265) >> > at >> > >> org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.toBuffer(SerDeUtil.scala:111) >> > at >> > >> scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252) >> > at >> > >> org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.toArray(SerDeUtil.scala:111) >> > at >> > >> org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$12.apply(RDD.scala:885) >> > at >> > >> org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$12.apply(RDD.scala:885) >> > at >> > >> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1767) >> > at >> > >> or
Re: Reading JSON in Pyspark throws scala.MatchError
I think this is due to the json file format. DataFrame can only accept json file with one valid record per line. Multiple line per record is invalid for DataFrame. On Tue, Oct 6, 2015 at 2:48 AM, Davies Liu wrote: > Could you create a JIRA to track this bug? > > On Fri, Oct 2, 2015 at 1:42 PM, balajikvijayan > wrote: > > Running Windows 8.1, Python 2.7.x, Scala 2.10.5, Spark 1.4.1. > > > > I'm trying to read in a large quantity of json data in a couple of files > and > > I receive a scala.MatchError when I do so. Json, Python and stack trace > all > > shown below. > > > > Json: > > > > { > > "dataunit": { > > "page_view": { > > "nonce": 438058072, > > "person": { > > "user_id": 5846 > > }, > > "page": { > > "url": "http://mysite.com/blog"; > > } > > } > > }, > > "pedigree": { > > "true_as_of_secs": 1438627992 > > } > > } > > > > Python: > > > > import pyspark > > sc = pyspark.SparkContext() > > sqlContext = pyspark.SQLContext(sc) > > pageviews = sqlContext.read.json("[Path to folder containing file with > above > > json]") > > pageviews.collect() > > > > Stack Trace: > > Py4JJavaError: An error occurred while calling > > z:org.apache.spark.api.python.PythonRDD.collectAndServe. > > : org.apache.spark.SparkException: Job aborted due to stage failure: > Task 1 > > in stage 32.0 failed 1 times, most recent failure: Lost task 1.0 in stage > > 32.0 (TID 133, localhost): scala.MatchError: > > (VALUE_STRING,ArrayType(StructType(),true)) (of class scala.Tuple2) > > at > > > org.apache.spark.sql.json.JacksonParser$.convertField(JacksonParser.scala:49) > > at > > > org.apache.spark.sql.json.JacksonParser$$anonfun$parseJson$1$$anonfun$apply$1.apply(JacksonParser.scala:201) > > at > > > org.apache.spark.sql.json.JacksonParser$$anonfun$parseJson$1$$anonfun$apply$1.apply(JacksonParser.scala:193) > > at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) > > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) > > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) > > at > > > org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.hasNext(SerDeUtil.scala:116) > > at scala.collection.Iterator$class.foreach(Iterator.scala:727) > > at > > > org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.foreach(SerDeUtil.scala:111) > > at > scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) > > at > > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) > > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) > > at scala.collection.TraversableOnce$class.to > (TraversableOnce.scala:273) > > at > > > org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.to(SerDeUtil.scala:111) > > at > > > scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265) > > at > > > org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.toBuffer(SerDeUtil.scala:111) > > at > > scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252) > > at > > > org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.toArray(SerDeUtil.scala:111) > > at > > > org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$12.apply(RDD.scala:885) > > at > > > org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$12.apply(RDD.scala:885) > > at > > > org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1767) > > at > > > org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1767) > > at > org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63) > > at org.apache.spark.scheduler.Task.run(Task.scala:70) > > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213) > > at > > > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > > at > > > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > > at java.lang.Thread.run(Thread.j
Re: Reading JSON in Pyspark throws scala.MatchError
Could you create a JIRA to track this bug? On Fri, Oct 2, 2015 at 1:42 PM, balajikvijayan wrote: > Running Windows 8.1, Python 2.7.x, Scala 2.10.5, Spark 1.4.1. > > I'm trying to read in a large quantity of json data in a couple of files and > I receive a scala.MatchError when I do so. Json, Python and stack trace all > shown below. > > Json: > > { > "dataunit": { > "page_view": { > "nonce": 438058072, > "person": { > "user_id": 5846 > }, > "page": { > "url": "http://mysite.com/blog"; > } > } > }, > "pedigree": { > "true_as_of_secs": 1438627992 > } > } > > Python: > > import pyspark > sc = pyspark.SparkContext() > sqlContext = pyspark.SQLContext(sc) > pageviews = sqlContext.read.json("[Path to folder containing file with above > json]") > pageviews.collect() > > Stack Trace: > Py4JJavaError: An error occurred while calling > z:org.apache.spark.api.python.PythonRDD.collectAndServe. > : org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 > in stage 32.0 failed 1 times, most recent failure: Lost task 1.0 in stage > 32.0 (TID 133, localhost): scala.MatchError: > (VALUE_STRING,ArrayType(StructType(),true)) (of class scala.Tuple2) > at > org.apache.spark.sql.json.JacksonParser$.convertField(JacksonParser.scala:49) > at > org.apache.spark.sql.json.JacksonParser$$anonfun$parseJson$1$$anonfun$apply$1.apply(JacksonParser.scala:201) > at > org.apache.spark.sql.json.JacksonParser$$anonfun$parseJson$1$$anonfun$apply$1.apply(JacksonParser.scala:193) > at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) > at > org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.hasNext(SerDeUtil.scala:116) > at scala.collection.Iterator$class.foreach(Iterator.scala:727) > at > org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.foreach(SerDeUtil.scala:111) > at > scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) > at > scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273) > at > org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.to(SerDeUtil.scala:111) > at > scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265) > at > org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.toBuffer(SerDeUtil.scala:111) > at > scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252) > at > org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.toArray(SerDeUtil.scala:111) > at > org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$12.apply(RDD.scala:885) > at > org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$12.apply(RDD.scala:885) > at > org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1767) > at > org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1767) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63) > at org.apache.spark.scheduler.Task.run(Task.scala:70) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > > Driver stacktrace: > at > org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1273) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1264) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1263) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > at > org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1263) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleT
Re: Reading JSON in Pyspark throws scala.MatchError
I got the following when parsing your input with master branch (Python version 2.6.6): http://pastebin.com/1w8WM3tz FYI On Fri, Oct 2, 2015 at 1:42 PM, balajikvijayan wrote: > Running Windows 8.1, Python 2.7.x, Scala 2.10.5, Spark 1.4.1. > > I'm trying to read in a large quantity of json data in a couple of files > and > I receive a scala.MatchError when I do so. Json, Python and stack trace all > shown below. > > Json: > > { > "dataunit": { > "page_view": { > "nonce": 438058072, > "person": { > "user_id": 5846 > }, > "page": { > "url": "http://mysite.com/blog"; > } > } > }, > "pedigree": { > "true_as_of_secs": 1438627992 > } > } > > Python: > > import pyspark > sc = pyspark.SparkContext() > sqlContext = pyspark.SQLContext(sc) > pageviews = sqlContext.read.json("[Path to folder containing file with > above > json]") > pageviews.collect() > > Stack Trace: > Py4JJavaError: An error occurred while calling > z:org.apache.spark.api.python.PythonRDD.collectAndServe. > : org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 > in stage 32.0 failed 1 times, most recent failure: Lost task 1.0 in stage > 32.0 (TID 133, localhost): scala.MatchError: > (VALUE_STRING,ArrayType(StructType(),true)) (of class scala.Tuple2) > at > > org.apache.spark.sql.json.JacksonParser$.convertField(JacksonParser.scala:49) > at > > org.apache.spark.sql.json.JacksonParser$$anonfun$parseJson$1$$anonfun$apply$1.apply(JacksonParser.scala:201) > at > > org.apache.spark.sql.json.JacksonParser$$anonfun$parseJson$1$$anonfun$apply$1.apply(JacksonParser.scala:193) > at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) > at > > org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.hasNext(SerDeUtil.scala:116) > at scala.collection.Iterator$class.foreach(Iterator.scala:727) > at > > org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.foreach(SerDeUtil.scala:111) > at > scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) > at scala.collection.TraversableOnce$class.to > (TraversableOnce.scala:273) > at > > org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.to(SerDeUtil.scala:111) > at > scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265) > at > > org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.toBuffer(SerDeUtil.scala:111) > at > scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252) > at > > org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.toArray(SerDeUtil.scala:111) > at > > org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$12.apply(RDD.scala:885) > at > > org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$12.apply(RDD.scala:885) > at > > org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1767) > at > > org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1767) > at > org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63) > at org.apache.spark.scheduler.Task.run(Task.scala:70) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213) > at > > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > > Driver stacktrace: > at > org.apache.spark.scheduler.DAGScheduler.org > $apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1273) > at > > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1264) > at > > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1263) > at > > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at > scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) >
Reading JSON in Pyspark throws scala.MatchError
Running Windows 8.1, Python 2.7.x, Scala 2.10.5, Spark 1.4.1. I'm trying to read in a large quantity of json data in a couple of files and I receive a scala.MatchError when I do so. Json, Python and stack trace all shown below. Json: { "dataunit": { "page_view": { "nonce": 438058072, "person": { "user_id": 5846 }, "page": { "url": "http://mysite.com/blog"; } } }, "pedigree": { "true_as_of_secs": 1438627992 } } Python: import pyspark sc = pyspark.SparkContext() sqlContext = pyspark.SQLContext(sc) pageviews = sqlContext.read.json("[Path to folder containing file with above json]") pageviews.collect() Stack Trace: Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 32.0 failed 1 times, most recent failure: Lost task 1.0 in stage 32.0 (TID 133, localhost): scala.MatchError: (VALUE_STRING,ArrayType(StructType(),true)) (of class scala.Tuple2) at org.apache.spark.sql.json.JacksonParser$.convertField(JacksonParser.scala:49) at org.apache.spark.sql.json.JacksonParser$$anonfun$parseJson$1$$anonfun$apply$1.apply(JacksonParser.scala:201) at org.apache.spark.sql.json.JacksonParser$$anonfun$parseJson$1$$anonfun$apply$1.apply(JacksonParser.scala:193) at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.hasNext(SerDeUtil.scala:116) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.foreach(SerDeUtil.scala:111) at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273) at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.to(SerDeUtil.scala:111) at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265) at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.toBuffer(SerDeUtil.scala:111) at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252) at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.toArray(SerDeUtil.scala:111) at org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$12.apply(RDD.scala:885) at org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$12.apply(RDD.scala:885) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1767) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1767) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63) at org.apache.spark.scheduler.Task.run(Task.scala:70) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1273) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1264) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1263) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1263) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:730) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:730) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:730) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1457) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1418) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) It seems like this issue has been resolved in scala per
Re: SparkSQL : using Hive UDF returning Map throws "rror: scala.MatchError: interface java.util.Map (of class java.lang.Class) (state=,code=0)"
I will..that will be great if simple UDF can return complex type. Thanks! On Fri, Jun 5, 2015 at 12:17 AM, Cheng, Hao wrote: > Confirmed, with latest master, we don't support complex data type for Simple > Hive UDF, do you mind file an issue in jira? > > -Original Message- > From: Cheng, Hao [mailto:hao.ch...@intel.com] > Sent: Friday, June 5, 2015 12:35 PM > To: ogoh; user@spark.apache.org > Subject: RE: SparkSQL : using Hive UDF returning Map throws "rror: > scala.MatchError: interface java.util.Map (of class java.lang.Class) > (state=,code=0)" > > Which version of Hive jar are you using? Hive 0.13.1 or Hive 0.12.0? > > -Original Message- > From: ogoh [mailto:oke...@gmail.com] > Sent: Friday, June 5, 2015 10:10 AM > To: user@spark.apache.org > Subject: SparkSQL : using Hive UDF returning Map throws "rror: > scala.MatchError: interface java.util.Map (of class java.lang.Class) > (state=,code=0)" > > > Hello, > I tested some custom udf on SparkSql's ThriftServer & Beeline (Spark 1.3.1). > Some udfs work fine (access array parameter and returning int or string type). > But my udf returning map type throws an error: > "Error: scala.MatchError: interface java.util.Map (of class java.lang.Class) > (state=,code=0)" > > I converted the code into Hive's GenericUDF since I worried that using > complex type parameter (array of map) and returning complex type (map) can be > supported in Hive's GenericUDF instead of simple UDF. > But SparkSQL doesn't seem supporting GenericUDF.(error message : Error: > java.lang.IllegalAccessException: Class > org.apache.spark.sql.hive.HiveFunctionWrapper can not access ..). > > Below is my example udf code returning MAP type. > I appreciate any advice. > Thanks > > -- > > public final class ArrayToMap extends UDF { > > public Map evaluate(ArrayList arrayOfString) { > // add code to handle all index problem > > Map map = new HashMap(); > > int count = 0; > for (String element : arrayOfString) { > map.put(count + "", element); > count++; > > } > return map; > } > } > > > > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/SparkSQL-using-Hive-UDF-returning-Map-throws-rror-scala-MatchError-interface-java-util-Map-of-class--tp23164.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > - > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional > commands, e-mail: user-h...@spark.apache.org > > > - > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
RE: SparkSQL : using Hive UDF returning Map throws "rror: scala.MatchError: interface java.util.Map (of class java.lang.Class) (state=,code=0)"
Confirmed, with latest master, we don't support complex data type for Simple Hive UDF, do you mind file an issue in jira? -Original Message- From: Cheng, Hao [mailto:hao.ch...@intel.com] Sent: Friday, June 5, 2015 12:35 PM To: ogoh; user@spark.apache.org Subject: RE: SparkSQL : using Hive UDF returning Map throws "rror: scala.MatchError: interface java.util.Map (of class java.lang.Class) (state=,code=0)" Which version of Hive jar are you using? Hive 0.13.1 or Hive 0.12.0? -Original Message- From: ogoh [mailto:oke...@gmail.com] Sent: Friday, June 5, 2015 10:10 AM To: user@spark.apache.org Subject: SparkSQL : using Hive UDF returning Map throws "rror: scala.MatchError: interface java.util.Map (of class java.lang.Class) (state=,code=0)" Hello, I tested some custom udf on SparkSql's ThriftServer & Beeline (Spark 1.3.1). Some udfs work fine (access array parameter and returning int or string type). But my udf returning map type throws an error: "Error: scala.MatchError: interface java.util.Map (of class java.lang.Class) (state=,code=0)" I converted the code into Hive's GenericUDF since I worried that using complex type parameter (array of map) and returning complex type (map) can be supported in Hive's GenericUDF instead of simple UDF. But SparkSQL doesn't seem supporting GenericUDF.(error message : Error: java.lang.IllegalAccessException: Class org.apache.spark.sql.hive.HiveFunctionWrapper can not access ..). Below is my example udf code returning MAP type. I appreciate any advice. Thanks -- public final class ArrayToMap extends UDF { public Map evaluate(ArrayList arrayOfString) { // add code to handle all index problem Map map = new HashMap(); int count = 0; for (String element : arrayOfString) { map.put(count + "", element); count++; } return map; } } -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/SparkSQL-using-Hive-UDF-returning-Map-throws-rror-scala-MatchError-interface-java-util-Map-of-class--tp23164.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: SparkSQL : using Hive UDF returning Map throws "rror: scala.MatchError: interface java.util.Map (of class java.lang.Class) (state=,code=0)"
It is Spark 1.3.1.e (it is AWS release .. I think it is close to Spark 1.3.1 with some bug fixes). My report about GenericUDF not working in SparkSQL is wrong. I tested with open-source GenericUDF and it worked fine. Just my GenericUDF which returns Map type didn't work. Sorry about false reporting. On Thu, Jun 4, 2015 at 9:35 PM, Cheng, Hao wrote: > Which version of Hive jar are you using? Hive 0.13.1 or Hive 0.12.0? > > -Original Message- > From: ogoh [mailto:oke...@gmail.com] > Sent: Friday, June 5, 2015 10:10 AM > To: user@spark.apache.org > Subject: SparkSQL : using Hive UDF returning Map throws "rror: > scala.MatchError: interface java.util.Map (of class java.lang.Class) > (state=,code=0)" > > > Hello, > I tested some custom udf on SparkSql's ThriftServer & Beeline (Spark 1.3.1). > Some udfs work fine (access array parameter and returning int or string type). > But my udf returning map type throws an error: > "Error: scala.MatchError: interface java.util.Map (of class java.lang.Class) > (state=,code=0)" > > I converted the code into Hive's GenericUDF since I worried that using > complex type parameter (array of map) and returning complex type (map) can be > supported in Hive's GenericUDF instead of simple UDF. > But SparkSQL doesn't seem supporting GenericUDF.(error message : Error: > java.lang.IllegalAccessException: Class > org.apache.spark.sql.hive.HiveFunctionWrapper can not access ..). > > Below is my example udf code returning MAP type. > I appreciate any advice. > Thanks > > -- > > public final class ArrayToMap extends UDF { > > public Map evaluate(ArrayList arrayOfString) { > // add code to handle all index problem > > Map map = new HashMap(); > > int count = 0; > for (String element : arrayOfString) { > map.put(count + "", element); > count++; > > } > return map; > } > } > > > > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/SparkSQL-using-Hive-UDF-returning-Map-throws-rror-scala-MatchError-interface-java-util-Map-of-class--tp23164.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > - > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional > commands, e-mail: user-h...@spark.apache.org > - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
RE: SparkSQL : using Hive UDF returning Map throws "rror: scala.MatchError: interface java.util.Map (of class java.lang.Class) (state=,code=0)"
Which version of Hive jar are you using? Hive 0.13.1 or Hive 0.12.0? -Original Message- From: ogoh [mailto:oke...@gmail.com] Sent: Friday, June 5, 2015 10:10 AM To: user@spark.apache.org Subject: SparkSQL : using Hive UDF returning Map throws "rror: scala.MatchError: interface java.util.Map (of class java.lang.Class) (state=,code=0)" Hello, I tested some custom udf on SparkSql's ThriftServer & Beeline (Spark 1.3.1). Some udfs work fine (access array parameter and returning int or string type). But my udf returning map type throws an error: "Error: scala.MatchError: interface java.util.Map (of class java.lang.Class) (state=,code=0)" I converted the code into Hive's GenericUDF since I worried that using complex type parameter (array of map) and returning complex type (map) can be supported in Hive's GenericUDF instead of simple UDF. But SparkSQL doesn't seem supporting GenericUDF.(error message : Error: java.lang.IllegalAccessException: Class org.apache.spark.sql.hive.HiveFunctionWrapper can not access ..). Below is my example udf code returning MAP type. I appreciate any advice. Thanks -- public final class ArrayToMap extends UDF { public Map evaluate(ArrayList arrayOfString) { // add code to handle all index problem Map map = new HashMap(); int count = 0; for (String element : arrayOfString) { map.put(count + "", element); count++; } return map; } } -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/SparkSQL-using-Hive-UDF-returning-Map-throws-rror-scala-MatchError-interface-java-util-Map-of-class--tp23164.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
SparkSQL : using Hive UDF returning Map throws "rror: scala.MatchError: interface java.util.Map (of class java.lang.Class) (state=,code=0)"
Hello, I tested some custom udf on SparkSql's ThriftServer & Beeline (Spark 1.3.1). Some udfs work fine (access array parameter and returning int or string type). But my udf returning map type throws an error: "Error: scala.MatchError: interface java.util.Map (of class java.lang.Class) (state=,code=0)" I converted the code into Hive's GenericUDF since I worried that using complex type parameter (array of map) and returning complex type (map) can be supported in Hive's GenericUDF instead of simple UDF. But SparkSQL doesn't seem supporting GenericUDF.(error message : Error: java.lang.IllegalAccessException: Class org.apache.spark.sql.hive.HiveFunctionWrapper can not access ..). Below is my example udf code returning MAP type. I appreciate any advice. Thanks -- public final class ArrayToMap extends UDF { public Map evaluate(ArrayList arrayOfString) { // add code to handle all index problem Map map = new HashMap(); int count = 0; for (String element : arrayOfString) { map.put(count + "", element); count++; } return map; } } -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/SparkSQL-using-Hive-UDF-returning-Map-throws-rror-scala-MatchError-interface-java-util-Map-of-class--tp23164.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: scala.MatchError: class org.apache.avro.Schema (of class java.lang.Class)
For more details on my question http://apache-spark-user-list.1001560.n3.nabble.com/How-to-generate-Java-bean-class-for-avro-files-using-spark-avro-project-tp22413.html Thanks, Yamini On Tue, Apr 7, 2015 at 2:23 PM, Yamini Maddirala wrote: > Hi Michael, > > Yes, I did try spark-avro 0.2.0 databricks project. I am using CHD5.3 > which is based on spark 1.2. Hence I'm bound to use spark-avro 0.2.0 > instead of the latest. > > I'm not sure how spark-avro project can help me in this scenario. > > 1. I have JavaDStream of type avro generic record > :JavaDStream [This is the data being read from kafka topics] > 2. I'm able to get JavaSchemaRDD using the avro file like this > final JavaSchemaRDD schemaRDD2 = AvroUtils.avroFile(sqlContext, > "/xyz-Project/trunk/src/main/resources/xyz.avro"); > 3. I don't know how I can apply schema in step 2 to data in step 1. > I chose to do something like this >JavaSchemaRDD schemaRDD = sqlContext.applySchema(genericRecordJavaRDD, > xyz.class); > >Used avro maven plugin to generate xyz class in Java. But this is not > good because avro maven plugin creates a field SCHEMA which is not > supported in applySchema method. > > Please let me know how to deal with this. > > Appreciate your help > > Thanks, > Yamini > > > > > > > > > > > > > On Tue, Apr 7, 2015 at 1:57 PM, Michael Armbrust > wrote: > >> Have you looked at spark-avro? >> >> https://github.com/databricks/spark-avro >> >> On Tue, Apr 7, 2015 at 3:57 AM, Yamini wrote: >> >>> Using spark(1.2) streaming to read avro schema based topics flowing in >>> kafka >>> and then using spark sql context to register data as temp table. Avro >>> maven >>> plugin(1.7.7 version) generates the java bean class for the avro file but >>> includes a field named SCHEMA$ of type org.apache.avro.Schema which is >>> not >>> supported in the JavaSQLContext class[Method : applySchema]. >>> How to auto generate java bean class for the avro file and over come the >>> above mentioned problem. >>> >>> Thanks. >>> >>> >>> >>> >>> - >>> Thanks, >>> Yamini >>> -- >>> View this message in context: >>> http://apache-spark-user-list.1001560.n3.nabble.com/scala-MatchError-class-org-apache-avro-Schema-of-class-java-lang-Class-tp22402.html >>> Sent from the Apache Spark User List mailing list archive at Nabble.com. >>> >>> - >>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >>> For additional commands, e-mail: user-h...@spark.apache.org >>> >>> >> >
Re: scala.MatchError: class org.apache.avro.Schema (of class java.lang.Class)
Hi Michael, Yes, I did try spark-avro 0.2.0 databricks project. I am using CHD5.3 which is based on spark 1.2. Hence I'm bound to use spark-avro 0.2.0 instead of the latest. I'm not sure how spark-avro project can help me in this scenario. 1. I have JavaDStream of type avro generic record :JavaDStream [This is the data being read from kafka topics] 2. I'm able to get JavaSchemaRDD using the avro file like this final JavaSchemaRDD schemaRDD2 = AvroUtils.avroFile(sqlContext, "/xyz-Project/trunk/src/main/resources/xyz.avro"); 3. I don't know how I can apply schema in step 2 to data in step 1. I chose to do something like this JavaSchemaRDD schemaRDD = sqlContext.applySchema(genericRecordJavaRDD, xyz.class); Used avro maven plugin to generate xyz class in Java. But this is not good because avro maven plugin creates a field SCHEMA which is not supported in applySchema method. Please let me know how to deal with this. Appreciate your help Thanks, Yamini On Tue, Apr 7, 2015 at 1:57 PM, Michael Armbrust wrote: > Have you looked at spark-avro? > > https://github.com/databricks/spark-avro > > On Tue, Apr 7, 2015 at 3:57 AM, Yamini wrote: > >> Using spark(1.2) streaming to read avro schema based topics flowing in >> kafka >> and then using spark sql context to register data as temp table. Avro >> maven >> plugin(1.7.7 version) generates the java bean class for the avro file but >> includes a field named SCHEMA$ of type org.apache.avro.Schema which is not >> supported in the JavaSQLContext class[Method : applySchema]. >> How to auto generate java bean class for the avro file and over come the >> above mentioned problem. >> >> Thanks. >> >> >> >> >> - >> Thanks, >> Yamini >> -- >> View this message in context: >> http://apache-spark-user-list.1001560.n3.nabble.com/scala-MatchError-class-org-apache-avro-Schema-of-class-java-lang-Class-tp22402.html >> Sent from the Apache Spark User List mailing list archive at Nabble.com. >> >> - >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >> For additional commands, e-mail: user-h...@spark.apache.org >> >> >
Re: scala.MatchError: class org.apache.avro.Schema (of class java.lang.Class)
Have you looked at spark-avro? https://github.com/databricks/spark-avro On Tue, Apr 7, 2015 at 3:57 AM, Yamini wrote: > Using spark(1.2) streaming to read avro schema based topics flowing in > kafka > and then using spark sql context to register data as temp table. Avro maven > plugin(1.7.7 version) generates the java bean class for the avro file but > includes a field named SCHEMA$ of type org.apache.avro.Schema which is not > supported in the JavaSQLContext class[Method : applySchema]. > How to auto generate java bean class for the avro file and over come the > above mentioned problem. > > Thanks. > > > > > - > Thanks, > Yamini > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/scala-MatchError-class-org-apache-avro-Schema-of-class-java-lang-Class-tp22402.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > - > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > >
scala.MatchError: class org.apache.avro.Schema (of class java.lang.Class)
Using spark(1.2) streaming to read avro schema based topics flowing in kafka and then using spark sql context to register data as temp table. Avro maven plugin(1.7.7 version) generates the java bean class for the avro file but includes a field named SCHEMA$ of type org.apache.avro.Schema which is not supported in the JavaSQLContext class[Method : applySchema]. How to auto generate java bean class for the avro file and over come the above mentioned problem. Thanks. - Thanks, Yamini -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/scala-MatchError-class-org-apache-avro-Schema-of-class-java-lang-Class-tp22402.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Q about Spark MLlib- Decision tree - scala.MatchError: 2.0 (of class java.lang.Double)
I am working some kind of Spark MLlib Test(Decision Tree) and I used IRIS data from Cran-R package. Original IRIS Data is not a good format for Spark MLlib. so I changed data format(change data format and features's location) When I ran sample Spark MLlib code for DT, I met the error like below How can i solve this error? == 14/12/15 14:27:30 ERROR TaskSetManager: Task 21.0:0 failed 4 times; aborting job 14/12/15 14:27:30 INFO TaskSchedulerImpl: Cancelling stage 21 14/12/15 14:27:30 INFO DAGScheduler: Failed to run aggregate at DecisionTree.scala:657 14/12/15 14:27:30 INFO TaskSchedulerImpl: Stage 21 was cancelled 14/12/15 14:27:30 WARN TaskSetManager: Loss was due to org.apache.spark.TaskKilledException org.apache.spark.TaskKilledException at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:178) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) 14/12/15 14:27:30 INFO TaskSchedulerImpl: Removed TaskSet 21.0, whose tasks have all completed, from pool org.apache.spark.SparkException: Job aborted due to stage failure: Task 21.0:0 failed 4 times, most recent failure: Exception failure in TID 34 on host krbda1anode01.kr.test.com: scala.MatchError: 2.0 (of class java.lang.Double) org.apache.spark.mllib.tree.DecisionTree$.classificationBinSeqOp$1(DecisionTree.scala:568) org.apache.spark.mllib.tree.DecisionTree$.org$apache$spark$mllib$tree$DecisionTree$$binSeqOp$1(DecisionTree.scala:623) org.apache.spark.mllib.tree.DecisionTree$$anonfun$4.apply(DecisionTree.scala:657) org.apache.spark.mllib.tree.DecisionTree$$anonfun$4.apply(DecisionTree.scala:657) scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:144) scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:144) scala.collection.Iterator$class.foreach(Iterator.scala:727) scala.collection.AbstractIterator.foreach(Iterator.scala:1157) scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:144) scala.collection.AbstractIterator.foldLeft(Iterator.scala:1157) scala.collection.TraversableOnce$class.aggregate(TraversableOnce.scala:201) scala.collection.AbstractIterator.aggregate(Iterator.scala:1157) org.apache.spark.rdd.RDD$$anonfun$21.apply(RDD.scala:838) org.apache.spark.rdd.RDD$$anonfun$21.apply(RDD.scala:838) org.apache.spark.SparkContext$$anonfun$23.apply(SparkContext.scala:1116) org.apache.spark.SparkContext$$anonfun$23.apply(SparkContext.scala:1116) org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111) org.apache.spark.scheduler.Task.run(Task.scala:51) org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187) java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) java.lang.Thread.run(Thread.java:745) Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1033) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1017) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1015) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1015) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:633) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:633) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:633) at org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1207) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) at akka.actor.ActorCell.invoke(ActorCell.scala:456) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) at akka.dispatch.Mailbox.run(Mailbox.scala:219) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
Re: scala.MatchError on SparkSQL when creating ArrayType of StructType
All values in Hive are always nullable, though you should still not be seeing this error. It should be addressed by this patch: https://github.com/apache/spark/pull/3150 On Fri, Dec 5, 2014 at 2:36 AM, Hao Ren wrote: > Hi, > > I am using SparkSQL on 1.1.0 branch. > > The following code leads to a scala.MatchError > at > > org.apache.spark.sql.catalyst.expressions.Cast.cast$lzycompute(Cast.scala:247) > > val scm = StructType(inputRDD.schema.fields.init :+ > StructField("list", > ArrayType( > StructType( > Seq(StructField("date", StringType, nullable = false), > StructField("nbPurchase", IntegerType, nullable = false, > nullable = false)) > > // purchaseRDD is RDD[sql.ROW] whose schema is corresponding to scm. It is > transformed from inputRDD > val schemaRDD = hiveContext.applySchema(purchaseRDD, scm) > schemaRDD.registerTempTable("t_purchase") > > Here's the stackTrace: > scala.MatchError: ArrayType(StructType(List(StructField(date,StringType, > true ), StructField(n_reachat,IntegerType, true ))),true) (of class > org.apache.spark.sql.catalyst.types.ArrayType) > at > > org.apache.spark.sql.catalyst.expressions.Cast.cast$lzycompute(Cast.scala:247) > at > org.apache.spark.sql.catalyst.expressions.Cast.cast(Cast.scala:247) > at > org.apache.spark.sql.catalyst.expressions.Cast.eval(Cast.scala:263) > at > > org.apache.spark.sql.catalyst.expressions.Alias.eval(namedExpressions.scala:84) > at > > org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection.apply(Projection.scala:66) > at > > org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection.apply(Projection.scala:50) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) > at > org.apache.spark.sql.hive.execution.InsertIntoHiveTable.org > $apache$spark$sql$hive$execution$InsertIntoHiveTable$$writeToFile$1(InsertIntoHiveTable.scala:149) > at > > org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$saveAsHiveFile$1.apply(InsertIntoHiveTable.scala:158) > at > > org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$saveAsHiveFile$1.apply(InsertIntoHiveTable.scala:158) > at > org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62) > at org.apache.spark.scheduler.Task.run(Task.scala:54) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177) > at > > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:744) > > The strange thing is that nullable of date and nbPurchase field are set to > true while it were false in the code. If I set both to true, it works. But, > in fact, they should not be nullable. > > Here's what I find at Cast.scala:247 on 1.1.0 branch > > private[this] lazy val cast: Any => Any = dataType match { > case StringType => castToString > case BinaryType => castToBinary > case DecimalType => castToDecimal > case TimestampType => castToTimestamp > case BooleanType => castToBoolean > case ByteType => castToByte > case ShortType => castToShort > case IntegerType => castToInt > case FloatType => castToFloat > case LongType => castToLong > case DoubleType => castToDouble > } > > Any idea? Thank you. > > Hao > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/scala-MatchError-on-SparkSQL-when-creating-ArrayType-of-StructType-tp20459.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > - > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > >
scala.MatchError on SparkSQL when creating ArrayType of StructType
Hi, I am using SparkSQL on 1.1.0 branch. The following code leads to a scala.MatchError at org.apache.spark.sql.catalyst.expressions.Cast.cast$lzycompute(Cast.scala:247) val scm = StructType(inputRDD.schema.fields.init :+ StructField("list", ArrayType( StructType( Seq(StructField("date", StringType, nullable = false), StructField("nbPurchase", IntegerType, nullable = false, nullable = false)) // purchaseRDD is RDD[sql.ROW] whose schema is corresponding to scm. It is transformed from inputRDD val schemaRDD = hiveContext.applySchema(purchaseRDD, scm) schemaRDD.registerTempTable("t_purchase") Here's the stackTrace: scala.MatchError: ArrayType(StructType(List(StructField(date,StringType, true ), StructField(n_reachat,IntegerType, true ))),true) (of class org.apache.spark.sql.catalyst.types.ArrayType) at org.apache.spark.sql.catalyst.expressions.Cast.cast$lzycompute(Cast.scala:247) at org.apache.spark.sql.catalyst.expressions.Cast.cast(Cast.scala:247) at org.apache.spark.sql.catalyst.expressions.Cast.eval(Cast.scala:263) at org.apache.spark.sql.catalyst.expressions.Alias.eval(namedExpressions.scala:84) at org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection.apply(Projection.scala:66) at org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection.apply(Projection.scala:50) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at org.apache.spark.sql.hive.execution.InsertIntoHiveTable.org$apache$spark$sql$hive$execution$InsertIntoHiveTable$$writeToFile$1(InsertIntoHiveTable.scala:149) at org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$saveAsHiveFile$1.apply(InsertIntoHiveTable.scala:158) at org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$saveAsHiveFile$1.apply(InsertIntoHiveTable.scala:158) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62) at org.apache.spark.scheduler.Task.run(Task.scala:54) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) The strange thing is that nullable of date and nbPurchase field are set to true while it were false in the code. If I set both to true, it works. But, in fact, they should not be nullable. Here's what I find at Cast.scala:247 on 1.1.0 branch private[this] lazy val cast: Any => Any = dataType match { case StringType => castToString case BinaryType => castToBinary case DecimalType => castToDecimal case TimestampType => castToTimestamp case BooleanType => castToBoolean case ByteType => castToByte case ShortType => castToShort case IntegerType => castToInt case FloatType => castToFloat case LongType => castToLong case DoubleType => castToDouble } Any idea? Thank you. Hao -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/scala-MatchError-on-SparkSQL-when-creating-ArrayType-of-StructType-tp20459.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
RE: scala.MatchError
Hi, Do you mean with java, I shouldn’t have Issue class as a property (attribute) in Instrument Class? Ex : Class Issue { Int a; } Class Instrument { Issue issue; } How about scala? Does it support such user defined datatypes in classes Case class Issue . case class Issue( a:Int = 0) case class Instrument(issue: Issue = null) -Naveen From: Michael Armbrust [mailto:mich...@databricks.com] Sent: Wednesday, November 12, 2014 12:09 AM To: Xiangrui Meng Cc: Naveen Kumar Pokala; user@spark.apache.org Subject: Re: scala.MatchError Xiangrui is correct that is must be a java bean, also nested classes are not yet supported in java. On Tue, Nov 11, 2014 at 10:11 AM, Xiangrui Meng mailto:men...@gmail.com>> wrote: I think you need a Java bean class instead of a normal class. See example here: http://spark.apache.org/docs/1.1.0/sql-programming-guide.html (switch to the java tab). -Xiangrui On Tue, Nov 11, 2014 at 7:18 AM, Naveen Kumar Pokala mailto:npok...@spcapitaliq.com>> wrote: > Hi, > > > > This is my Instrument java constructor. > > > > public Instrument(Issue issue, Issuer issuer, Issuing issuing) { > > super(); > > this.issue = issue; > > this.issuer = issuer; > > this.issuing = issuing; > > } > > > > > > I am trying to create javaschemaRDD > > > > JavaSchemaRDD schemaInstruments = sqlCtx.applySchema(distData, > Instrument.class); > > > > Remarks: > > > > > > Instrument, Issue, Issuer, Issuing all are java classes > > > > distData is holding List< Instrument > > > > > > > I am getting the following error. > > > > > > > > Exception in thread "Driver" java.lang.reflect.InvocationTargetException > > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > > at java.lang.reflect.Method.invoke(Method.java:483) > > at > org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:162) > > Caused by: scala.MatchError: class sample.spark.test.Issue (of class > java.lang.Class) > > at > org.apache.spark.sql.api.java.JavaSQLContext$$anonfun$getSchema$1.apply(JavaSQLContext.scala:189) > > at > org.apache.spark.sql.api.java.JavaSQLContext$$anonfun$getSchema$1.apply(JavaSQLContext.scala:188) > > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > > at > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) > > at > scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108) > > at > scala.collection.TraversableLike$class.map(TraversableLike.scala:244) > > at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108) > > at > org.apache.spark.sql.api.java.JavaSQLContext.getSchema(JavaSQLContext.scala:188) > > at > org.apache.spark.sql.api.java.JavaSQLContext.applySchema(JavaSQLContext.scala:90) > > at sample.spark.test.SparkJob.main(SparkJob.java:33) > > ... 5 more > > > > Please help me. > > > > Regards, > > Naveen. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org<mailto:user-unsubscr...@spark.apache.org> For additional commands, e-mail: user-h...@spark.apache.org<mailto:user-h...@spark.apache.org>
Re: scala.MatchError
Xiangrui is correct that is must be a java bean, also nested classes are not yet supported in java. On Tue, Nov 11, 2014 at 10:11 AM, Xiangrui Meng wrote: > I think you need a Java bean class instead of a normal class. See > example here: > http://spark.apache.org/docs/1.1.0/sql-programming-guide.html > (switch to the java tab). -Xiangrui > > On Tue, Nov 11, 2014 at 7:18 AM, Naveen Kumar Pokala > wrote: > > Hi, > > > > > > > > This is my Instrument java constructor. > > > > > > > > public Instrument(Issue issue, Issuer issuer, Issuing issuing) { > > > > super(); > > > > this.issue = issue; > > > > this.issuer = issuer; > > > > this.issuing = issuing; > > > > } > > > > > > > > > > > > I am trying to create javaschemaRDD > > > > > > > > JavaSchemaRDD schemaInstruments = sqlCtx.applySchema(distData, > > Instrument.class); > > > > > > > > Remarks: > > > > > > > > > > > > Instrument, Issue, Issuer, Issuing all are java classes > > > > > > > > distData is holding List< Instrument > > > > > > > > > > > > > I am getting the following error. > > > > > > > > > > > > > > > > Exception in thread "Driver" java.lang.reflect.InvocationTargetException > > > > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > > > > at > > > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > > > > at > > > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > > > > at java.lang.reflect.Method.invoke(Method.java:483) > > > > at > > > org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:162) > > > > Caused by: scala.MatchError: class sample.spark.test.Issue (of class > > java.lang.Class) > > > > at > > > org.apache.spark.sql.api.java.JavaSQLContext$$anonfun$getSchema$1.apply(JavaSQLContext.scala:189) > > > > at > > > org.apache.spark.sql.api.java.JavaSQLContext$$anonfun$getSchema$1.apply(JavaSQLContext.scala:188) > > > > at > > > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > > > > at > > > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > > > > at > > > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) > > > > at > > scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108) > > > > at > > scala.collection.TraversableLike$class.map(TraversableLike.scala:244) > > > > at > scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108) > > > > at > > > org.apache.spark.sql.api.java.JavaSQLContext.getSchema(JavaSQLContext.scala:188) > > > > at > > > org.apache.spark.sql.api.java.JavaSQLContext.applySchema(JavaSQLContext.scala:90) > > > > at sample.spark.test.SparkJob.main(SparkJob.java:33) > > > > ... 5 more > > > > > > > > Please help me. > > > > > > > > Regards, > > > > Naveen. > > - > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > >
Re: scala.MatchError
I think you need a Java bean class instead of a normal class. See example here: http://spark.apache.org/docs/1.1.0/sql-programming-guide.html (switch to the java tab). -Xiangrui On Tue, Nov 11, 2014 at 7:18 AM, Naveen Kumar Pokala wrote: > Hi, > > > > This is my Instrument java constructor. > > > > public Instrument(Issue issue, Issuer issuer, Issuing issuing) { > > super(); > > this.issue = issue; > > this.issuer = issuer; > > this.issuing = issuing; > > } > > > > > > I am trying to create javaschemaRDD > > > > JavaSchemaRDD schemaInstruments = sqlCtx.applySchema(distData, > Instrument.class); > > > > Remarks: > > > > > > Instrument, Issue, Issuer, Issuing all are java classes > > > > distData is holding List< Instrument > > > > > > > I am getting the following error. > > > > > > > > Exception in thread "Driver" java.lang.reflect.InvocationTargetException > > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > > at java.lang.reflect.Method.invoke(Method.java:483) > > at > org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:162) > > Caused by: scala.MatchError: class sample.spark.test.Issue (of class > java.lang.Class) > > at > org.apache.spark.sql.api.java.JavaSQLContext$$anonfun$getSchema$1.apply(JavaSQLContext.scala:189) > > at > org.apache.spark.sql.api.java.JavaSQLContext$$anonfun$getSchema$1.apply(JavaSQLContext.scala:188) > > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > > at > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) > > at > scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108) > > at > scala.collection.TraversableLike$class.map(TraversableLike.scala:244) > > at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108) > > at > org.apache.spark.sql.api.java.JavaSQLContext.getSchema(JavaSQLContext.scala:188) > > at > org.apache.spark.sql.api.java.JavaSQLContext.applySchema(JavaSQLContext.scala:90) > > at sample.spark.test.SparkJob.main(SparkJob.java:33) > > ... 5 more > > > > Please help me. > > > > Regards, > > Naveen. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
scala.MatchError
Hi, This is my Instrument java constructor. public Instrument(Issue issue, Issuer issuer, Issuing issuing) { super(); this.issue = issue; this.issuer = issuer; this.issuing = issuing; } I am trying to create javaschemaRDD JavaSchemaRDD schemaInstruments = sqlCtx.applySchema(distData, Instrument.class); Remarks: Instrument, Issue, Issuer, Issuing all are java classes distData is holding List< Instrument > I am getting the following error. Exception in thread "Driver" java.lang.reflect.InvocationTargetException at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:483) at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:162) Caused by: scala.MatchError: class sample.spark.test.Issue (of class java.lang.Class) at org.apache.spark.sql.api.java.JavaSQLContext$$anonfun$getSchema$1.apply(JavaSQLContext.scala:189) at org.apache.spark.sql.api.java.JavaSQLContext$$anonfun$getSchema$1.apply(JavaSQLContext.scala:188) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108) at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108) at org.apache.spark.sql.api.java.JavaSQLContext.getSchema(JavaSQLContext.scala:188) at org.apache.spark.sql.api.java.JavaSQLContext.applySchema(JavaSQLContext.scala:90) at sample.spark.test.SparkJob.main(SparkJob.java:33) ... 5 more Please help me. Regards, Naveen.
RE: scala.MatchError: class java.sql.Timestamp
I have created an issue for this https://issues.apache.org/jira/browse/SPARK-4003 From: Cheng, Hao Sent: Monday, October 20, 2014 9:20 AM To: Ge, Yao (Y.); Wang, Daoyuan; user@spark.apache.org Subject: RE: scala.MatchError: class java.sql.Timestamp Seems bugs in the JavaSQLContext.getSchema(), which doesn't enumerate all of the data types supported by Catalyst. From: Ge, Yao (Y.) [mailto:y...@ford.com] Sent: Sunday, October 19, 2014 11:44 PM To: Wang, Daoyuan; user@spark.apache.org<mailto:user@spark.apache.org> Subject: RE: scala.MatchError: class java.sql.Timestamp scala.MatchError: class java.sql.Timestamp (of class java.lang.Class) at org.apache.spark.sql.api.java.JavaSQLContext$$anonfun$getSchema$1.apply(JavaSQLContext.scala:189) at org.apache.spark.sql.api.java.JavaSQLContext$$anonfun$getSchema$1.apply(JavaSQLContext.scala:188) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108) at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108) at org.apache.spark.sql.api.java.JavaSQLContext.getSchema(JavaSQLContext.scala:188) at org.apache.spark.sql.api.java.JavaSQLContext.applySchema(JavaSQLContext.scala:90) at com.ford.dtc.ff.SessionStats.testTimeStamp(SessionStats.java:111) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source) at java.lang.reflect.Method.invoke(Unknown Source) at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:45) at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:15) at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:42) at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:20) at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:263) at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:68) at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:47) at org.junit.runners.ParentRunner$3.run(ParentRunner.java:231) at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:60) at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:229) at org.junit.runners.ParentRunner.access$000(ParentRunner.java:50) at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:222) at org.junit.runners.ParentRunner.run(ParentRunner.java:300) at org.eclipse.jdt.internal.junit4.runner.JUnit4TestReference.run(JUnit4TestReference.java:50) at org.eclipse.jdt.internal.junit.runner.TestExecution.run(TestExecution.java:38) at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:467) at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:683) at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.run(RemoteTestRunner.java:390) at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.main(RemoteTestRunner.java:197) From: Wang, Daoyuan [mailto:daoyuan.w...@intel.com] Sent: Sunday, October 19, 2014 10:31 AM To: Ge, Yao (Y.); user@spark.apache.org<mailto:user@spark.apache.org> Subject: RE: scala.MatchError: class java.sql.Timestamp Can you provide the exception stack? Thanks, Daoyuan From: Ge, Yao (Y.) [mailto:y...@ford.com] Sent: Sunday, October 19, 2014 10:17 PM To: user@spark.apache.org<mailto:user@spark.apache.org> Subject: scala.MatchError: class java.sql.Timestamp I am working with Spark 1.1.0 and I believe Timestamp is a supported data type for Spark SQL. However I keep getting this MatchError for java.sql.Timestamp when I try to use reflection to register a Java Bean with Timestamp field. Anything wrong with my code below? public static class Event implements Serializable { private String name; private Timestamp time; public String getName() { return name; }
RE: scala.MatchError: class java.sql.Timestamp
Seems bugs in the JavaSQLContext.getSchema(), which doesn't enumerate all of the data types supported by Catalyst. From: Ge, Yao (Y.) [mailto:y...@ford.com] Sent: Sunday, October 19, 2014 11:44 PM To: Wang, Daoyuan; user@spark.apache.org Subject: RE: scala.MatchError: class java.sql.Timestamp scala.MatchError: class java.sql.Timestamp (of class java.lang.Class) at org.apache.spark.sql.api.java.JavaSQLContext$$anonfun$getSchema$1.apply(JavaSQLContext.scala:189) at org.apache.spark.sql.api.java.JavaSQLContext$$anonfun$getSchema$1.apply(JavaSQLContext.scala:188) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108) at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108) at org.apache.spark.sql.api.java.JavaSQLContext.getSchema(JavaSQLContext.scala:188) at org.apache.spark.sql.api.java.JavaSQLContext.applySchema(JavaSQLContext.scala:90) at com.ford.dtc.ff.SessionStats.testTimeStamp(SessionStats.java:111) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source) at java.lang.reflect.Method.invoke(Unknown Source) at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:45) at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:15) at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:42) at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:20) at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:263) at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:68) at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:47) at org.junit.runners.ParentRunner$3.run(ParentRunner.java:231) at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:60) at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:229) at org.junit.runners.ParentRunner.access$000(ParentRunner.java:50) at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:222) at org.junit.runners.ParentRunner.run(ParentRunner.java:300) at org.eclipse.jdt.internal.junit4.runner.JUnit4TestReference.run(JUnit4TestReference.java:50) at org.eclipse.jdt.internal.junit.runner.TestExecution.run(TestExecution.java:38) at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:467) at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:683) at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.run(RemoteTestRunner.java:390) at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.main(RemoteTestRunner.java:197) From: Wang, Daoyuan [mailto:daoyuan.w...@intel.com] Sent: Sunday, October 19, 2014 10:31 AM To: Ge, Yao (Y.); user@spark.apache.org<mailto:user@spark.apache.org> Subject: RE: scala.MatchError: class java.sql.Timestamp Can you provide the exception stack? Thanks, Daoyuan From: Ge, Yao (Y.) [mailto:y...@ford.com] Sent: Sunday, October 19, 2014 10:17 PM To: user@spark.apache.org<mailto:user@spark.apache.org> Subject: scala.MatchError: class java.sql.Timestamp I am working with Spark 1.1.0 and I believe Timestamp is a supported data type for Spark SQL. However I keep getting this MatchError for java.sql.Timestamp when I try to use reflection to register a Java Bean with Timestamp field. Anything wrong with my code below? public static class Event implements Serializable { private String name; private Timestamp time; public String getName() { return name; } public void setName(String name) { this.name = name; } public Timestamp getTime() {
RE: scala.MatchError: class java.sql.Timestamp
scala.MatchError: class java.sql.Timestamp (of class java.lang.Class) at org.apache.spark.sql.api.java.JavaSQLContext$$anonfun$getSchema$1.apply(JavaSQLContext.scala:189) at org.apache.spark.sql.api.java.JavaSQLContext$$anonfun$getSchema$1.apply(JavaSQLContext.scala:188) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108) at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108) at org.apache.spark.sql.api.java.JavaSQLContext.getSchema(JavaSQLContext.scala:188) at org.apache.spark.sql.api.java.JavaSQLContext.applySchema(JavaSQLContext.scala:90) at com.ford.dtc.ff.SessionStats.testTimeStamp(SessionStats.java:111) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source) at java.lang.reflect.Method.invoke(Unknown Source) at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:45) at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:15) at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:42) at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:20) at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:263) at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:68) at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:47) at org.junit.runners.ParentRunner$3.run(ParentRunner.java:231) at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:60) at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:229) at org.junit.runners.ParentRunner.access$000(ParentRunner.java:50) at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:222) at org.junit.runners.ParentRunner.run(ParentRunner.java:300) at org.eclipse.jdt.internal.junit4.runner.JUnit4TestReference.run(JUnit4TestReference.java:50) at org.eclipse.jdt.internal.junit.runner.TestExecution.run(TestExecution.java:38) at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:467) at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:683) at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.run(RemoteTestRunner.java:390) at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.main(RemoteTestRunner.java:197) From: Wang, Daoyuan [mailto:daoyuan.w...@intel.com] Sent: Sunday, October 19, 2014 10:31 AM To: Ge, Yao (Y.); user@spark.apache.org Subject: RE: scala.MatchError: class java.sql.Timestamp Can you provide the exception stack? Thanks, Daoyuan From: Ge, Yao (Y.) [mailto:y...@ford.com] Sent: Sunday, October 19, 2014 10:17 PM To: user@spark.apache.org<mailto:user@spark.apache.org> Subject: scala.MatchError: class java.sql.Timestamp I am working with Spark 1.1.0 and I believe Timestamp is a supported data type for Spark SQL. However I keep getting this MatchError for java.sql.Timestamp when I try to use reflection to register a Java Bean with Timestamp field. Anything wrong with my code below? public static class Event implements Serializable { private String name; private Timestamp time; public String getName() { return name; } public void setName(String name) { this.name = name; } public Timestamp getTime() { return time; } public void setTime(Timestamp time) { this.time = time; } } @Test public void testTimeStamp() { JavaSparkContext sc
RE: scala.MatchError: class java.sql.Timestamp
Can you provide the exception stack? Thanks, Daoyuan From: Ge, Yao (Y.) [mailto:y...@ford.com] Sent: Sunday, October 19, 2014 10:17 PM To: user@spark.apache.org Subject: scala.MatchError: class java.sql.Timestamp I am working with Spark 1.1.0 and I believe Timestamp is a supported data type for Spark SQL. However I keep getting this MatchError for java.sql.Timestamp when I try to use reflection to register a Java Bean with Timestamp field. Anything wrong with my code below? public static class Event implements Serializable { private String name; private Timestamp time; public String getName() { return name; } public void setName(String name) { this.name = name; } public Timestamp getTime() { return time; } public void setTime(Timestamp time) { this.time = time; } } @Test public void testTimeStamp() { JavaSparkContext sc = new JavaSparkContext("local", "timestamp"); String[] data = {"1,2014-01-01", "2,2014-02-01"}; JavaRDD input = sc.parallelize(Arrays.asList(data)); JavaRDD events = input.map(new Function() { public Event call(String arg0) throws Exception { String[] c = arg0.split(","); Event e = new Event(); e.setName(c[0]); DateFormat fmt = new SimpleDateFormat("-MM-dd"); e.setTime(new Timestamp(fmt.parse(c[1]).getTime())); return e; } }); JavaSQLContext sqlCtx = new JavaSQLContext(sc); JavaSchemaRDD schemaEvent = sqlCtx.applySchema(events, Event.class); schemaEvent.registerTempTable("event"); sc.stop(); }
scala.MatchError: class java.sql.Timestamp
I am working with Spark 1.1.0 and I believe Timestamp is a supported data type for Spark SQL. However I keep getting this MatchError for java.sql.Timestamp when I try to use reflection to register a Java Bean with Timestamp field. Anything wrong with my code below? public static class Event implements Serializable { private String name; private Timestamp time; public String getName() { return name; } public void setName(String name) { this.name = name; } public Timestamp getTime() { return time; } public void setTime(Timestamp time) { this.time = time; } } @Test public void testTimeStamp() { JavaSparkContext sc = new JavaSparkContext("local", "timestamp"); String[] data = {"1,2014-01-01", "2,2014-02-01"}; JavaRDD input = sc.parallelize(Arrays.asList(data)); JavaRDD events = input.map(new Function() { public Event call(String arg0) throws Exception { String[] c = arg0.split(","); Event e = new Event(); e.setName(c[0]); DateFormat fmt = new SimpleDateFormat("-MM-dd"); e.setTime(new Timestamp(fmt.parse(c[1]).getTime())); return e; } }); JavaSQLContext sqlCtx = new JavaSQLContext(sc); JavaSchemaRDD schemaEvent = sqlCtx.applySchema(events, Event.class); schemaEvent.registerTempTable("event"); sc.stop(); }
Re: Are "scala.MatchError" messages a problem?
Jeremy, On Mon, Jun 9, 2014 at 10:22 AM, Jeremy Lee wrote: >> When you use match, the match must be exhaustive. That is, a match error >> is thrown if the match fails. > > Ahh, right. That makes sense. Scala is applying its "strong typing" rules > here instead of "no ceremony"... but isn't the idea that type errors should > get picked up at compile time? I suppose the compiler can't tell there's not > complete coverage, but it seems strange to throw that at runtime when it is > literally the 'default case'. You can use subclasses of "sealed traits" to get a compiler warning for non-exhaustive matches: http://stackoverflow.com/questions/11203268/what-is-a-sealed-trait I don't know if it can be applied for regular expression matching, though... Tobias
Re: Are "scala.MatchError" messages a problem?
On Sun, Jun 8, 2014 at 10:00 AM, Nick Pentreath wrote: > When you use match, the match must be exhaustive. That is, a match error > is thrown if the match fails. Ahh, right. That makes sense. Scala is applying its "strong typing" rules here instead of "no ceremony"... but isn't the idea that type errors should get picked up at compile time? I suppose the compiler can't tell there's not complete coverage, but it seems strange to throw that at runtime when it is literally the 'default case'. I think I need a good "Scala Programming Guide"... any suggestions? I've read and watch the usual resources and videos, but it feels like a shotgun approach and I've clearly missed a lot. On Mon, Jun 9, 2014 at 3:26 AM, Mark Hamstra wrote: > > And you probably want to push down that filter into the cluster -- > collecting all of the elements of an RDD only to not use or filter out some > of them isn't an efficient usage of expensive (at least in terms of > time/performance) network resources. There may also be a good opportunity > to use the partial function form of collect to push even more processing > into the cluster. > I almost certainly do :-) And I am really looking forward to spending time optimizing the code, but I keep getting caught up on deployment issues, uberjars, missing /mnt/spark directories, only being able to submit from the master, and being thoroughly confused about sample code from three versions ago. I'm even thinking of learning maven, if it means I never have to use sbt again. Does it mean that? -- Jeremy Lee BCompSci(Hons) The Unorthodox Engineers
Re: Are "scala.MatchError" messages a problem?
> > The solution is either to add a default case which does nothing, or > probably better to add a .filter such that you filter out anything that's > not a command before matching. > And you probably want to push down that filter into the cluster -- collecting all of the elements of an RDD only to not use or filter out some of them isn't an efficient usage of expensive (at least in terms of time/performance) network resources. There may also be a good opportunity to use the partial function form of collect to push even more processing into the cluster. On Sun, Jun 8, 2014 at 10:00 AM, Nick Pentreath wrote: > When you use match, the match must be exhaustive. That is, a match error > is thrown if the match fails. > > That's why you usually handle the default case using "case _ => ..." > > Here it looks like your taking the text of all statuses - which means not > all of them will be commands... Which means your match will not be > exhaustive. > > The solution is either to add a default case which does nothing, or > probably better to add a .filter such that you filter out anything that's > not a command before matching. > > Just looking at it again it could also be that you take x => x._2._1 ... > What type is that? Should it not be a Seq if you're joining, in which case > the match will also fail... > > Hope this helps. > — > Sent from Mailbox <https://www.dropbox.com/mailbox> > > > On Sun, Jun 8, 2014 at 6:45 PM, Jeremy Lee > wrote: > >> >> I shut down my first (working) cluster and brought up a fresh one... and >> It's been a bit of a horror and I need to sleep now. Should I be worried >> about these errors? Or did I just have the old log4j.config tuned so I >> didn't see them? >> >> I >> >> 14/06/08 16:32:52 ERROR scheduler.JobScheduler: Error running job >> streaming job 1402245172000 ms.2 >> scala.MatchError: 0101-01-10 (of class java.lang.String) >> at >> SimpleApp$$anonfun$6$$anonfun$apply$6.apply(SimpleApp.scala:218) >> at >> SimpleApp$$anonfun$6$$anonfun$apply$6.apply(SimpleApp.scala:217) >> at >> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) >> at >> scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108) >> at SimpleApp$$anonfun$6.apply(SimpleApp.scala:217) >> at SimpleApp$$anonfun$6.apply(SimpleApp.scala:214) >> at >> org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1.apply(DStream.scala:527) >> at >> org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1.apply(DStream.scala:527) >> at >> org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply$mcV$sp(ForEachDStream.scala:41) >> at >> org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40) >> at >> org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40) >> at scala.util.Try$.apply(Try.scala:161) >> at org.apache.spark.streaming.scheduler.Job.run(Job.scala:32) >> at >> org.apache.spark.streaming.scheduler.JobScheduler$JobHandler.run(JobScheduler.scala:172) >> at >> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) >> at >> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) >> at java.lang.Thread.run(Thread.java:744) >> >> >> The error comes from this code, which seemed like a sensible way to match >> things: >> (The "case cmd_plus(w)" statement is generating the error,) >> >> val cmd_plus = """[+]([\w]+)""".r >> val cmd_minus = """[-]([\w]+)""".r >> // find command user tweets >> val commands = stream.map( >> status => ( status.getUser().getId(), status.getText() ) >> ).foreachRDD(rdd => { >> rdd.join(superusers).map( >> x => x._2._1 >> ).collect().foreach{ cmd => { >> 218: cmd match { >> case cmd_plus(w) => { >> ... >> } case cmd_minus(w) => { ... } } }} }) >> >> It seems a bit excessive for scala to throw exceptions because a regex >> didn't match. Something feels wrong. >> > >
Re: Are "scala.MatchError" messages a problem?
When you use match, the match must be exhaustive. That is, a match error is thrown if the match fails. That's why you usually handle the default case using "case _ => ..." Here it looks like your taking the text of all statuses - which means not all of them will be commands... Which means your match will not be exhaustive. The solution is either to add a default case which does nothing, or probably better to add a .filter such that you filter out anything that's not a command before matching. Just looking at it again it could also be that you take x => x._2._1 ... What type is that? Should it not be a Seq if you're joining, in which case the match will also fail... Hope this helps. — Sent from Mailbox On Sun, Jun 8, 2014 at 6:45 PM, Jeremy Lee wrote: > I shut down my first (working) cluster and brought up a fresh one... and > It's been a bit of a horror and I need to sleep now. Should I be worried > about these errors? Or did I just have the old log4j.config tuned so I > didn't see them? > I > 14/06/08 16:32:52 ERROR scheduler.JobScheduler: Error running job streaming > job 1402245172000 ms.2 > scala.MatchError: 0101-01-10 (of class java.lang.String) > at SimpleApp$$anonfun$6$$anonfun$apply$6.apply(SimpleApp.scala:218) > at SimpleApp$$anonfun$6$$anonfun$apply$6.apply(SimpleApp.scala:217) > at > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) > at > scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108) > at SimpleApp$$anonfun$6.apply(SimpleApp.scala:217) > at SimpleApp$$anonfun$6.apply(SimpleApp.scala:214) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1.apply(DStream.scala:527) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1.apply(DStream.scala:527) > at > org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply$mcV$sp(ForEachDStream.scala:41) > at > org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40) > at > org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40) > at scala.util.Try$.apply(Try.scala:161) > at org.apache.spark.streaming.scheduler.Job.run(Job.scala:32) > at > org.apache.spark.streaming.scheduler.JobScheduler$JobHandler.run(JobScheduler.scala:172) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:744) > The error comes from this code, which seemed like a sensible way to match > things: > (The "case cmd_plus(w)" statement is generating the error,) > val cmd_plus = """[+]([\w]+)""".r > val cmd_minus = """[-]([\w]+)""".r > // find command user tweets > val commands = stream.map( > status => ( status.getUser().getId(), status.getText() ) > ).foreachRDD(rdd => { > rdd.join(superusers).map( > x => x._2._1 > ).collect().foreach{ cmd => { > 218: cmd match { > case cmd_plus(w) => { > ... > } case cmd_minus(w) => { ... } } }} }) > It seems a bit excessive for scala to throw exceptions because a regex > didn't match. Something feels wrong.
Re: Are "scala.MatchError" messages a problem?
A match clause needs to cover all the possibilities, and not matching any regex is a distinct possibility. It's not really like 'switch' because it requires this and I think that has benefits, like being able to interpret a match as something with a type. I think it's all in order, but it's more of a Scala thing than Spark thing. You just need a "case _ => ..." to cover anything else. (You can avoid two extra levels of scope with .foreach(_ match { ... }) BTW) On Sun, Jun 8, 2014 at 12:44 PM, Jeremy Lee wrote: > > I shut down my first (working) cluster and brought up a fresh one... and > It's been a bit of a horror and I need to sleep now. Should I be worried > about these errors? Or did I just have the old log4j.config tuned so I > didn't see them? > > I > > 14/06/08 16:32:52 ERROR scheduler.JobScheduler: Error running job streaming > job 1402245172000 ms.2 > scala.MatchError: 0101-01-10 (of class java.lang.String) > at SimpleApp$$anonfun$6$$anonfun$apply$6.apply(SimpleApp.scala:218) > at SimpleApp$$anonfun$6$$anonfun$apply$6.apply(SimpleApp.scala:217) > at > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) > at > scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108) > at SimpleApp$$anonfun$6.apply(SimpleApp.scala:217) > at SimpleApp$$anonfun$6.apply(SimpleApp.scala:214) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1.apply(DStream.scala:527) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1.apply(DStream.scala:527) > at > org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply$mcV$sp(ForEachDStream.scala:41) > at > org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40) > at > org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40) > at scala.util.Try$.apply(Try.scala:161) > at org.apache.spark.streaming.scheduler.Job.run(Job.scala:32) > at > org.apache.spark.streaming.scheduler.JobScheduler$JobHandler.run(JobScheduler.scala:172) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:744) > > > The error comes from this code, which seemed like a sensible way to match > things: > (The "case cmd_plus(w)" statement is generating the error,) > > val cmd_plus = """[+]([\w]+)""".r > val cmd_minus = """[-]([\w]+)""".r > // find command user tweets > val commands = stream.map( > status => ( status.getUser().getId(), status.getText() ) > ).foreachRDD(rdd => { > rdd.join(superusers).map( > x => x._2._1 > ).collect().foreach{ cmd => { > 218: cmd match { > case cmd_plus(w) => { > ... > } case cmd_minus(w) => { ... } } }} }) > > It seems a bit excessive for scala to throw exceptions because a regex > didn't match. Something feels wrong.
Are "scala.MatchError" messages a problem?
I shut down my first (working) cluster and brought up a fresh one... and It's been a bit of a horror and I need to sleep now. Should I be worried about these errors? Or did I just have the old log4j.config tuned so I didn't see them? I 14/06/08 16:32:52 ERROR scheduler.JobScheduler: Error running job streaming job 1402245172000 ms.2 scala.MatchError: 0101-01-10 (of class java.lang.String) at SimpleApp$$anonfun$6$$anonfun$apply$6.apply(SimpleApp.scala:218) at SimpleApp$$anonfun$6$$anonfun$apply$6.apply(SimpleApp.scala:217) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108) at SimpleApp$$anonfun$6.apply(SimpleApp.scala:217) at SimpleApp$$anonfun$6.apply(SimpleApp.scala:214) at org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1.apply(DStream.scala:527) at org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1.apply(DStream.scala:527) at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply$mcV$sp(ForEachDStream.scala:41) at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40) at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40) at scala.util.Try$.apply(Try.scala:161) at org.apache.spark.streaming.scheduler.Job.run(Job.scala:32) at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler.run(JobScheduler.scala:172) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) The error comes from this code, which seemed like a sensible way to match things: (The "case cmd_plus(w)" statement is generating the error,) val cmd_plus = """[+]([\w]+)""".r val cmd_minus = """[-]([\w]+)""".r // find command user tweets val commands = stream.map( status => ( status.getUser().getId(), status.getText() ) ).foreachRDD(rdd => { rdd.join(superusers).map( x => x._2._1 ).collect().foreach{ cmd => { 218: cmd match { case cmd_plus(w) => { ... } case cmd_minus(w) => { ... } } }} }) It seems a bit excessive for scala to throw exceptions because a regex didn't match. Something feels wrong.