Regarding KryoSerialization in Spark
Hi, As per the code, KryoSerialization used writeClassAndObject method, which internally calls writeClass method, which will write the class of the object while serilization. As per the documentation in tuning page of spark, it says that registering the class will avoid that. Am I missing something or there is some issue with the documentation??? Thanks, Twinkle
RE: Is SQLContext thread-safe?
Hi, in a test on SparkSQL 1.3.0, multiple threads are doing select on a same SQLContext instance, but below exception is thrown, so it looks like SQLContext is NOT thread safe? I think this is not the desired behavior. == java.lang.RuntimeException: [1.1] failure: ``insert'' expected but identifier select found select id ,ext.d from UNIT_TEST ^ at scala.sys.package$.error(package.scala:27) at org.apache.spark.sql.catalyst.AbstractSparkSQLParser.apply(AbstractSpark SQLParser.scala:40) at org.apache.spark.sql.SQLContext$$anonfun$2.apply(SQLContext.scala:130) at org.apache.spark.sql.SQLContext$$anonfun$2.apply(SQLContext.scala:130) at org.apache.spark.sql.SparkSQLParser$$anonfun$org$apache$spark$sql$SparkS QLParser$$others$1.apply(SparkSQLParser.scala:96) at org.apache.spark.sql.SparkSQLParser$$anonfun$org$apache$spark$sql$SparkS QLParser$$others$1.apply(SparkSQLParser.scala:95) at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:136) at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:135) at scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parser s.scala:242) at scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parser s.scala:242) at scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222) at scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$ apply$2.apply(Parsers.scala:254) at scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$ apply$2.apply(Parsers.scala:254) at scala.util.parsing.combinator.Parsers$Failure.append(Parsers.scala:202) at scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Par sers.scala:254) at scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Par sers.scala:254) at scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222) at scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Pa rsers.scala:891) at scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Pa rsers.scala:891) at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57) at scala.util.parsing.combinator.Parsers$$anon$2.apply(Parsers.scala:890) at scala.util.parsing.combinator.PackratParsers$$anon$1.apply(PackratParser s.scala:110) at org.apache.spark.sql.catalyst.AbstractSparkSQLParser.apply(AbstractSpark SQLParser.scala:38) at org.apache.spark.sql.SQLContext$$anonfun$parseSql$1.apply(SQLContext.sca la:134) at org.apache.spark.sql.SQLContext$$anonfun$parseSql$1.apply(SQLContext.sca la:134) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.sql.SQLContext.parseSql(SQLContext.scala:134) at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:915) -Original Message- From: Cheng, Hao [mailto:hao.ch...@intel.com] Sent: Monday, March 02, 2015 9:05 PM To: Haopu Wang; user Subject: RE: Is SQLContext thread-safe? Yes it is thread safe, at least it's supposed to be. -Original Message- From: Haopu Wang [mailto:hw...@qilinsoft.com] Sent: Monday, March 2, 2015 4:43 PM To: user Subject: Is SQLContext thread-safe? Hi, is it safe to use the same SQLContext to do Select operations in different threads at the same time? Thank you very much! - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: Is SQLContext thread-safe?
actually this is a sql parse exception, are you sure your sql is right? 发自我的 iPhone 在 2015年4月30日,18:50,Haopu Wang hw...@qilinsoft.com 写道: Hi, in a test on SparkSQL 1.3.0, multiple threads are doing select on a same SQLContext instance, but below exception is thrown, so it looks like SQLContext is NOT thread safe? I think this is not the desired behavior. == java.lang.RuntimeException: [1.1] failure: ``insert'' expected but identifier select found select id ,ext.d from UNIT_TEST ^ at scala.sys.package$.error(package.scala:27) at org.apache.spark.sql.catalyst.AbstractSparkSQLParser.apply(AbstractSpark SQLParser.scala:40) at org.apache.spark.sql.SQLContext$$anonfun$2.apply(SQLContext.scala:130) at org.apache.spark.sql.SQLContext$$anonfun$2.apply(SQLContext.scala:130) at org.apache.spark.sql.SparkSQLParser$$anonfun$org$apache$spark$sql$SparkS QLParser$$others$1.apply(SparkSQLParser.scala:96) at org.apache.spark.sql.SparkSQLParser$$anonfun$org$apache$spark$sql$SparkS QLParser$$others$1.apply(SparkSQLParser.scala:95) at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:136) at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:135) at scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parser s.scala:242) at scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parser s.scala:242) at scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222) at scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$ apply$2.apply(Parsers.scala:254) at scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$ apply$2.apply(Parsers.scala:254) at scala.util.parsing.combinator.Parsers$Failure.append(Parsers.scala:202) at scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Par sers.scala:254) at scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Par sers.scala:254) at scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222) at scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Pa rsers.scala:891) at scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Pa rsers.scala:891) at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57) at scala.util.parsing.combinator.Parsers$$anon$2.apply(Parsers.scala:890) at scala.util.parsing.combinator.PackratParsers$$anon$1.apply(PackratParser s.scala:110) at org.apache.spark.sql.catalyst.AbstractSparkSQLParser.apply(AbstractSpark SQLParser.scala:38) at org.apache.spark.sql.SQLContext$$anonfun$parseSql$1.apply(SQLContext.sca la:134) at org.apache.spark.sql.SQLContext$$anonfun$parseSql$1.apply(SQLContext.sca la:134) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.sql.SQLContext.parseSql(SQLContext.scala:134) at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:915) -Original Message- From: Cheng, Hao [mailto:hao.ch...@intel.com] Sent: Monday, March 02, 2015 9:05 PM To: Haopu Wang; user Subject: RE: Is SQLContext thread-safe? Yes it is thread safe, at least it's supposed to be. -Original Message- From: Haopu Wang [mailto:hw...@qilinsoft.com] Sent: Monday, March 2, 2015 4:43 PM To: user Subject: Is SQLContext thread-safe? Hi, is it safe to use the same SQLContext to do Select operations in different threads at the same time? Thank you very much! - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: withColumn is very slow with datasets with large number of columns
I have reported the issue on JIRA: https://issues.apache.org/jira/browse/SPARK-7276 On Thu, Apr 30, 2015 at 4:36 PM, alexandre Clement a.p.clem...@gmail.com wrote: Hi all, I'm experimenting serious performance problem when using withColumn and dataset with large number of columns. It is very slow: on a dataset with 100 columns it takes a few seconds. The code snippet demonstrates the problem. val custs = Seq( Row(1, Bob, 21, 80.5), Row(2, Bobby, 21, 80.5), Row(3, Jean, 21, 80.5), Row(4, Fatime, 21, 80.5) ) var fields = List( StructField(id, IntegerType, true), StructField(a, IntegerType, true), StructField(b, StringType, true), StructField(target, DoubleType, false)) val schema = StructType(fields) var rdd = sc.parallelize(custs) var df = sqlContext.createDataFrame(rdd, schema) for (i - 1 to 200) { val now = System.currentTimeMillis df = df.withColumn(a_new_col_ + i, df(a) + i) println(s$i - + (System.currentTimeMillis - now)) } df.show()
withColumn is very slow with datasets with large number of columns
Hi all, I'm experimenting serious performance problem when using withColumn and dataset with large number of columns. It is very slow: on a dataset with 100 columns it takes a few seconds. The code snippet demonstrates the problem. val custs = Seq( Row(1, Bob, 21, 80.5), Row(2, Bobby, 21, 80.5), Row(3, Jean, 21, 80.5), Row(4, Fatime, 21, 80.5) ) var fields = List( StructField(id, IntegerType, true), StructField(a, IntegerType, true), StructField(b, StringType, true), StructField(target, DoubleType, false)) val schema = StructType(fields) var rdd = sc.parallelize(custs) var df = sqlContext.createDataFrame(rdd, schema) for (i - 1 to 200) { val now = System.currentTimeMillis df = df.withColumn(a_new_col_ + i, df(a) + i) println(s$i - + (System.currentTimeMillis - now)) } df.show()
Drop column/s in DataFrame
Hi All: Is there any plan to add drop column/s functionality in the data frame? One can you select function to do so, but I find that tedious when only one or two columns in large dataframe are to be dropped. Pandas has this functionality, which I find handy when constructing feature vectors from DataFrame, allowing me to carry only useful information forward. http://pandas.pydata.org/pandas-docs/dev/generated/pandas.DataFrame.drop.html Rakesh -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Drop-column-s-in-DataFrame-tp11917.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: Drop column/s in DataFrame
I filed a ticket: https://issues.apache.org/jira/browse/SPARK-7280 Would you like to give it a shot? On Thu, Apr 30, 2015 at 10:22 AM, rakeshchalasani vnit.rak...@gmail.com wrote: Hi All: Is there any plan to add drop column/s functionality in the data frame? One can you select function to do so, but I find that tedious when only one or two columns in large dataframe are to be dropped. Pandas has this functionality, which I find handy when constructing feature vectors from DataFrame, allowing me to carry only useful information forward. http://pandas.pydata.org/pandas-docs/dev/generated/pandas.DataFrame.drop.html Rakesh -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Drop-column-s-in-DataFrame-tp11917.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: practical usage of the new exactly-once supporting DirectKafkaInputDStream
Cody Koeninger-2 wrote What's your schema for the offset table, and what's the definition of writeOffset ? The schema is the same as the one in your post: topic | partition| offset The writeOffset is nearly identical: def writeOffset(osr: OffsetRange)(implicit session: DBSession): Unit = { logWarning(Thread.currentThread().toString + writeOffset: + osr) if(osr==null) { logWarning(no offset provided) return } val updated = sql update txn_offsets set off = ${osr.untilOffset} where topic = ${osr.topic} and part = ${osr.partition} and off = ${osr.fromOffset} .update.apply() if (updated != 1) { throw new Exception( Thread.currentThread().toString + sfailed to write offset: ${osr.topic},${osr.partition},${osr.fromOffset}-${osr.untilOffset}) } else { logWarning(Thread.currentThread().toString + offsets updated to + osr.untilOffset) } } Cody Koeninger-2 wrote What key are you reducing on? Maybe I'm misreading the code, but it looks like the per-partition offset is part of the key. If that's true then you could just do your reduction on each partition, rather than after the fact on the whole stream. Yes, the key is a duple comprised of a case class called Key and the partition's OffsetRange. We piggybacked the OffsetRange in this way so it would be available within the scope of the partition. I have tried moving the reduceByKey from the end of the .transform block into the partition level (at the end of the mapPartitionsWithIndex block.) This is what you're suggesting, yes? The results didn't correct the offset update behavior; they still get out of sync pretty quickly. Some details: I'm using the kafka-console-producer.sh tool to drive the process, calling it three or four times in succession and piping in 100-1000 messages in each call. Once all the messages have been processed I wait for the output of the printOffsets method to stop changing and compare it to the txn_offsets table. (When no data is getting processed the printOffsets method yields something like the following: [ OffsetRange(topic: 'testmulti', partition: 1, range: [23602 - 23602] OffsetRange(topic: 'testmulti', partition: 2, range: [32503 - 32503] OffsetRange(topic: 'testmulti', partition: 0, range: [26100 - 26100] OffsetRange(topic: 'testmulti', partition: 3, range: [20900 - 20900]]) Thanks, Mark -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/practical-usage-of-the-new-exactly-once-supporting-DirectKafkaInputDStream-tp11916p11921.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: Is SQLContext thread-safe?
Unfortunately, I think the SQLParser is not threadsafe. I would recommend using HiveQL. On Thu, Apr 30, 2015 at 4:07 AM, Wangfei (X) wangf...@huawei.com wrote: actually this is a sql parse exception, are you sure your sql is right? 发自我的 iPhone 在 2015年4月30日,18:50,Haopu Wang hw...@qilinsoft.com 写道: Hi, in a test on SparkSQL 1.3.0, multiple threads are doing select on a same SQLContext instance, but below exception is thrown, so it looks like SQLContext is NOT thread safe? I think this is not the desired behavior. == java.lang.RuntimeException: [1.1] failure: ``insert'' expected but identifier select found select id ,ext.d from UNIT_TEST ^ at scala.sys.package$.error(package.scala:27) at org.apache.spark.sql.catalyst.AbstractSparkSQLParser.apply(AbstractSpark SQLParser.scala:40) at org.apache.spark.sql.SQLContext$$anonfun$2.apply(SQLContext.scala:130) at org.apache.spark.sql.SQLContext$$anonfun$2.apply(SQLContext.scala:130) at org.apache.spark.sql.SparkSQLParser$$anonfun$org$apache$spark$sql$SparkS QLParser$$others$1.apply(SparkSQLParser.scala:96) at org.apache.spark.sql.SparkSQLParser$$anonfun$org$apache$spark$sql$SparkS QLParser$$others$1.apply(SparkSQLParser.scala:95) at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:136) at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:135) at scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parser s.scala:242) at scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parser s.scala:242) at scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222) at scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$ apply$2.apply(Parsers.scala:254) at scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$ apply$2.apply(Parsers.scala:254) at scala.util.parsing.combinator.Parsers$Failure.append(Parsers.scala:202) at scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Par sers.scala:254) at scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Par sers.scala:254) at scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222) at scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Pa rsers.scala:891) at scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Pa rsers.scala:891) at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57) at scala.util.parsing.combinator.Parsers$$anon$2.apply(Parsers.scala:890) at scala.util.parsing.combinator.PackratParsers$$anon$1.apply(PackratParser s.scala:110) at org.apache.spark.sql.catalyst.AbstractSparkSQLParser.apply(AbstractSpark SQLParser.scala:38) at org.apache.spark.sql.SQLContext$$anonfun$parseSql$1.apply(SQLContext.sca la:134) at org.apache.spark.sql.SQLContext$$anonfun$parseSql$1.apply(SQLContext.sca la:134) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.sql.SQLContext.parseSql(SQLContext.scala:134) at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:915) -Original Message- From: Cheng, Hao [mailto:hao.ch...@intel.com] Sent: Monday, March 02, 2015 9:05 PM To: Haopu Wang; user Subject: RE: Is SQLContext thread-safe? Yes it is thread safe, at least it's supposed to be. -Original Message- From: Haopu Wang [mailto:hw...@qilinsoft.com] Sent: Monday, March 2, 2015 4:43 PM To: user Subject: Is SQLContext thread-safe? Hi, is it safe to use the same SQLContext to do Select operations in different threads at the same time? Thank you very much! - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: practical usage of the new exactly-once supporting DirectKafkaInputDStream
What's your schema for the offset table, and what's the definition of writeOffset ? What key are you reducing on? Maybe I'm misreading the code, but it looks like the per-partition offset is part of the key. If that's true then you could just do your reduction on each partition, rather than after the fact on the whole stream. On Thu, Apr 30, 2015 at 12:10 PM, badgerpants mark.stew...@tapjoy.com wrote: We're a group of experienced backend developers who are fairly new to Spark Streaming (and Scala) and very interested in using the new (in 1.3) DirectKafkaInputDStream impl as part of the metrics reporting service we're building. Our flow involves reading in metric events, lightly modifying some of the data values, and then creating aggregates via reduceByKey. We're following the approach in Cody Koeninger's blog on exactly-once streaming (https://github.com/koeninger/kafka-exactly-once/blob/master/blogpost.md) in which the Kakfa OffsetRanges are grabbed from the RDD and persisted to a tracking table within the same db transaction as the data within said ranges. Within a short time frame the offsets in the table fall out of synch with the offsets. It appears that the writeOffsets method (see code below) occasionally doesn't get called which also indicates that some blocks of data aren't being processed either; the aggregate operation makes this difficult to eyeball from the data that's written to the db. Note that we do understand that the reduce operation alters that size/boundaries of the partitions we end up processing. Indeed, without the reduceByKey operation our code seems to work perfectly. But without the reduceByKey operation the db has to perform *a lot* more updates. It's certainly a significant restriction to place on what is such a promising approach. I'm hoping there simply something we're missing. Any workarounds or thoughts are welcome. Here's the code we've got: def run(stream: DStream[Input], conf: Config, args: List[String]): Unit = { ... val sumFunc: (BigDecimal, BigDecimal) = BigDecimal = (_ + _) val transformStream = stream.transform { rdd = val offsets = rdd.asInstanceOf[HasOffsetRanges].offsetRanges printOffsets(offsets) // just prints out the offsets for reference rdd.mapPartitionsWithIndex { case (i, iter) = iter.flatMap { case (name, msg) = extractMetrics(msg) } .map { case (k,v) = ( ( keyWithFlooredTimestamp(k), offsets(i) ), v ) } } }.reduceByKey(sumFunc, 1) transformStream.foreachRDD { rdd = rdd.foreachPartition { partition = val conn = DriverManager.getConnection(dbUrl, dbUser, dbPass) val db = DB(conn) db.autoClose(false) db.autoCommit { implicit session = var currentOffset: OffsetRange = null partition.foreach { case (key, value) = currentOffset = key._2 writeMetrics(key._1, value, table) } writeOffset(currentOffset) // updates the offset positions } db.close() } } Thanks, Mark -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/practical-usage-of-the-new-exactly-once-supporting-DirectKafkaInputDStream-tp11916.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: [discuss] DataFrame function namespacing
IMHO I would go with choice #1 Cheers On Wed, Apr 29, 2015 at 10:03 PM, Reynold Xin r...@databricks.com wrote: We definitely still have the name collision problem in SQL. On Wed, Apr 29, 2015 at 10:01 PM, Punyashloka Biswal punya.bis...@gmail.com wrote: Do we still have to keep the names of the functions distinct to avoid collisions in SQL? Or is there a plan to allow importing a namespace into SQL somehow? I ask because if we have to keep worrying about name collisions then I'm not sure what the added complexity of #2 and #3 buys us. Punya On Wed, Apr 29, 2015 at 3:52 PM Reynold Xin r...@databricks.com wrote: Scaladoc isn't much of a problem because scaladocs are grouped. Java/Python is the main problem ... See https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.functions$ On Wed, Apr 29, 2015 at 3:38 PM, Shivaram Venkataraman shiva...@eecs.berkeley.edu wrote: My feeling is that we should have a handful of namespaces (say 4 or 5). It becomes too cumbersome to import / remember more package names and having everything in one package makes it hard to read scaladoc etc. Thanks Shivaram On Wed, Apr 29, 2015 at 3:30 PM, Reynold Xin r...@databricks.com wrote: To add a little bit more context, some pros/cons I can think of are: Option 1: Very easy for users to find the function, since they are all in org.apache.spark.sql.functions. However, there will be quite a large number of them. Option 2: I can't tell why we would want this one over Option 3, since it has all the problems of Option 3, and not as nice of a hierarchy. Option 3: Opposite of Option 1. Each package or static class has a small number of functions that are relevant to each other, but for some functions it is unclear where they should go (e.g. should min go into basic or math?) On Wed, Apr 29, 2015 at 3:21 PM, Reynold Xin r...@databricks.com wrote: Before we make DataFrame non-alpha, it would be great to decide how we want to namespace all the functions. There are 3 alternatives: 1. Put all in org.apache.spark.sql.functions. This is how SQL does it, since SQL doesn't have namespaces. I estimate eventually we will have ~ 200 functions. 2. Have explicit namespaces, which is what master branch currently looks like: - org.apache.spark.sql.functions - org.apache.spark.sql.mathfunctions - ... 3. Have explicit namespaces, but restructure them slightly so everything is under functions. package object functions { // all the old functions here -- but deprecated so we keep source compatibility def ... } package org.apache.spark.sql.functions object mathFunc { ... } object basicFuncs { ... }
practical usage of the new exactly-once supporting DirectKafkaInputDStream
We're a group of experienced backend developers who are fairly new to Spark Streaming (and Scala) and very interested in using the new (in 1.3) DirectKafkaInputDStream impl as part of the metrics reporting service we're building. Our flow involves reading in metric events, lightly modifying some of the data values, and then creating aggregates via reduceByKey. We're following the approach in Cody Koeninger's blog on exactly-once streaming (https://github.com/koeninger/kafka-exactly-once/blob/master/blogpost.md) in which the Kakfa OffsetRanges are grabbed from the RDD and persisted to a tracking table within the same db transaction as the data within said ranges. Within a short time frame the offsets in the table fall out of synch with the offsets. It appears that the writeOffsets method (see code below) occasionally doesn't get called which also indicates that some blocks of data aren't being processed either; the aggregate operation makes this difficult to eyeball from the data that's written to the db. Note that we do understand that the reduce operation alters that size/boundaries of the partitions we end up processing. Indeed, without the reduceByKey operation our code seems to work perfectly. But without the reduceByKey operation the db has to perform *a lot* more updates. It's certainly a significant restriction to place on what is such a promising approach. I'm hoping there simply something we're missing. Any workarounds or thoughts are welcome. Here's the code we've got: def run(stream: DStream[Input], conf: Config, args: List[String]): Unit = { ... val sumFunc: (BigDecimal, BigDecimal) = BigDecimal = (_ + _) val transformStream = stream.transform { rdd = val offsets = rdd.asInstanceOf[HasOffsetRanges].offsetRanges printOffsets(offsets) // just prints out the offsets for reference rdd.mapPartitionsWithIndex { case (i, iter) = iter.flatMap { case (name, msg) = extractMetrics(msg) } .map { case (k,v) = ( ( keyWithFlooredTimestamp(k), offsets(i) ), v ) } } }.reduceByKey(sumFunc, 1) transformStream.foreachRDD { rdd = rdd.foreachPartition { partition = val conn = DriverManager.getConnection(dbUrl, dbUser, dbPass) val db = DB(conn) db.autoClose(false) db.autoCommit { implicit session = var currentOffset: OffsetRange = null partition.foreach { case (key, value) = currentOffset = key._2 writeMetrics(key._1, value, table) } writeOffset(currentOffset) // updates the offset positions } db.close() } } Thanks, Mark -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/practical-usage-of-the-new-exactly-once-supporting-DirectKafkaInputDStream-tp11916.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: Drop column/s in DataFrame
Sure, I will try sending a PR soon. On Thu, Apr 30, 2015 at 1:42 PM Reynold Xin r...@databricks.com wrote: I filed a ticket: https://issues.apache.org/jira/browse/SPARK-7280 Would you like to give it a shot? On Thu, Apr 30, 2015 at 10:22 AM, rakeshchalasani vnit.rak...@gmail.com wrote: Hi All: Is there any plan to add drop column/s functionality in the data frame? One can you select function to do so, but I find that tedious when only one or two columns in large dataframe are to be dropped. Pandas has this functionality, which I find handy when constructing feature vectors from DataFrame, allowing me to carry only useful information forward. http://pandas.pydata.org/pandas-docs/dev/generated/pandas.DataFrame.drop.html Rakesh -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Drop-column-s-in-DataFrame-tp11917.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: Regarding KryoSerialization in Spark
Hi Twinkle, Registering the class makes it so that writeClass only writes out a couple bytes, instead of a full String of the class name. -Sandy On Thu, Apr 30, 2015 at 4:13 AM, twinkle sachdeva twinkle.sachd...@gmail.com wrote: Hi, As per the code, KryoSerialization used writeClassAndObject method, which internally calls writeClass method, which will write the class of the object while serilization. As per the documentation in tuning page of spark, it says that registering the class will avoid that. Am I missing something or there is some issue with the documentation??? Thanks, Twinkle
Re: [discuss] ending support for Java 6?
i am not sure eol means much if it is still actively used. we have a lot of clients with centos 5 (for which we still support python 2.4 in some form or another, fun!). most of them are on centos 6, which means python 2.6. by cutting out python 2.6 you would cut out the majority of the actual clusters i am aware of. unless you intention is to truly make something academic i dont think that is wise. On Thu, Apr 30, 2015 at 3:48 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: (On that note, I think Python 2.6 should be next on the chopping block sometime later this year, but that’s for another thread.) (To continue the parenthetical, Python 2.6 was in fact EOL-ed in October of 2013. https://www.python.org/download/releases/2.6.9/) On Thu, Apr 30, 2015 at 3:18 PM Nicholas Chammas nicholas.cham...@gmail.com wrote: I understand the concern about cutting out users who still use Java 6, and I don't have numbers about how many people are still using Java 6. But I want to say at a high level that I support deprecating older versions of stuff to reduce our maintenance burden and let us use more modern patterns in our code. Maintenance always costs way more than initial development over the lifetime of a project, and for that reason anti-support is just as important as support. (On that note, I think Python 2.6 should be next on the chopping block sometime later this year, but that's for another thread.) Nick On Thu, Apr 30, 2015 at 3:03 PM Reynold Xin r...@databricks.com wrote: This has been discussed a few times in the past, but now Oracle has ended support for Java 6 for over a year, I wonder if we should just drop Java 6 support. There is one outstanding issue Tom has brought to my attention: PySpark on YARN doesn't work well with Java 7/8, but we have an outstanding pull request to fix that. https://issues.apache.org/jira/browse/SPARK-6869 https://issues.apache.org/jira/browse/SPARK-1920
Re: [discuss] ending support for Java 6?
I understand the concern about cutting out users who still use Java 6, and I don't have numbers about how many people are still using Java 6. But I want to say at a high level that I support deprecating older versions of stuff to reduce our maintenance burden and let us use more modern patterns in our code. Maintenance always costs way more than initial development over the lifetime of a project, and for that reason anti-support is just as important as support. (On that note, I think Python 2.6 should be next on the chopping block sometime later this year, but that's for another thread.) Nick On Thu, Apr 30, 2015 at 3:03 PM Reynold Xin r...@databricks.com wrote: This has been discussed a few times in the past, but now Oracle has ended support for Java 6 for over a year, I wonder if we should just drop Java 6 support. There is one outstanding issue Tom has brought to my attention: PySpark on YARN doesn't work well with Java 7/8, but we have an outstanding pull request to fix that. https://issues.apache.org/jira/browse/SPARK-6869 https://issues.apache.org/jira/browse/SPARK-1920
Re: [discuss] ending support for Java 6?
something to keep in mind: we can easily support java 6 for the build environment, particularly if there's a definite EOL. i'd like to fix our java versioning 'problem', and this could be a big instigator... right now we're hackily setting java_home in test invocation on jenkins, which really isn't the best. if i decide, within jenkins, to reconfigure every build to 'do the right thing' WRT java version, then i will clean up the old mess and pay down on some technical debt. or i can just install java 6 and we use that as JAVA_HOME on a build-by-build basis. this will be a few days of prep and another morning-long downtime if i do the right thing (within jenkins), and only a couple of hours the hacky way (system level). either way, we can test on java 6. :) On Thu, Apr 30, 2015 at 1:00 PM, Koert Kuipers ko...@tresata.com wrote: nicholas started it! :) for java 6 i would have said the same thing about 1 year ago: it is foolish to drop it. but i think the time is right about now. about half our clients are on java 7 and the other half have active plans to migrate to it within 6 months. On Thu, Apr 30, 2015 at 3:57 PM, Reynold Xin r...@databricks.com wrote: Guys thanks for chiming in, but please focus on Java here. Python is an entirely separate issue. On Thu, Apr 30, 2015 at 12:53 PM, Koert Kuipers ko...@tresata.com wrote: i am not sure eol means much if it is still actively used. we have a lot of clients with centos 5 (for which we still support python 2.4 in some form or another, fun!). most of them are on centos 6, which means python 2.6. by cutting out python 2.6 you would cut out the majority of the actual clusters i am aware of. unless you intention is to truly make something academic i dont think that is wise. On Thu, Apr 30, 2015 at 3:48 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: (On that note, I think Python 2.6 should be next on the chopping block sometime later this year, but that’s for another thread.) (To continue the parenthetical, Python 2.6 was in fact EOL-ed in October of 2013. https://www.python.org/download/releases/2.6.9/) On Thu, Apr 30, 2015 at 3:18 PM Nicholas Chammas nicholas.cham...@gmail.com wrote: I understand the concern about cutting out users who still use Java 6, and I don't have numbers about how many people are still using Java 6. But I want to say at a high level that I support deprecating older versions of stuff to reduce our maintenance burden and let us use more modern patterns in our code. Maintenance always costs way more than initial development over the lifetime of a project, and for that reason anti-support is just as important as support. (On that note, I think Python 2.6 should be next on the chopping block sometime later this year, but that's for another thread.) Nick On Thu, Apr 30, 2015 at 3:03 PM Reynold Xin r...@databricks.com wrote: This has been discussed a few times in the past, but now Oracle has ended support for Java 6 for over a year, I wonder if we should just drop Java 6 support. There is one outstanding issue Tom has brought to my attention: PySpark on YARN doesn't work well with Java 7/8, but we have an outstanding pull request to fix that. https://issues.apache.org/jira/browse/SPARK-6869 https://issues.apache.org/jira/browse/SPARK-1920
Re: practical usage of the new exactly-once supporting DirectKafkaInputDStream
In fact, you're using the 2 arg form of reduce by key to shrink it down to 1 partition reduceByKey(sumFunc, 1) But you started with 4 kafka partitions? So they're definitely no longer 1:1 On Thu, Apr 30, 2015 at 1:58 PM, Cody Koeninger c...@koeninger.org wrote: This is what I'm suggesting, in pseudocode rdd.mapPartitionsWithIndex { case (i, iter) = offset = offsets(i) result = yourReductionFunction(iter) transaction { save(result) save(offset) } }.foreach { (_: Nothing) = () } where yourReductionFunction is just normal scala code. The code you posted looks like you're only saving offsets once per partition, but you're doing it after reduceByKey. Reduction steps in spark imply a shuffle. After a shuffle you no longer have a guaranteed 1:1 correspondence between spark partiion and kafka partition. If you want to verify that's what the problem is, log the value of currentOffset whenever it changes. On Thu, Apr 30, 2015 at 1:38 PM, badgerpants mark.stew...@tapjoy.com wrote: Cody Koeninger-2 wrote What's your schema for the offset table, and what's the definition of writeOffset ? The schema is the same as the one in your post: topic | partition| offset The writeOffset is nearly identical: def writeOffset(osr: OffsetRange)(implicit session: DBSession): Unit = { logWarning(Thread.currentThread().toString + writeOffset: + osr) if(osr==null) { logWarning(no offset provided) return } val updated = sql update txn_offsets set off = ${osr.untilOffset} where topic = ${osr.topic} and part = ${osr.partition} and off = ${osr.fromOffset} .update.apply() if (updated != 1) { throw new Exception( Thread.currentThread().toString + sfailed to write offset: ${osr.topic},${osr.partition},${osr.fromOffset}-${osr.untilOffset}) } else { logWarning(Thread.currentThread().toString + offsets updated to + osr.untilOffset) } } Cody Koeninger-2 wrote What key are you reducing on? Maybe I'm misreading the code, but it looks like the per-partition offset is part of the key. If that's true then you could just do your reduction on each partition, rather than after the fact on the whole stream. Yes, the key is a duple comprised of a case class called Key and the partition's OffsetRange. We piggybacked the OffsetRange in this way so it would be available within the scope of the partition. I have tried moving the reduceByKey from the end of the .transform block into the partition level (at the end of the mapPartitionsWithIndex block.) This is what you're suggesting, yes? The results didn't correct the offset update behavior; they still get out of sync pretty quickly. Some details: I'm using the kafka-console-producer.sh tool to drive the process, calling it three or four times in succession and piping in 100-1000 messages in each call. Once all the messages have been processed I wait for the output of the printOffsets method to stop changing and compare it to the txn_offsets table. (When no data is getting processed the printOffsets method yields something like the following: [ OffsetRange(topic: 'testmulti', partition: 1, range: [23602 - 23602] OffsetRange(topic: 'testmulti', partition: 2, range: [32503 - 32503] OffsetRange(topic: 'testmulti', partition: 0, range: [26100 - 26100] OffsetRange(topic: 'testmulti', partition: 3, range: [20900 - 20900]]) Thanks, Mark -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/practical-usage-of-the-new-exactly-once-supporting-DirectKafkaInputDStream-tp11916p11921.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: Pickling error when attempting to add a method in pyspark
Bumping this. Anyone of you having some familiarity with py4j interface in pyspark? thanks 2015-04-27 22:09 GMT-07:00 Stephen Boesch java...@gmail.com: My intention is to add pyspark support for certain mllib spark methods. I have been unable to resolve pickling errors of the form Pyspark py4j PickleException: “expected zero arguments for construction of ClassDict” http://stackoverflow.com/questions/29910708/pyspark-py4j-pickleexception-expected-zero-arguments-for-construction-of-class These are occurring during python to java conversion of python named tuples. The details are rather hard to provide here so I have created an SOF question http://stackoverflow.com/questions/29910708/pyspark-py4j-pickleexception-expected-zero-arguments-for-construction-of-class In any case I have included the text here. The SOF is easier to read though ;) -- This question is directed towards persons familiar with py4j - and can help to resolve a pickling error. I am trying to add a method to the pyspark PythonMLLibAPI that accepts an RDD of a namedtuple, does some work, and returns a result in the form of an RDD. This method is modeled after the PYthonMLLibAPI.trainALSModel() method, whose analogous *existing* relevant portions are: def trainALSModel( ratingsJRDD: JavaRDD[Rating], .. ) The *existing* python Rating class used to model the new code is: class Rating(namedtuple(Rating, [user, product, rating])): def __reduce__(self): return Rating, (int(self.user), int(self.product), float(self.rating)) Here is the attempt So here are the relevant classes: *New* python class pyspark.mllib.clustering.MatrixEntry: from collections import namedtupleclass MatrixEntry(namedtuple(MatrixEntry, [x,y,weight])): def __reduce__(self): return MatrixEntry, (long(self.x), long(self.y), float(self.weight)) *New* method *foobarRDD* In PythonMLLibAPI: def foobarRdd( data: JavaRDD[MatrixEntry]): RDD[FooBarResult] = { val rdd = data.rdd.map { d = FooBarResult(d.i, d.j, d.value, d.i * 100 + d.j * 10 + d.value)} rdd } Now let us try it out: from pyspark.mllib.clustering import MatrixEntry def convert_to_MatrixEntry(tuple): return MatrixEntry(*tuple) from pyspark.mllib.clustering import * pic = PowerIterationClusteringModel(2) tups = [(1,2,3),(4,5,6),(12,13,14),(15,7,8),(16,17,16.5)] trdd = sc.parallelize(map(convert_to_MatrixEntry,tups)) # print out the RDD on python side just for validationprint %s %(repr(trdd.collect())) from pyspark.mllib.common import callMLlibFunc pic = callMLlibFunc(foobar, trdd) Relevant portions of results: [(1,2)=3.0, (4,5)=6.0, (12,13)=14.0, (15,7)=8.0, (16,17)=16.5] which shows the input rdd is 'whole'. However the pickling was unhappy: 5/04/27 21:15:44 ERROR Executor: Exception in task 6.0 in stage 1.0 (TID 14) net.razorvine.pickle.PickleException: expected zero arguments for construction of ClassDict(for pyspark.mllib.clustering.MatrixEntry) at net.razorvine.pickle.objects.ClassDictConstructor.construct(ClassDictConstructor.java:23) at net.razorvine.pickle.Unpickler.load_reduce(Unpickler.java:617) at net.razorvine.pickle.Unpickler.dispatch(Unpickler.java:170) at net.razorvine.pickle.Unpickler.load(Unpickler.java:84) at net.razorvine.pickle.Unpickler.loads(Unpickler.java:97) at org.apache.spark.mllib.api.python.SerDe$$anonfun$pythonToJava$1$$anonfun$apply$1.apply(PythonMLLibAPI.scala:1167) at org.apache.spark.mllib.api.python.SerDe$$anonfun$pythonToJava$1$$anonfun$apply$1.apply(PythonMLLibAPI.scala:1166) at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273) at scala.collection.AbstractIterator.to(Iterator.scala:1157) at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265) at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157) at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252) at scala.collection.AbstractIterator.toArray(Iterator.scala:1157) at org.apache.spark.rdd.RDD$$anonfun$17.apply(RDD.scala:819) at org.apache.spark.rdd.RDD$$anonfun$17.apply(RDD.scala:819) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1523) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1523) at
Re: [discuss] ending support for Java 6?
I'd also support this. In general, I think it's good that we try to have Spark support different versions of things (Hadoop, Hive, etc). But at some point you need to weigh the costs of doing so against the number of users affected. In the case of Java 6, we are seeing increasing cost from this. Some of the newer unsafe code is not supported in Java 6 (and it's a pretty large internal initiative). And the ability to upgrade dependencies is starting to cause pain for users. Sean and I had to wontfix an important bug fix for users because the library requires JRE 7. On Thu, Apr 30, 2015 at 1:00 PM, Koert Kuipers ko...@tresata.com wrote: nicholas started it! :) for java 6 i would have said the same thing about 1 year ago: it is foolish to drop it. but i think the time is right about now. about half our clients are on java 7 and the other half have active plans to migrate to it within 6 months. On Thu, Apr 30, 2015 at 3:57 PM, Reynold Xin r...@databricks.com wrote: Guys thanks for chiming in, but please focus on Java here. Python is an entirely separate issue. On Thu, Apr 30, 2015 at 12:53 PM, Koert Kuipers ko...@tresata.com wrote: i am not sure eol means much if it is still actively used. we have a lot of clients with centos 5 (for which we still support python 2.4 in some form or another, fun!). most of them are on centos 6, which means python 2.6. by cutting out python 2.6 you would cut out the majority of the actual clusters i am aware of. unless you intention is to truly make something academic i dont think that is wise. On Thu, Apr 30, 2015 at 3:48 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: (On that note, I think Python 2.6 should be next on the chopping block sometime later this year, but that's for another thread.) (To continue the parenthetical, Python 2.6 was in fact EOL-ed in October of 2013. https://www.python.org/download/releases/2.6.9/) On Thu, Apr 30, 2015 at 3:18 PM Nicholas Chammas nicholas.cham...@gmail.com wrote: I understand the concern about cutting out users who still use Java 6, and I don't have numbers about how many people are still using Java 6. But I want to say at a high level that I support deprecating older versions of stuff to reduce our maintenance burden and let us use more modern patterns in our code. Maintenance always costs way more than initial development over the lifetime of a project, and for that reason anti-support is just as important as support. (On that note, I think Python 2.6 should be next on the chopping block sometime later this year, but that's for another thread.) Nick On Thu, Apr 30, 2015 at 3:03 PM Reynold Xin r...@databricks.com wrote: This has been discussed a few times in the past, but now Oracle has ended support for Java 6 for over a year, I wonder if we should just drop Java 6 support. There is one outstanding issue Tom has brought to my attention: PySpark on YARN doesn't work well with Java 7/8, but we have an outstanding pull request to fix that. https://issues.apache.org/jira/browse/SPARK-6869 https://issues.apache.org/jira/browse/SPARK-1920 - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: [discuss] ending support for Java 6?
I'm in favor of ending support for Java 6. We should also articulate a policy on how long we want to support current and future versions of Java after Oracle declares them EOL (Java 7 will be in that bucket in a matter of days). Punya On Thu, Apr 30, 2015 at 1:18 PM shane knapp skn...@berkeley.edu wrote: something to keep in mind: we can easily support java 6 for the build environment, particularly if there's a definite EOL. i'd like to fix our java versioning 'problem', and this could be a big instigator... right now we're hackily setting java_home in test invocation on jenkins, which really isn't the best. if i decide, within jenkins, to reconfigure every build to 'do the right thing' WRT java version, then i will clean up the old mess and pay down on some technical debt. or i can just install java 6 and we use that as JAVA_HOME on a build-by-build basis. this will be a few days of prep and another morning-long downtime if i do the right thing (within jenkins), and only a couple of hours the hacky way (system level). either way, we can test on java 6. :) On Thu, Apr 30, 2015 at 1:00 PM, Koert Kuipers ko...@tresata.com wrote: nicholas started it! :) for java 6 i would have said the same thing about 1 year ago: it is foolish to drop it. but i think the time is right about now. about half our clients are on java 7 and the other half have active plans to migrate to it within 6 months. On Thu, Apr 30, 2015 at 3:57 PM, Reynold Xin r...@databricks.com wrote: Guys thanks for chiming in, but please focus on Java here. Python is an entirely separate issue. On Thu, Apr 30, 2015 at 12:53 PM, Koert Kuipers ko...@tresata.com wrote: i am not sure eol means much if it is still actively used. we have a lot of clients with centos 5 (for which we still support python 2.4 in some form or another, fun!). most of them are on centos 6, which means python 2.6. by cutting out python 2.6 you would cut out the majority of the actual clusters i am aware of. unless you intention is to truly make something academic i dont think that is wise. On Thu, Apr 30, 2015 at 3:48 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: (On that note, I think Python 2.6 should be next on the chopping block sometime later this year, but that’s for another thread.) (To continue the parenthetical, Python 2.6 was in fact EOL-ed in October of 2013. https://www.python.org/download/releases/2.6.9/) On Thu, Apr 30, 2015 at 3:18 PM Nicholas Chammas nicholas.cham...@gmail.com wrote: I understand the concern about cutting out users who still use Java 6, and I don't have numbers about how many people are still using Java 6. But I want to say at a high level that I support deprecating older versions of stuff to reduce our maintenance burden and let us use more modern patterns in our code. Maintenance always costs way more than initial development over the lifetime of a project, and for that reason anti-support is just as important as support. (On that note, I think Python 2.6 should be next on the chopping block sometime later this year, but that's for another thread.) Nick On Thu, Apr 30, 2015 at 3:03 PM Reynold Xin r...@databricks.com wrote: This has been discussed a few times in the past, but now Oracle has ended support for Java 6 for over a year, I wonder if we should just drop Java 6 support. There is one outstanding issue Tom has brought to my attention: PySpark on YARN doesn't work well with Java 7/8, but we have an outstanding pull request to fix that. https://issues.apache.org/jira/browse/SPARK-6869 https://issues.apache.org/jira/browse/SPARK-1920
Re: [discuss] ending support for Java 6?
Hi Team, Should we take this opportunity to layout and evangelize a pattern for EOL of dependencies.I propose, we follow the official EOL of java, python, scala, .And add say 6-12-24 months depending on the popularity. Java 6 official EOL Feb 2013Add 6-12 monthsAug 2013 - Feb 2014 official End of Support for Java 6 in SparkAnnounce 3-6 months prior to EOS. Thanking you. With Regards Sree On Thursday, April 30, 2015 1:41 PM, Marcelo Vanzin van...@cloudera.com wrote: As for the idea, I'm +1. Spark is the only reason I still have jdk6 around - exactly because I don't want to cause the issue that started this discussion (inadvertently using JDK7 APIs). And as has been pointed out, even J7 is about to go EOL real soon. Even Hadoop is moving away (I think 2.7 will be j7-only). Hive 1.1 is already j7-only. And when Hadoop moves away from something, it's an event worthy of headlines. They're still on Jetty 6! As for pyspark, https://github.com/apache/spark/pull/5580 should get rid of the last incompatibility with large assemblies, by keeping the python files in separate archives. If we remove support for Java 6, then we don't need to worry about the size of the assembly anymore. On Thu, Apr 30, 2015 at 1:32 PM, Sean Owen so...@cloudera.com wrote: I'm firmly in favor of this. It would also fix https://issues.apache.org/jira/browse/SPARK-7009 and avoid any more of the long-standing 64K file limit thing that's still a problem for PySpark. -- Marcelo - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: practical usage of the new exactly-once supporting DirectKafkaInputDStream
Cody Koeninger-2 wrote In fact, you're using the 2 arg form of reduce by key to shrink it down to 1 partition reduceByKey(sumFunc, 1) But you started with 4 kafka partitions? So they're definitely no longer 1:1 True. I added the second arg because we were seeing multiple threads attempting to update the same offset. Setting it to 1 prevented that but doesn't fix the core issue. Cody Koeninger-2 wrote This is what I'm suggesting, in pseudocode rdd.mapPartitionsWithIndex { case (i, iter) = offset = offsets(i) result = yourReductionFunction(iter) transaction { save(result) save(offset) } }.foreach { (_: Nothing) = () } where yourReductionFunction is just normal scala code. I'll give this a try. Thanks, Cody. -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/practical-usage-of-the-new-exactly-once-supporting-DirectKafkaInputDStream-tp11916p11928.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: [discuss] ending support for Java 6?
I'm firmly in favor of this. It would also fix https://issues.apache.org/jira/browse/SPARK-7009 and avoid any more of the long-standing 64K file limit thing that's still a problem for PySpark. As a point of reference, CDH5 has never supported Java 6, and it was released over a year ago. On Thu, Apr 30, 2015 at 8:02 PM, Reynold Xin r...@databricks.com wrote: This has been discussed a few times in the past, but now Oracle has ended support for Java 6 for over a year, I wonder if we should just drop Java 6 support. There is one outstanding issue Tom has brought to my attention: PySpark on YARN doesn't work well with Java 7/8, but we have an outstanding pull request to fix that. https://issues.apache.org/jira/browse/SPARK-6869 https://issues.apache.org/jira/browse/SPARK-1920 - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: [discuss] ending support for Java 6?
As for the idea, I'm +1. Spark is the only reason I still have jdk6 around - exactly because I don't want to cause the issue that started this discussion (inadvertently using JDK7 APIs). And as has been pointed out, even J7 is about to go EOL real soon. Even Hadoop is moving away (I think 2.7 will be j7-only). Hive 1.1 is already j7-only. And when Hadoop moves away from something, it's an event worthy of headlines. They're still on Jetty 6! As for pyspark, https://github.com/apache/spark/pull/5580 should get rid of the last incompatibility with large assemblies, by keeping the python files in separate archives. If we remove support for Java 6, then we don't need to worry about the size of the assembly anymore. On Thu, Apr 30, 2015 at 1:32 PM, Sean Owen so...@cloudera.com wrote: I'm firmly in favor of this. It would also fix https://issues.apache.org/jira/browse/SPARK-7009 and avoid any more of the long-standing 64K file limit thing that's still a problem for PySpark. -- Marcelo - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
[discuss] ending support for Java 6?
This has been discussed a few times in the past, but now Oracle has ended support for Java 6 for over a year, I wonder if we should just drop Java 6 support. There is one outstanding issue Tom has brought to my attention: PySpark on YARN doesn't work well with Java 7/8, but we have an outstanding pull request to fix that. https://issues.apache.org/jira/browse/SPARK-6869 https://issues.apache.org/jira/browse/SPARK-1920
Re: [discuss] ending support for Java 6?
nicholas started it! :) for java 6 i would have said the same thing about 1 year ago: it is foolish to drop it. but i think the time is right about now. about half our clients are on java 7 and the other half have active plans to migrate to it within 6 months. On Thu, Apr 30, 2015 at 3:57 PM, Reynold Xin r...@databricks.com wrote: Guys thanks for chiming in, but please focus on Java here. Python is an entirely separate issue. On Thu, Apr 30, 2015 at 12:53 PM, Koert Kuipers ko...@tresata.com wrote: i am not sure eol means much if it is still actively used. we have a lot of clients with centos 5 (for which we still support python 2.4 in some form or another, fun!). most of them are on centos 6, which means python 2.6. by cutting out python 2.6 you would cut out the majority of the actual clusters i am aware of. unless you intention is to truly make something academic i dont think that is wise. On Thu, Apr 30, 2015 at 3:48 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: (On that note, I think Python 2.6 should be next on the chopping block sometime later this year, but that’s for another thread.) (To continue the parenthetical, Python 2.6 was in fact EOL-ed in October of 2013. https://www.python.org/download/releases/2.6.9/) On Thu, Apr 30, 2015 at 3:18 PM Nicholas Chammas nicholas.cham...@gmail.com wrote: I understand the concern about cutting out users who still use Java 6, and I don't have numbers about how many people are still using Java 6. But I want to say at a high level that I support deprecating older versions of stuff to reduce our maintenance burden and let us use more modern patterns in our code. Maintenance always costs way more than initial development over the lifetime of a project, and for that reason anti-support is just as important as support. (On that note, I think Python 2.6 should be next on the chopping block sometime later this year, but that's for another thread.) Nick On Thu, Apr 30, 2015 at 3:03 PM Reynold Xin r...@databricks.com wrote: This has been discussed a few times in the past, but now Oracle has ended support for Java 6 for over a year, I wonder if we should just drop Java 6 support. There is one outstanding issue Tom has brought to my attention: PySpark on YARN doesn't work well with Java 7/8, but we have an outstanding pull request to fix that. https://issues.apache.org/jira/browse/SPARK-6869 https://issues.apache.org/jira/browse/SPARK-1920
Re: [discuss] ending support for Java 6?
+1 on ending support for Java 6. BTW from https://www.java.com/en/download/faq/java_7.xml : After April 2015, Oracle will no longer post updates of Java SE 7 to its public download sites. On Thu, Apr 30, 2015 at 1:34 PM, Punyashloka Biswal punya.bis...@gmail.com wrote: I'm in favor of ending support for Java 6. We should also articulate a policy on how long we want to support current and future versions of Java after Oracle declares them EOL (Java 7 will be in that bucket in a matter of days). Punya On Thu, Apr 30, 2015 at 1:18 PM shane knapp skn...@berkeley.edu wrote: something to keep in mind: we can easily support java 6 for the build environment, particularly if there's a definite EOL. i'd like to fix our java versioning 'problem', and this could be a big instigator... right now we're hackily setting java_home in test invocation on jenkins, which really isn't the best. if i decide, within jenkins, to reconfigure every build to 'do the right thing' WRT java version, then i will clean up the old mess and pay down on some technical debt. or i can just install java 6 and we use that as JAVA_HOME on a build-by-build basis. this will be a few days of prep and another morning-long downtime if i do the right thing (within jenkins), and only a couple of hours the hacky way (system level). either way, we can test on java 6. :) On Thu, Apr 30, 2015 at 1:00 PM, Koert Kuipers ko...@tresata.com wrote: nicholas started it! :) for java 6 i would have said the same thing about 1 year ago: it is foolish to drop it. but i think the time is right about now. about half our clients are on java 7 and the other half have active plans to migrate to it within 6 months. On Thu, Apr 30, 2015 at 3:57 PM, Reynold Xin r...@databricks.com wrote: Guys thanks for chiming in, but please focus on Java here. Python is an entirely separate issue. On Thu, Apr 30, 2015 at 12:53 PM, Koert Kuipers ko...@tresata.com wrote: i am not sure eol means much if it is still actively used. we have a lot of clients with centos 5 (for which we still support python 2.4 in some form or another, fun!). most of them are on centos 6, which means python 2.6. by cutting out python 2.6 you would cut out the majority of the actual clusters i am aware of. unless you intention is to truly make something academic i dont think that is wise. On Thu, Apr 30, 2015 at 3:48 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: (On that note, I think Python 2.6 should be next on the chopping block sometime later this year, but that’s for another thread.) (To continue the parenthetical, Python 2.6 was in fact EOL-ed in October of 2013. https://www.python.org/download/releases/2.6.9/) On Thu, Apr 30, 2015 at 3:18 PM Nicholas Chammas nicholas.cham...@gmail.com wrote: I understand the concern about cutting out users who still use Java 6, and I don't have numbers about how many people are still using Java 6. But I want to say at a high level that I support deprecating older versions of stuff to reduce our maintenance burden and let us use more modern patterns in our code. Maintenance always costs way more than initial development over the lifetime of a project, and for that reason anti-support is just as important as support. (On that note, I think Python 2.6 should be next on the chopping block sometime later this year, but that's for another thread.) Nick On Thu, Apr 30, 2015 at 3:03 PM Reynold Xin r...@databricks.com wrote: This has been discussed a few times in the past, but now Oracle has ended support for Java 6 for over a year, I wonder if we should just drop Java 6 support. There is one outstanding issue Tom has brought to my attention: PySpark on YARN doesn't work well with Java 7/8, but we have an outstanding pull request to fix that. https://issues.apache.org/jira/browse/SPARK-6869 https://issues.apache.org/jira/browse/SPARK-1920
Re: [discuss] ending support for Java 6?
(On that note, I think Python 2.6 should be next on the chopping block sometime later this year, but that’s for another thread.) (To continue the parenthetical, Python 2.6 was in fact EOL-ed in October of 2013. https://www.python.org/download/releases/2.6.9/) On Thu, Apr 30, 2015 at 3:18 PM Nicholas Chammas nicholas.cham...@gmail.com wrote: I understand the concern about cutting out users who still use Java 6, and I don't have numbers about how many people are still using Java 6. But I want to say at a high level that I support deprecating older versions of stuff to reduce our maintenance burden and let us use more modern patterns in our code. Maintenance always costs way more than initial development over the lifetime of a project, and for that reason anti-support is just as important as support. (On that note, I think Python 2.6 should be next on the chopping block sometime later this year, but that's for another thread.) Nick On Thu, Apr 30, 2015 at 3:03 PM Reynold Xin r...@databricks.com wrote: This has been discussed a few times in the past, but now Oracle has ended support for Java 6 for over a year, I wonder if we should just drop Java 6 support. There is one outstanding issue Tom has brought to my attention: PySpark on YARN doesn't work well with Java 7/8, but we have an outstanding pull request to fix that. https://issues.apache.org/jira/browse/SPARK-6869 https://issues.apache.org/jira/browse/SPARK-1920
Re: Regarding KryoSerialization in Spark
Thanks for the info. On Fri, May 1, 2015 at 12:10 AM, Sandy Ryza sandy.r...@cloudera.com wrote: Hi Twinkle, Registering the class makes it so that writeClass only writes out a couple bytes, instead of a full String of the class name. -Sandy On Thu, Apr 30, 2015 at 4:13 AM, twinkle sachdeva twinkle.sachd...@gmail.com wrote: Hi, As per the code, KryoSerialization used writeClassAndObject method, which internally calls writeClass method, which will write the class of the object while serilization. As per the documentation in tuning page of spark, it says that registering the class will avoid that. Am I missing something or there is some issue with the documentation??? Thanks, Twinkle
Uninitialized session in HiveContext?
Hey all, We ran into some test failures in our internal branch (which builds against Hive 1.1), and I narrowed it down to the fix below. I'm not super familiar with the Hive integration code, but does this look like a bug for other versions of Hive too? This caused an error where some internal Hive configuration that is initialized by the session were not available. diff --git a/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveContext.scala b/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveContext.scala index dd06b26..6242745 100644 --- a/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveContext.scala +++ b/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveContext.scala @@ -93,6 +93,10 @@ class HiveContext(sc: SparkContext) extends SQLContext(sc) { if (conf.dialect == sql) { super.sql(substituted) } else if (conf.dialect == hiveql) { + // Make sure Hive session state is initialized. + if (SessionState.get() != sessionState) { +SessionState.start(sessionState) + } val ddlPlan = ddlParserWithHiveQL.parse(sqlText, exceptionOnError = false) DataFrame(this, ddlPlan.getOrElse(HiveQl.parseSql(substituted))) } else { -- Marcelo - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: [discuss] ending support for Java 6?
FYI, after enough consideration, we the Hadoop community dropped support for JDK 6 starting release Apache Hadoop 2.7.x. Thanks +Vinod On Apr 30, 2015, at 12:02 PM, Reynold Xin r...@databricks.com wrote: This has been discussed a few times in the past, but now Oracle has ended support for Java 6 for over a year, I wonder if we should just drop Java 6 support. There is one outstanding issue Tom has brought to my attention: PySpark on YARN doesn't work well with Java 7/8, but we have an outstanding pull request to fix that. https://issues.apache.org/jira/browse/SPARK-6869 https://issues.apache.org/jira/browse/SPARK-1920 - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: Uninitialized session in HiveContext?
Hi Michael, It would be great to see changes to make hive integration less painful, and I can test them in our environment once you have a patch. But I guess my question is a little more geared towards the current code; doesn't the issue I ran into affect 1.4 and potentially earlier versions too? On Thu, Apr 30, 2015 at 5:19 PM, Michael Armbrust mich...@databricks.com wrote: Hey Marcelo, Thanks for the heads up! I'm currently in the process of refactoring all of this (to separate the metadata connection from the execution side) and as part of this I'm making the initialization of the session not lazy. It would be great to hear if this also works for your internal integration tests once the patch is up (hopefully this weekend). Michael On Thu, Apr 30, 2015 at 2:36 PM, Marcelo Vanzin van...@cloudera.com wrote: Hey all, We ran into some test failures in our internal branch (which builds against Hive 1.1), and I narrowed it down to the fix below. I'm not super familiar with the Hive integration code, but does this look like a bug for other versions of Hive too? This caused an error where some internal Hive configuration that is initialized by the session were not available. diff --git a/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveContext.scala b/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveContext.scala index dd06b26..6242745 100644 --- a/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveContext.scala +++ b/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveContext.scala @@ -93,6 +93,10 @@ class HiveContext(sc: SparkContext) extends SQLContext(sc) { if (conf.dialect == sql) { super.sql(substituted) } else if (conf.dialect == hiveql) { + // Make sure Hive session state is initialized. + if (SessionState.get() != sessionState) { +SessionState.start(sessionState) + } val ddlPlan = ddlParserWithHiveQL.parse(sqlText, exceptionOnError = false) DataFrame(this, ddlPlan.getOrElse(HiveQl.parseSql(substituted))) } else { -- Marcelo - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org -- Marcelo - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: Mima test failure in the master branch?
Any PR open for this? -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Mima-test-failure-in-the-master-branch-tp11949p11950.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: Mima test failure in the master branch?
Looks like this has been taken care of: commit beeafcfd6ee1e460c4d564cd1515d8781989b422 Author: Patrick Wendell patr...@databricks.com Date: Thu Apr 30 20:33:36 2015 -0700 Revert [SPARK-5213] [SQL] Pluggable SQL Parser Support On Thu, Apr 30, 2015 at 7:58 PM, zhazhan zzh...@hortonworks.com wrote: [info] spark-sql: found 1 potential binary incompatibilities (filtered 129) [error] * method sqlParser()org.apache.spark.sql.SparkSQLParser in class org.apache.spark.sql.SQLContext does not have a correspondent in new version [error] filter with: ProblemFilters.excludeMissingMethodProblem -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Mima-test-failure-in-the-master-branch-tp11949.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Mima test failure in the master branch?
[info] spark-sql: found 1 potential binary incompatibilities (filtered 129) [error] * method sqlParser()org.apache.spark.sql.SparkSQLParser in class org.apache.spark.sql.SQLContext does not have a correspondent in new version [error] filter with: ProblemFilters.excludeMissingMethodProblem -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Mima-test-failure-in-the-master-branch-tp11949.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: Mima test failure in the master branch?
I reverted the patch that I think was causing this: SPARK-5213 Thanks On Thu, Apr 30, 2015 at 7:59 PM, zhazhan zzh...@hortonworks.com wrote: Any PR open for this? -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Mima-test-failure-in-the-master-branch-tp11949p11950.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: [discuss] ending support for Java 6?
But it is hard to know how long customers stay with their most recent download. Cheers On Thu, Apr 30, 2015 at 2:26 PM, Sree V sree_at_ch...@yahoo.com.invalid wrote: If there is any possibility of getting the download counts,then we can use it as EOS criteria as well.Say, if download counts are lower than 30% (or another number) of Life time highest,then it qualifies for EOS. Thanking you. With Regards Sree On Thursday, April 30, 2015 2:22 PM, Sree V sree_at_ch...@yahoo.com.INVALID wrote: Hi Team, Should we take this opportunity to layout and evangelize a pattern for EOL of dependencies.I propose, we follow the official EOL of java, python, scala, .And add say 6-12-24 months depending on the popularity. Java 6 official EOL Feb 2013Add 6-12 monthsAug 2013 - Feb 2014 official End of Support for Java 6 in SparkAnnounce 3-6 months prior to EOS. Thanking you. With Regards Sree On Thursday, April 30, 2015 1:41 PM, Marcelo Vanzin van...@cloudera.com wrote: As for the idea, I'm +1. Spark is the only reason I still have jdk6 around - exactly because I don't want to cause the issue that started this discussion (inadvertently using JDK7 APIs). And as has been pointed out, even J7 is about to go EOL real soon. Even Hadoop is moving away (I think 2.7 will be j7-only). Hive 1.1 is already j7-only. And when Hadoop moves away from something, it's an event worthy of headlines. They're still on Jetty 6! As for pyspark, https://github.com/apache/spark/pull/5580 should get rid of the last incompatibility with large assemblies, by keeping the python files in separate archives. If we remove support for Java 6, then we don't need to worry about the size of the assembly anymore. On Thu, Apr 30, 2015 at 1:32 PM, Sean Owen so...@cloudera.com wrote: I'm firmly in favor of this. It would also fix https://issues.apache.org/jira/browse/SPARK-7009 and avoid any more of the long-standing 64K file limit thing that's still a problem for PySpark. -- Marcelo - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: [discuss] ending support for Java 6?
+1 for end of support for Java 6 On Thursday, April 30, 2015 3:08 PM, Vinod Kumar Vavilapalli vino...@hortonworks.com wrote: FYI, after enough consideration, we the Hadoop community dropped support for JDK 6 starting release Apache Hadoop 2.7.x. Thanks +Vinod On Apr 30, 2015, at 12:02 PM, Reynold Xin r...@databricks.com wrote: This has been discussed a few times in the past, but now Oracle has ended support for Java 6 for over a year, I wonder if we should just drop Java 6 support. There is one outstanding issue Tom has brought to my attention: PySpark on YARN doesn't work well with Java 7/8, but we have an outstanding pull request to fix that. https://issues.apache.org/jira/browse/SPARK-6869 https://issues.apache.org/jira/browse/SPARK-1920 - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: Issue of running partitioned loading (RDD) in Spark External Datasource on Mesos
I finally isolated the issue to be related to the ActorSystem I reuse from SparkEnv.get.actorSystem. This ActorSystem will contain the configuration defined in my application jar's reference.conf in both local cluster case, and in the case I use it directly in an extension to BaseRelation's buildScan method. However if used in my RDD which is returned in the buildScan, it loses the configuration. I solve / bypass the problem by checking if my configuration exists in the SparkEnv.get.actorSystem(settings.config) .If it does not exist, I will create a new ActorSystem using my class's classLoader to force config reading from my application jar: val classLoader = this.getClass.getClassLoader val myconfig = ConfigFactory.load(classLoader)// force config reading from my classloader ActorSystem(somename..,myconfig,classLoader) I wonder if this different behavior of SparkEnv.get.actorSystem is working-as-designed, or something is missing in executor setup for this custom RDD driven execution case. Thanks. Yang
Re: [discuss] ending support for Java 6?
If there is any possibility of getting the download counts,then we can use it as EOS criteria as well.Say, if download counts are lower than 30% (or another number) of Life time highest,then it qualifies for EOS. Thanking you. With Regards Sree On Thursday, April 30, 2015 2:22 PM, Sree V sree_at_ch...@yahoo.com.INVALID wrote: Hi Team, Should we take this opportunity to layout and evangelize a pattern for EOL of dependencies.I propose, we follow the official EOL of java, python, scala, .And add say 6-12-24 months depending on the popularity. Java 6 official EOL Feb 2013Add 6-12 monthsAug 2013 - Feb 2014 official End of Support for Java 6 in SparkAnnounce 3-6 months prior to EOS. Thanking you. With Regards Sree On Thursday, April 30, 2015 1:41 PM, Marcelo Vanzin van...@cloudera.com wrote: As for the idea, I'm +1. Spark is the only reason I still have jdk6 around - exactly because I don't want to cause the issue that started this discussion (inadvertently using JDK7 APIs). And as has been pointed out, even J7 is about to go EOL real soon. Even Hadoop is moving away (I think 2.7 will be j7-only). Hive 1.1 is already j7-only. And when Hadoop moves away from something, it's an event worthy of headlines. They're still on Jetty 6! As for pyspark, https://github.com/apache/spark/pull/5580 should get rid of the last incompatibility with large assemblies, by keeping the python files in separate archives. If we remove support for Java 6, then we don't need to worry about the size of the assembly anymore. On Thu, Apr 30, 2015 at 1:32 PM, Sean Owen so...@cloudera.com wrote: I'm firmly in favor of this. It would also fix https://issues.apache.org/jira/browse/SPARK-7009 and avoid any more of the long-standing 64K file limit thing that's still a problem for PySpark. -- Marcelo - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: Uninitialized session in HiveContext?
Hey Marcelo, Thanks for the heads up! I'm currently in the process of refactoring all of this (to separate the metadata connection from the execution side) and as part of this I'm making the initialization of the session not lazy. It would be great to hear if this also works for your internal integration tests once the patch is up (hopefully this weekend). Michael On Thu, Apr 30, 2015 at 2:36 PM, Marcelo Vanzin van...@cloudera.com wrote: Hey all, We ran into some test failures in our internal branch (which builds against Hive 1.1), and I narrowed it down to the fix below. I'm not super familiar with the Hive integration code, but does this look like a bug for other versions of Hive too? This caused an error where some internal Hive configuration that is initialized by the session were not available. diff --git a/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveContext.scala b/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveContext.scala index dd06b26..6242745 100644 --- a/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveContext.scala +++ b/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveContext.scala @@ -93,6 +93,10 @@ class HiveContext(sc: SparkContext) extends SQLContext(sc) { if (conf.dialect == sql) { super.sql(substituted) } else if (conf.dialect == hiveql) { + // Make sure Hive session state is initialized. + if (SessionState.get() != sessionState) { +SessionState.start(sessionState) + } val ddlPlan = ddlParserWithHiveQL.parse(sqlText, exceptionOnError = false) DataFrame(this, ddlPlan.getOrElse(HiveQl.parseSql(substituted))) } else { -- Marcelo - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Custom PersistanceEngine and LeaderAgent implementation in Java
Hi, this follows the following feature in this feature [1] I'm trying to implement a custom persistence engine and a leader agent in the Java environment. vis-a-vis scala, when I implement the PersistenceEngine trait in java, I would have to implement methods such as readPersistedData, removeDriver, etc together with read, persist and unpersist methods. but the issue here is, methods such as readPersistedData etc are 'final def's, hence can not be overridden in the java environment. I am new to scala, but is there any workaround to implement the above traits in java? look forward to hear from you. [1] https://issues.apache.org/jira/browse/SPARK-1830 -- Niranda
Re: Custom PersistanceEngine and LeaderAgent implementation in Java
We should change the trait to abstract class, and then your problem will go away. Do you want to submit a pull request? On Wed, Apr 29, 2015 at 11:02 PM, Niranda Perera niranda.per...@gmail.com wrote: Hi, this follows the following feature in this feature [1] I'm trying to implement a custom persistence engine and a leader agent in the Java environment. vis-a-vis scala, when I implement the PersistenceEngine trait in java, I would have to implement methods such as readPersistedData, removeDriver, etc together with read, persist and unpersist methods. but the issue here is, methods such as readPersistedData etc are 'final def's, hence can not be overridden in the java environment. I am new to scala, but is there any workaround to implement the above traits in java? look forward to hear from you. [1] https://issues.apache.org/jira/browse/SPARK-1830 -- Niranda