Regarding KryoSerialization in Spark

2015-04-30 Thread twinkle sachdeva
Hi,

As per the code, KryoSerialization used writeClassAndObject method, which
internally calls writeClass method, which will write the class of the
object while serilization.

As per the documentation in tuning page of spark, it says that registering
the class will avoid that.

Am I missing something or there is some issue with the documentation???

Thanks,
Twinkle


RE: Is SQLContext thread-safe?

2015-04-30 Thread Haopu Wang
Hi, in a test on SparkSQL 1.3.0, multiple threads are doing select on a
same SQLContext instance, but below exception is thrown, so it looks
like SQLContext is NOT thread safe? I think this is not the desired
behavior.

==

java.lang.RuntimeException: [1.1] failure: ``insert'' expected but
identifier select found

select id ,ext.d from UNIT_TEST
^
 at scala.sys.package$.error(package.scala:27)
 at
org.apache.spark.sql.catalyst.AbstractSparkSQLParser.apply(AbstractSpark
SQLParser.scala:40)
 at
org.apache.spark.sql.SQLContext$$anonfun$2.apply(SQLContext.scala:130)
 at
org.apache.spark.sql.SQLContext$$anonfun$2.apply(SQLContext.scala:130)
 at
org.apache.spark.sql.SparkSQLParser$$anonfun$org$apache$spark$sql$SparkS
QLParser$$others$1.apply(SparkSQLParser.scala:96)
 at
org.apache.spark.sql.SparkSQLParser$$anonfun$org$apache$spark$sql$SparkS
QLParser$$others$1.apply(SparkSQLParser.scala:95)
 at
scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:136)
 at
scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:135)
 at
scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parser
s.scala:242)
 at
scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parser
s.scala:242)
 at
scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
 at
scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$
apply$2.apply(Parsers.scala:254)
 at
scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$
apply$2.apply(Parsers.scala:254)
 at
scala.util.parsing.combinator.Parsers$Failure.append(Parsers.scala:202)
 at
scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Par
sers.scala:254)
 at
scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Par
sers.scala:254)
 at
scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
 at
scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Pa
rsers.scala:891)
 at
scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Pa
rsers.scala:891)
 at
scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
 at
scala.util.parsing.combinator.Parsers$$anon$2.apply(Parsers.scala:890)
 at
scala.util.parsing.combinator.PackratParsers$$anon$1.apply(PackratParser
s.scala:110)
 at
org.apache.spark.sql.catalyst.AbstractSparkSQLParser.apply(AbstractSpark
SQLParser.scala:38)
 at
org.apache.spark.sql.SQLContext$$anonfun$parseSql$1.apply(SQLContext.sca
la:134)
 at
org.apache.spark.sql.SQLContext$$anonfun$parseSql$1.apply(SQLContext.sca
la:134)
 at scala.Option.getOrElse(Option.scala:120)
 at
org.apache.spark.sql.SQLContext.parseSql(SQLContext.scala:134)
 at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:915)

-Original Message-
From: Cheng, Hao [mailto:hao.ch...@intel.com] 
Sent: Monday, March 02, 2015 9:05 PM
To: Haopu Wang; user
Subject: RE: Is SQLContext thread-safe?

Yes it is thread safe, at least it's supposed to be.

-Original Message-
From: Haopu Wang [mailto:hw...@qilinsoft.com] 
Sent: Monday, March 2, 2015 4:43 PM
To: user
Subject: Is SQLContext thread-safe?

Hi, is it safe to use the same SQLContext to do Select operations in
different threads at the same time? Thank you very much!

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional
commands, e-mail: user-h...@spark.apache.org


-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Is SQLContext thread-safe?

2015-04-30 Thread Wangfei (X)
actually this is a sql parse exception, are you sure your sql is right?

发自我的 iPhone

 在 2015年4月30日,18:50,Haopu Wang hw...@qilinsoft.com 写道:
 
 Hi, in a test on SparkSQL 1.3.0, multiple threads are doing select on a
 same SQLContext instance, but below exception is thrown, so it looks
 like SQLContext is NOT thread safe? I think this is not the desired
 behavior.
 
 ==
 
 java.lang.RuntimeException: [1.1] failure: ``insert'' expected but
 identifier select found
 
 select id ,ext.d from UNIT_TEST
 ^
 at scala.sys.package$.error(package.scala:27)
 at
 org.apache.spark.sql.catalyst.AbstractSparkSQLParser.apply(AbstractSpark
 SQLParser.scala:40)
 at
 org.apache.spark.sql.SQLContext$$anonfun$2.apply(SQLContext.scala:130)
 at
 org.apache.spark.sql.SQLContext$$anonfun$2.apply(SQLContext.scala:130)
 at
 org.apache.spark.sql.SparkSQLParser$$anonfun$org$apache$spark$sql$SparkS
 QLParser$$others$1.apply(SparkSQLParser.scala:96)
 at
 org.apache.spark.sql.SparkSQLParser$$anonfun$org$apache$spark$sql$SparkS
 QLParser$$others$1.apply(SparkSQLParser.scala:95)
 at
 scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:136)
 at
 scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:135)
 at
 scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parser
 s.scala:242)
 at
 scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parser
 s.scala:242)
 at
 scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
 at
 scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$
 apply$2.apply(Parsers.scala:254)
 at
 scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$
 apply$2.apply(Parsers.scala:254)
 at
 scala.util.parsing.combinator.Parsers$Failure.append(Parsers.scala:202)
 at
 scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Par
 sers.scala:254)
 at
 scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Par
 sers.scala:254)
 at
 scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
 at
 scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Pa
 rsers.scala:891)
 at
 scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Pa
 rsers.scala:891)
 at
 scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
 at
 scala.util.parsing.combinator.Parsers$$anon$2.apply(Parsers.scala:890)
 at
 scala.util.parsing.combinator.PackratParsers$$anon$1.apply(PackratParser
 s.scala:110)
 at
 org.apache.spark.sql.catalyst.AbstractSparkSQLParser.apply(AbstractSpark
 SQLParser.scala:38)
 at
 org.apache.spark.sql.SQLContext$$anonfun$parseSql$1.apply(SQLContext.sca
 la:134)
 at
 org.apache.spark.sql.SQLContext$$anonfun$parseSql$1.apply(SQLContext.sca
 la:134)
 at scala.Option.getOrElse(Option.scala:120)
 at
 org.apache.spark.sql.SQLContext.parseSql(SQLContext.scala:134)
 at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:915)
 
 -Original Message-
 From: Cheng, Hao [mailto:hao.ch...@intel.com] 
 Sent: Monday, March 02, 2015 9:05 PM
 To: Haopu Wang; user
 Subject: RE: Is SQLContext thread-safe?
 
 Yes it is thread safe, at least it's supposed to be.
 
 -Original Message-
 From: Haopu Wang [mailto:hw...@qilinsoft.com] 
 Sent: Monday, March 2, 2015 4:43 PM
 To: user
 Subject: Is SQLContext thread-safe?
 
 Hi, is it safe to use the same SQLContext to do Select operations in
 different threads at the same time? Thank you very much!
 
 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional
 commands, e-mail: user-h...@spark.apache.org
 
 
 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org
 


Re: withColumn is very slow with datasets with large number of columns

2015-04-30 Thread alexandre Clement
I have reported the issue on JIRA:
https://issues.apache.org/jira/browse/SPARK-7276

On Thu, Apr 30, 2015 at 4:36 PM, alexandre Clement a.p.clem...@gmail.com
wrote:

 Hi all,


 I'm experimenting serious performance problem when using withColumn and
 dataset with large number of columns. It is very slow: on a dataset with
 100 columns it takes a few seconds.


 The code snippet demonstrates the problem.


 val custs = Seq(
 Row(1, Bob, 21, 80.5),
 Row(2, Bobby, 21, 80.5),
 Row(3, Jean, 21, 80.5),
 Row(4, Fatime, 21, 80.5)
 )

 var fields = List(
 StructField(id, IntegerType, true),
 StructField(a, IntegerType, true),
 StructField(b, StringType, true),
 StructField(target, DoubleType, false))
 val schema = StructType(fields)

 var rdd = sc.parallelize(custs)
 var df = sqlContext.createDataFrame(rdd, schema)

 for (i - 1 to 200)
 { val now = System.currentTimeMillis df = df.withColumn(a_new_col_ + i,
 df(a) + i) println(s$i -  + (System.currentTimeMillis - now)) }

 df.show()



withColumn is very slow with datasets with large number of columns

2015-04-30 Thread alexandre Clement
Hi all,


I'm experimenting serious performance problem when using withColumn and
dataset with large number of columns. It is very slow: on a dataset with
100 columns it takes a few seconds.


The code snippet demonstrates the problem.


val custs = Seq(
Row(1, Bob, 21, 80.5),
Row(2, Bobby, 21, 80.5),
Row(3, Jean, 21, 80.5),
Row(4, Fatime, 21, 80.5)
)

var fields = List(
StructField(id, IntegerType, true),
StructField(a, IntegerType, true),
StructField(b, StringType, true),
StructField(target, DoubleType, false))
val schema = StructType(fields)

var rdd = sc.parallelize(custs)
var df = sqlContext.createDataFrame(rdd, schema)

for (i - 1 to 200)
{ val now = System.currentTimeMillis df = df.withColumn(a_new_col_ + i,
df(a) + i) println(s$i -  + (System.currentTimeMillis - now)) }

df.show()


Drop column/s in DataFrame

2015-04-30 Thread rakeshchalasani
Hi All:

Is there any plan to add drop column/s functionality in the data frame?
One can you select function to do so, but I find that tedious when only
one or two columns in large dataframe are to be dropped. 

Pandas has this functionality, which I find handy when constructing feature
vectors from DataFrame, allowing me to carry only useful information
forward.
http://pandas.pydata.org/pandas-docs/dev/generated/pandas.DataFrame.drop.html

Rakesh



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Drop-column-s-in-DataFrame-tp11917.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Drop column/s in DataFrame

2015-04-30 Thread Reynold Xin
I filed a ticket: https://issues.apache.org/jira/browse/SPARK-7280

Would you like to give it a shot?


On Thu, Apr 30, 2015 at 10:22 AM, rakeshchalasani vnit.rak...@gmail.com
wrote:

 Hi All:

 Is there any plan to add drop column/s functionality in the data frame?
 One can you select function to do so, but I find that tedious when only
 one or two columns in large dataframe are to be dropped.

 Pandas has this functionality, which I find handy when constructing feature
 vectors from DataFrame, allowing me to carry only useful information
 forward.

 http://pandas.pydata.org/pandas-docs/dev/generated/pandas.DataFrame.drop.html

 Rakesh



 --
 View this message in context:
 http://apache-spark-developers-list.1001551.n3.nabble.com/Drop-column-s-in-DataFrame-tp11917.html
 Sent from the Apache Spark Developers List mailing list archive at
 Nabble.com.

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org




Re: practical usage of the new exactly-once supporting DirectKafkaInputDStream

2015-04-30 Thread badgerpants
Cody Koeninger-2 wrote
 What's your schema for the offset table, and what's the definition of
 writeOffset ?

The schema is the same as the one in your post: topic | partition| offset
The writeOffset is nearly identical:

  def writeOffset(osr: OffsetRange)(implicit session: DBSession): Unit = {
logWarning(Thread.currentThread().toString + writeOffset:  + osr)
if(osr==null) {
  logWarning(no offset provided)
  return
}

val updated = sql
update txn_offsets set off = ${osr.untilOffset}
  where topic = ${osr.topic} and part = ${osr.partition} and off =
${osr.fromOffset}
.update.apply()
if (updated != 1) {
  throw new Exception( Thread.currentThread().toString + sfailed to
write offset:
${osr.topic},${osr.partition},${osr.fromOffset}-${osr.untilOffset})
} else {
  logWarning(Thread.currentThread().toString + offsets updated to  +
osr.untilOffset)
}

  }


Cody Koeninger-2 wrote
 What key are you reducing on?  Maybe I'm misreading the code, but it looks
 like the per-partition offset is part of the key.  If that's true then you
 could just do your reduction on each partition, rather than after the fact
 on the whole stream.

Yes, the key is a duple comprised of a case class called Key and the
partition's OffsetRange. We piggybacked the OffsetRange in this way so it
would be available within the scope of the partition.

I have tried moving the reduceByKey from the end of the .transform block
into the partition level (at the end of the mapPartitionsWithIndex block.)
This is what you're suggesting, yes? The results didn't correct the offset
update behavior; they still get out of sync pretty quickly.

Some details: I'm using the kafka-console-producer.sh tool to drive the
process, calling it three or four times in succession and piping in 100-1000
messages in each call. Once all the messages have been processed I wait for
the output of the printOffsets method to stop changing and compare it to the
txn_offsets table. (When no data is getting processed the printOffsets
method yields something like the following: [ OffsetRange(topic:
'testmulti', partition: 1, range: [23602 - 23602] OffsetRange(topic:
'testmulti', partition: 2, range: [32503 - 32503] OffsetRange(topic:
'testmulti', partition: 0, range: [26100 - 26100] OffsetRange(topic:
'testmulti', partition: 3, range: [20900 - 20900]])

Thanks,
Mark






--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/practical-usage-of-the-new-exactly-once-supporting-DirectKafkaInputDStream-tp11916p11921.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Is SQLContext thread-safe?

2015-04-30 Thread Michael Armbrust
Unfortunately, I think the SQLParser is not threadsafe.  I would recommend
using HiveQL.

On Thu, Apr 30, 2015 at 4:07 AM, Wangfei (X) wangf...@huawei.com wrote:

 actually this is a sql parse exception, are you sure your sql is right?

 发自我的 iPhone

  在 2015年4月30日,18:50,Haopu Wang hw...@qilinsoft.com 写道:
 
  Hi, in a test on SparkSQL 1.3.0, multiple threads are doing select on a
  same SQLContext instance, but below exception is thrown, so it looks
  like SQLContext is NOT thread safe? I think this is not the desired
  behavior.
 
  ==
 
  java.lang.RuntimeException: [1.1] failure: ``insert'' expected but
  identifier select found
 
  select id ,ext.d from UNIT_TEST
  ^
  at scala.sys.package$.error(package.scala:27)
  at
  org.apache.spark.sql.catalyst.AbstractSparkSQLParser.apply(AbstractSpark
  SQLParser.scala:40)
  at
  org.apache.spark.sql.SQLContext$$anonfun$2.apply(SQLContext.scala:130)
  at
  org.apache.spark.sql.SQLContext$$anonfun$2.apply(SQLContext.scala:130)
  at
  org.apache.spark.sql.SparkSQLParser$$anonfun$org$apache$spark$sql$SparkS
  QLParser$$others$1.apply(SparkSQLParser.scala:96)
  at
  org.apache.spark.sql.SparkSQLParser$$anonfun$org$apache$spark$sql$SparkS
  QLParser$$others$1.apply(SparkSQLParser.scala:95)
  at
  scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:136)
  at
  scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:135)
  at
  scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parser
  s.scala:242)
  at
  scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parser
  s.scala:242)
  at
  scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
  at
  scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$
  apply$2.apply(Parsers.scala:254)
  at
  scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$
  apply$2.apply(Parsers.scala:254)
  at
  scala.util.parsing.combinator.Parsers$Failure.append(Parsers.scala:202)
  at
  scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Par
  sers.scala:254)
  at
  scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Par
  sers.scala:254)
  at
  scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
  at
  scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Pa
  rsers.scala:891)
  at
  scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Pa
  rsers.scala:891)
  at
  scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
  at
  scala.util.parsing.combinator.Parsers$$anon$2.apply(Parsers.scala:890)
  at
  scala.util.parsing.combinator.PackratParsers$$anon$1.apply(PackratParser
  s.scala:110)
  at
  org.apache.spark.sql.catalyst.AbstractSparkSQLParser.apply(AbstractSpark
  SQLParser.scala:38)
  at
  org.apache.spark.sql.SQLContext$$anonfun$parseSql$1.apply(SQLContext.sca
  la:134)
  at
  org.apache.spark.sql.SQLContext$$anonfun$parseSql$1.apply(SQLContext.sca
  la:134)
  at scala.Option.getOrElse(Option.scala:120)
  at
  org.apache.spark.sql.SQLContext.parseSql(SQLContext.scala:134)
  at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:915)
 
  -Original Message-
  From: Cheng, Hao [mailto:hao.ch...@intel.com]
  Sent: Monday, March 02, 2015 9:05 PM
  To: Haopu Wang; user
  Subject: RE: Is SQLContext thread-safe?
 
  Yes it is thread safe, at least it's supposed to be.
 
  -Original Message-
  From: Haopu Wang [mailto:hw...@qilinsoft.com]
  Sent: Monday, March 2, 2015 4:43 PM
  To: user
  Subject: Is SQLContext thread-safe?
 
  Hi, is it safe to use the same SQLContext to do Select operations in
  different threads at the same time? Thank you very much!
 
  -
  To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional
  commands, e-mail: user-h...@spark.apache.org
 
 
  -
  To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
  For additional commands, e-mail: dev-h...@spark.apache.org
 



Re: practical usage of the new exactly-once supporting DirectKafkaInputDStream

2015-04-30 Thread Cody Koeninger
What's your schema for the offset table, and what's the definition of
writeOffset ?

What key are you reducing on?  Maybe I'm misreading the code, but it looks
like the per-partition offset is part of the key.  If that's true then you
could just do your reduction on each partition, rather than after the fact
on the whole stream.

On Thu, Apr 30, 2015 at 12:10 PM, badgerpants mark.stew...@tapjoy.com
wrote:

 We're a group of experienced backend developers who are fairly new to Spark
 Streaming (and Scala) and very interested in using the new (in 1.3)
 DirectKafkaInputDStream impl as part of the metrics reporting service we're
 building.

 Our flow involves reading in metric events, lightly modifying some of the
 data values, and then creating aggregates via reduceByKey. We're following
 the approach in Cody Koeninger's blog on exactly-once streaming
 (https://github.com/koeninger/kafka-exactly-once/blob/master/blogpost.md)
 in
 which the Kakfa OffsetRanges are grabbed from the RDD and persisted to a
 tracking table within the same db transaction as the data within said
 ranges.

 Within a short time frame the offsets in the table fall out of synch with
 the offsets. It appears that the writeOffsets method (see code below)
 occasionally doesn't get called which also indicates that some blocks of
 data aren't being processed either; the aggregate operation makes this
 difficult to eyeball from the data that's written to the db.

 Note that we do understand that the reduce operation alters that
 size/boundaries of the partitions we end up processing. Indeed, without the
 reduceByKey operation our code seems to work perfectly. But without the
 reduceByKey operation the db has to perform *a lot* more updates. It's
 certainly a significant restriction to place on what is such a promising
 approach. I'm hoping there simply something we're missing.

 Any workarounds or thoughts are welcome. Here's the code we've got:

 def run(stream: DStream[Input], conf: Config, args: List[String]): Unit = {
 ...
 val sumFunc: (BigDecimal, BigDecimal) = BigDecimal = (_ + _)

 val transformStream = stream.transform { rdd =
   val offsets = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
   printOffsets(offsets) // just prints out the offsets for reference
   rdd.mapPartitionsWithIndex { case (i, iter) =
 iter.flatMap { case (name, msg) = extractMetrics(msg) }
   .map { case (k,v) = ( ( keyWithFlooredTimestamp(k), offsets(i)
 ),
 v ) }
   }
 }.reduceByKey(sumFunc, 1)

 transformStream.foreachRDD { rdd =
   rdd.foreachPartition { partition =
 val conn = DriverManager.getConnection(dbUrl, dbUser, dbPass)
 val db = DB(conn)
 db.autoClose(false)

 db.autoCommit { implicit session =
   var currentOffset: OffsetRange = null
   partition.foreach { case (key, value) =
 currentOffset = key._2
 writeMetrics(key._1, value, table)
   }
   writeOffset(currentOffset) // updates the offset positions
 }
 db.close()
   }
 }

 Thanks,
 Mark




 --
 View this message in context:
 http://apache-spark-developers-list.1001551.n3.nabble.com/practical-usage-of-the-new-exactly-once-supporting-DirectKafkaInputDStream-tp11916.html
 Sent from the Apache Spark Developers List mailing list archive at
 Nabble.com.

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org




Re: [discuss] DataFrame function namespacing

2015-04-30 Thread Ted Yu
IMHO I would go with choice #1

Cheers

On Wed, Apr 29, 2015 at 10:03 PM, Reynold Xin r...@databricks.com wrote:

 We definitely still have the name collision problem in SQL.

 On Wed, Apr 29, 2015 at 10:01 PM, Punyashloka Biswal 
 punya.bis...@gmail.com
  wrote:

  Do we still have to keep the names of the functions distinct to avoid
  collisions in SQL? Or is there a plan to allow importing a namespace
 into
  SQL somehow?
 
  I ask because if we have to keep worrying about name collisions then I'm
  not sure what the added complexity of #2 and #3 buys us.
 
  Punya
 
  On Wed, Apr 29, 2015 at 3:52 PM Reynold Xin r...@databricks.com wrote:
 
  Scaladoc isn't much of a problem because scaladocs are grouped.
  Java/Python
  is the main problem ...
 
  See
 
 
 https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.functions$
 
  On Wed, Apr 29, 2015 at 3:38 PM, Shivaram Venkataraman 
  shiva...@eecs.berkeley.edu wrote:
 
   My feeling is that we should have a handful of namespaces (say 4 or
 5).
  It
   becomes too cumbersome to import / remember more package names and
  having
   everything in one package makes it hard to read scaladoc etc.
  
   Thanks
   Shivaram
  
   On Wed, Apr 29, 2015 at 3:30 PM, Reynold Xin r...@databricks.com
  wrote:
  
   To add a little bit more context, some pros/cons I can think of are:
  
   Option 1: Very easy for users to find the function, since they are
 all
  in
   org.apache.spark.sql.functions. However, there will be quite a large
   number
   of them.
  
   Option 2: I can't tell why we would want this one over Option 3,
 since
  it
   has all the problems of Option 3, and not as nice of a hierarchy.
  
   Option 3: Opposite of Option 1. Each package or static class has a
  small
   number of functions that are relevant to each other, but for some
   functions
   it is unclear where they should go (e.g. should min go into basic
 or
   math?)
  
  
  
  
   On Wed, Apr 29, 2015 at 3:21 PM, Reynold Xin r...@databricks.com
  wrote:
  
Before we make DataFrame non-alpha, it would be great to decide how
  we
want to namespace all the functions. There are 3 alternatives:
   
1. Put all in org.apache.spark.sql.functions. This is how SQL does
  it,
since SQL doesn't have namespaces. I estimate eventually we will
  have ~
   200
functions.
   
2. Have explicit namespaces, which is what master branch currently
  looks
like:
   
- org.apache.spark.sql.functions
- org.apache.spark.sql.mathfunctions
- ...
   
3. Have explicit namespaces, but restructure them slightly so
  everything
is under functions.
   
package object functions {
   
  // all the old functions here -- but deprecated so we keep source
compatibility
  def ...
}
   
package org.apache.spark.sql.functions
   
object mathFunc {
  ...
}
   
object basicFuncs {
  ...
}
   
   
   
  
  
  
 
 



practical usage of the new exactly-once supporting DirectKafkaInputDStream

2015-04-30 Thread badgerpants
We're a group of experienced backend developers who are fairly new to Spark
Streaming (and Scala) and very interested in using the new (in 1.3)
DirectKafkaInputDStream impl as part of the metrics reporting service we're
building.

Our flow involves reading in metric events, lightly modifying some of the
data values, and then creating aggregates via reduceByKey. We're following
the approach in Cody Koeninger's blog on exactly-once streaming
(https://github.com/koeninger/kafka-exactly-once/blob/master/blogpost.md) in
which the Kakfa OffsetRanges are grabbed from the RDD and persisted to a
tracking table within the same db transaction as the data within said
ranges. 

Within a short time frame the offsets in the table fall out of synch with
the offsets. It appears that the writeOffsets method (see code below)
occasionally doesn't get called which also indicates that some blocks of
data aren't being processed either; the aggregate operation makes this
difficult to eyeball from the data that's written to the db.

Note that we do understand that the reduce operation alters that
size/boundaries of the partitions we end up processing. Indeed, without the
reduceByKey operation our code seems to work perfectly. But without the
reduceByKey operation the db has to perform *a lot* more updates. It's
certainly a significant restriction to place on what is such a promising
approach. I'm hoping there simply something we're missing.

Any workarounds or thoughts are welcome. Here's the code we've got:

def run(stream: DStream[Input], conf: Config, args: List[String]): Unit = {
...
val sumFunc: (BigDecimal, BigDecimal) = BigDecimal = (_ + _)

val transformStream = stream.transform { rdd =
  val offsets = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
  printOffsets(offsets) // just prints out the offsets for reference
  rdd.mapPartitionsWithIndex { case (i, iter) =
iter.flatMap { case (name, msg) = extractMetrics(msg) }
  .map { case (k,v) = ( ( keyWithFlooredTimestamp(k), offsets(i) ),
v ) }
  }
}.reduceByKey(sumFunc, 1)

transformStream.foreachRDD { rdd =
  rdd.foreachPartition { partition =
val conn = DriverManager.getConnection(dbUrl, dbUser, dbPass)
val db = DB(conn)
db.autoClose(false)

db.autoCommit { implicit session =
  var currentOffset: OffsetRange = null
  partition.foreach { case (key, value) =
currentOffset = key._2
writeMetrics(key._1, value, table)
  }
  writeOffset(currentOffset) // updates the offset positions
}
db.close()
  }
}

Thanks,
Mark




--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/practical-usage-of-the-new-exactly-once-supporting-DirectKafkaInputDStream-tp11916.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Drop column/s in DataFrame

2015-04-30 Thread Rakesh Chalasani
Sure, I will try sending a PR soon.

On Thu, Apr 30, 2015 at 1:42 PM Reynold Xin r...@databricks.com wrote:

 I filed a ticket: https://issues.apache.org/jira/browse/SPARK-7280

 Would you like to give it a shot?


 On Thu, Apr 30, 2015 at 10:22 AM, rakeshchalasani vnit.rak...@gmail.com
 wrote:

 Hi All:

 Is there any plan to add drop column/s functionality in the data frame?
 One can you select function to do so, but I find that tedious when only
 one or two columns in large dataframe are to be dropped.

 Pandas has this functionality, which I find handy when constructing
 feature
 vectors from DataFrame, allowing me to carry only useful information
 forward.

 http://pandas.pydata.org/pandas-docs/dev/generated/pandas.DataFrame.drop.html

 Rakesh



 --
 View this message in context:
 http://apache-spark-developers-list.1001551.n3.nabble.com/Drop-column-s-in-DataFrame-tp11917.html
 Sent from the Apache Spark Developers List mailing list archive at
 Nabble.com.

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org





Re: Regarding KryoSerialization in Spark

2015-04-30 Thread Sandy Ryza
Hi Twinkle,

Registering the class makes it so that writeClass only writes out a couple
bytes, instead of a full String of the class name.

-Sandy

On Thu, Apr 30, 2015 at 4:13 AM, twinkle sachdeva 
twinkle.sachd...@gmail.com wrote:

 Hi,

 As per the code, KryoSerialization used writeClassAndObject method, which
 internally calls writeClass method, which will write the class of the
 object while serilization.

 As per the documentation in tuning page of spark, it says that registering
 the class will avoid that.

 Am I missing something or there is some issue with the documentation???

 Thanks,
 Twinkle



Re: [discuss] ending support for Java 6?

2015-04-30 Thread Koert Kuipers
i am not sure eol means much if it is still actively used. we have a lot of
clients with centos 5 (for which we still support python 2.4 in some form
or another, fun!). most of them are on centos 6, which means python 2.6. by
cutting out python 2.6 you would cut out the majority of the actual
clusters i am aware of. unless you intention is to truly make something
academic i dont think that is wise.

On Thu, Apr 30, 2015 at 3:48 PM, Nicholas Chammas 
nicholas.cham...@gmail.com wrote:

 (On that note, I think Python 2.6 should be next on the chopping block
 sometime later this year, but that’s for another thread.)

 (To continue the parenthetical, Python 2.6 was in fact EOL-ed in October of
 2013. https://www.python.org/download/releases/2.6.9/)
 ​

 On Thu, Apr 30, 2015 at 3:18 PM Nicholas Chammas 
 nicholas.cham...@gmail.com
 wrote:

  I understand the concern about cutting out users who still use Java 6,
 and
  I don't have numbers about how many people are still using Java 6.
 
  But I want to say at a high level that I support deprecating older
  versions of stuff to reduce our maintenance burden and let us use more
  modern patterns in our code.
 
  Maintenance always costs way more than initial development over the
  lifetime of a project, and for that reason anti-support is just as
  important as support.
 
  (On that note, I think Python 2.6 should be next on the chopping block
  sometime later this year, but that's for another thread.)
 
  Nick
 
 
  On Thu, Apr 30, 2015 at 3:03 PM Reynold Xin r...@databricks.com wrote:
 
  This has been discussed a few times in the past, but now Oracle has
 ended
  support for Java 6 for over a year, I wonder if we should just drop
 Java 6
  support.
 
  There is one outstanding issue Tom has brought to my attention: PySpark
 on
  YARN doesn't work well with Java 7/8, but we have an outstanding pull
  request to fix that.
 
  https://issues.apache.org/jira/browse/SPARK-6869
  https://issues.apache.org/jira/browse/SPARK-1920
 
 



Re: [discuss] ending support for Java 6?

2015-04-30 Thread Nicholas Chammas
I understand the concern about cutting out users who still use Java 6, and
I don't have numbers about how many people are still using Java 6.

But I want to say at a high level that I support deprecating older versions
of stuff to reduce our maintenance burden and let us use more modern
patterns in our code.

Maintenance always costs way more than initial development over the
lifetime of a project, and for that reason anti-support is just as
important as support.

(On that note, I think Python 2.6 should be next on the chopping block
sometime later this year, but that's for another thread.)

Nick


On Thu, Apr 30, 2015 at 3:03 PM Reynold Xin r...@databricks.com wrote:

 This has been discussed a few times in the past, but now Oracle has ended
 support for Java 6 for over a year, I wonder if we should just drop Java 6
 support.

 There is one outstanding issue Tom has brought to my attention: PySpark on
 YARN doesn't work well with Java 7/8, but we have an outstanding pull
 request to fix that.

 https://issues.apache.org/jira/browse/SPARK-6869
 https://issues.apache.org/jira/browse/SPARK-1920



Re: [discuss] ending support for Java 6?

2015-04-30 Thread shane knapp
something to keep in mind:  we can easily support java 6 for the build
environment, particularly if there's a definite EOL.

i'd like to fix our java versioning 'problem', and this could be a big
instigator...  right now we're hackily setting java_home in test invocation
on jenkins, which really isn't the best.  if i decide, within jenkins, to
reconfigure every build to 'do the right thing' WRT java version, then i
will clean up the old mess and pay down on some technical debt.

or i can just install java 6 and we use that as JAVA_HOME on a
build-by-build basis.

this will be a few days of prep and another morning-long downtime if i do
the right thing (within jenkins), and only a couple of hours the hacky way
(system level).

either way, we can test on java 6.  :)

On Thu, Apr 30, 2015 at 1:00 PM, Koert Kuipers ko...@tresata.com wrote:

 nicholas started it! :)

 for java 6 i would have said the same thing about 1 year ago: it is foolish
 to drop it. but i think the time is right about now.
 about half our clients are on java 7 and the other half have active plans
 to migrate to it within 6 months.

 On Thu, Apr 30, 2015 at 3:57 PM, Reynold Xin r...@databricks.com wrote:

  Guys thanks for chiming in, but please focus on Java here. Python is an
  entirely separate issue.
 
 
  On Thu, Apr 30, 2015 at 12:53 PM, Koert Kuipers ko...@tresata.com
 wrote:
 
  i am not sure eol means much if it is still actively used. we have a lot
  of clients with centos 5 (for which we still support python 2.4 in some
  form or another, fun!). most of them are on centos 6, which means python
  2.6. by cutting out python 2.6 you would cut out the majority of the
 actual
  clusters i am aware of. unless you intention is to truly make something
  academic i dont think that is wise.
 
  On Thu, Apr 30, 2015 at 3:48 PM, Nicholas Chammas 
  nicholas.cham...@gmail.com wrote:
 
  (On that note, I think Python 2.6 should be next on the chopping block
  sometime later this year, but that’s for another thread.)
 
  (To continue the parenthetical, Python 2.6 was in fact EOL-ed in
 October
  of
  2013. https://www.python.org/download/releases/2.6.9/)
  ​
 
  On Thu, Apr 30, 2015 at 3:18 PM Nicholas Chammas 
  nicholas.cham...@gmail.com
  wrote:
 
   I understand the concern about cutting out users who still use Java
 6,
  and
   I don't have numbers about how many people are still using Java 6.
  
   But I want to say at a high level that I support deprecating older
   versions of stuff to reduce our maintenance burden and let us use
 more
   modern patterns in our code.
  
   Maintenance always costs way more than initial development over the
   lifetime of a project, and for that reason anti-support is just as
   important as support.
  
   (On that note, I think Python 2.6 should be next on the chopping
 block
   sometime later this year, but that's for another thread.)
  
   Nick
  
  
   On Thu, Apr 30, 2015 at 3:03 PM Reynold Xin r...@databricks.com
  wrote:
  
   This has been discussed a few times in the past, but now Oracle has
  ended
   support for Java 6 for over a year, I wonder if we should just drop
  Java 6
   support.
  
   There is one outstanding issue Tom has brought to my attention:
  PySpark on
   YARN doesn't work well with Java 7/8, but we have an outstanding
 pull
   request to fix that.
  
   https://issues.apache.org/jira/browse/SPARK-6869
   https://issues.apache.org/jira/browse/SPARK-1920
  
  
 
 
 
 



Re: practical usage of the new exactly-once supporting DirectKafkaInputDStream

2015-04-30 Thread Cody Koeninger
In fact, you're using the 2 arg form of reduce by key to shrink it down to
1 partition

 reduceByKey(sumFunc, 1)

But you started with 4 kafka partitions?  So they're definitely no longer
1:1


On Thu, Apr 30, 2015 at 1:58 PM, Cody Koeninger c...@koeninger.org wrote:

 This is what I'm suggesting, in pseudocode

 rdd.mapPartitionsWithIndex { case (i, iter) =
offset = offsets(i)
result = yourReductionFunction(iter)
transaction {
   save(result)
   save(offset)
}
 }.foreach { (_: Nothing) = () }

 where yourReductionFunction is just normal scala code.

 The code you posted looks like you're only saving offsets once per
 partition, but you're doing it after reduceByKey.  Reduction steps in spark
 imply a shuffle.  After a shuffle you no longer have a guaranteed 1:1
 correspondence between spark partiion and kafka partition.  If you want to
 verify that's what the problem is, log the value of currentOffset whenever
 it changes.



 On Thu, Apr 30, 2015 at 1:38 PM, badgerpants mark.stew...@tapjoy.com
 wrote:

 Cody Koeninger-2 wrote
  What's your schema for the offset table, and what's the definition of
  writeOffset ?

 The schema is the same as the one in your post: topic | partition| offset
 The writeOffset is nearly identical:

   def writeOffset(osr: OffsetRange)(implicit session: DBSession): Unit = {
 logWarning(Thread.currentThread().toString + writeOffset:  + osr)
 if(osr==null) {
   logWarning(no offset provided)
   return
 }

 val updated = sql
 update txn_offsets set off = ${osr.untilOffset}
   where topic = ${osr.topic} and part = ${osr.partition} and off =
 ${osr.fromOffset}
 .update.apply()
 if (updated != 1) {
   throw new Exception( Thread.currentThread().toString + sfailed to
 write offset:
 ${osr.topic},${osr.partition},${osr.fromOffset}-${osr.untilOffset})
 } else {
   logWarning(Thread.currentThread().toString + offsets updated to  +
 osr.untilOffset)
 }

   }


 Cody Koeninger-2 wrote
  What key are you reducing on?  Maybe I'm misreading the code, but it
 looks
  like the per-partition offset is part of the key.  If that's true then
 you
  could just do your reduction on each partition, rather than after the
 fact
  on the whole stream.

 Yes, the key is a duple comprised of a case class called Key and the
 partition's OffsetRange. We piggybacked the OffsetRange in this way so it
 would be available within the scope of the partition.

 I have tried moving the reduceByKey from the end of the .transform block
 into the partition level (at the end of the mapPartitionsWithIndex block.)
 This is what you're suggesting, yes? The results didn't correct the offset
 update behavior; they still get out of sync pretty quickly.

 Some details: I'm using the kafka-console-producer.sh tool to drive the
 process, calling it three or four times in succession and piping in
 100-1000
 messages in each call. Once all the messages have been processed I wait
 for
 the output of the printOffsets method to stop changing and compare it to
 the
 txn_offsets table. (When no data is getting processed the printOffsets
 method yields something like the following: [ OffsetRange(topic:
 'testmulti', partition: 1, range: [23602 - 23602] OffsetRange(topic:
 'testmulti', partition: 2, range: [32503 - 32503] OffsetRange(topic:
 'testmulti', partition: 0, range: [26100 - 26100] OffsetRange(topic:
 'testmulti', partition: 3, range: [20900 - 20900]])

 Thanks,
 Mark






 --
 View this message in context:
 http://apache-spark-developers-list.1001551.n3.nabble.com/practical-usage-of-the-new-exactly-once-supporting-DirectKafkaInputDStream-tp11916p11921.html
 Sent from the Apache Spark Developers List mailing list archive at
 Nabble.com.

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org





Re: Pickling error when attempting to add a method in pyspark

2015-04-30 Thread Stephen Boesch
Bumping this.  Anyone of you having some familiarity with py4j interface in
pyspark?

thanks


2015-04-27 22:09 GMT-07:00 Stephen Boesch java...@gmail.com:


 My intention is to add pyspark support for certain mllib spark methods.  I
 have been unable to resolve pickling errors of the form

Pyspark py4j PickleException: “expected zero arguments for
 construction of ClassDict”
 http://stackoverflow.com/questions/29910708/pyspark-py4j-pickleexception-expected-zero-arguments-for-construction-of-class

 These are occurring during python to java conversion of python named
 tuples.  The details are rather hard to provide here so I have created an
 SOF question


 http://stackoverflow.com/questions/29910708/pyspark-py4j-pickleexception-expected-zero-arguments-for-construction-of-class

 In any case I have included the text here. The SOF is easier to read
 though ;)

 --

 This question is directed towards persons familiar with py4j - and can
 help to resolve a pickling error. I am trying to add a method to the
 pyspark PythonMLLibAPI that accepts an RDD of a namedtuple, does some work,
 and returns a result in the form of an RDD.

 This method is modeled after the PYthonMLLibAPI.trainALSModel() method,
 whose analogous *existing* relevant portions are:

   def trainALSModel(
 ratingsJRDD: JavaRDD[Rating],
 .. )

 The *existing* python Rating class used to model the new code is:

 class Rating(namedtuple(Rating, [user, product, rating])):
 def __reduce__(self):
 return Rating, (int(self.user), int(self.product), float(self.rating))

 Here is the attempt So here are the relevant classes:

 *New* python class pyspark.mllib.clustering.MatrixEntry:

 from collections import namedtupleclass MatrixEntry(namedtuple(MatrixEntry, 
 [x,y,weight])):
 def __reduce__(self):
 return MatrixEntry, (long(self.x), long(self.y), float(self.weight))

 *New* method *foobarRDD* In PythonMLLibAPI:

   def foobarRdd(
 data: JavaRDD[MatrixEntry]): RDD[FooBarResult] = {
 val rdd = data.rdd.map { d = FooBarResult(d.i, d.j, d.value, d.i * 100 + 
 d.j * 10 + d.value)}
 rdd
   }

 Now let us try it out:

 from pyspark.mllib.clustering import MatrixEntry
 def convert_to_MatrixEntry(tuple):
   return MatrixEntry(*tuple)
 from pyspark.mllib.clustering import *
 pic = PowerIterationClusteringModel(2)
 tups = [(1,2,3),(4,5,6),(12,13,14),(15,7,8),(16,17,16.5)]
 trdd = sc.parallelize(map(convert_to_MatrixEntry,tups))
 # print out the RDD on python side just for validationprint %s 
 %(repr(trdd.collect()))
 from pyspark.mllib.common import callMLlibFunc
 pic = callMLlibFunc(foobar, trdd)

 Relevant portions of results:

 [(1,2)=3.0, (4,5)=6.0, (12,13)=14.0, (15,7)=8.0, (16,17)=16.5]

 which shows the input rdd is 'whole'. However the pickling was unhappy:

 5/04/27 21:15:44 ERROR Executor: Exception in task 6.0 in stage 1.0 (TID 14)
 net.razorvine.pickle.PickleException: expected zero arguments for 
 construction of ClassDict(for pyspark.mllib.clustering.MatrixEntry)
 at 
 net.razorvine.pickle.objects.ClassDictConstructor.construct(ClassDictConstructor.java:23)
 at net.razorvine.pickle.Unpickler.load_reduce(Unpickler.java:617)
 at net.razorvine.pickle.Unpickler.dispatch(Unpickler.java:170)
 at net.razorvine.pickle.Unpickler.load(Unpickler.java:84)
 at net.razorvine.pickle.Unpickler.loads(Unpickler.java:97)
 at 
 org.apache.spark.mllib.api.python.SerDe$$anonfun$pythonToJava$1$$anonfun$apply$1.apply(PythonMLLibAPI.scala:1167)
 at 
 org.apache.spark.mllib.api.python.SerDe$$anonfun$pythonToJava$1$$anonfun$apply$1.apply(PythonMLLibAPI.scala:1166)
 at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
 at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
 at scala.collection.Iterator$class.foreach(Iterator.scala:727)
 at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
 at 
 scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
 at 
 scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
 at 
 scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
 at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
 at scala.collection.AbstractIterator.to(Iterator.scala:1157)
 at 
 scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
 at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
 at 
 scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
 at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
 at org.apache.spark.rdd.RDD$$anonfun$17.apply(RDD.scala:819)
 at org.apache.spark.rdd.RDD$$anonfun$17.apply(RDD.scala:819)
 at 
 org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1523)
 at 
 org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1523)
 at 

Re: [discuss] ending support for Java 6?

2015-04-30 Thread Patrick Wendell
I'd also support this. In general, I think it's good that we try to
have Spark support different versions of things (Hadoop, Hive, etc).
But at some point you need to weigh the costs of doing so against the
number of users affected.

In the case of Java 6, we are seeing increasing cost from this. Some
of the newer unsafe code is not supported in Java 6 (and it's a pretty
large internal initiative). And the ability to upgrade dependencies is
starting to cause pain for users. Sean and I had to wontfix an
important bug fix for users because the library requires JRE 7.

On Thu, Apr 30, 2015 at 1:00 PM, Koert Kuipers ko...@tresata.com wrote:
 nicholas started it! :)

 for java 6 i would have said the same thing about 1 year ago: it is foolish
 to drop it. but i think the time is right about now.
 about half our clients are on java 7 and the other half have active plans
 to migrate to it within 6 months.

 On Thu, Apr 30, 2015 at 3:57 PM, Reynold Xin r...@databricks.com wrote:

 Guys thanks for chiming in, but please focus on Java here. Python is an
 entirely separate issue.


 On Thu, Apr 30, 2015 at 12:53 PM, Koert Kuipers ko...@tresata.com wrote:

 i am not sure eol means much if it is still actively used. we have a lot
 of clients with centos 5 (for which we still support python 2.4 in some
 form or another, fun!). most of them are on centos 6, which means python
 2.6. by cutting out python 2.6 you would cut out the majority of the actual
 clusters i am aware of. unless you intention is to truly make something
 academic i dont think that is wise.

 On Thu, Apr 30, 2015 at 3:48 PM, Nicholas Chammas 
 nicholas.cham...@gmail.com wrote:

 (On that note, I think Python 2.6 should be next on the chopping block
 sometime later this year, but that's for another thread.)

 (To continue the parenthetical, Python 2.6 was in fact EOL-ed in October
 of
 2013. https://www.python.org/download/releases/2.6.9/)


 On Thu, Apr 30, 2015 at 3:18 PM Nicholas Chammas 
 nicholas.cham...@gmail.com
 wrote:

  I understand the concern about cutting out users who still use Java 6,
 and
  I don't have numbers about how many people are still using Java 6.
 
  But I want to say at a high level that I support deprecating older
  versions of stuff to reduce our maintenance burden and let us use more
  modern patterns in our code.
 
  Maintenance always costs way more than initial development over the
  lifetime of a project, and for that reason anti-support is just as
  important as support.
 
  (On that note, I think Python 2.6 should be next on the chopping block
  sometime later this year, but that's for another thread.)
 
  Nick
 
 
  On Thu, Apr 30, 2015 at 3:03 PM Reynold Xin r...@databricks.com
 wrote:
 
  This has been discussed a few times in the past, but now Oracle has
 ended
  support for Java 6 for over a year, I wonder if we should just drop
 Java 6
  support.
 
  There is one outstanding issue Tom has brought to my attention:
 PySpark on
  YARN doesn't work well with Java 7/8, but we have an outstanding pull
  request to fix that.
 
  https://issues.apache.org/jira/browse/SPARK-6869
  https://issues.apache.org/jira/browse/SPARK-1920
 
 





-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: [discuss] ending support for Java 6?

2015-04-30 Thread Punyashloka Biswal
I'm in favor of ending support for Java 6. We should also articulate a
policy on how long we want to support current and future versions of Java
after Oracle declares them EOL (Java 7 will be in that bucket in a matter
of days).

Punya
On Thu, Apr 30, 2015 at 1:18 PM shane knapp skn...@berkeley.edu wrote:

 something to keep in mind:  we can easily support java 6 for the build
 environment, particularly if there's a definite EOL.

 i'd like to fix our java versioning 'problem', and this could be a big
 instigator...  right now we're hackily setting java_home in test invocation
 on jenkins, which really isn't the best.  if i decide, within jenkins, to
 reconfigure every build to 'do the right thing' WRT java version, then i
 will clean up the old mess and pay down on some technical debt.

 or i can just install java 6 and we use that as JAVA_HOME on a
 build-by-build basis.

 this will be a few days of prep and another morning-long downtime if i do
 the right thing (within jenkins), and only a couple of hours the hacky way
 (system level).

 either way, we can test on java 6.  :)

 On Thu, Apr 30, 2015 at 1:00 PM, Koert Kuipers ko...@tresata.com wrote:

  nicholas started it! :)
 
  for java 6 i would have said the same thing about 1 year ago: it is
 foolish
  to drop it. but i think the time is right about now.
  about half our clients are on java 7 and the other half have active plans
  to migrate to it within 6 months.
 
  On Thu, Apr 30, 2015 at 3:57 PM, Reynold Xin r...@databricks.com
 wrote:
 
   Guys thanks for chiming in, but please focus on Java here. Python is an
   entirely separate issue.
  
  
   On Thu, Apr 30, 2015 at 12:53 PM, Koert Kuipers ko...@tresata.com
  wrote:
  
   i am not sure eol means much if it is still actively used. we have a
 lot
   of clients with centos 5 (for which we still support python 2.4 in
 some
   form or another, fun!). most of them are on centos 6, which means
 python
   2.6. by cutting out python 2.6 you would cut out the majority of the
  actual
   clusters i am aware of. unless you intention is to truly make
 something
   academic i dont think that is wise.
  
   On Thu, Apr 30, 2015 at 3:48 PM, Nicholas Chammas 
   nicholas.cham...@gmail.com wrote:
  
   (On that note, I think Python 2.6 should be next on the chopping
 block
   sometime later this year, but that’s for another thread.)
  
   (To continue the parenthetical, Python 2.6 was in fact EOL-ed in
  October
   of
   2013. https://www.python.org/download/releases/2.6.9/)
   ​
  
   On Thu, Apr 30, 2015 at 3:18 PM Nicholas Chammas 
   nicholas.cham...@gmail.com
   wrote:
  
I understand the concern about cutting out users who still use Java
  6,
   and
I don't have numbers about how many people are still using Java 6.
   
But I want to say at a high level that I support deprecating older
versions of stuff to reduce our maintenance burden and let us use
  more
modern patterns in our code.
   
Maintenance always costs way more than initial development over the
lifetime of a project, and for that reason anti-support is just
 as
important as support.
   
(On that note, I think Python 2.6 should be next on the chopping
  block
sometime later this year, but that's for another thread.)
   
Nick
   
   
On Thu, Apr 30, 2015 at 3:03 PM Reynold Xin r...@databricks.com
   wrote:
   
This has been discussed a few times in the past, but now Oracle
 has
   ended
support for Java 6 for over a year, I wonder if we should just
 drop
   Java 6
support.
   
There is one outstanding issue Tom has brought to my attention:
   PySpark on
YARN doesn't work well with Java 7/8, but we have an outstanding
  pull
request to fix that.
   
https://issues.apache.org/jira/browse/SPARK-6869
https://issues.apache.org/jira/browse/SPARK-1920
   
   
  
  
  
  
 



Re: [discuss] ending support for Java 6?

2015-04-30 Thread Sree V
Hi Team,
Should we take this opportunity to layout and evangelize a pattern for EOL of 
dependencies.I propose, we follow the official EOL of java, python, scala, 
.And add say 6-12-24 months depending on the popularity.
Java 6 official EOL Feb 2013Add 6-12 monthsAug 2013 - Feb 2014 official End of 
Support for Java 6 in SparkAnnounce 3-6 months prior to EOS.

Thanking you.

With Regards
Sree 


 On Thursday, April 30, 2015 1:41 PM, Marcelo Vanzin van...@cloudera.com 
wrote:
   

 As for the idea, I'm +1. Spark is the only reason I still have jdk6
around - exactly because I don't want to cause the issue that started
this discussion (inadvertently using JDK7 APIs). And as has been
pointed out, even J7 is about to go EOL real soon.

Even Hadoop is moving away (I think 2.7 will be j7-only). Hive 1.1 is
already j7-only. And when Hadoop moves away from something, it's an
event worthy of headlines. They're still on Jetty 6!

As for pyspark, https://github.com/apache/spark/pull/5580 should get
rid of the last incompatibility with large assemblies, by keeping the
python files in separate archives. If we remove support for Java 6,
then we don't need to worry about the size of the assembly anymore.

On Thu, Apr 30, 2015 at 1:32 PM, Sean Owen so...@cloudera.com wrote:
 I'm firmly in favor of this.

 It would also fix https://issues.apache.org/jira/browse/SPARK-7009 and
 avoid any more of the long-standing 64K file limit thing that's still
 a problem for PySpark.

-- 
Marcelo

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



   

Re: practical usage of the new exactly-once supporting DirectKafkaInputDStream

2015-04-30 Thread badgerpants
Cody Koeninger-2 wrote
 In fact, you're using the 2 arg form of reduce by key to shrink it down to
 1 partition
 
  reduceByKey(sumFunc, 1)
 
 But you started with 4 kafka partitions?  So they're definitely no longer
 1:1

True. I added the second arg because we were seeing multiple threads
attempting to update the same offset. Setting it to 1 prevented that but
doesn't fix the core issue.


Cody Koeninger-2 wrote
 This is what I'm suggesting, in pseudocode

 rdd.mapPartitionsWithIndex { case (i, iter) =
offset = offsets(i)
result = yourReductionFunction(iter)
transaction {
   save(result)
   save(offset)
}
 }.foreach { (_: Nothing) = () }

 where yourReductionFunction is just normal scala code.


I'll give this a try. Thanks, Cody.





--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/practical-usage-of-the-new-exactly-once-supporting-DirectKafkaInputDStream-tp11916p11928.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: [discuss] ending support for Java 6?

2015-04-30 Thread Sean Owen
I'm firmly in favor of this.

It would also fix https://issues.apache.org/jira/browse/SPARK-7009 and
avoid any more of the long-standing 64K file limit thing that's still
a problem for PySpark.

As a point of reference, CDH5 has never supported Java 6, and it was
released over a year ago.

On Thu, Apr 30, 2015 at 8:02 PM, Reynold Xin r...@databricks.com wrote:
 This has been discussed a few times in the past, but now Oracle has ended
 support for Java 6 for over a year, I wonder if we should just drop Java 6
 support.

 There is one outstanding issue Tom has brought to my attention: PySpark on
 YARN doesn't work well with Java 7/8, but we have an outstanding pull
 request to fix that.

 https://issues.apache.org/jira/browse/SPARK-6869
 https://issues.apache.org/jira/browse/SPARK-1920

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: [discuss] ending support for Java 6?

2015-04-30 Thread Marcelo Vanzin
As for the idea, I'm +1. Spark is the only reason I still have jdk6
around - exactly because I don't want to cause the issue that started
this discussion (inadvertently using JDK7 APIs). And as has been
pointed out, even J7 is about to go EOL real soon.

Even Hadoop is moving away (I think 2.7 will be j7-only). Hive 1.1 is
already j7-only. And when Hadoop moves away from something, it's an
event worthy of headlines. They're still on Jetty 6!

As for pyspark, https://github.com/apache/spark/pull/5580 should get
rid of the last incompatibility with large assemblies, by keeping the
python files in separate archives. If we remove support for Java 6,
then we don't need to worry about the size of the assembly anymore.

On Thu, Apr 30, 2015 at 1:32 PM, Sean Owen so...@cloudera.com wrote:
 I'm firmly in favor of this.

 It would also fix https://issues.apache.org/jira/browse/SPARK-7009 and
 avoid any more of the long-standing 64K file limit thing that's still
 a problem for PySpark.

-- 
Marcelo

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



[discuss] ending support for Java 6?

2015-04-30 Thread Reynold Xin
This has been discussed a few times in the past, but now Oracle has ended
support for Java 6 for over a year, I wonder if we should just drop Java 6
support.

There is one outstanding issue Tom has brought to my attention: PySpark on
YARN doesn't work well with Java 7/8, but we have an outstanding pull
request to fix that.

https://issues.apache.org/jira/browse/SPARK-6869
https://issues.apache.org/jira/browse/SPARK-1920


Re: [discuss] ending support for Java 6?

2015-04-30 Thread Koert Kuipers
nicholas started it! :)

for java 6 i would have said the same thing about 1 year ago: it is foolish
to drop it. but i think the time is right about now.
about half our clients are on java 7 and the other half have active plans
to migrate to it within 6 months.

On Thu, Apr 30, 2015 at 3:57 PM, Reynold Xin r...@databricks.com wrote:

 Guys thanks for chiming in, but please focus on Java here. Python is an
 entirely separate issue.


 On Thu, Apr 30, 2015 at 12:53 PM, Koert Kuipers ko...@tresata.com wrote:

 i am not sure eol means much if it is still actively used. we have a lot
 of clients with centos 5 (for which we still support python 2.4 in some
 form or another, fun!). most of them are on centos 6, which means python
 2.6. by cutting out python 2.6 you would cut out the majority of the actual
 clusters i am aware of. unless you intention is to truly make something
 academic i dont think that is wise.

 On Thu, Apr 30, 2015 at 3:48 PM, Nicholas Chammas 
 nicholas.cham...@gmail.com wrote:

 (On that note, I think Python 2.6 should be next on the chopping block
 sometime later this year, but that’s for another thread.)

 (To continue the parenthetical, Python 2.6 was in fact EOL-ed in October
 of
 2013. https://www.python.org/download/releases/2.6.9/)
 ​

 On Thu, Apr 30, 2015 at 3:18 PM Nicholas Chammas 
 nicholas.cham...@gmail.com
 wrote:

  I understand the concern about cutting out users who still use Java 6,
 and
  I don't have numbers about how many people are still using Java 6.
 
  But I want to say at a high level that I support deprecating older
  versions of stuff to reduce our maintenance burden and let us use more
  modern patterns in our code.
 
  Maintenance always costs way more than initial development over the
  lifetime of a project, and for that reason anti-support is just as
  important as support.
 
  (On that note, I think Python 2.6 should be next on the chopping block
  sometime later this year, but that's for another thread.)
 
  Nick
 
 
  On Thu, Apr 30, 2015 at 3:03 PM Reynold Xin r...@databricks.com
 wrote:
 
  This has been discussed a few times in the past, but now Oracle has
 ended
  support for Java 6 for over a year, I wonder if we should just drop
 Java 6
  support.
 
  There is one outstanding issue Tom has brought to my attention:
 PySpark on
  YARN doesn't work well with Java 7/8, but we have an outstanding pull
  request to fix that.
 
  https://issues.apache.org/jira/browse/SPARK-6869
  https://issues.apache.org/jira/browse/SPARK-1920
 
 






Re: [discuss] ending support for Java 6?

2015-04-30 Thread Ted Yu
+1 on ending support for Java 6.

BTW from https://www.java.com/en/download/faq/java_7.xml :
After April 2015, Oracle will no longer post updates of Java SE 7 to its
public download sites.

On Thu, Apr 30, 2015 at 1:34 PM, Punyashloka Biswal punya.bis...@gmail.com
wrote:

 I'm in favor of ending support for Java 6. We should also articulate a
 policy on how long we want to support current and future versions of Java
 after Oracle declares them EOL (Java 7 will be in that bucket in a matter
 of days).

 Punya
 On Thu, Apr 30, 2015 at 1:18 PM shane knapp skn...@berkeley.edu wrote:

  something to keep in mind:  we can easily support java 6 for the build
  environment, particularly if there's a definite EOL.
 
  i'd like to fix our java versioning 'problem', and this could be a big
  instigator...  right now we're hackily setting java_home in test
 invocation
  on jenkins, which really isn't the best.  if i decide, within jenkins, to
  reconfigure every build to 'do the right thing' WRT java version, then i
  will clean up the old mess and pay down on some technical debt.
 
  or i can just install java 6 and we use that as JAVA_HOME on a
  build-by-build basis.
 
  this will be a few days of prep and another morning-long downtime if i do
  the right thing (within jenkins), and only a couple of hours the hacky
 way
  (system level).
 
  either way, we can test on java 6.  :)
 
  On Thu, Apr 30, 2015 at 1:00 PM, Koert Kuipers ko...@tresata.com
 wrote:
 
   nicholas started it! :)
  
   for java 6 i would have said the same thing about 1 year ago: it is
  foolish
   to drop it. but i think the time is right about now.
   about half our clients are on java 7 and the other half have active
 plans
   to migrate to it within 6 months.
  
   On Thu, Apr 30, 2015 at 3:57 PM, Reynold Xin r...@databricks.com
  wrote:
  
Guys thanks for chiming in, but please focus on Java here. Python is
 an
entirely separate issue.
   
   
On Thu, Apr 30, 2015 at 12:53 PM, Koert Kuipers ko...@tresata.com
   wrote:
   
i am not sure eol means much if it is still actively used. we have a
  lot
of clients with centos 5 (for which we still support python 2.4 in
  some
form or another, fun!). most of them are on centos 6, which means
  python
2.6. by cutting out python 2.6 you would cut out the majority of the
   actual
clusters i am aware of. unless you intention is to truly make
  something
academic i dont think that is wise.
   
On Thu, Apr 30, 2015 at 3:48 PM, Nicholas Chammas 
nicholas.cham...@gmail.com wrote:
   
(On that note, I think Python 2.6 should be next on the chopping
  block
sometime later this year, but that’s for another thread.)
   
(To continue the parenthetical, Python 2.6 was in fact EOL-ed in
   October
of
2013. https://www.python.org/download/releases/2.6.9/)
​
   
On Thu, Apr 30, 2015 at 3:18 PM Nicholas Chammas 
nicholas.cham...@gmail.com
wrote:
   
 I understand the concern about cutting out users who still use
 Java
   6,
and
 I don't have numbers about how many people are still using Java
 6.

 But I want to say at a high level that I support deprecating
 older
 versions of stuff to reduce our maintenance burden and let us use
   more
 modern patterns in our code.

 Maintenance always costs way more than initial development over
 the
 lifetime of a project, and for that reason anti-support is just
  as
 important as support.

 (On that note, I think Python 2.6 should be next on the chopping
   block
 sometime later this year, but that's for another thread.)

 Nick


 On Thu, Apr 30, 2015 at 3:03 PM Reynold Xin r...@databricks.com
 
wrote:

 This has been discussed a few times in the past, but now Oracle
  has
ended
 support for Java 6 for over a year, I wonder if we should just
  drop
Java 6
 support.

 There is one outstanding issue Tom has brought to my attention:
PySpark on
 YARN doesn't work well with Java 7/8, but we have an outstanding
   pull
 request to fix that.

 https://issues.apache.org/jira/browse/SPARK-6869
 https://issues.apache.org/jira/browse/SPARK-1920


   
   
   
   
  
 



Re: [discuss] ending support for Java 6?

2015-04-30 Thread Nicholas Chammas
(On that note, I think Python 2.6 should be next on the chopping block
sometime later this year, but that’s for another thread.)

(To continue the parenthetical, Python 2.6 was in fact EOL-ed in October of
2013. https://www.python.org/download/releases/2.6.9/)
​

On Thu, Apr 30, 2015 at 3:18 PM Nicholas Chammas nicholas.cham...@gmail.com
wrote:

 I understand the concern about cutting out users who still use Java 6, and
 I don't have numbers about how many people are still using Java 6.

 But I want to say at a high level that I support deprecating older
 versions of stuff to reduce our maintenance burden and let us use more
 modern patterns in our code.

 Maintenance always costs way more than initial development over the
 lifetime of a project, and for that reason anti-support is just as
 important as support.

 (On that note, I think Python 2.6 should be next on the chopping block
 sometime later this year, but that's for another thread.)

 Nick


 On Thu, Apr 30, 2015 at 3:03 PM Reynold Xin r...@databricks.com wrote:

 This has been discussed a few times in the past, but now Oracle has ended
 support for Java 6 for over a year, I wonder if we should just drop Java 6
 support.

 There is one outstanding issue Tom has brought to my attention: PySpark on
 YARN doesn't work well with Java 7/8, but we have an outstanding pull
 request to fix that.

 https://issues.apache.org/jira/browse/SPARK-6869
 https://issues.apache.org/jira/browse/SPARK-1920




Re: Regarding KryoSerialization in Spark

2015-04-30 Thread twinkle sachdeva
Thanks for the info.


On Fri, May 1, 2015 at 12:10 AM, Sandy Ryza sandy.r...@cloudera.com wrote:

 Hi Twinkle,

 Registering the class makes it so that writeClass only writes out a couple
 bytes, instead of a full String of the class name.

 -Sandy

 On Thu, Apr 30, 2015 at 4:13 AM, twinkle sachdeva 
 twinkle.sachd...@gmail.com wrote:

 Hi,

 As per the code, KryoSerialization used writeClassAndObject method, which
 internally calls writeClass method, which will write the class of the
 object while serilization.

 As per the documentation in tuning page of spark, it says that registering
 the class will avoid that.

 Am I missing something or there is some issue with the documentation???

 Thanks,
 Twinkle





Uninitialized session in HiveContext?

2015-04-30 Thread Marcelo Vanzin
Hey all,

We ran into some test failures in our internal branch (which builds
against Hive 1.1), and I narrowed it down to the fix below. I'm not
super familiar with the Hive integration code, but does this look like
a bug for other versions of Hive too?

This caused an error where some internal Hive configuration that is
initialized by the session were not available.

diff --git a/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveContext.scala
b/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveContext.scala
index dd06b26..6242745 100644
--- a/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveContext.scala
+++ b/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveContext.scala
@@ -93,6 +93,10 @@ class HiveContext(sc: SparkContext) extends SQLContext(sc) {
 if (conf.dialect == sql) {
   super.sql(substituted)
 } else if (conf.dialect == hiveql) {
+  // Make sure Hive session state is initialized.
+  if (SessionState.get() != sessionState) {
+SessionState.start(sessionState)
+  }
   val ddlPlan = ddlParserWithHiveQL.parse(sqlText,
exceptionOnError = false)
   DataFrame(this, ddlPlan.getOrElse(HiveQl.parseSql(substituted)))
 }  else {



-- 
Marcelo

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: [discuss] ending support for Java 6?

2015-04-30 Thread Vinod Kumar Vavilapalli
FYI, after enough consideration, we the Hadoop community dropped support for 
JDK 6 starting release Apache Hadoop 2.7.x.

Thanks
+Vinod

On Apr 30, 2015, at 12:02 PM, Reynold Xin r...@databricks.com wrote:

 This has been discussed a few times in the past, but now Oracle has ended
 support for Java 6 for over a year, I wonder if we should just drop Java 6
 support.
 
 There is one outstanding issue Tom has brought to my attention: PySpark on
 YARN doesn't work well with Java 7/8, but we have an outstanding pull
 request to fix that.
 
 https://issues.apache.org/jira/browse/SPARK-6869
 https://issues.apache.org/jira/browse/SPARK-1920


-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Uninitialized session in HiveContext?

2015-04-30 Thread Marcelo Vanzin
Hi Michael,

It would be great to see changes to make hive integration less
painful, and I can test them in our environment once you have a patch.

But I guess my question is a little more geared towards the current
code; doesn't the issue I ran into affect 1.4 and potentially earlier
versions too?


On Thu, Apr 30, 2015 at 5:19 PM, Michael Armbrust
mich...@databricks.com wrote:
 Hey Marcelo,

 Thanks for the heads up!  I'm currently in the process of refactoring all of
 this (to separate the metadata connection from the execution side) and as
 part of this I'm making the initialization of the session not lazy.  It
 would be great to hear if this also works for your internal integration
 tests once the patch is up (hopefully this weekend).

 Michael

 On Thu, Apr 30, 2015 at 2:36 PM, Marcelo Vanzin van...@cloudera.com wrote:

 Hey all,

 We ran into some test failures in our internal branch (which builds
 against Hive 1.1), and I narrowed it down to the fix below. I'm not
 super familiar with the Hive integration code, but does this look like
 a bug for other versions of Hive too?

 This caused an error where some internal Hive configuration that is
 initialized by the session were not available.

 diff --git
 a/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveContext.scala
 b/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveContext.scala
 index dd06b26..6242745 100644
 --- a/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveContext.scala
 +++ b/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveContext.scala
 @@ -93,6 +93,10 @@ class HiveContext(sc: SparkContext) extends
 SQLContext(sc) {
  if (conf.dialect == sql) {
super.sql(substituted)
  } else if (conf.dialect == hiveql) {
 +  // Make sure Hive session state is initialized.
 +  if (SessionState.get() != sessionState) {
 +SessionState.start(sessionState)
 +  }
val ddlPlan = ddlParserWithHiveQL.parse(sqlText,
 exceptionOnError = false)
DataFrame(this, ddlPlan.getOrElse(HiveQl.parseSql(substituted)))
  }  else {



 --
 Marcelo

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org





-- 
Marcelo

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Mima test failure in the master branch?

2015-04-30 Thread zhazhan
Any PR open for this?



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Mima-test-failure-in-the-master-branch-tp11949p11950.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Mima test failure in the master branch?

2015-04-30 Thread Ted Yu
Looks like this has been taken care of:

commit beeafcfd6ee1e460c4d564cd1515d8781989b422
Author: Patrick Wendell patr...@databricks.com
Date:   Thu Apr 30 20:33:36 2015 -0700

Revert [SPARK-5213] [SQL] Pluggable SQL Parser Support

On Thu, Apr 30, 2015 at 7:58 PM, zhazhan zzh...@hortonworks.com wrote:

 [info] spark-sql: found 1 potential binary incompatibilities (filtered 129)
 [error] * method sqlParser()org.apache.spark.sql.SparkSQLParser in class
 org.apache.spark.sql.SQLContext does not have a correspondent in new
 version
 [error] filter with: ProblemFilters.excludeMissingMethodProblem



 --
 View this message in context:
 http://apache-spark-developers-list.1001551.n3.nabble.com/Mima-test-failure-in-the-master-branch-tp11949.html
 Sent from the Apache Spark Developers List mailing list archive at
 Nabble.com.

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org




Mima test failure in the master branch?

2015-04-30 Thread zhazhan
[info] spark-sql: found 1 potential binary incompatibilities (filtered 129)
[error] * method sqlParser()org.apache.spark.sql.SparkSQLParser in class
org.apache.spark.sql.SQLContext does not have a correspondent in new version
[error] filter with: ProblemFilters.excludeMissingMethodProblem



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Mima-test-failure-in-the-master-branch-tp11949.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Mima test failure in the master branch?

2015-04-30 Thread Patrick Wendell
I reverted the patch that I think was causing this: SPARK-5213

Thanks

On Thu, Apr 30, 2015 at 7:59 PM, zhazhan zzh...@hortonworks.com wrote:
 Any PR open for this?



 --
 View this message in context: 
 http://apache-spark-developers-list.1001551.n3.nabble.com/Mima-test-failure-in-the-master-branch-tp11949p11950.html
 Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org


-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: [discuss] ending support for Java 6?

2015-04-30 Thread Ted Yu
But it is hard to know how long customers stay with their most recent
download.

Cheers

On Thu, Apr 30, 2015 at 2:26 PM, Sree V sree_at_ch...@yahoo.com.invalid
wrote:

 If there is any possibility of getting the download counts,then we can use
 it as EOS criteria as well.Say, if download counts are lower than 30% (or
 another number) of Life time highest,then it qualifies for EOS.

 Thanking you.

 With Regards
 Sree


  On Thursday, April 30, 2015 2:22 PM, Sree V
 sree_at_ch...@yahoo.com.INVALID wrote:


  Hi Team,
 Should we take this opportunity to layout and evangelize a pattern for EOL
 of dependencies.I propose, we follow the official EOL of java, python,
 scala, .And add say 6-12-24 months depending on the popularity.
 Java 6 official EOL Feb 2013Add 6-12 monthsAug 2013 - Feb 2014 official
 End of Support for Java 6 in SparkAnnounce 3-6 months prior to EOS.

 Thanking you.

 With Regards
 Sree


 On Thursday, April 30, 2015 1:41 PM, Marcelo Vanzin 
 van...@cloudera.com wrote:


  As for the idea, I'm +1. Spark is the only reason I still have jdk6
 around - exactly because I don't want to cause the issue that started
 this discussion (inadvertently using JDK7 APIs). And as has been
 pointed out, even J7 is about to go EOL real soon.

 Even Hadoop is moving away (I think 2.7 will be j7-only). Hive 1.1 is
 already j7-only. And when Hadoop moves away from something, it's an
 event worthy of headlines. They're still on Jetty 6!

 As for pyspark, https://github.com/apache/spark/pull/5580 should get
 rid of the last incompatibility with large assemblies, by keeping the
 python files in separate archives. If we remove support for Java 6,
 then we don't need to worry about the size of the assembly anymore.

 On Thu, Apr 30, 2015 at 1:32 PM, Sean Owen so...@cloudera.com wrote:
  I'm firmly in favor of this.
 
  It would also fix https://issues.apache.org/jira/browse/SPARK-7009 and
  avoid any more of the long-standing 64K file limit thing that's still
  a problem for PySpark.

 --
 Marcelo

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org









Re: [discuss] ending support for Java 6?

2015-04-30 Thread Ram Sriharsha
+1 for end of support for Java 6 


 On Thursday, April 30, 2015 3:08 PM, Vinod Kumar Vavilapalli 
vino...@hortonworks.com wrote:
   

 FYI, after enough consideration, we the Hadoop community dropped support for 
JDK 6 starting release Apache Hadoop 2.7.x.

Thanks
+Vinod

On Apr 30, 2015, at 12:02 PM, Reynold Xin r...@databricks.com wrote:

 This has been discussed a few times in the past, but now Oracle has ended
 support for Java 6 for over a year, I wonder if we should just drop Java 6
 support.
 
 There is one outstanding issue Tom has brought to my attention: PySpark on
 YARN doesn't work well with Java 7/8, but we have an outstanding pull
 request to fix that.
 
 https://issues.apache.org/jira/browse/SPARK-6869
 https://issues.apache.org/jira/browse/SPARK-1920


-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org


   

Re: Issue of running partitioned loading (RDD) in Spark External Datasource on Mesos

2015-04-30 Thread Yang Lei
I finally isolated the issue to be related to the ActorSystem I reuse from
SparkEnv.get.actorSystem. This ActorSystem will contain the configuration
defined in my application jar's reference.conf in both local cluster case,
and in the case I use it directly in an extension to BaseRelation's buildScan
method. However if used in my RDD which is returned in the buildScan, it
loses the configuration.

I solve / bypass the problem by checking if my configuration exists in the
SparkEnv.get.actorSystem(settings.config) .If it does not exist, I will
create a new ActorSystem using my class's classLoader to force config
reading from my application jar:

val classLoader = this.getClass.getClassLoader

val myconfig = ConfigFactory.load(classLoader)// force config
reading from my classloader

ActorSystem(somename..,myconfig,classLoader)


I wonder if this different behavior of SparkEnv.get.actorSystem is
working-as-designed, or something is missing in executor setup for this
custom RDD driven execution case.


Thanks.


Yang


Re: [discuss] ending support for Java 6?

2015-04-30 Thread Sree V
If there is any possibility of getting the download counts,then we can use it 
as EOS criteria as well.Say, if download counts are lower than 30% (or another 
number) of Life time highest,then it qualifies for EOS.

Thanking you.

With Regards
Sree 


 On Thursday, April 30, 2015 2:22 PM, Sree V 
sree_at_ch...@yahoo.com.INVALID wrote:
   

 Hi Team,
Should we take this opportunity to layout and evangelize a pattern for EOL of 
dependencies.I propose, we follow the official EOL of java, python, scala, 
.And add say 6-12-24 months depending on the popularity.
Java 6 official EOL Feb 2013Add 6-12 monthsAug 2013 - Feb 2014 official End of 
Support for Java 6 in SparkAnnounce 3-6 months prior to EOS.

Thanking you.

With Regards
Sree 


    On Thursday, April 30, 2015 1:41 PM, Marcelo Vanzin van...@cloudera.com 
wrote:
  

 As for the idea, I'm +1. Spark is the only reason I still have jdk6
around - exactly because I don't want to cause the issue that started
this discussion (inadvertently using JDK7 APIs). And as has been
pointed out, even J7 is about to go EOL real soon.

Even Hadoop is moving away (I think 2.7 will be j7-only). Hive 1.1 is
already j7-only. And when Hadoop moves away from something, it's an
event worthy of headlines. They're still on Jetty 6!

As for pyspark, https://github.com/apache/spark/pull/5580 should get
rid of the last incompatibility with large assemblies, by keeping the
python files in separate archives. If we remove support for Java 6,
then we don't need to worry about the size of the assembly anymore.

On Thu, Apr 30, 2015 at 1:32 PM, Sean Owen so...@cloudera.com wrote:
 I'm firmly in favor of this.

 It would also fix https://issues.apache.org/jira/browse/SPARK-7009 and
 avoid any more of the long-standing 64K file limit thing that's still
 a problem for PySpark.

-- 
Marcelo

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



  

  

Re: Uninitialized session in HiveContext?

2015-04-30 Thread Michael Armbrust
Hey Marcelo,

Thanks for the heads up!  I'm currently in the process of refactoring all
of this (to separate the metadata connection from the execution side) and
as part of this I'm making the initialization of the session not lazy.  It
would be great to hear if this also works for your internal integration
tests once the patch is up (hopefully this weekend).

Michael

On Thu, Apr 30, 2015 at 2:36 PM, Marcelo Vanzin van...@cloudera.com wrote:

 Hey all,

 We ran into some test failures in our internal branch (which builds
 against Hive 1.1), and I narrowed it down to the fix below. I'm not
 super familiar with the Hive integration code, but does this look like
 a bug for other versions of Hive too?

 This caused an error where some internal Hive configuration that is
 initialized by the session were not available.

 diff --git
 a/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveContext.scala
 b/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveContext.scala
 index dd06b26..6242745 100644
 --- a/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveContext.scala
 +++ b/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveContext.scala
 @@ -93,6 +93,10 @@ class HiveContext(sc: SparkContext) extends
 SQLContext(sc) {
  if (conf.dialect == sql) {
super.sql(substituted)
  } else if (conf.dialect == hiveql) {
 +  // Make sure Hive session state is initialized.
 +  if (SessionState.get() != sessionState) {
 +SessionState.start(sessionState)
 +  }
val ddlPlan = ddlParserWithHiveQL.parse(sqlText,
 exceptionOnError = false)
DataFrame(this, ddlPlan.getOrElse(HiveQl.parseSql(substituted)))
  }  else {



 --
 Marcelo

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org




Custom PersistanceEngine and LeaderAgent implementation in Java

2015-04-30 Thread Niranda Perera
Hi,

this follows the following feature in this feature [1]

I'm trying to implement a custom persistence engine and a leader agent in
the Java environment.

vis-a-vis scala, when I implement the PersistenceEngine trait in java, I
would have to implement methods such as readPersistedData, removeDriver,
etc together with read, persist and unpersist methods.

but the issue here is, methods such as readPersistedData etc are 'final
def's, hence can not be overridden in the java environment.

I am new to scala, but is there any workaround to implement the above
traits in java?

look forward to hear from you.

[1] https://issues.apache.org/jira/browse/SPARK-1830

-- 
Niranda


Re: Custom PersistanceEngine and LeaderAgent implementation in Java

2015-04-30 Thread Reynold Xin
We should change the trait to abstract class, and then your problem will go
away.

Do you want to submit a pull request?


On Wed, Apr 29, 2015 at 11:02 PM, Niranda Perera niranda.per...@gmail.com
wrote:

 Hi,

 this follows the following feature in this feature [1]

 I'm trying to implement a custom persistence engine and a leader agent in
 the Java environment.

 vis-a-vis scala, when I implement the PersistenceEngine trait in java, I
 would have to implement methods such as readPersistedData, removeDriver,
 etc together with read, persist and unpersist methods.

 but the issue here is, methods such as readPersistedData etc are 'final
 def's, hence can not be overridden in the java environment.

 I am new to scala, but is there any workaround to implement the above
 traits in java?

 look forward to hear from you.

 [1] https://issues.apache.org/jira/browse/SPARK-1830

 --
 Niranda