Re: Remove Hadoop 1 support (Hadoop 2.2) for Spark 1.5?

2015-06-13 Thread Patrick Wendell
Yeah so Steve, hopefully it's self evident, but that is a perfect
example of the kind of annoying stuff we don't want to force users to
deal with by forcing an upgrade to 2.X. Compare the pain from Spark
users of trying to reason about what to do (and btw it seems like the
answer is simply that there isn't a good answer). And that will be
experienced by every Spark users who uses AWS and the Spark ec2
scripts, which are extremely popular.

Is this pain, in aggregate, more than our cost of having a few patches
to deal with runtime reflection stuff to make things work with Hadoop
1? My feeling is that it's much more efficient for us as the Spark
maintainers to pay this cost rather than to force a lot of our users
to deal with painful upgrades.

On Sat, Jun 13, 2015 at 1:39 AM, Steve Loughran ste...@hortonworks.com wrote:

 On 12 Jun 2015, at 17:12, Patrick Wendell pwend...@gmail.com wrote:

  For instance at Databricks we use
 the FileSystem library for talking to S3... every time we've tried to
 upgrade to Hadoop 2.X there have been significant regressions in
 performance and we've had to downgrade. That's purely anecdotal, but I
 think you have people out there using the Hadoop 1 bindings for whom
 upgrade would be a pain.

 ah s3n. The unloved orphan FS, which has been fairly neglected as being 
 non-strategic to anyone but Amazon, who have a private fork.

 s3n broke in hadopo 2.4 where the upgraded Jets3t went in with some patch 
 which swallowed exceptions (nobody should ever do that) and as result would 
 NPE on a seek(0) of a file of length(0). HADOOP-10457. Fixed in Hadoop 2.5

 Hadoop 2.6 has left S3n on maintenance out of fear of breaking more things, 
 future work is in s3a:,, which switched to the amazon awstoolkit JAR and 
 moved the implementation to hadoop-aws JAR. S3a promises: speed, partitioned 
 upload, better auth.

 But: it's not ready for serious use in Hadoop 2.6, so don't try. You need the 
 Hadoop 2.7 patches, which are in ASF Hadoop 2.7, will be in HDP2.3, and have 
 been picked up in CDH5.3. (HADOOP-11571). For Spark, the fact that the block 
 size is being returned as 0 in getFileStatus() could be the killer.

 Future work is going to improve performance and scale ( HADOOP-11694 )

 Now, if spark is finding problems with s3a performance, tests for this would 
 be great -complaints on JIRAs too. There's not enough functional testing of 
 analytics workloads against the object stores, especially s3 and swift. If 
 someone volunteers to add some optional test module for object store testing, 
 I'll help review it and suggest some tests to generate stress

 That can be done without the leap to Hadoop 2 —though the proposed 
 HADOOP-9565 work allowing object stores to declare that they are and publish 
 some of their consistency and atomicity semantics will be Hadoop 2.8+. If you 
 want your output committers to recognise when the destination is an 
 eventually constitent object store with O(n) directory rename and delete, 
 that's where the code will be.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: A confusing ClassNotFoundException error

2015-06-13 Thread StanZhai
I have encountered the similar error too at spark 1.4.0. 

The same code can be run on spark 1.3.1.

My code is(it can be run on spark-shell): 
===
  // hc is a instance of HiveContext
  val df = hc.sql(select * from test limit 10)
  val sb = new mutable.StringBuilder
  def mapHandle = (row: Row) = {
val rowData = ArrayBuffer[String]()
for (i - 0 until row.size) {
  val d = row.get(i)

  d match {
case data: ArrayBuffer[Any] =
  sb.clear()
  sb.append('[')
  for (j - 0 until data.length) {
val elm = data(j)
if (elm != null) {
  sb.append('')
  sb.append(elm.toString)
  sb.append('')
} else {
  sb.append(null)
}
sb.append(',')
  }
  if (sb.length  1) {
sb.deleteCharAt(sb.length - 1)
  }
  sb.append(']')
  rowData += sb.toString()
case _ =
  rowData += (if (d != null) d.toString else null)
  }
}
rowData
  }
  df.map(mapHandle).foreach(println)


My submit script is: spark-submit --class cn.zhaishidan.trans.Main --master
local[8] test-spark.jar
===the error
java.lang.ClassNotFoundException:
cn.zhaishidan.trans.service.SparkHiveService$$anonfun$mapHandle$1$1$$anonfun$apply$1
at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:270)
at
org.apache.spark.util.InnerClosureFinder$$anon$4.visitMethodInsn(ClosureCleaner.scala:455)
at
com.esotericsoftware.reflectasm.shaded.org.objectweb.asm.ClassReader.accept(Unknown
Source)
at
com.esotericsoftware.reflectasm.shaded.org.objectweb.asm.ClassReader.accept(Unknown
Source)
at
org.apache.spark.util.ClosureCleaner$.getInnerClosureClasses(ClosureCleaner.scala:101)
at
org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:197)
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:132)
at org.apache.spark.SparkContext.clean(SparkContext.scala:1891)
at org.apache.spark.rdd.RDD$$anonfun$map$1.apply(RDD.scala:294)
at org.apache.spark.rdd.RDD$$anonfun$map$1.apply(RDD.scala:293)
at
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:148)
at
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:109)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:286)
at org.apache.spark.rdd.RDD.map(RDD.scala:293)
at org.apache.spark.sql.DataFrame.map(DataFrame.scala:1210)
at
cn.zhaishidan.trans.service.SparkHiveService.formatDF(SparkHiveService.scala:66)
at
cn.zhaishidan.trans.service.SparkHiveService.query(SparkHiveService.scala:80)
at
cn.zhaishidan.trans.api.DatabaseApi$$anonfun$query$1.apply(DatabaseApi.scala:39)
at
cn.zhaishidan.trans.api.DatabaseApi$$anonfun$query$1.apply(DatabaseApi.scala:30)
at
cn.zhaishidan.trans.web.JettyUtils$$anon$1.getOrPost(JettyUtils.scala:56)
at cn.zhaishidan.trans.web.JettyUtils$$anon$1.doGet(JettyUtils.scala:73)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:735)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:848)
at 
org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:684)
at
org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:501)
at
org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1086)
at
org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:428)
at
org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1020)
at
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135)
at 
org.eclipse.jetty.server.handler.HandlerList.handle(HandlerList.java:52)
at
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116)
at org.eclipse.jetty.server.Server.handle(Server.java:370)
at
org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:494)
at
org.eclipse.jetty.server.AbstractHttpConnection.headerComplete(AbstractHttpConnection.java:971)
at
org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.headerComplete(AbstractHttpConnection.java:1033)
at 

RE: Contribution

2015-06-13 Thread Eron Wright
The deeplearning4j project provides neural net algorithms for Spark ML.   You 
may consider it sample code for extending Spark with new ML algorithms.

http://deeplearning4j.org/sparkmlhttps://github.com/deeplearning4j/deeplearning4j/tree/master/deeplearning4j-scaleout/spark/dl4j-spark-ml
-Eron
 Date: Fri, 12 Jun 2015 20:16:33 -0700
 From: sreenivas.raghav...@gmail.com
 To: dev@spark.apache.org
 Subject: Contribution
 
 Hi everyone,
  I am interest to contribute new algorithms and optimize
 existing algorithms in the area of graph algorithms and machine learning.
 Please give me some ideas where to start. Is it possible for me to introduce
 the notion of neural network in the apache spark
 
 
 
 --
 View this message in context: 
 http://apache-spark-developers-list.1001551.n3.nabble.com/Contribution-tp12739.html
 Sent from the Apache Spark Developers List mailing list archive at Nabble.com.
 
 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org
 
  

Re: Remove Hadoop 1 support (Hadoop 2.2) for Spark 1.5?

2015-06-13 Thread Steve Loughran

 On 12 Jun 2015, at 17:12, Patrick Wendell pwend...@gmail.com wrote:
 
  For instance at Databricks we use
 the FileSystem library for talking to S3... every time we've tried to
 upgrade to Hadoop 2.X there have been significant regressions in
 performance and we've had to downgrade. That's purely anecdotal, but I
 think you have people out there using the Hadoop 1 bindings for whom
 upgrade would be a pain.

ah s3n. The unloved orphan FS, which has been fairly neglected as being 
non-strategic to anyone but Amazon, who have a private fork. 

s3n broke in hadopo 2.4 where the upgraded Jets3t went in with some patch which 
swallowed exceptions (nobody should ever do that) and as result would NPE on a 
seek(0) of a file of length(0). HADOOP-10457. Fixed in Hadoop 2.5

Hadoop 2.6 has left S3n on maintenance out of fear of breaking more things, 
future work is in s3a:,, which switched to the amazon awstoolkit JAR and moved 
the implementation to hadoop-aws JAR. S3a promises: speed, partitioned upload, 
better auth. 

But: it's not ready for serious use in Hadoop 2.6, so don't try. You need the 
Hadoop 2.7 patches, which are in ASF Hadoop 2.7, will be in HDP2.3, and have 
been picked up in CDH5.3. (HADOOP-11571). For Spark, the fact that the block 
size is being returned as 0 in getFileStatus() could be the killer.

Future work is going to improve performance and scale ( HADOOP-11694 )

Now, if spark is finding problems with s3a performance, tests for this would be 
great -complaints on JIRAs too. There's not enough functional testing of 
analytics workloads against the object stores, especially s3 and swift. If 
someone volunteers to add some optional test module for object store testing, 
I'll help review it and suggest some tests to generate stress

That can be done without the leap to Hadoop 2 —though the proposed HADOOP-9565 
work allowing object stores to declare that they are and publish some of their 
consistency and atomicity semantics will be Hadoop 2.8+. If you want your 
output committers to recognise when the destination is an eventually constitent 
object store with O(n) directory rename and delete, that's where the code will 
be.

About HostName display in SparkUI

2015-06-13 Thread Sea
In spark 1.4.0, I find that the Address is ip (it was hostname in v1.3.0), why? 
who did it?

Re: Contribution

2015-06-13 Thread Akhil Das
This is a good start, if you haven't seen this already
https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark

Thanks
Best Regards

On Sat, Jun 13, 2015 at 8:46 AM, srinivasraghavansr71 
sreenivas.raghav...@gmail.com wrote:

 Hi everyone,
  I am interest to contribute new algorithms and optimize
 existing algorithms in the area of graph algorithms and machine learning.
 Please give me some ideas where to start. Is it possible for me to
 introduce
 the notion of neural network in the apache spark



 --
 View this message in context:
 http://apache-spark-developers-list.1001551.n3.nabble.com/Contribution-tp12739.html
 Sent from the Apache Spark Developers List mailing list archive at
 Nabble.com.

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org