Re: Remove Hadoop 1 support (Hadoop 2.2) for Spark 1.5?
Yeah so Steve, hopefully it's self evident, but that is a perfect example of the kind of annoying stuff we don't want to force users to deal with by forcing an upgrade to 2.X. Compare the pain from Spark users of trying to reason about what to do (and btw it seems like the answer is simply that there isn't a good answer). And that will be experienced by every Spark users who uses AWS and the Spark ec2 scripts, which are extremely popular. Is this pain, in aggregate, more than our cost of having a few patches to deal with runtime reflection stuff to make things work with Hadoop 1? My feeling is that it's much more efficient for us as the Spark maintainers to pay this cost rather than to force a lot of our users to deal with painful upgrades. On Sat, Jun 13, 2015 at 1:39 AM, Steve Loughran ste...@hortonworks.com wrote: On 12 Jun 2015, at 17:12, Patrick Wendell pwend...@gmail.com wrote: For instance at Databricks we use the FileSystem library for talking to S3... every time we've tried to upgrade to Hadoop 2.X there have been significant regressions in performance and we've had to downgrade. That's purely anecdotal, but I think you have people out there using the Hadoop 1 bindings for whom upgrade would be a pain. ah s3n. The unloved orphan FS, which has been fairly neglected as being non-strategic to anyone but Amazon, who have a private fork. s3n broke in hadopo 2.4 where the upgraded Jets3t went in with some patch which swallowed exceptions (nobody should ever do that) and as result would NPE on a seek(0) of a file of length(0). HADOOP-10457. Fixed in Hadoop 2.5 Hadoop 2.6 has left S3n on maintenance out of fear of breaking more things, future work is in s3a:,, which switched to the amazon awstoolkit JAR and moved the implementation to hadoop-aws JAR. S3a promises: speed, partitioned upload, better auth. But: it's not ready for serious use in Hadoop 2.6, so don't try. You need the Hadoop 2.7 patches, which are in ASF Hadoop 2.7, will be in HDP2.3, and have been picked up in CDH5.3. (HADOOP-11571). For Spark, the fact that the block size is being returned as 0 in getFileStatus() could be the killer. Future work is going to improve performance and scale ( HADOOP-11694 ) Now, if spark is finding problems with s3a performance, tests for this would be great -complaints on JIRAs too. There's not enough functional testing of analytics workloads against the object stores, especially s3 and swift. If someone volunteers to add some optional test module for object store testing, I'll help review it and suggest some tests to generate stress That can be done without the leap to Hadoop 2 —though the proposed HADOOP-9565 work allowing object stores to declare that they are and publish some of their consistency and atomicity semantics will be Hadoop 2.8+. If you want your output committers to recognise when the destination is an eventually constitent object store with O(n) directory rename and delete, that's where the code will be. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: A confusing ClassNotFoundException error
I have encountered the similar error too at spark 1.4.0. The same code can be run on spark 1.3.1. My code is(it can be run on spark-shell): === // hc is a instance of HiveContext val df = hc.sql(select * from test limit 10) val sb = new mutable.StringBuilder def mapHandle = (row: Row) = { val rowData = ArrayBuffer[String]() for (i - 0 until row.size) { val d = row.get(i) d match { case data: ArrayBuffer[Any] = sb.clear() sb.append('[') for (j - 0 until data.length) { val elm = data(j) if (elm != null) { sb.append('') sb.append(elm.toString) sb.append('') } else { sb.append(null) } sb.append(',') } if (sb.length 1) { sb.deleteCharAt(sb.length - 1) } sb.append(']') rowData += sb.toString() case _ = rowData += (if (d != null) d.toString else null) } } rowData } df.map(mapHandle).foreach(println) My submit script is: spark-submit --class cn.zhaishidan.trans.Main --master local[8] test-spark.jar ===the error java.lang.ClassNotFoundException: cn.zhaishidan.trans.service.SparkHiveService$$anonfun$mapHandle$1$1$$anonfun$apply$1 at java.net.URLClassLoader$1.run(URLClassLoader.java:366) at java.net.URLClassLoader$1.run(URLClassLoader.java:355) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:354) at java.lang.ClassLoader.loadClass(ClassLoader.java:425) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308) at java.lang.ClassLoader.loadClass(ClassLoader.java:358) at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:270) at org.apache.spark.util.InnerClosureFinder$$anon$4.visitMethodInsn(ClosureCleaner.scala:455) at com.esotericsoftware.reflectasm.shaded.org.objectweb.asm.ClassReader.accept(Unknown Source) at com.esotericsoftware.reflectasm.shaded.org.objectweb.asm.ClassReader.accept(Unknown Source) at org.apache.spark.util.ClosureCleaner$.getInnerClosureClasses(ClosureCleaner.scala:101) at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:197) at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:132) at org.apache.spark.SparkContext.clean(SparkContext.scala:1891) at org.apache.spark.rdd.RDD$$anonfun$map$1.apply(RDD.scala:294) at org.apache.spark.rdd.RDD$$anonfun$map$1.apply(RDD.scala:293) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:148) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:109) at org.apache.spark.rdd.RDD.withScope(RDD.scala:286) at org.apache.spark.rdd.RDD.map(RDD.scala:293) at org.apache.spark.sql.DataFrame.map(DataFrame.scala:1210) at cn.zhaishidan.trans.service.SparkHiveService.formatDF(SparkHiveService.scala:66) at cn.zhaishidan.trans.service.SparkHiveService.query(SparkHiveService.scala:80) at cn.zhaishidan.trans.api.DatabaseApi$$anonfun$query$1.apply(DatabaseApi.scala:39) at cn.zhaishidan.trans.api.DatabaseApi$$anonfun$query$1.apply(DatabaseApi.scala:30) at cn.zhaishidan.trans.web.JettyUtils$$anon$1.getOrPost(JettyUtils.scala:56) at cn.zhaishidan.trans.web.JettyUtils$$anon$1.doGet(JettyUtils.scala:73) at javax.servlet.http.HttpServlet.service(HttpServlet.java:735) at javax.servlet.http.HttpServlet.service(HttpServlet.java:848) at org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:684) at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:501) at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1086) at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:428) at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1020) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135) at org.eclipse.jetty.server.handler.HandlerList.handle(HandlerList.java:52) at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116) at org.eclipse.jetty.server.Server.handle(Server.java:370) at org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:494) at org.eclipse.jetty.server.AbstractHttpConnection.headerComplete(AbstractHttpConnection.java:971) at org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.headerComplete(AbstractHttpConnection.java:1033) at
RE: Contribution
The deeplearning4j project provides neural net algorithms for Spark ML. You may consider it sample code for extending Spark with new ML algorithms. http://deeplearning4j.org/sparkmlhttps://github.com/deeplearning4j/deeplearning4j/tree/master/deeplearning4j-scaleout/spark/dl4j-spark-ml -Eron Date: Fri, 12 Jun 2015 20:16:33 -0700 From: sreenivas.raghav...@gmail.com To: dev@spark.apache.org Subject: Contribution Hi everyone, I am interest to contribute new algorithms and optimize existing algorithms in the area of graph algorithms and machine learning. Please give me some ideas where to start. Is it possible for me to introduce the notion of neural network in the apache spark -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Contribution-tp12739.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: Remove Hadoop 1 support (Hadoop 2.2) for Spark 1.5?
On 12 Jun 2015, at 17:12, Patrick Wendell pwend...@gmail.com wrote: For instance at Databricks we use the FileSystem library for talking to S3... every time we've tried to upgrade to Hadoop 2.X there have been significant regressions in performance and we've had to downgrade. That's purely anecdotal, but I think you have people out there using the Hadoop 1 bindings for whom upgrade would be a pain. ah s3n. The unloved orphan FS, which has been fairly neglected as being non-strategic to anyone but Amazon, who have a private fork. s3n broke in hadopo 2.4 where the upgraded Jets3t went in with some patch which swallowed exceptions (nobody should ever do that) and as result would NPE on a seek(0) of a file of length(0). HADOOP-10457. Fixed in Hadoop 2.5 Hadoop 2.6 has left S3n on maintenance out of fear of breaking more things, future work is in s3a:,, which switched to the amazon awstoolkit JAR and moved the implementation to hadoop-aws JAR. S3a promises: speed, partitioned upload, better auth. But: it's not ready for serious use in Hadoop 2.6, so don't try. You need the Hadoop 2.7 patches, which are in ASF Hadoop 2.7, will be in HDP2.3, and have been picked up in CDH5.3. (HADOOP-11571). For Spark, the fact that the block size is being returned as 0 in getFileStatus() could be the killer. Future work is going to improve performance and scale ( HADOOP-11694 ) Now, if spark is finding problems with s3a performance, tests for this would be great -complaints on JIRAs too. There's not enough functional testing of analytics workloads against the object stores, especially s3 and swift. If someone volunteers to add some optional test module for object store testing, I'll help review it and suggest some tests to generate stress That can be done without the leap to Hadoop 2 —though the proposed HADOOP-9565 work allowing object stores to declare that they are and publish some of their consistency and atomicity semantics will be Hadoop 2.8+. If you want your output committers to recognise when the destination is an eventually constitent object store with O(n) directory rename and delete, that's where the code will be.
About HostName display in SparkUI
In spark 1.4.0, I find that the Address is ip (it was hostname in v1.3.0), why? who did it?
Re: Contribution
This is a good start, if you haven't seen this already https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark Thanks Best Regards On Sat, Jun 13, 2015 at 8:46 AM, srinivasraghavansr71 sreenivas.raghav...@gmail.com wrote: Hi everyone, I am interest to contribute new algorithms and optimize existing algorithms in the area of graph algorithms and machine learning. Please give me some ideas where to start. Is it possible for me to introduce the notion of neural network in the apache spark -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Contribution-tp12739.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org