Re: spark 1.6.0 read s3 files error.
Solution: sc._jsc.hadoopConfiguration().set("fs.s3a.awsAccessKeyId", "...") sc._jsc.hadoopConfiguration().set("fs.s3a.awsSecretAccessKey", "...") Got this solution from a cloudera lady. Thanks Neerja. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/spark-1-6-0-read-s3-files-error-tp27417p27452.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Re: spark 1.6.0 read s3 files error.
Any one, please? I believe many of us are using spark 1.6 or higher with s3... -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/spark-1-6-0-read-s3-files-error-tp27417p27451.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Re: spark 1.6.0 read s3 files error.
tried the following. still failed the same way.. it ran in yarn. cdh5.8.0 from pyspark import SparkContext, SparkConf conf = SparkConf().setAppName('s3 ---') sc = SparkContext(conf=conf) sc._jsc.hadoopConfiguration().set("fs.s3n.awsAccessKeyId", "...") sc._jsc.hadoopConfiguration().set("fs.s3n.awsSecretAccessKey", "...") myRdd = sc.textFile("s3n:///y=2016/m=5/d=26/h=20/2016.5.26.21.9.52.6d53180a-28b9-4e65-b749-b4a2694b9199.json.gz") count = myRdd.count() print "The count is", count -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/spark-1-6-0-read-s3-files-error-tp27417p27427.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Re: spark 1.6.0 read s3 files error.
BTW, I also tried yarn. Same error. When I ran the script, I used the real credentials for s3, which is omitted in this post. sorry about that. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/spark-1-6-0-read-s3-files-error-tp27417p27425.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Re: spark 1.6.0 read s3 files error.
The question is, what is the cause of the problem? and how to fix it? Thanks. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/spark-1-6-0-read-s3-files-error-tp27417p27424.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
spark 1.6.0 read s3 files error.
cdh 5.7.1. pyspark. codes: === from pyspark import SparkContext, SparkConf conf = SparkConf().setAppName('s3 ---') sc = SparkContext(conf=conf) myRdd = sc.textFile("s3n:///y=2016/m=5/d=26/h=20/2016.5.26.21.9.52.6d53180a-28b9-4e65-b749-b4a2694b9199.json.gz") count = myRdd.count() print "The count is", count === standalone mode: command line: AWS_ACCESS_KEY_ID=??? AWS_SECRET_ACCESS_KEY=??? ./bin/spark-submit --driver-memory 4G --master spark://master:7077 --conf "spark.default.parallelism=70" /root/workspace/test/s3.py Error: ) 16/07/27 17:27:26 INFO spark.SparkContext: Created broadcast 0 from textFile at NativeMethodAccessorImpl.java:-2 Traceback (most recent call last): File "/root/workspace/test/s3.py", line 12, in count = myRdd.count() File "/opt/cloudera/parcels/CDH-5.7.1-1.cdh5.7.1.p0.11/lib/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 1004, in count File "/opt/cloudera/parcels/CDH-5.7.1-1.cdh5.7.1.p0.11/lib/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 995, in sum File "/opt/cloudera/parcels/CDH-5.7.1-1.cdh5.7.1.p0.11/lib/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 869, in fold File "/opt/cloudera/parcels/CDH-5.7.1-1.cdh5.7.1.p0.11/lib/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 771, in collect File "/opt/cloudera/parcels/CDH-5.7.1-1.cdh5.7.1.p0.11/lib/spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py", line 813, in __call__ File "/opt/cloudera/parcels/CDH-5.7.1-1.cdh5.7.1.p0.11/lib/spark/python/lib/py4j-0.9-src.zip/py4j/protocol.py", line 308, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe. : java.lang.VerifyError: Bad type on operand stack Exception Details: Location: org/apache/hadoop/fs/s3native/Jets3tNativeFileSystemStore.copy(Ljava/lang/String;Ljava/lang/String;)V @155: invokevirtual Reason: Type 'org/jets3t/service/model/S3Object' (current frame, stack[4]) is not assignable to 'org/jets3t/service/model/StorageObject' Current Frame: bci: @155 flags: { } locals: { 'org/apache/hadoop/fs/s3native/Jets3tNativeFileSystemStore', 'java/lang/String', 'java/lang/String', 'org/jets3t/service/model/S3Object' } stack: { 'org/jets3t/service/S3Service', 'java/lang/String', 'java/lang/String', 'java/lang/String', 'org/jets3t/service/model/S3Object', integer } Bytecode: 000: b200 fcb9 0190 0100 9900 39b2 00fc bb01 010: 5659 b701 5713 0192 b601 5b2b b601 5b13 020: 0194 b601 5b2c b601 5b13 0196 b601 5b2a 030: b400 7db6 00e7 b601 5bb6 015e b901 9802 040: 002a b400 5799 0030 2ab4 0047 2ab4 007d 050: 2b01 0101 01b6 019b 4e2a b400 6b09 949e 060: 0016 2db6 019c 2ab4 006b 949e 000a 2a2d 070: 2cb6 01a0 b1bb 00a0 592c b700 a14e 2d2a 080: b400 73b6 00b0 2ab4 0047 2ab4 007d b600 090: e72b 2ab4 007d b600 e72d 03b6 01a4 57a7 0a0: 000a 4e2a 2d2b b700 c7b1 Exception Handler Table: bci [0, 116] => handler: 162 bci [117, 159] => handler: 162 Stackmap Table: same_frame_extended(@65) same_frame(@117) same_locals_1_stack_item_frame(@162,Object[#139]) same_frame(@169) at org.apache.hadoop.fs.s3native.NativeS3FileSystem.createDefaultStore(NativeS3FileSystem.java:338) at org.apache.hadoop.fs.s3native.NativeS3FileSystem.initialize(NativeS3FileSystem.java:328) . TIA -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/spark-1-6-0-read-s3-files-error-tp27417.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Re: build spark 1.6 against cdh5.7 with hadoop 2.6.0 hbase 1.2: Failure
agh.. typo. supposed to use cdh5.7.0. I rerun the command with the fix, but still get the same error. build/mvn -Pyarn -Phadoop-2.6 -Dhadoop.version=2.6.0-cdh5.7.0 -DskipTests clean package -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/build-spark-1-6-against-cdh5-7-with-hadoop-2-6-0-hbase-1-2-Failure-tp26762p26763.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
build spark 1.6 against cdh5.7 with hadoop 2.6.0 hbase 1.2: Failure
jdk: 1.8.0_77 scala: 2.10.4 mvn: 3.3.9. Slightly changed the pom.xml: $ diff pom.xml pom.original 130c130 < 2.6.0-cdh5.7.0-SNAPSHOT --- > 2.2.0 133c133 < 1.2.0-cdh5.7.0-SNAPSHOT --- > 0.98.7-hadoop2 command: build/mvn -Pyarn -Phadoop-2.6 -Dhadoop.version=2.6.0-cdh5.6.0 -DskipTests clean package error: [INFO] [INFO] --- maven-remote-resources-plugin:1.5:process (default) @ spark-core_2.10 --- [INFO] [INFO] --- maven-resources-plugin:2.6:resources (default-resources) @ spark-core_2.10 --- [INFO] Using 'UTF-8' encoding to copy filtered resources. [INFO] Copying 21 resources [INFO] Copying 3 resources [INFO] [INFO] --- scala-maven-plugin:3.2.2:compile (scala-compile-first) @ spark-core_2.10 --- [INFO] Using zinc server for incremental compilation [info] Compiling 486 Scala sources and 76 Java sources to /home/jfeng/workspace/spark-1.6.0/core/target/scala-2.10/classes... [error] /home/jfeng/workspace/spark-1.6.0/core/src/main/scala/org/apache/spark/TestUtils.scala:22: object StandardCharsets is not a member of package java.nio.charset [error] import java.nio.charset.StandardCharsets [error]^ [error] /home/jfeng/workspace/spark-1.6.0/core/src/main/scala/org/apache/spark/TestUtils.scala:23: object file is not a member of package java.nio [error] import java.nio.file.Paths [error] ^ [error] /home/jfeng/workspace/spark-1.6.0/core/src/main/scala/org/apache/spark/TestUtils.scala:80: not found: value StandardCharsets [error] ByteStreams.copy(new ByteArrayInputStream(v.getBytes(StandardCharsets.UTF_8)), jarStream) [error]^ [error] /home/jfeng/workspace/spark-1.6.0/core/src/main/scala/org/apache/spark/TestUtils.scala:95: not found: value Paths [error] val jarEntry = new JarEntry(Paths.get(directoryPrefix.getOrElse(""), file.getName).toString) [error] ^ [error] /home/jfeng/workspace/spark-1.6.0/core/src/main/scala/org/apache/spark/launcher/LauncherBackend.scala:43: value getLoopbackAddress is not a member of object java.net.InetAddress [error] val s = new Socket(InetAddress.getLoopbackAddress(), port.get) [error] Thanks in advance. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/build-spark-1-6-against-cdh5-7-with-hadoop-2-6-0-hbase-1-2-Failure-tp26762.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
spark's behavior about failed tasks
Hello there, I have a spark running in a 20 node cluster. The job is logically simple, just a mapPartition and then sum. The return value of the mapPartitions is an integer for each partition. The tasks got some random failure (which could be caused by a 3rh party key-value store connections. The cause is irrelevant to my question). In more details, Description: 1. spark 1.1.1. 2. 4096 tasks total. 3. 66 failed tasks. Issue: Spark seems rerunning all the 4096 tasks instead of the 66 failed tasks. It current runs at 469/4096 (stage2). Is this behavior normal? Thanks for your help! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/spark-s-behavior-about-failed-tasks-tp24232.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Lots of fetch failures on saveAsNewAPIHadoopDataset PairRDDFunctions
spark1.1.1 + Hbase (CDH5.3.1). 20 nodes each with 4 cores and 32G memory. 3 cores and 16G memory were assigned to spark in each worker node. Standalone mode. Data set is 3.8 T. wondering how to fix this. Thanks! org.apache.spark.rdd.PairRDDFunctions.saveAsNewAPIHadoopDataset(PairRDDFunctions.scala:935) org.apache.spark.api.python.PythonRDD$.saveAsHadoopDataset(PythonRDD.scala:691) org.apache.spark.api.python.PythonRDD.saveAsHadoopDataset(PythonRDD.scala) sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) java.lang.reflect.Method.invoke(Method.java:606) py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231) py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379) py4j.Gateway.invoke(Gateway.java:259) py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133) py4j.commands.CallCommand.execute(CallCommand.java:79) py4j.GatewayConnection.run(GatewayConnection.java:207) java.lang.Thread.run(Thread.java:745) -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Lots-of-fetch-failures-on-saveAsNewAPIHadoopDataset-PairRDDFunctions-tp22038.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
correct way to broadcast a variable
Suppose I have an object to broadcast and then use it in a mapper function, sth like follows, (Python codes) obj2share = sc.broadcast(Some object here) someRdd.map(createMapper(obj2share)).collect() The createMapper function will create a mapper function using the shared object's value. Another way to do this is someRdd.map(createMapper(obj2share.value)).collect() Here the creatMapper function directly uses the shared object to create the mapper function. Is there a difference from spark side for the two methods? If there is no difference at all, I'd prefer the second, because it hides the spark from the createMapper function. Thanks. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/correct-way-to-broadcast-a-variable-tp21631.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
What could cause number of tasks to go down from 2k to 1?
Hi, The input data has 2048 partitions. The final step is to load the processed data into hbase through saveAsNewAPIHadoopDataset(). Every step except the last one ran in parallel in the cluster. But the last step only has 1 task which runs on only 1 node using one core. Spark 1.1.1. + CDH5.3.0. Probably I should set the numPartitions in reduceByKey call to some big number? I did not set this parameter in the current codes. This reduceByKey call is the one that runs before the saveAsNewAPIHaddopDataset() call. Any idea? Thanks! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/What-could-cause-number-of-tasks-to-go-down-from-2k-to-1-tp21430.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
large data set to get rid of exceeds Integer.MAX_VALUE error
Hi, This seems to be a known issue (see here: http://apache-spark-user-list.1001560.n3.nabble.com/ALS-failure-with-size-gt-Integer-MAX-VALUE-td19982.html) The data set is about 1.5 T bytes. There are 14 region servers. I am not sure how many regions there are for this data set. But very likely each region will have much more than 2g data. In this case, repartition seems also a very expensive action (I would guess), if possible in my cluster at all. Could any one give some suggestions to make this job done? Thanks! platform: spark 1.2.0, cdh5.3.0. The error is like, py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.saveAsHadoopDataset. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1.0 failed 4 times, most recent failure: Lost task 0.3 in stage 1.0 (TID 34, node007): java.lang.RuntimeException: java.lang.IllegalArgumentException: Size exceeds Integer.MAX_VALUE at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:828) at org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:123) at org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:132) at org.apache.spark.storage.BlockManager.doGetLocal(BlockManager.scala:517) at org.apache.spark.storage.BlockManager.getBlockData(BlockManager.scala:307) at org.apache.spark.network.netty.NettyBlockRpcServer$$anonfun$2.apply(NettyBlockRpcServer.scala:57) at org.apache.spark.network.netty.NettyBlockRpcServer$$anonfun$2.apply(NettyBlockRpcServer.scala:57) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108) at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108) at org.apache.spark.network.netty.NettyBlockRpcServer.receive(NettyBlockRpcServer.scala:57) at org.apache.spark.network.server.TransportRequestHandler.processRpcRequest(TransportRequestHandler.java:124) at org.apache.spark.network.server.TransportRequestHandler.handle(TransportRequestHandler.java:97) at org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:91) at org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:44) at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319) at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319) at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:163) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319) at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:787) at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:130) at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511) at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468) at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382) at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354) at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116) at java.lang.Thread.run(Thread.java:745) at org.apache.spark.network.client.TransportResponseHandler.handle(TransportResponseHandler.java:156) at org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:93) at org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:44) at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333) at
Re: Testing if an RDD is empty?
I think Sampo's thought is to get a function that only tests if a RDD is empty. He does not want to know the size of the RDD, and getting the size of a RDD is expensive for large data sets. I myself saw many times that my app threw out exceptions because an empty RDD cannot be saved. This is not big issue, but annoying. Having a cheap solution testing if an RDD is empty would be nice if there is no such thing available now. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Testing-if-an-RDD-is-empty-tp1678p21175.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
how to run python app in yarn?
A cdh5.3.0 with spark is set up. just wondering how to run a python application on it. I used 'spark-submit --master yarn-cluster ./loadsessions.py' but got the error, Error: Cluster deploy mode is currently not supported for python applications. Run with --help for usage help or --verbose for debug output Thanks. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/how-to-run-python-app-in-yarn-tp21141.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: how to run python app in yarn?
Got help from Marcelo and Josh. Now it is running smoothly. In case you need this info - Just use yarn-client instead of yarn-cluster Thanks folks! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/how-to-run-python-app-in-yarn-tp21141p21142.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: correct/best way to install custom spark1.2 on cdh5.3.0?
I installed the custom as a standalone mode as normal. The master and slaves started successfully. However, I got error when I ran a job. It seems to me from the error message the some library was compiled against hadoop1, but my spark was compiled against hadoop2. 15/01/08 23:27:36 INFO ClientCnxn: Opening socket connection to server master/10.191.41.253:2181. Will not attempt to authenticate using SASL (unknown error) 15/01/08 23:27:36 INFO ClientCnxn: Socket connection established to master/10.191.41.253:2181, initiating session 15/01/08 23:27:36 INFO ClientCnxn: Session establishment complete on server master/10.191.41.253:2181, sessionid = 0x14acbdae7e60022, negotiated timeout = 6 Traceback (most recent call last): File /root/workspace/test/sparkhbase.py, line 23, in module conf=conf2) File /root/spark/python/pyspark/context.py, line 530, in newAPIHadoopRDD jconf, batchSize) File /root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py, line 538, in __call__ File /root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py, line 300, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.newAPIHadoopRDD. : java.lang.IncompatibleClassChangeError: Found interface org.apache.hadoop.mapreduce.JobContext, but class was expected at org.apache.hadoop.hbase.mapreduce.TableInputFormatBase.getSplits(TableInputFormatBase.java:157) at org.apache.spark.rdd.NewHadoopRDD.getPartitions(NewHadoopRDD.scala:98) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:203) at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:203) at org.apache.spark.rdd.RDD.take(RDD.scala:1060) at org.apache.spark.rdd.RDD.first(RDD.scala:1093) at org.apache.spark.api.python.SerDeUtil$.pairRDDToPython(SerDeUtil.scala:202) at org.apache.spark.api.python.PythonRDD$.newAPIHadoopRDD(PythonRDD.scala:500) at org.apache.spark.api.python.PythonRDD.newAPIHadoopRDD(PythonRDD.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379) at py4j.Gateway.invoke(Gateway.java:259) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:207) at java.lang.Thread.run(Thread.java:745) If I understand correctly, the org.apache.hadoop.mapreduce.JobContext in hadoop1 is a class, but is a interface in hadoop2. My question is which library could cause this problem. Thanks. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/correct-best-way-to-install-custom-spark1-2-on-cdh5-3-0-tp21045p21046.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: correct/best way to install custom spark1.2 on cdh5.3.0?
I ran the release spark in cdh5.3.0 but got the same error. Anyone tried to run spark in cdh5.3.0 using its newAPIHadoopRDD? command: spark-submit --master spark://master:7077 --jars /opt/cloudera/parcels/CDH-5.3.0-1.cdh5.3.0.p0.30/jars/spark-examples-1.2.0-cdh5.3.0-hadoop2.5.0-cdh5.3.0.jar ./sparkhbase.py Error. 2015-01-09 00:02:03,344 INFO [Thread-2-SendThread(master:2181)] zookeeper.ClientCnxn (ClientCnxn.java:primeConnection(852)) - Socket connection established to master/10.191.41.253:2181, initiating session 2015-01-09 00:02:03,358 INFO [Thread-2-SendThread(master:2181)] zookeeper.ClientCnxn (ClientCnxn.java:onConnected(1235)) - Session establishment complete on server master/10.191.41.253:2181, sessionid = 0x14acbdae7e60066, negotiated timeout = 6 Traceback (most recent call last): File /root/workspace/test/./sparkhbase.py, line 23, in module conf=conf2) File /opt/cloudera/parcels/CDH-5.3.0-1.cdh5.3.0.p0.30/lib/spark/python/pyspark/context.py, line 530, in newAPIHadoopRDD jconf, batchSize) File /opt/cloudera/parcels/CDH-5.3.0-1.cdh5.3.0.p0.30/lib/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py, line 538, in __call__ File /opt/cloudera/parcels/CDH-5.3.0-1.cdh5.3.0.p0.30/lib/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py, line 300, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.newAPIHadoopRDD. : java.lang.IncompatibleClassChangeError: Found interface org.apache.hadoop.mapreduce.JobContext, but class was expected at org.apache.hadoop.hbase.mapreduce.TableInputFormatBase.getSplits(TableInputFormatBase.java:158) at org.apache.spark.rdd.NewHadoopRDD.getPartitions(NewHadoopRDD.scala:98) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:203) at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:203) at org.apache.spark.rdd.RDD.take(RDD.scala:1060) at org.apache.spark.rdd.RDD.first(RDD.scala:1093) at org.apache.spark.api.python.SerDeUtil$.pairRDDToPython(SerDeUtil.scala:202) at org.apache.spark.api.python.PythonRDD$.newAPIHadoopRDD(PythonRDD.scala:500) at org.apache.spark.api.python.PythonRDD.newAPIHadoopRDD(PythonRDD.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379) at py4j.Gateway.invoke(Gateway.java:259) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:207) at java.lang.Thread.run(Thread.java:745) -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/correct-best-way-to-install-custom-spark1-2-on-cdh5-3-0-tp21045p21047.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
correct/best way to install custom spark1.2 on cdh5.3.0?
Could anyone come up with your experience on how to do this? I have created a cluster and installed cdh5.3.0 on it with basically core + Hbase. but cloudera installed and configured the spark in its parcels anyway. I'd like to install our custom spark on this cluster to use the hadoop and hbase service there. There could be potentially conflicts if this is not done correctly. Library conflicts are what I worry most. I understand this is a special case. but if you know how to do it, please let me know. Thanks. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/correct-best-way-to-install-custom-spark1-2-on-cdh5-3-0-tp21045.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
spark 1.1 got error when working with cdh5.3.0 standalone mode
Hi, I installed the cdh5.3.0 core+Hbase in a new ec2 cluster. Then I manually installed spark1.1 in it. but when I started the slaves, I got an error as follows, ./bin/spark-class org.apache.spark.deploy.worker.Worker spark://master:7077 Error: Could not find or load main class s.rolling.maxRetainedFiles=3 The spark was compiled against hadoop2.5 + hbase 0.98.6 as in cdh5.3.0. Is the error because of some mysterious conflict somewhere? Or I should use the spark in cdh5.3.0 for safe? Thanks! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/spark-1-1-got-error-when-working-with-cdh5-3-0-standalone-mode-tp21022.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
executor logging management from python
Hi, wondering if anyone could help with this. We use ec2 cluster to run spark apps in standalone mode. The default log info goes to /$spark_folder/work/. This folder is in the 10G root fs. So it won't take long to fill up the whole fs. My goal is 1. move the logging location to /mnt, where we have 37G space. 2. make the log files iterate, meaning the new ones will replace the old ones after some threshold. Could you show how to do this? I was trying to change the spark-env.sh file -- I don't know that's the best way to do it. Thanks! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/executor-logging-management-from-python-tp20198.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: executor logging management from python
cat spark-env.sh -- #!/usr/bin/env bash export SPARK_WORKER_OPTS=-Dspark.executor.logs.rolling.strategy=time -Dspark.executor.logs.rolling.time.interval=daily -Dspark.executor.logs.rolling.maxRetainedFiles=3 export SPARK_LOCAL_DIRS=/mnt/spark export SPARK_WORKER_DIR=/mnt/spark -- But the spark log still writes to /$Spark_home/work folder. why spark doesn't take the changes? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/executor-logging-management-from-python-tp20198p20210.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
logging in workers for pyspark
Hi, I am wondering how to write logging info in a worker when running a pyspark app. I saw the thread http://apache-spark-user-list.1001560.n3.nabble.com/logging-in-pyspark-td5458.html but did not see a solution. Anybody know a solution? Thanks! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/logging-in-workers-for-pyspark-tp19432.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
suggest pyspark using 'with' for sparkcontext to be more 'pythonic'
It seems sparkcontext is good fit to be used with 'with' in python. A context manager will do. example: with SparkContext(conf=conf, batchSize=512) as sc: Then sc.stop() is not necessary to write any more. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/suggest-pyspark-using-with-for-sparkcontext-to-be-more-pythonic-tp18863.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: pyspark get column family and qualifier names from hbase table
Hi, This is my code, import org.apache.hadoop.hbase.CellUtil /** * JF: convert a Result object into a string with column family and qualifier names. Sth like * 'columnfamily1:columnqualifier1:value1;columnfamily2:columnqualifier2:value2' etc. * k-v pairs are separated by ';'. different columns for each cell is separated by ':'. * Notice that we don't need the row key here, because it has been converted by * ImmutableBytesWritableToStringConverter. */ class CustomHBaseResultToStringConverter extends Converter[Any, String] { override def convert(obj: Any): String = { val result = obj.asInstanceOf[Result] result.rawCells().map(cell = List(Bytes.toString(CellUtil.cloneFamily(cell)), Bytes.toString(CellUtil.cloneQualifier(cell)), Bytes.toString(CellUtil.cloneValue(cell))).mkString(:)).mkString(;) } } I recommend you to use different delimiters (to replace : or ; ) if you have data with those stuff in them. I am not a seasoned scala programmer, so there might be a more flexible solution. For example, make the delimiters dynamically assignable. I will try to open a PR probably later today. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/pyspark-get-column-family-and-qualifier-names-from-hbase-table-tp18613p18744.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: pyspark get column family and qualifier names from hbase table
Hi Nick, I saw the HBase api has experienced lots of changes. If I remember correctly, the default hbase in spark 1.1.0 is 0.94.6. The one I am using is 0.98.1. To get the column family names and qualifier names, we need to call different methods for these two different versions. I don't know how to do that...sorry... -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/pyspark-get-column-family-and-qualifier-names-from-hbase-table-tp18613p18749.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
pyspark get column family and qualifier names from hbase table
Hello there, I am wondering how to get the column family names and column qualifier names when using pyspark to read an hbase table with multiple column families. I have a hbase table as follows, hbase(main):007:0 scan 'data1' ROW COLUMN+CELL row1 column=f1:, timestamp=1411078148186, value=value1 row1 column=f2:, timestamp=1415732470877, value=value7 row2 column=f2:, timestamp=1411078160265, value=value2 when I ran the examples/hbase_inputformat.py code: conf2 = {hbase.zookeeper.quorum: localhost, hbase.mapreduce.inputtable: 'data1'} hbase_rdd = sc.newAPIHadoopRDD( org.apache.hadoop.hbase.mapreduce.TableInputFormat, org.apache.hadoop.hbase.io.ImmutableBytesWritable, org.apache.hadoop.hbase.client.Result, keyConverter=org.apache.spark.examples.pythonconverters.ImmutableBytesWritableToStringConverter, valueConverter=org.apache.spark.examples.pythonconverters.HBaseResultToStringConverter, conf=conf2) output = hbase_rdd.collect() for (k, v) in output: print (k, v) I only see (u'row1', u'value1') (u'row2', u'value2') What I really want is (row_id, column family:column qualifier, value) tuples. Any comments? Thanks! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/pyspark-get-column-family-and-qualifier-names-from-hbase-table-tp18613.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: pyspark get column family and qualifier names from hbase table
checked the source, found the following, class HBaseResultToStringConverter extends Converter[Any, String] { override def convert(obj: Any): String = { val result = obj.asInstanceOf[Result] Bytes.toStringBinary(result.value()) } } I feel using 'result.value()' here is a big limitation. Converting from the 'list()' from the 'Result' is more general and easy to use. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/pyspark-get-column-family-and-qualifier-names-from-hbase-table-tp18613p18619.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: pyspark get column family and qualifier names from hbase table
just wrote a custom convert in scala to replace HBaseResultToStringConverter. Just couple of lines of code. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/pyspark-get-column-family-and-qualifier-names-from-hbase-table-tp18613p18639.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: How to ship cython library to workers?
Thanks for the solution! I did figure out how to create an .egg file to ship out to the workers. Using ipython seems to be another cool solution. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-ship-cython-library-to-workers-tp14467p18116.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: akka connection refused bug, fix?
Any one has experience or advice to fix this problem? highly appreciated! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/akka-connection-refused-bug-fix-tp17764p17972.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
IllegalStateException: unread block data
Hollo there, Just set up an ec2 cluster with no HDFS, hadoop, hbase whatsoever. Just installed spark to read/process data from a hbase in a different cluster. The spark was built against the hbase/hadoop version in the remote (ec2) hbase cluster, which is 0.98.1 and 2.3.0 respectively. but I got the following error when running a simple test python script. The command line ./spark-submit --master spark://master:7077 --driver-class-path ./spark-examples-1.1.0-hadoop2.3.0.jar ~/workspace/test/sparkhbase.py From the worker log, I can see the worker node got the request from the master. Can anyone help with this problem? Tons of thanks! java.lang.IllegalStateException: unread block data java.io.ObjectInputStream$BlockDataInputStream.setBlockDataMode(ObjectInputStream.java:2399) java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1378) java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1969) java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893) java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1776) java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1346) java.io.ObjectInputStream.readObject(ObjectInputStream.java:368) org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:62) org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:87) org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:162) java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) java.lang.Thread.run(Thread.java:679) -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/IllegalStateException-unread-block-data-tp18011.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
stage failure: java.lang.IllegalStateException: unread block data
Hi, Got this error when running spark 1.1.0 to read Hbase 0.98.1 through simple python code in a ec2 cluster. The same program runs correctly in local mode. So this error only happens when running in a real cluster. Here's what I got, 14/10/30 17:51:53 INFO TaskSetManager: Starting task 0.1 in stage 0.0 (TID 1, node001, ANY, 1265 bytes) 14/10/30 17:51:53 INFO TaskSetManager: Lost task 0.1 in stage 0.0 (TID 1) on executor node001: java.lang.IllegalStateException (unread block data) [duplicate 1] 14/10/30 17:51:53 INFO TaskSetManager: Starting task 0.2 in stage 0.0 (TID 2, node001, ANY, 1265 bytes) 14/10/30 17:51:53 INFO TaskSetManager: Lost task 0.2 in stage 0.0 (TID 2) on executor node001: java.lang.IllegalStateException (unread block data) [duplicate 2] 14/10/30 17:51:53 INFO TaskSetManager: Starting task 0.3 in stage 0.0 (TID 3, node001, ANY, 1265 bytes) 14/10/30 17:51:53 INFO TaskSetManager: Lost task 0.3 in stage 0.0 (TID 3) on executor node001: java.lang.IllegalStateException (unread block data) [duplicate 3] 14/10/30 17:51:53 ERROR TaskSetManager: Task 0 in stage 0.0 failed 4 times; aborting job 14/10/30 17:51:53 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool 14/10/30 17:51:53 INFO TaskSchedulerImpl: Cancelling stage 0 14/10/30 17:51:53 INFO DAGScheduler: Failed to run first at SerDeUtil.scala:70 Traceback (most recent call last): File /root/workspace/test/sparkhbase.py, line 22, in module conf=conf2) File /root/spark-1.1.0/python/pyspark/context.py, line 471, in newAPIHadoopRDD jconf, batchSize) File /root/spark-1.1.0/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py, line 538, in __call__ File /root/spark-1.1.0/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py, line 300, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.newAPIHadoopRDD. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 3, node001): java.lang.IllegalStateException: unread block data java.io.ObjectInputStream$BlockDataInputStream.setBlockDataMode(ObjectInputStream.java:2399) java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1378) java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1969) java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893) java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1776) java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1346) java.io.ObjectInputStream.readObject(ObjectInputStream.java:368) org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:62) org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:87) org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:159) java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) java.lang.Thread.run(Thread.java:679) Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1185) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1174) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1173) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1173) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:688) at org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1391) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) at akka.actor.ActorCell.invoke(ActorCell.scala:456) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) at akka.dispatch.Mailbox.run(Mailbox.scala:219) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at
Re: stage failure: java.lang.IllegalStateException: unread block data
The worker side has error message as this, 14/10/30 18:29:00 INFO Worker: Asked to launch executor app-20141030182900-0006/0 for testspark_v1 14/10/30 18:29:01 INFO ExecutorRunner: Launch command: java -cp ::/root/spark-1.1.0/conf:/root/spark-1.1.0/assembly/target/scala-2.10/spark-assembly-1.1.0-hadoop2.3.0.jar -XX:MaxPermSize=128m -Dspark.driver.port=52552 -Xms512M -Xmx512M org.apache.spark.executor.CoarseGrainedExecutorBackend akka.tcp://sparkDriver@master:52552/user/CoarseGrainedScheduler 0 node001 4 akka.tcp://sparkWorker@node001:60184/user/Worker app-20141030182900-0006 14/10/30 18:29:03 INFO Worker: Asked to kill executor app-20141030182900-0006/0 14/10/30 18:29:03 INFO ExecutorRunner: Runner thread for executor app-20141030182900-0006/0 interrupted 14/10/30 18:29:03 INFO ExecutorRunner: Killing process! 14/10/30 18:29:03 ERROR FileAppender: Error writing stream to file /root/spark-1.1.0/work/app-20141030182900-0006/0/stderr java.io.IOException: Stream Closed at java.io.FileInputStream.readBytes(Native Method) at java.io.FileInputStream.read(FileInputStream.java:214) at org.apache.spark.util.logging.FileAppender.appendStreamToFile(FileAppender.scala:70) at org.apache.spark.util.logging.FileAppender$$anon$1$$anonfun$run$1.apply$mcV$sp(FileAppender.scala:39) at org.apache.spark.util.logging.FileAppender$$anon$1$$anonfun$run$1.apply(FileAppender.scala:39) at org.apache.spark.util.logging.FileAppender$$anon$1$$anonfun$run$1.apply(FileAppender.scala:39) at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1311) at org.apache.spark.util.logging.FileAppender$$anon$1.run(FileAppender.scala:38) 14/10/30 18:29:04 INFO Worker: Executor app-20141030182900-0006/0 finished with state KILLED exitStatus 143 14/10/30 18:29:04 INFO LocalActorRef: Message [akka.remote.transport.ActorTransportAdapter$DisassociateUnderlying] from Actor[akka://sparkWorker/deadLetters] to Actor[akka://sparkWorker/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2FsparkWorker%4010.180.49.228%3A52120-22#1336571562] was not delivered. [6] dead letters encountered. This logging can be turned off or adjusted with configuration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'. 14/10/30 18:29:04 ERROR EndpointWriter: AssociationError [akka.tcp://sparkWorker@node001:60184] - [akka.tcp://sparkExecutor@node001:37697]: Error [Association failed with [akka.tcp://sparkExecutor@node001:37697]] [ akka.remote.EndpointAssociationException: Association failed with [akka.tcp://sparkExecutor@node001:37697] Caused by: akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2: Connection refused: node001/10.180.49.228:37697 ] 14/10/30 18:29:04 ERROR EndpointWriter: AssociationError [akka.tcp://sparkWorker@node001:60184] - [akka.tcp://sparkExecutor@node001:37697]: Error [Association failed with [akka.tcp://sparkExecutor@node001:37697]] [ akka.remote.EndpointAssociationException: Association failed with [akka.tcp://sparkExecutor@node001:37697] Caused by: akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2: Connection refused: node001/10.180.49.228:37697 ] 14/10/30 18:29:04 ERROR EndpointWriter: AssociationError [akka.tcp://sparkWorker@node001:60184] - [akka.tcp://sparkExecutor@node001:37697]: Error [Association failed with [akka.tcp://sparkExecutor@node001:37697]] [ akka.remote.EndpointAssociationException: Association failed with [akka.tcp://sparkExecutor@node001:37697] Caused by: akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2: Connection refused: node001/10.180.49.228:37697 ] Thanks! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/stage-failure-java-lang-IllegalStateException-unread-block-data-tp17751p17755.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
akka connection refused bug, fix?
Hi, I saw the same issue as this thread, http://apache-spark-user-list.1001560.n3.nabble.com/Spark-1-0-1-akka-connection-refused-td9864.html Anyone has a fix for this bug? Please?! The log info in my worker node is like, 14/10/30 20:15:18 INFO Worker: Asked to kill executor app-20141030201514-/0 14/10/30 20:15:18 INFO ExecutorRunner: Runner thread for executor app-20141030201514-/0 interrupted 14/10/30 20:15:18 INFO ExecutorRunner: Killing process! 14/10/30 20:15:18 INFO Worker: Executor app-20141030201514-/0 finished with state KILLED exitStatus 1 14/10/30 20:15:18 INFO LocalActorRef: Message [akka.remote.transport.ActorTransportAdapter$DisassociateUnderlying] from Actor[akka://sparkWorker/deadLetters] to Actor[akka://sparkWorker/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2FsparkWorker%4010.180.49.228%3A47087-2#-814958390] was not delivered. [1] dead letters encountered. This logging can be turned off or adjusted with configuration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'. 14/10/30 20:15:18 ERROR EndpointWriter: AssociationError [akka.tcp://sparkWorker@node001:42816] - [akka.tcp://sparkExecutor@node001:35811]: Error [Association failed with [akka.tcp://sparkExecutor@node001:35811]] [ akka.remote.EndpointAssociationException: Association failed with [akka.tcp://sparkExecutor@node001:35811] Caused by: akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2: Connection refused: node001/10.180.49.228:35811 ] 14/10/30 20:15:18 ERROR EndpointWriter: AssociationError [akka.tcp://sparkWorker@node001:42816] - [akka.tcp://sparkExecutor@node001:35811]: Error [Association failed with [akka.tcp://sparkExecutor@node001:35811]] [ akka.remote.EndpointAssociationException: Association failed with [akka.tcp://sparkExecutor@node001:35811] Caused by: akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2: Connection refused: node001/10.180.49.228:35811 ] 14/10/30 20:15:18 ERROR EndpointWriter: AssociationError [akka.tcp://sparkWorker@node001:42816] - [akka.tcp://sparkExecutor@node001:35811]: Error [Association failed with [akka.tcp://sparkExecutor@node001:35811]] [ akka.remote.EndpointAssociationException: Association failed with [akka.tcp://sparkExecutor@node001:35811] Caused by: akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2: Connection refused: node001/10.180.49.228:35811 -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/akka-connection-refused-bug-fix-tp17764.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: akka connection refused bug, fix?
followed this http://community.cloudera.com/t5/Advanced-Analytics-Apache-Spark/Akka-Error-while-running-Spark-Jobs/td-p/18602 but the problem was not fixed.. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/akka-connection-refused-bug-fix-tp17764p17774.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Usage of spark-ec2: how to deploy a revised version of spark 1.1.0?
Thanks Daniil! if I use --spark-git-repo, is there a way to specify the mvn command line parameters? like following mvn -Pyarn -Phadoop-2.3 -Dhadoop.version=2.3.0 -DskipTests clean package mvn -Pyarn -Phadoop-2.3 -Phbase-hadoop2 -Dhadoop.version=2.3.0 -DskipTests clean package -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Usage-of-spark-ec2-how-to-deploy-a-revised-version-of-spark-1-1-0-tp16943p17040.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Usage of spark-ec2: how to deploy a revised version of spark 1.1.0?
I modified the pom files in my private repo to use those parameters as default to solve the problem. But after the deployment, I found the installed version is not the customized version, but an official one. Anyone please give a hint on how the spark-ec2 work with spark from private repos.. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Usage-of-spark-ec2-how-to-deploy-a-revised-version-of-spark-1-1-0-tp16943p17067.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
stage failure: Task 0 in stage 0.0 failed 4 times
what could cause this type of 'stage failure'? Thanks! This is a simple py spark script to list data in hbase. command line: ./spark-submit --driver-class-path ~/spark-examples-1.1.0-hadoop2.3.0.jar /root/workspace/test/sparkhbase.py 14/10/21 17:53:50 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on ip-***.ec2.internal:35201 (size: 1470.0 B, free: 265.4 MB) 14/10/21 17:53:50 INFO BlockManagerMaster: Updated info of block broadcast_2_piece0 14/10/21 17:53:50 INFO DAGScheduler: Submitting 1 missing tasks from Stage 0 (MappedRDD[1] at map at PythonHadoopUtil.scala:185) 14/10/21 17:53:50 INFO TaskSchedulerImpl: Adding task set 0.0 with 1 tasks 14/10/21 17:53:53 INFO SparkDeploySchedulerBackend: Registered executor: Actor[akka.tcp://sparkExecutor@ip-***.ec2.internal:34050/user/Executor#681287499] with ID 0 14/10/21 17:53:53 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, ip-.internal, ANY, 1264 bytes) 14/10/21 17:53:53 INFO SparkDeploySchedulerBackend: Registered executor: Actor[akka.tcp://sparkExecutor@ip-***.ec2.internal:47483/user/Executor#-936252397] with ID 1 14/10/21 17:53:53 INFO BlockManagerMasterActor: Registering block manager ip-2.internal:49236 with 3.1 GB RAM 14/10/21 17:53:54 INFO BlockManagerMasterActor: Registering block manager ip-.ec2.internal:36699 with 3.1 GB RAM 14/10/21 17:53:54 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, ip-.ec2.internal): java.lang.IllegalStateException: unread block data java.io.ObjectInputStream$BlockDataInputStream.setBlockDataMode(ObjectInputStream.java:2421) java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1382) java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990) java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915) java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) java.io.ObjectInputStream.readObject(ObjectInputStream.java:370) org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:62) org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:87) org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:159) java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) java.lang.Thread.run(Thread.java:745) 14/10/21 17:53:54 INFO TaskSetManager: Starting task 0.1 in stage 0.0 (TID 1, ip-.internal, ANY, 1264 bytes) 14/10/21 17:53:54 INFO TaskSetManager: Lost task 0.1 in stage 0.0 (TID 1) on executor ip-.internal: java.lang.IllegalStateException (unread block data) [duplicate 1] 14/10/21 17:53:54 INFO TaskSetManager: Starting task 0.2 in stage 0.0 (TID 2, ip-.internal, ANY, 1264 bytes) 14/10/21 17:53:54 INFO TaskSetManager: Lost task 0.2 in stage 0.0 (TID 2) on executor ip-.internal: java.lang.IllegalStateException (unread block data) [duplicate 2] 14/10/21 17:53:54 INFO TaskSetManager: Starting task 0.3 in stage 0.0 (TID 3, ip-.internal, ANY, 1264 bytes) 14/10/21 17:53:54 INFO TaskSetManager: Lost task 0.3 in stage 0.0 (TID 3) on executor ip-2.internal: java.lang.IllegalStateException (unread block data) [duplicate 3] 14/10/21 17:53:54 ERROR TaskSetManager: Task 0 in stage 0.0 failed 4 times; aborting job 14/10/21 17:53:54 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool 14/10/21 17:53:54 INFO TaskSchedulerImpl: Cancelling stage 0 14/10/21 17:53:54 INFO DAGScheduler: Failed to run first at SerDeUtil.scala:70 Traceback (most recent call last): File /root/workspace/test/sparkhbase.py, line 17, in module conf=conf2) File /root/spark/python/pyspark/context.py, line 471, in newAPIHadoopRDD jconf, batchSize) File /root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py, line 538, in __call__ File /root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py, line 300, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.newAPIHadoopRDD. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 3, ip-internal): java.lang.IllegalStateException: unread block data java.io.ObjectInputStream$BlockDataInputStream.setBlockDataMode(ObjectInputStream.java:2421) java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1382) java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990) java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915) java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
Re: stage failure: Task 0 in stage 0.0 failed 4 times
maybe set up a hbase.jar in the conf? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/stage-failure-Task-0-in-stage-0-0-failed-4-times-tp16928p16929.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Usage of spark-ec2: how to deploy a revised version of spark 1.1.0?
Thanks for the help! Hadoop version: 2.3.0 Hbase version: 0.98.1 Use python to read/write data from/to hbase. Only change over the official spark 1.1.0 is the pom file under examples. Compilation: spark:mvn -Pyarn -Phadoop-2.3 -Dhadoop.version=2.3.0 -DskipTests clean package spark/examples:mvn -Pyarn -Phadoop-2.3 -Phbase-hadoop2 -Dhadoop.version=2.3.0 -DskipTests clean package I am wondering how I can deploy this version of spark to a new ec2 cluster. I tried ./spark-ec2 -k sparkcluster -i ~/sparkcluster.pem -s 1 -v 1.1.0 --hadoop-major-version=2.3.0 --worker-instances=2 -z us-east-1d launch sparktest1 but this version got a type mismatch error when I read hbase data. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Usage-of-spark-ec2-how-to-deploy-a-revised-version-of-spark-1-1-0-tp16943.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
example.jar caused exception when running pi.py, spark 1.1
created a EC2 cluster using spark-ec2 command. If I run the pi.py example in the cluster without using the example.jar, it works. But if I added the example.jar as the driver class (sth like follows), it will fail with an exception. Could anyone help with this? -- what is the cause of the problem? The compilation looked fine to me.. but since I don't have much java experience, I cannot figure out why adding the driver class path can cause any conflict. Thanks! ./spark-submit --driver-class-path /root/workspace/test/spark-examples-1.1.0-SNAPSHOT-hadoop2.3.0.jar /root/workspace/test/pi.py 14/10/20 20:37:28 INFO spark.HttpServer: Starting HTTP Server 14/10/20 20:37:28 INFO server.Server: jetty-8.y.z-SNAPSHOT 14/10/20 20:37:28 INFO server.AbstractConnector: Started SocketConnector@0.0.0.0:46426 14/10/20 20:37:28 INFO server.Server: jetty-8.y.z-SNAPSHOT 14/10/20 20:37:28 INFO server.AbstractConnector: Started SelectChannelConnector@0.0.0.0:4040 14/10/20 20:37:28 INFO ui.SparkUI: Started SparkUI at http://:4040 Traceback (most recent call last): File /root/workspace/test/pi.py, line 7, in module sc = SparkContext(conf=conf) File /root/spark/python/pyspark/context.py, line 134, in __init__ self._jsc = self._initialize_context(self._conf._jconf) File /root/spark/python/pyspark/context.py, line 178, in _initialize_context return self._jvm.JavaSparkContext(jconf) File /root/spark/python/lib/py4j-0.8.1-src.zip/py4j/java_gateway.py, line 669, in __call__ File /root/spark/python/lib/py4j-0.8.1-src.zip/py4j/protocol.py, line 300, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling None.org.apache.spark.api.java.JavaSparkContext. : java.lang.ExceptionInInitializerError at org.apache.spark.SparkContext.init(SparkContext.scala:228) at org.apache.spark.api.java.JavaSparkContext.init(JavaSparkContext.scala:53) at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:526) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:234) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379) at py4j.Gateway.invoke(Gateway.java:214) at py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:79) at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:68) at py4j.GatewayConnection.run(GatewayConnection.java:207) at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.RuntimeException: java.lang.reflect.InvocationTargetException at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:131) at org.apache.hadoop.security.Groups.init(Groups.java:55) at org.apache.hadoop.security.Groups.getUserToGroupsMappingService(Groups.java:182) at org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:235) at org.apache.hadoop.security.UserGroupInformation.setConfiguration(UserGroupInformation.java:249) at org.apache.spark.deploy.SparkHadoopUtil.init(SparkHadoopUtil.scala:36) at org.apache.spark.deploy.SparkHadoopUtil$.init(SparkHadoopUtil.scala:109) at org.apache.spark.deploy.SparkHadoopUtil$.clinit(SparkHadoopUtil.scala) ... 13 more Caused by: java.lang.reflect.InvocationTargetException at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:526) at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:129) ... 20 more Caused by: java.lang.UnsatisfiedLinkError: org.apache.hadoop.security.JniBasedUnixGroupsMapping.anchorNative()V at org.apache.hadoop.security.JniBasedUnixGroupsMapping.anchorNative(Native Method) at org.apache.hadoop.security.JniBasedUnixGroupsMapping.clinit(JniBasedUnixGroupsMapping.java:49) at org.apache.hadoop.security.JniBasedUnixGroupsMappingWithFallback.init(JniBasedUnixGroupsMappingWithFallback.java:38) ... 25 more -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/example-jar-caused-exception-when-running-pi-py-spark-1-1-tp16849.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: example.jar caused exception when running pi.py, spark 1.1
Fixed by recompiling. Thanks. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/example-jar-caused-exception-when-running-pi-py-spark-1-1-tp16849p16862.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
EC2 cluster set up and access to HBase in a different cluster
The plan is to create an EC2 cluster and run the (py) spark on it. Input data is from s3, output data goes to an hbase in a persistent cluster (also EC2). My questions are: 1. I need to install some software packages on all the workers (sudo apt-get install ...). Is there a better way to do this than going to every node to manually install them? 2. I assume the spark can access the hbase which is in a different cluster. Am I correct? if yes, how? Thanks! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/EC2-cluster-set-up-and-access-to-HBase-in-a-different-cluster-tp16622.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: EC2 cluster set up and access to HBase in a different cluster
Maybe I should create a private AMI to use for my question No.1? Assuming I use the default instance type as the base image.. Anyone tried this? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/EC2-cluster-set-up-and-access-to-HBase-in-a-different-cluster-tp16622p16628.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
performance comparison: join vs cogroup?
For two large key-value data sets, if they have the same set of keys, what is the fastest way to join them into one? Suppose all keys are unique in each data set, and we only care about those keys that appear in both data sets. input data I have: (k, v1) and (k, v2) data I want to get from the input: (k, (v1, v2)). I don't mind using co-group if it's faster, because only minor work needs to be done to convert into the format I need. Join is more straightforward, but I think join assumes the keys are not unique. There could be some performance loss there (I might be wrong here.) Thanks. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/performance-comparison-join-vs-cogroup-tp15823.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Spark 1.1.0 hbase_inputformat.py not work
I don't know if it's relevant, but I had to compile spark for my specific hbase and hadoop version to make that hbase_inputformat.py work. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-1-1-0-hbase-inputformat-py-not-work-tp14905p14912.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
spark 1.1 examples build failure on cdh 5.1
This is a mvn build. [ERROR] Failed to execute goal on project spark-examples_2.10: Could not resolve dependencies for project org.apache.spark:spark-examples_2.10:jar:1.1.0: Could not find artifact org.apache.hbase:hbase:jar:0.98.1 in central (https://repo1.maven.org/maven2) - [Help 1] [ERROR] It seems maven cannot find the jar in the central repo. Any idea how to solve this? Thanks! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/spark-1-1-examples-build-failure-on-cdh-5-1-tp14608.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
request to merge the pull request #1893 to master
We are working on a project that needs python + spark to work on hdfs and hbase data. We like to use a not-too-old version of hbase such as hbase 0.98.x. We have tried many different ways (and platforms) to compile and test Spark 1.1 official release, but got all sorts of issues. The only version that compiled successfully and passed our test is this pull request. Is there anyone still working on this pull request? Could this pull request be merged and finally go to the official release in some time soon? Thanks! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/request-to-merge-the-pull-request-1893-to-master-tp14633.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
How to ship cython library to workers?
I have a library written in Cython and C. wondering if it can be shipped to the workers which don't have cython installed. maybe create an egg package from this library? how? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-ship-cython-library-to-workers-tp14467.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
spark 1.1 failure. class conflict?
Newbie for Java. so please be specific on how to resolve this, The command I was running is $ ./spark-submit --driver-class-path /home/cloudera/Downloads/spark-1.1.0-bin-hadoop2.3/lib/spark-examples-1.1.0-hadoop2.3.0.jar /home/cloudera/Downloads/spark-1.1.0-bin-hadoop2.3/examples/src/main/python/hbase_inputformat.py quickstart.cloudera data1 14/09/12 14:12:07 WARN spark.SparkConf: Setting 'spark.executor.extraClassPath' to ':/usr/lib/hbase/hbase-protocol-0.98.1-cdh5.1.0.jar:/usr/lib/hbase/hbase-protocol-0.98.1-cdh5.1.0.jar' as a work-around. Traceback (most recent call last): File /home/cloudera/Downloads/spark-1.1.0-bin-hadoop2.3/examples/src/main/python/hbase_inputformat.py, line 61, in module sc = SparkContext(appName=HBaseInputFormat) File /home/cloudera/Downloads/spark-1.1.0-bin-hadoop2.3/python/pyspark/context.py, line 107, in __init__ conf) File /home/cloudera/Downloads/spark-1.1.0-bin-hadoop2.3/python/pyspark/context.py, line 155, in _do_init self._jsc = self._initialize_context(self._conf._jconf) File /home/cloudera/Downloads/spark-1.1.0-bin-hadoop2.3/python/pyspark/context.py, line 201, in _initialize_context return self._jvm.JavaSparkContext(jconf) File /usr/lib/python2.6/site-packages/py4j-0.8.2.1-py2.6.egg/py4j/java_gateway.py, line 701, in __call__ self._fqn) File /usr/lib/python2.6/site-packages/py4j-0.8.2.1-py2.6.egg/py4j/protocol.py, line 300, in get_return_value format(target_id, '.', name), value) py4j.protocol.Py4JJavaError: An error occurred while calling None.org.apache.spark.api.java.JavaSparkContext. : org.apache.spark.SparkException: Found both spark.driver.extraClassPath and SPARK_CLASSPATH. Use only the former. at org.apache.spark.SparkConf$$anonfun$validateSettings$5$$anonfun$apply$6.apply(SparkConf.scala:300) at org.apache.spark.SparkConf$$anonfun$validateSettings$5$$anonfun$apply$6.apply(SparkConf.scala:298) at scala.collection.immutable.List.foreach(List.scala:318) at org.apache.spark.SparkConf$$anonfun$validateSettings$5.apply(SparkConf.scala:298) at org.apache.spark.SparkConf$$anonfun$validateSettings$5.apply(SparkConf.scala:286) at scala.Option.foreach(Option.scala:236) at org.apache.spark.SparkConf.validateSettings(SparkConf.scala:286) at org.apache.spark.SparkContext.init(SparkContext.scala:158) at org.apache.spark.api.java.JavaSparkContext.init(JavaSparkContext.scala:53) at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:526) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:234) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379) at py4j.Gateway.invoke(Gateway.java:214) at py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:79) at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:68) at py4j.GatewayConnection.run(GatewayConnection.java:207) at java.lang.Thread.run(Thread.java:745) -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/spark-1-1-failure-class-conflict-tp14127.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: spark 1.1 failure. class conflict?
The same command passed in another quick-start vm (v4.7) which has hbase 0.96 installed. maybe there are some conflicts for the newer hbase version and spark 1.1.0? just my guess. Thanks. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/spark-1-1-failure-class-conflict-tp14127p14131.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: how to run python examples in spark 1.1?
Just want to provide more information on how I ran the examples. Environment: Cloudera quick start Vm 5.1.0 (HBase 0.98.1 installed). I created a table called 'data1', and 'put' two records in it. I can see the table and data are fine in hbase shell. I cloned spark repo and checked out to 1.1 branch, built it by running sbt/sbt assembly/assembly sbt/sbt examples/assembly. The script is basically, if __name__ == __main__: conf = SparkConf().setAppName('testspark_similar_users_v2') sc = SparkContext(conf=conf, batchSize=512) conf2 = {hbase.zookeeper.quorum: localhost, hbase.mapreduce.inputtable: 'data1'} hbase_rdd = sc.newAPIHadoopRDD( org.apache.hadoop.hbase.mapreduce.TableInputFormat, org.apache.hadoop.hbase.io.ImmutableBytesWritable, org.apache.hadoop.hbase.client.Result, keyConverter=org.apache.spark.examples.pythonconverters.ImmutableBytesWritableToStringConverter, valueConverter=org.apache.spark.examples.pythonconverters.HBaseResultToStringConverter, conf=conf2) output = hbase_rdd.collect() for (k, v) in output: print (k, v) sc.stop() The error message is, 14/09/10 09:41:52 INFO ZooKeeper: Initiating client connection, connectString=localhost:2181 sessionTimeout=18 watcher=hconnection 14/09/10 09:41:52 INFO RecoverableZooKeeper: The identifier of this process is 25963@quickstart.cloudera 14/09/10 09:41:52 INFO ClientCnxn: Opening socket connection to server quickstart.cloudera/127.0.0.1:2181. Will not attempt to authenticate using SASL (unknown error) 14/09/10 09:41:52 INFO ClientCnxn: Socket connection established to quickstart.cloudera/127.0.0.1:2181, initiating session 14/09/10 09:41:52 INFO ClientCnxn: Session establishment complete on server quickstart.cloudera/127.0.0.1:2181, sessionid = 0x1485b365c450016, negotiated timeout = 4 14/09/10 09:52:32 ERROR TableInputFormat: org.apache.hadoop.hbase.client.NoServerForRegionException: Unable to find region for data1,,99 after 10 tries. at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegionInMeta(HConnectionManager.java:980) at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(HConnectionManager.java:885) at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegionInMeta(HConnectionManager.java:987) at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(HConnectionManager.java:889) at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(HConnectionManager.java:846) at org.apache.hadoop.hbase.client.HTable.finishSetup(HTable.java:234) at org.apache.hadoop.hbase.client.HTable.init(HTable.java:174) at org.apache.hadoop.hbase.client.HTable.init(HTable.java:133) at org.apache.hadoop.hbase.mapreduce.TableInputFormat.setConf(TableInputFormat.java:96) at org.apache.spark.rdd.NewHadoopRDD.getPartitions(NewHadoopRDD.scala:90) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:202) at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:202) at org.apache.spark.rdd.RDD.take(RDD.scala:1060) at org.apache.spark.rdd.RDD.first(RDD.scala:1091) at org.apache.spark.api.python.SerDeUtil$.pairRDDToPython(SerDeUtil.scala:70) at org.apache.spark.api.python.PythonRDD$.newAPIHadoopRDD(PythonRDD.scala:441) at org.apache.spark.api.python.PythonRDD.newAPIHadoopRDD(PythonRDD.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379) at py4j.Gateway.invoke(Gateway.java:259) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:207) at java.lang.Thread.run(Thread.java:745) Traceback (most recent call last): File /home/cloudera/workspace/recom_env_testing/bin/sparkhbasecheck.py, line 71, in module
how to run python examples in spark 1.1?
I'm mostly interested in the hbase examples in the repo. I saw two examples hbase_inputformat.py and hbase_outputformat.py in the 1.1 branch. Can you show me how to run them? Compile step is done. I tried to run the examples, but failed. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/how-to-run-python-examples-in-spark-1-1-tp13841.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org