Re: spark 1.6.0 read s3 files error.

2016-08-02 Thread freedafeng
Solution:
sc._jsc.hadoopConfiguration().set("fs.s3a.awsAccessKeyId", "...") 
sc._jsc.hadoopConfiguration().set("fs.s3a.awsSecretAccessKey", "...") 

Got this solution from a cloudera lady. Thanks Neerja.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/spark-1-6-0-read-s3-files-error-tp27417p27452.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: spark 1.6.0 read s3 files error.

2016-08-02 Thread freedafeng
Any one, please? I believe many of us are using spark 1.6 or higher with
s3... 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/spark-1-6-0-read-s3-files-error-tp27417p27451.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: spark 1.6.0 read s3 files error.

2016-07-28 Thread freedafeng
tried the following. still failed the same way.. it ran in yarn. cdh5.8.0

from pyspark import SparkContext, SparkConf

conf = SparkConf().setAppName('s3 ---')
sc = SparkContext(conf=conf)

sc._jsc.hadoopConfiguration().set("fs.s3n.awsAccessKeyId", "...")
sc._jsc.hadoopConfiguration().set("fs.s3n.awsSecretAccessKey", "...")

myRdd =
sc.textFile("s3n:///y=2016/m=5/d=26/h=20/2016.5.26.21.9.52.6d53180a-28b9-4e65-b749-b4a2694b9199.json.gz")

count = myRdd.count()
print "The count is", count



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/spark-1-6-0-read-s3-files-error-tp27417p27427.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: spark 1.6.0 read s3 files error.

2016-07-28 Thread freedafeng
BTW, I also tried yarn. Same error. 

When I ran the script, I used the real credentials for s3, which is omitted
in this post. sorry about that.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/spark-1-6-0-read-s3-files-error-tp27417p27425.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: spark 1.6.0 read s3 files error.

2016-07-28 Thread freedafeng
The question is, what is the cause of the problem? and how to fix it? Thanks.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/spark-1-6-0-read-s3-files-error-tp27417p27424.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



spark 1.6.0 read s3 files error.

2016-07-27 Thread freedafeng
cdh 5.7.1. pyspark. 

codes: ===
from pyspark import SparkContext, SparkConf

conf = SparkConf().setAppName('s3 ---')
sc = SparkContext(conf=conf)

myRdd =
sc.textFile("s3n:///y=2016/m=5/d=26/h=20/2016.5.26.21.9.52.6d53180a-28b9-4e65-b749-b4a2694b9199.json.gz")

count = myRdd.count()
print "The count is", count

===
standalone mode: command line:

AWS_ACCESS_KEY_ID=??? AWS_SECRET_ACCESS_KEY=??? ./bin/spark-submit
--driver-memory 4G  --master  spark://master:7077 --conf
"spark.default.parallelism=70"  /root/workspace/test/s3.py

Error:
)
16/07/27 17:27:26 INFO spark.SparkContext: Created broadcast 0 from textFile
at NativeMethodAccessorImpl.java:-2
Traceback (most recent call last):
  File "/root/workspace/test/s3.py", line 12, in 
count = myRdd.count()
  File
"/opt/cloudera/parcels/CDH-5.7.1-1.cdh5.7.1.p0.11/lib/spark/python/lib/pyspark.zip/pyspark/rdd.py",
line 1004, in count
  File
"/opt/cloudera/parcels/CDH-5.7.1-1.cdh5.7.1.p0.11/lib/spark/python/lib/pyspark.zip/pyspark/rdd.py",
line 995, in sum
  File
"/opt/cloudera/parcels/CDH-5.7.1-1.cdh5.7.1.p0.11/lib/spark/python/lib/pyspark.zip/pyspark/rdd.py",
line 869, in fold
  File
"/opt/cloudera/parcels/CDH-5.7.1-1.cdh5.7.1.p0.11/lib/spark/python/lib/pyspark.zip/pyspark/rdd.py",
line 771, in collect
  File
"/opt/cloudera/parcels/CDH-5.7.1-1.cdh5.7.1.p0.11/lib/spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py",
line 813, in __call__
  File
"/opt/cloudera/parcels/CDH-5.7.1-1.cdh5.7.1.p0.11/lib/spark/python/lib/py4j-0.9-src.zip/py4j/protocol.py",
line 308, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling
z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: java.lang.VerifyError: Bad type on operand stack
Exception Details:
  Location:
   
org/apache/hadoop/fs/s3native/Jets3tNativeFileSystemStore.copy(Ljava/lang/String;Ljava/lang/String;)V
@155: invokevirtual
  Reason:
Type 'org/jets3t/service/model/S3Object' (current frame, stack[4]) is
not assignable to 'org/jets3t/service/model/StorageObject'
  Current Frame:
bci: @155
flags: { }
locals: { 'org/apache/hadoop/fs/s3native/Jets3tNativeFileSystemStore',
'java/lang/String', 'java/lang/String', 'org/jets3t/service/model/S3Object'
}
stack: { 'org/jets3t/service/S3Service', 'java/lang/String',
'java/lang/String', 'java/lang/String', 'org/jets3t/service/model/S3Object',
integer }
  Bytecode:
000: b200 fcb9 0190 0100 9900 39b2 00fc bb01
010: 5659 b701 5713 0192 b601 5b2b b601 5b13
020: 0194 b601 5b2c b601 5b13 0196 b601 5b2a
030: b400 7db6 00e7 b601 5bb6 015e b901 9802
040: 002a b400 5799 0030 2ab4 0047 2ab4 007d
050: 2b01 0101 01b6 019b 4e2a b400 6b09 949e
060: 0016 2db6 019c 2ab4 006b 949e 000a 2a2d
070: 2cb6 01a0 b1bb 00a0 592c b700 a14e 2d2a
080: b400 73b6 00b0 2ab4 0047 2ab4 007d b600
090: e72b 2ab4 007d b600 e72d 03b6 01a4 57a7
0a0: 000a 4e2a 2d2b b700 c7b1   
  Exception Handler Table:
bci [0, 116] => handler: 162
bci [117, 159] => handler: 162
  Stackmap Table:
same_frame_extended(@65)
same_frame(@117)
same_locals_1_stack_item_frame(@162,Object[#139])
same_frame(@169)

at
org.apache.hadoop.fs.s3native.NativeS3FileSystem.createDefaultStore(NativeS3FileSystem.java:338)
at
org.apache.hadoop.fs.s3native.NativeS3FileSystem.initialize(NativeS3FileSystem.java:328)

.

TIA




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/spark-1-6-0-read-s3-files-error-tp27417.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: build spark 1.6 against cdh5.7 with hadoop 2.6.0 hbase 1.2: Failure

2016-04-12 Thread freedafeng
agh.. typo. supposed to use cdh5.7.0. I rerun the command with the fix, but
still get the same error. 

build/mvn -Pyarn -Phadoop-2.6 -Dhadoop.version=2.6.0-cdh5.7.0 -DskipTests
clean package




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/build-spark-1-6-against-cdh5-7-with-hadoop-2-6-0-hbase-1-2-Failure-tp26762p26763.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



build spark 1.6 against cdh5.7 with hadoop 2.6.0 hbase 1.2: Failure

2016-04-12 Thread freedafeng
jdk: 1.8.0_77
scala: 2.10.4
mvn: 3.3.9.

Slightly changed the pom.xml:
$ diff pom.xml pom.original 
130c130
< 2.6.0-cdh5.7.0-SNAPSHOT
---
> 2.2.0
133c133
< 1.2.0-cdh5.7.0-SNAPSHOT
---
> 0.98.7-hadoop2


command: build/mvn -Pyarn -Phadoop-2.6 -Dhadoop.version=2.6.0-cdh5.6.0
-DskipTests clean package

error: 
[INFO] 
[INFO] --- maven-remote-resources-plugin:1.5:process (default) @
spark-core_2.10 ---
[INFO] 
[INFO] --- maven-resources-plugin:2.6:resources (default-resources) @
spark-core_2.10 ---
[INFO] Using 'UTF-8' encoding to copy filtered resources.
[INFO] Copying 21 resources
[INFO] Copying 3 resources
[INFO] 
[INFO] --- scala-maven-plugin:3.2.2:compile (scala-compile-first) @
spark-core_2.10 ---
[INFO] Using zinc server for incremental compilation
[info] Compiling 486 Scala sources and 76 Java sources to
/home/jfeng/workspace/spark-1.6.0/core/target/scala-2.10/classes...
[error]
/home/jfeng/workspace/spark-1.6.0/core/src/main/scala/org/apache/spark/TestUtils.scala:22:
object StandardCharsets is not a member of package java.nio.charset
[error] import java.nio.charset.StandardCharsets
[error]^
[error]
/home/jfeng/workspace/spark-1.6.0/core/src/main/scala/org/apache/spark/TestUtils.scala:23:
object file is not a member of package java.nio
[error] import java.nio.file.Paths
[error] ^
[error]
/home/jfeng/workspace/spark-1.6.0/core/src/main/scala/org/apache/spark/TestUtils.scala:80:
not found: value StandardCharsets
[error]   ByteStreams.copy(new
ByteArrayInputStream(v.getBytes(StandardCharsets.UTF_8)), jarStream)
[error]^
[error]
/home/jfeng/workspace/spark-1.6.0/core/src/main/scala/org/apache/spark/TestUtils.scala:95:
not found: value Paths
[error]   val jarEntry = new
JarEntry(Paths.get(directoryPrefix.getOrElse(""), file.getName).toString)
[error]   ^
[error]
/home/jfeng/workspace/spark-1.6.0/core/src/main/scala/org/apache/spark/launcher/LauncherBackend.scala:43:
value getLoopbackAddress is not a member of object java.net.InetAddress
[error]   val s = new Socket(InetAddress.getLoopbackAddress(), port.get)
[error] 

Thanks in advance.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/build-spark-1-6-against-cdh5-7-with-hadoop-2-6-0-hbase-1-2-Failure-tp26762.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



spark's behavior about failed tasks

2015-08-12 Thread freedafeng
Hello there,

I have a spark running in a 20 node cluster. The job is logically simple,
just a mapPartition and then sum. The return value of the mapPartitions is
an integer for each partition. The tasks got some random failure (which
could be caused by a 3rh party key-value store connections. The cause is
irrelevant to my question). In more details,

Description:
1. spark 1.1.1. 
2. 4096 tasks total.
3. 66 failed tasks.

Issue:
Spark seems rerunning all the 4096 tasks instead of the 66 failed tasks. It
current runs at 469/4096 (stage2). 

Is this behavior normal? 

Thanks for your help!






--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/spark-s-behavior-about-failed-tasks-tp24232.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Lots of fetch failures on saveAsNewAPIHadoopDataset PairRDDFunctions

2015-03-13 Thread freedafeng
spark1.1.1 + Hbase (CDH5.3.1). 20 nodes each with 4 cores and 32G memory. 3
cores and 16G memory were assigned to spark in each worker node. Standalone
mode. Data set is 3.8 T. wondering how to fix this. Thanks!

org.apache.spark.rdd.PairRDDFunctions.saveAsNewAPIHadoopDataset(PairRDDFunctions.scala:935)
org.apache.spark.api.python.PythonRDD$.saveAsHadoopDataset(PythonRDD.scala:691)
org.apache.spark.api.python.PythonRDD.saveAsHadoopDataset(PythonRDD.scala)
sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
java.lang.reflect.Method.invoke(Method.java:606)
py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
py4j.Gateway.invoke(Gateway.java:259)
py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
py4j.commands.CallCommand.execute(CallCommand.java:79)
py4j.GatewayConnection.run(GatewayConnection.java:207)
java.lang.Thread.run(Thread.java:745)



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Lots-of-fetch-failures-on-saveAsNewAPIHadoopDataset-PairRDDFunctions-tp22038.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



correct way to broadcast a variable

2015-02-12 Thread freedafeng
Suppose I have an object to broadcast and then use it in a mapper function,
sth like follows, (Python codes)

obj2share = sc.broadcast(Some object here)

someRdd.map(createMapper(obj2share)).collect()

The createMapper function will create a mapper function using the shared
object's value. Another way to do this is

someRdd.map(createMapper(obj2share.value)).collect()

Here the creatMapper function directly uses the shared object to create the
mapper function. Is there a difference from spark side for the two methods?
If there is no difference at all, I'd prefer the second, because it hides
the spark from the createMapper function. 

Thanks.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/correct-way-to-broadcast-a-variable-tp21631.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



What could cause number of tasks to go down from 2k to 1?

2015-01-29 Thread freedafeng
Hi, 

The input data has 2048 partitions. The final step is to load the processed
data into hbase through saveAsNewAPIHadoopDataset(). Every step except the
last one ran in parallel in the cluster. But the last step only has 1 task
which runs on only 1 node using one core. 

Spark 1.1.1. + CDH5.3.0. 

Probably I should set the numPartitions in reduceByKey call to some big
number? I did not set this parameter in the current codes. This reduceByKey
call is the one that runs before the saveAsNewAPIHaddopDataset() call. 

Any idea? Thanks!



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/What-could-cause-number-of-tasks-to-go-down-from-2k-to-1-tp21430.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



large data set to get rid of exceeds Integer.MAX_VALUE error

2015-01-26 Thread freedafeng
Hi,

This seems to be a known issue (see here:
http://apache-spark-user-list.1001560.n3.nabble.com/ALS-failure-with-size-gt-Integer-MAX-VALUE-td19982.html)

The data set is about 1.5 T bytes. There are 14 region servers. I am not
sure how many regions there are for this data set. But very likely each
region will have much more than 2g data. In this case, repartition seems
also a very expensive action (I would guess), if possible in my cluster at
all. 

Could any one give some suggestions to make this job done? Thanks!

platform: spark 1.2.0, cdh5.3.0. 

The error is like,
py4j.protocol.Py4JJavaError: An error occurred while calling
z:org.apache.spark.api.python.PythonRDD.saveAsHadoopDataset.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0
in stage 1.0 failed 4 times, most recent failure: Lost task 0.3 in stage 1.0
(TID 34, node007): java.lang.RuntimeException:
java.lang.IllegalArgumentException: Size exceeds Integer.MAX_VALUE
at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:828)
at org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:123)
at org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:132)
at 
org.apache.spark.storage.BlockManager.doGetLocal(BlockManager.scala:517)
at
org.apache.spark.storage.BlockManager.getBlockData(BlockManager.scala:307)
at
org.apache.spark.network.netty.NettyBlockRpcServer$$anonfun$2.apply(NettyBlockRpcServer.scala:57)
at
org.apache.spark.network.netty.NettyBlockRpcServer$$anonfun$2.apply(NettyBlockRpcServer.scala:57)
at
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108)
at
org.apache.spark.network.netty.NettyBlockRpcServer.receive(NettyBlockRpcServer.scala:57)
at
org.apache.spark.network.server.TransportRequestHandler.processRpcRequest(TransportRequestHandler.java:124)
at
org.apache.spark.network.server.TransportRequestHandler.handle(TransportRequestHandler.java:97)
at
org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:91)
at
org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:44)
at
io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)
at
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333)
at
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319)
at
io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103)
at
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333)
at
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319)
at
io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:163)
at
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333)
at
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319)
at
io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:787)
at
io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:130)
at
io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
at
io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
at
io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
at
io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116)
at java.lang.Thread.run(Thread.java:745)

at
org.apache.spark.network.client.TransportResponseHandler.handle(TransportResponseHandler.java:156)
at
org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:93)
at
org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:44)
at
io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)
at
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333)
at

Re: Testing if an RDD is empty?

2015-01-15 Thread freedafeng
I think Sampo's thought is to get a function that only tests if a RDD is
empty. He does not want to know the size of the RDD, and getting the size of
a RDD is expensive for large data sets. 

I myself saw many times that my app threw out exceptions because an empty
RDD cannot be saved. This is not big issue, but annoying. Having a cheap
solution testing if an RDD is empty would be nice if there is no such thing
available now. 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Testing-if-an-RDD-is-empty-tp1678p21175.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



how to run python app in yarn?

2015-01-14 Thread freedafeng
A cdh5.3.0 with spark is set up. just wondering how to run a python
application on it. 

I used 'spark-submit --master yarn-cluster ./loadsessions.py' but got the
error,

Error: Cluster deploy mode is currently not supported for python
applications.
Run with --help for usage help or --verbose for debug output

Thanks.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/how-to-run-python-app-in-yarn-tp21141.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: how to run python app in yarn?

2015-01-14 Thread freedafeng
Got help from Marcelo and Josh. Now it is running smoothly. In case you need
this info - Just use yarn-client instead of yarn-cluster 

Thanks folks!



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/how-to-run-python-app-in-yarn-tp21141p21142.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: correct/best way to install custom spark1.2 on cdh5.3.0?

2015-01-08 Thread freedafeng
I installed the custom as a standalone mode as normal. The master and slaves
started successfully. 
However, I got error when I ran a job. It seems to me from the error message
the some library was compiled against hadoop1, but my spark was compiled
against hadoop2. 

15/01/08 23:27:36 INFO ClientCnxn: Opening socket connection to server
master/10.191.41.253:2181. Will not attempt to authenticate using SASL
(unknown error)
15/01/08 23:27:36 INFO ClientCnxn: Socket connection established to
master/10.191.41.253:2181, initiating session
15/01/08 23:27:36 INFO ClientCnxn: Session establishment complete on server
master/10.191.41.253:2181, sessionid = 0x14acbdae7e60022, negotiated timeout
= 6
Traceback (most recent call last):
  File /root/workspace/test/sparkhbase.py, line 23, in module
conf=conf2)
  File /root/spark/python/pyspark/context.py, line 530, in newAPIHadoopRDD
jconf, batchSize)
  File /root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py,
line 538, in __call__
  File /root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py, line
300, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling
z:org.apache.spark.api.python.PythonRDD.newAPIHadoopRDD.
: java.lang.IncompatibleClassChangeError: Found interface
org.apache.hadoop.mapreduce.JobContext, but class was expected
at
org.apache.hadoop.hbase.mapreduce.TableInputFormatBase.getSplits(TableInputFormatBase.java:157)
at 
org.apache.spark.rdd.NewHadoopRDD.getPartitions(NewHadoopRDD.scala:98)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:203)
at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:203)
at org.apache.spark.rdd.RDD.take(RDD.scala:1060)
at org.apache.spark.rdd.RDD.first(RDD.scala:1093)
at
org.apache.spark.api.python.SerDeUtil$.pairRDDToPython(SerDeUtil.scala:202)
at
org.apache.spark.api.python.PythonRDD$.newAPIHadoopRDD(PythonRDD.scala:500)
at 
org.apache.spark.api.python.PythonRDD.newAPIHadoopRDD(PythonRDD.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
at py4j.Gateway.invoke(Gateway.java:259)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:207)
at java.lang.Thread.run(Thread.java:745)

If I understand correctly, the org.apache.hadoop.mapreduce.JobContext in
hadoop1 is a class, but is a interface in hadoop2. My question is which
library could cause this problem. 

Thanks.




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/correct-best-way-to-install-custom-spark1-2-on-cdh5-3-0-tp21045p21046.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: correct/best way to install custom spark1.2 on cdh5.3.0?

2015-01-08 Thread freedafeng
I ran the release spark in cdh5.3.0 but got the same error. Anyone tried to
run spark in cdh5.3.0 using its newAPIHadoopRDD? 

command: 
spark-submit --master spark://master:7077 --jars
/opt/cloudera/parcels/CDH-5.3.0-1.cdh5.3.0.p0.30/jars/spark-examples-1.2.0-cdh5.3.0-hadoop2.5.0-cdh5.3.0.jar
./sparkhbase.py

Error.

2015-01-09 00:02:03,344 INFO  [Thread-2-SendThread(master:2181)]
zookeeper.ClientCnxn (ClientCnxn.java:primeConnection(852)) - Socket
connection established to master/10.191.41.253:2181, initiating session
2015-01-09 00:02:03,358 INFO  [Thread-2-SendThread(master:2181)]
zookeeper.ClientCnxn (ClientCnxn.java:onConnected(1235)) - Session
establishment complete on server master/10.191.41.253:2181, sessionid =
0x14acbdae7e60066, negotiated timeout = 6
Traceback (most recent call last):
  File /root/workspace/test/./sparkhbase.py, line 23, in module
conf=conf2)
  File
/opt/cloudera/parcels/CDH-5.3.0-1.cdh5.3.0.p0.30/lib/spark/python/pyspark/context.py,
line 530, in newAPIHadoopRDD
jconf, batchSize)
  File
/opt/cloudera/parcels/CDH-5.3.0-1.cdh5.3.0.p0.30/lib/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py,
line 538, in __call__
  File
/opt/cloudera/parcels/CDH-5.3.0-1.cdh5.3.0.p0.30/lib/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py,
line 300, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling
z:org.apache.spark.api.python.PythonRDD.newAPIHadoopRDD.
: java.lang.IncompatibleClassChangeError: Found interface
org.apache.hadoop.mapreduce.JobContext, but class was expected
at
org.apache.hadoop.hbase.mapreduce.TableInputFormatBase.getSplits(TableInputFormatBase.java:158)
at 
org.apache.spark.rdd.NewHadoopRDD.getPartitions(NewHadoopRDD.scala:98)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:203)
at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:203)
at org.apache.spark.rdd.RDD.take(RDD.scala:1060)
at org.apache.spark.rdd.RDD.first(RDD.scala:1093)
at
org.apache.spark.api.python.SerDeUtil$.pairRDDToPython(SerDeUtil.scala:202)
at
org.apache.spark.api.python.PythonRDD$.newAPIHadoopRDD(PythonRDD.scala:500)
at 
org.apache.spark.api.python.PythonRDD.newAPIHadoopRDD(PythonRDD.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
at py4j.Gateway.invoke(Gateway.java:259)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:207)
at java.lang.Thread.run(Thread.java:745)




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/correct-best-way-to-install-custom-spark1-2-on-cdh5-3-0-tp21045p21047.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



correct/best way to install custom spark1.2 on cdh5.3.0?

2015-01-08 Thread freedafeng
Could anyone come up with your experience on how to do this? 

I have created a cluster and installed cdh5.3.0 on it with basically core +
Hbase. but cloudera installed and configured the spark in its parcels
anyway. I'd like to install our custom spark on this cluster to use the
hadoop and hbase service there. There could be potentially conflicts if this
is not done correctly. Library conflicts are what I worry most.

I understand this is a special case. but if you know how to do it, please
let me know. Thanks.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/correct-best-way-to-install-custom-spark1-2-on-cdh5-3-0-tp21045.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



spark 1.1 got error when working with cdh5.3.0 standalone mode

2015-01-07 Thread freedafeng
Hi,

I installed the cdh5.3.0 core+Hbase in a new ec2 cluster. Then I manually
installed spark1.1 in it.  but when I started the slaves, I got an error as
follows,

./bin/spark-class org.apache.spark.deploy.worker.Worker spark://master:7077
Error: Could not find or load main class s.rolling.maxRetainedFiles=3

The spark was compiled against hadoop2.5 + hbase 0.98.6 as in cdh5.3.0.
Is the error because of some mysterious conflict somewhere? Or I should use
the spark in cdh5.3.0 for safe?

Thanks!




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/spark-1-1-got-error-when-working-with-cdh5-3-0-standalone-mode-tp21022.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



executor logging management from python

2014-12-02 Thread freedafeng
Hi, wondering if anyone could help with this. We use ec2 cluster to run spark
apps in standalone mode. The default log info goes to /$spark_folder/work/.
This folder is in the 10G root fs. So it won't take long to fill up the
whole fs. 

My goal is
1. move the logging location to /mnt, where we have 37G space.
2. make the log files iterate, meaning the new ones will replace the old
ones after some threshold.

Could you show how to do this? I was trying to change the spark-env.sh file
-- I don't know that's the best way to do it.

Thanks!



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/executor-logging-management-from-python-tp20198.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: executor logging management from python

2014-12-02 Thread freedafeng
cat spark-env.sh
--
#!/usr/bin/env bash

export SPARK_WORKER_OPTS=-Dspark.executor.logs.rolling.strategy=time
-Dspark.executor.logs.rolling.time.interval=daily
-Dspark.executor.logs.rolling.maxRetainedFiles=3
export SPARK_LOCAL_DIRS=/mnt/spark
export SPARK_WORKER_DIR=/mnt/spark
--

But the spark log still writes to /$Spark_home/work folder. why spark
doesn't take the changes?




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/executor-logging-management-from-python-tp20198p20210.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



logging in workers for pyspark

2014-11-20 Thread freedafeng
Hi, 

I am wondering how to write logging info in a worker when running a pyspark
app. I saw the thread

http://apache-spark-user-list.1001560.n3.nabble.com/logging-in-pyspark-td5458.html

but did not see a solution. Anybody know a solution? Thanks!



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/logging-in-workers-for-pyspark-tp19432.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



suggest pyspark using 'with' for sparkcontext to be more 'pythonic'

2014-11-13 Thread freedafeng
It seems sparkcontext is good fit to be used with 'with' in python. A context
manager will do. 

example:

with SparkContext(conf=conf, batchSize=512) as sc:



Then sc.stop() is not necessary to write any more.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/suggest-pyspark-using-with-for-sparkcontext-to-be-more-pythonic-tp18863.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: pyspark get column family and qualifier names from hbase table

2014-11-12 Thread freedafeng
Hi, 

This is my code,

import org.apache.hadoop.hbase.CellUtil

/**
 * JF: convert a Result object into a string with column family and
qualifier names. Sth like
 *
'columnfamily1:columnqualifier1:value1;columnfamily2:columnqualifier2:value2'
etc.
 * k-v pairs are separated by ';'. different columns for each cell is
separated by ':'.
 * Notice that we don't need the row key here, because it has been converted
by
 * ImmutableBytesWritableToStringConverter.
 */
class CustomHBaseResultToStringConverter extends Converter[Any, String] {
  override def convert(obj: Any): String = {
val result = obj.asInstanceOf[Result]

result.rawCells().map(cell =
List(Bytes.toString(CellUtil.cloneFamily(cell)),
  Bytes.toString(CellUtil.cloneQualifier(cell)),
 
Bytes.toString(CellUtil.cloneValue(cell))).mkString(:)).mkString(;)
  }
}

I recommend you to use different delimiters (to replace : or ; ) if you
have data with those stuff
in them. I am not a seasoned scala programmer, so there might be a more
flexible solution. For 
example, make the delimiters dynamically assignable. 

I will try to open a PR probably later today.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/pyspark-get-column-family-and-qualifier-names-from-hbase-table-tp18613p18744.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: pyspark get column family and qualifier names from hbase table

2014-11-12 Thread freedafeng
Hi Nick,

I saw the HBase api has experienced lots of changes. If I remember
correctly, the default hbase in spark 1.1.0 is 0.94.6. The one I am using is
0.98.1. To get the column family names and qualifier names, we need to call
different methods for these two different versions. I don't know how to do
that...sorry...



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/pyspark-get-column-family-and-qualifier-names-from-hbase-table-tp18613p18749.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



pyspark get column family and qualifier names from hbase table

2014-11-11 Thread freedafeng
Hello there,

I am wondering how to get the column family names and column qualifier names
when using pyspark to read an hbase table with multiple column families.

I have a hbase table as follows,
hbase(main):007:0 scan 'data1'
ROW   COLUMN+CELL   
 row1 column=f1:, timestamp=1411078148186, value=value1 
 row1 column=f2:, timestamp=1415732470877, value=value7 
 row2 column=f2:, timestamp=1411078160265, value=value2 

when I ran the examples/hbase_inputformat.py code: 
conf2 = {hbase.zookeeper.quorum: localhost,
hbase.mapreduce.inputtable: 'data1'}
hbase_rdd = sc.newAPIHadoopRDD(
org.apache.hadoop.hbase.mapreduce.TableInputFormat,
org.apache.hadoop.hbase.io.ImmutableBytesWritable,
org.apache.hadoop.hbase.client.Result,
   
keyConverter=org.apache.spark.examples.pythonconverters.ImmutableBytesWritableToStringConverter,
   
valueConverter=org.apache.spark.examples.pythonconverters.HBaseResultToStringConverter,
conf=conf2)
output = hbase_rdd.collect()
for (k, v) in output:
print (k, v)
I only see 
(u'row1', u'value1')
(u'row2', u'value2')

What I really want is (row_id, column family:column qualifier, value)
tuples. Any comments? Thanks!



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/pyspark-get-column-family-and-qualifier-names-from-hbase-table-tp18613.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: pyspark get column family and qualifier names from hbase table

2014-11-11 Thread freedafeng
checked the source, found the following,

class HBaseResultToStringConverter extends Converter[Any, String] {
  override def convert(obj: Any): String = {
val result = obj.asInstanceOf[Result]
Bytes.toStringBinary(result.value())
  }
}

I feel using 'result.value()' here is a big limitation. Converting from the
'list()' from the 'Result' is more general and easy to use. 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/pyspark-get-column-family-and-qualifier-names-from-hbase-table-tp18613p18619.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: pyspark get column family and qualifier names from hbase table

2014-11-11 Thread freedafeng
just wrote a custom convert in scala to replace HBaseResultToStringConverter.
Just couple of lines of code. 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/pyspark-get-column-family-and-qualifier-names-from-hbase-table-tp18613p18639.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: How to ship cython library to workers?

2014-11-04 Thread freedafeng
Thanks for the solution! I did figure out how to create an .egg file to ship
out to the workers. Using ipython seems to be another cool solution.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-ship-cython-library-to-workers-tp14467p18116.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: akka connection refused bug, fix?

2014-11-03 Thread freedafeng
Any one has experience or advice to fix this problem? highly appreciated! 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/akka-connection-refused-bug-fix-tp17764p17972.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



IllegalStateException: unread block data

2014-11-03 Thread freedafeng
Hollo there,

Just set up an ec2 cluster with no HDFS, hadoop, hbase whatsoever. Just
installed spark to read/process data from a hbase in a different cluster.
The spark was built against the hbase/hadoop version in the remote (ec2)
hbase cluster, which is 0.98.1 and 2.3.0 respectively. 

but I got the following error when running a simple test python script. The
command line
./spark-submit --master  spark://master:7077 --driver-class-path
./spark-examples-1.1.0-hadoop2.3.0.jar ~/workspace/test/sparkhbase.py

From the worker log, I can see the worker node got the request from the
master.

Can anyone help with this problem? Tons of thanks!


java.lang.IllegalStateException: unread block data
   
java.io.ObjectInputStream$BlockDataInputStream.setBlockDataMode(ObjectInputStream.java:2399)
java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1378)
   
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1969)
   
java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893)
   
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1776)
java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1346)
java.io.ObjectInputStream.readObject(ObjectInputStream.java:368)
   
org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:62)
   
org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:87)
   
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:162)
   
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146)
   
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
java.lang.Thread.run(Thread.java:679)



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/IllegalStateException-unread-block-data-tp18011.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



stage failure: java.lang.IllegalStateException: unread block data

2014-10-30 Thread freedafeng
Hi, 

Got this error when running spark 1.1.0 to read Hbase 0.98.1 through simple
python code in a ec2 cluster. The same program runs correctly in local mode.
So this error only happens when running in a real cluster.

Here's what I got,

14/10/30 17:51:53 INFO TaskSetManager: Starting task 0.1 in stage 0.0 (TID
1, node001, ANY, 1265 bytes)
14/10/30 17:51:53 INFO TaskSetManager: Lost task 0.1 in stage 0.0 (TID 1) on
executor node001: java.lang.IllegalStateException (unread block data)
[duplicate 1]
14/10/30 17:51:53 INFO TaskSetManager: Starting task 0.2 in stage 0.0 (TID
2, node001, ANY, 1265 bytes)
14/10/30 17:51:53 INFO TaskSetManager: Lost task 0.2 in stage 0.0 (TID 2) on
executor node001: java.lang.IllegalStateException (unread block data)
[duplicate 2]
14/10/30 17:51:53 INFO TaskSetManager: Starting task 0.3 in stage 0.0 (TID
3, node001, ANY, 1265 bytes)
14/10/30 17:51:53 INFO TaskSetManager: Lost task 0.3 in stage 0.0 (TID 3) on
executor node001: java.lang.IllegalStateException (unread block data)
[duplicate 3]
14/10/30 17:51:53 ERROR TaskSetManager: Task 0 in stage 0.0 failed 4 times;
aborting job
14/10/30 17:51:53 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks
have all completed, from pool 
14/10/30 17:51:53 INFO TaskSchedulerImpl: Cancelling stage 0
14/10/30 17:51:53 INFO DAGScheduler: Failed to run first at
SerDeUtil.scala:70
Traceback (most recent call last):
  File /root/workspace/test/sparkhbase.py, line 22, in module
conf=conf2)
  File /root/spark-1.1.0/python/pyspark/context.py, line 471, in
newAPIHadoopRDD
jconf, batchSize)
  File
/root/spark-1.1.0/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py,
line 538, in __call__
  File /root/spark-1.1.0/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py,
line 300, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling
z:org.apache.spark.api.python.PythonRDD.newAPIHadoopRDD.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0
in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0
(TID 3, node001): java.lang.IllegalStateException: unread block data
   
java.io.ObjectInputStream$BlockDataInputStream.setBlockDataMode(ObjectInputStream.java:2399)
java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1378)
   
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1969)
   
java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893)
   
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1776)
java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1346)
java.io.ObjectInputStream.readObject(ObjectInputStream.java:368)
   
org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:62)
   
org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:87)
   
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:159)
   
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146)
   
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
java.lang.Thread.run(Thread.java:679)
Driver stacktrace:
at
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1185)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1174)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1173)
at
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1173)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688)
at scala.Option.foreach(Option.scala:236)
at
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:688)
at
org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1391)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
at akka.actor.ActorCell.invoke(ActorCell.scala:456)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
at akka.dispatch.Mailbox.run(Mailbox.scala:219)
at
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at 
scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at

Re: stage failure: java.lang.IllegalStateException: unread block data

2014-10-30 Thread freedafeng
The worker side has error message as this,

14/10/30 18:29:00 INFO Worker: Asked to launch executor
app-20141030182900-0006/0 for testspark_v1
14/10/30 18:29:01 INFO ExecutorRunner: Launch command: java -cp
::/root/spark-1.1.0/conf:/root/spark-1.1.0/assembly/target/scala-2.10/spark-assembly-1.1.0-hadoop2.3.0.jar
-XX:MaxPermSize=128m -Dspark.driver.port=52552 -Xms512M -Xmx512M
org.apache.spark.executor.CoarseGrainedExecutorBackend
akka.tcp://sparkDriver@master:52552/user/CoarseGrainedScheduler 0
node001 4 akka.tcp://sparkWorker@node001:60184/user/Worker
app-20141030182900-0006
14/10/30 18:29:03 INFO Worker: Asked to kill executor
app-20141030182900-0006/0
14/10/30 18:29:03 INFO ExecutorRunner: Runner thread for executor
app-20141030182900-0006/0 interrupted
14/10/30 18:29:03 INFO ExecutorRunner: Killing process!
14/10/30 18:29:03 ERROR FileAppender: Error writing stream to file
/root/spark-1.1.0/work/app-20141030182900-0006/0/stderr
java.io.IOException: Stream Closed
at java.io.FileInputStream.readBytes(Native Method)
at java.io.FileInputStream.read(FileInputStream.java:214)
at
org.apache.spark.util.logging.FileAppender.appendStreamToFile(FileAppender.scala:70)
at
org.apache.spark.util.logging.FileAppender$$anon$1$$anonfun$run$1.apply$mcV$sp(FileAppender.scala:39)
at
org.apache.spark.util.logging.FileAppender$$anon$1$$anonfun$run$1.apply(FileAppender.scala:39)
at
org.apache.spark.util.logging.FileAppender$$anon$1$$anonfun$run$1.apply(FileAppender.scala:39)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1311)
at
org.apache.spark.util.logging.FileAppender$$anon$1.run(FileAppender.scala:38)
14/10/30 18:29:04 INFO Worker: Executor app-20141030182900-0006/0 finished
with state KILLED exitStatus 143
14/10/30 18:29:04 INFO LocalActorRef: Message
[akka.remote.transport.ActorTransportAdapter$DisassociateUnderlying] from
Actor[akka://sparkWorker/deadLetters] to
Actor[akka://sparkWorker/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2FsparkWorker%4010.180.49.228%3A52120-22#1336571562]
was not delivered. [6] dead letters encountered. This logging can be turned
off or adjusted with configuration settings 'akka.log-dead-letters' and
'akka.log-dead-letters-during-shutdown'.
14/10/30 18:29:04 ERROR EndpointWriter: AssociationError
[akka.tcp://sparkWorker@node001:60184] -
[akka.tcp://sparkExecutor@node001:37697]: Error [Association failed with
[akka.tcp://sparkExecutor@node001:37697]] [
akka.remote.EndpointAssociationException: Association failed with
[akka.tcp://sparkExecutor@node001:37697]
Caused by:
akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2:
Connection refused: node001/10.180.49.228:37697
]
14/10/30 18:29:04 ERROR EndpointWriter: AssociationError
[akka.tcp://sparkWorker@node001:60184] -
[akka.tcp://sparkExecutor@node001:37697]: Error [Association failed with
[akka.tcp://sparkExecutor@node001:37697]] [
akka.remote.EndpointAssociationException: Association failed with
[akka.tcp://sparkExecutor@node001:37697]
Caused by:
akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2:
Connection refused: node001/10.180.49.228:37697
]
14/10/30 18:29:04 ERROR EndpointWriter: AssociationError
[akka.tcp://sparkWorker@node001:60184] -
[akka.tcp://sparkExecutor@node001:37697]: Error [Association failed with
[akka.tcp://sparkExecutor@node001:37697]] [
akka.remote.EndpointAssociationException: Association failed with
[akka.tcp://sparkExecutor@node001:37697]
Caused by:
akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2:
Connection refused: node001/10.180.49.228:37697
]

Thanks!





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/stage-failure-java-lang-IllegalStateException-unread-block-data-tp17751p17755.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



akka connection refused bug, fix?

2014-10-30 Thread freedafeng
Hi, I saw the same issue as this thread,

http://apache-spark-user-list.1001560.n3.nabble.com/Spark-1-0-1-akka-connection-refused-td9864.html

Anyone has a fix for this bug? Please?!

The log info in my worker node is like,

14/10/30 20:15:18 INFO Worker: Asked to kill executor
app-20141030201514-/0
14/10/30 20:15:18 INFO ExecutorRunner: Runner thread for executor
app-20141030201514-/0 interrupted
14/10/30 20:15:18 INFO ExecutorRunner: Killing process!
14/10/30 20:15:18 INFO Worker: Executor app-20141030201514-/0 finished
with state KILLED exitStatus 1
14/10/30 20:15:18 INFO LocalActorRef: Message
[akka.remote.transport.ActorTransportAdapter$DisassociateUnderlying] from
Actor[akka://sparkWorker/deadLetters] to
Actor[akka://sparkWorker/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2FsparkWorker%4010.180.49.228%3A47087-2#-814958390]
was not delivered. [1] dead letters encountered. This logging can be turned
off or adjusted with configuration settings 'akka.log-dead-letters' and
'akka.log-dead-letters-during-shutdown'.
14/10/30 20:15:18 ERROR EndpointWriter: AssociationError
[akka.tcp://sparkWorker@node001:42816] -
[akka.tcp://sparkExecutor@node001:35811]: Error [Association failed with
[akka.tcp://sparkExecutor@node001:35811]] [
akka.remote.EndpointAssociationException: Association failed with
[akka.tcp://sparkExecutor@node001:35811]
Caused by:
akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2:
Connection refused: node001/10.180.49.228:35811
]
14/10/30 20:15:18 ERROR EndpointWriter: AssociationError
[akka.tcp://sparkWorker@node001:42816] -
[akka.tcp://sparkExecutor@node001:35811]: Error [Association failed with
[akka.tcp://sparkExecutor@node001:35811]] [
akka.remote.EndpointAssociationException: Association failed with
[akka.tcp://sparkExecutor@node001:35811]
Caused by:
akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2:
Connection refused: node001/10.180.49.228:35811
]
14/10/30 20:15:18 ERROR EndpointWriter: AssociationError
[akka.tcp://sparkWorker@node001:42816] -
[akka.tcp://sparkExecutor@node001:35811]: Error [Association failed with
[akka.tcp://sparkExecutor@node001:35811]] [
akka.remote.EndpointAssociationException: Association failed with
[akka.tcp://sparkExecutor@node001:35811]
Caused by:
akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2:
Connection refused: node001/10.180.49.228:35811




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/akka-connection-refused-bug-fix-tp17764.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: akka connection refused bug, fix?

2014-10-30 Thread freedafeng
followed this

http://community.cloudera.com/t5/Advanced-Analytics-Apache-Spark/Akka-Error-while-running-Spark-Jobs/td-p/18602

but the problem was not fixed..



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/akka-connection-refused-bug-fix-tp17764p17774.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Usage of spark-ec2: how to deploy a revised version of spark 1.1.0?

2014-10-22 Thread freedafeng
Thanks Daniil! if I use --spark-git-repo, is there a way to specify the mvn
command line parameters? like following
mvn -Pyarn -Phadoop-2.3 -Dhadoop.version=2.3.0 -DskipTests clean package

mvn -Pyarn -Phadoop-2.3 -Phbase-hadoop2 -Dhadoop.version=2.3.0 -DskipTests
clean package



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Usage-of-spark-ec2-how-to-deploy-a-revised-version-of-spark-1-1-0-tp16943p17040.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Usage of spark-ec2: how to deploy a revised version of spark 1.1.0?

2014-10-22 Thread freedafeng
I modified the pom files in my private repo to use those parameters as
default to solve the problem. But after the deployment, I found the
installed version is not the customized version, but an official one. Anyone
please give a hint on how the spark-ec2 work with spark from private repos..



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Usage-of-spark-ec2-how-to-deploy-a-revised-version-of-spark-1-1-0-tp16943p17067.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



stage failure: Task 0 in stage 0.0 failed 4 times

2014-10-21 Thread freedafeng
what could cause this type of 'stage failure'?  Thanks!

This is a simple py spark script to list data in hbase.
command line: ./spark-submit --driver-class-path
~/spark-examples-1.1.0-hadoop2.3.0.jar /root/workspace/test/sparkhbase.py 

14/10/21 17:53:50 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory
on ip-***.ec2.internal:35201 (size: 1470.0 B, free: 265.4 MB)
14/10/21 17:53:50 INFO BlockManagerMaster: Updated info of block
broadcast_2_piece0
14/10/21 17:53:50 INFO DAGScheduler: Submitting 1 missing tasks from Stage 0
(MappedRDD[1] at map at PythonHadoopUtil.scala:185)
14/10/21 17:53:50 INFO TaskSchedulerImpl: Adding task set 0.0 with 1 tasks
14/10/21 17:53:53 INFO SparkDeploySchedulerBackend: Registered executor:
Actor[akka.tcp://sparkExecutor@ip-***.ec2.internal:34050/user/Executor#681287499]
with ID 0
14/10/21 17:53:53 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID
0, ip-.internal, ANY, 1264 bytes)
14/10/21 17:53:53 INFO SparkDeploySchedulerBackend: Registered executor:
Actor[akka.tcp://sparkExecutor@ip-***.ec2.internal:47483/user/Executor#-936252397]
with ID 1
14/10/21 17:53:53 INFO BlockManagerMasterActor: Registering block manager
ip-2.internal:49236 with 3.1 GB RAM
14/10/21 17:53:54 INFO BlockManagerMasterActor: Registering block manager
ip-.ec2.internal:36699 with 3.1 GB RAM
14/10/21 17:53:54 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0,
ip-.ec2.internal): java.lang.IllegalStateException: unread block data
   
java.io.ObjectInputStream$BlockDataInputStream.setBlockDataMode(ObjectInputStream.java:2421)
java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1382)
   
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
   
java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
   
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
   
org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:62)
   
org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:87)
   
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:159)
   
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
   
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
java.lang.Thread.run(Thread.java:745)
14/10/21 17:53:54 INFO TaskSetManager: Starting task 0.1 in stage 0.0 (TID
1, ip-.internal, ANY, 1264 bytes)
14/10/21 17:53:54 INFO TaskSetManager: Lost task 0.1 in stage 0.0 (TID 1) on
executor ip-.internal: java.lang.IllegalStateException (unread block data)
[duplicate 1]
14/10/21 17:53:54 INFO TaskSetManager: Starting task 0.2 in stage 0.0 (TID
2, ip-.internal, ANY, 1264 bytes)
14/10/21 17:53:54 INFO TaskSetManager: Lost task 0.2 in stage 0.0 (TID 2) on
executor ip-.internal: java.lang.IllegalStateException (unread block data)
[duplicate 2]
14/10/21 17:53:54 INFO TaskSetManager: Starting task 0.3 in stage 0.0 (TID
3, ip-.internal, ANY, 1264 bytes)
14/10/21 17:53:54 INFO TaskSetManager: Lost task 0.3 in stage 0.0 (TID 3) on
executor ip-2.internal: java.lang.IllegalStateException (unread block data)
[duplicate 3]
14/10/21 17:53:54 ERROR TaskSetManager: Task 0 in stage 0.0 failed 4 times;
aborting job
14/10/21 17:53:54 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks
have all completed, from pool 
14/10/21 17:53:54 INFO TaskSchedulerImpl: Cancelling stage 0
14/10/21 17:53:54 INFO DAGScheduler: Failed to run first at
SerDeUtil.scala:70
Traceback (most recent call last):
  File /root/workspace/test/sparkhbase.py, line 17, in module
conf=conf2)
  File /root/spark/python/pyspark/context.py, line 471, in newAPIHadoopRDD
jconf, batchSize)
  File /root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py,
line 538, in __call__
  File /root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py, line
300, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling
z:org.apache.spark.api.python.PythonRDD.newAPIHadoopRDD.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0
in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0
(TID 3, ip-internal): java.lang.IllegalStateException: unread block data
   
java.io.ObjectInputStream$BlockDataInputStream.setBlockDataMode(ObjectInputStream.java:2421)
java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1382)
   
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
   
java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
   
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)

Re: stage failure: Task 0 in stage 0.0 failed 4 times

2014-10-21 Thread freedafeng
maybe set up a hbase.jar in the conf?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/stage-failure-Task-0-in-stage-0-0-failed-4-times-tp16928p16929.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Usage of spark-ec2: how to deploy a revised version of spark 1.1.0?

2014-10-21 Thread freedafeng
Thanks for the help!

Hadoop version: 2.3.0
Hbase version: 0.98.1

Use python to read/write data from/to hbase. 

Only change over the official spark 1.1.0 is the pom file under examples. 
Compilation: 
spark:mvn -Pyarn -Phadoop-2.3 -Dhadoop.version=2.3.0 -DskipTests clean
package
spark/examples:mvn -Pyarn -Phadoop-2.3 -Phbase-hadoop2
-Dhadoop.version=2.3.0 -DskipTests clean package

I am wondering how I can deploy this version of spark to a new ec2 cluster.
I tried 
./spark-ec2 -k sparkcluster -i ~/sparkcluster.pem -s 1 -v 1.1.0
--hadoop-major-version=2.3.0 --worker-instances=2  -z us-east-1d launch
sparktest1

but this version got a type mismatch error when I read hbase data.





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Usage-of-spark-ec2-how-to-deploy-a-revised-version-of-spark-1-1-0-tp16943.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



example.jar caused exception when running pi.py, spark 1.1

2014-10-20 Thread freedafeng
created a EC2 cluster using spark-ec2 command. If I run the pi.py example in
the cluster without using the example.jar, it works. But if I added the
example.jar as the driver class (sth like follows), it will fail with an
exception. Could anyone help with this? -- what is the cause of the problem?
The compilation looked fine to me.. but since I don't have much java
experience, I cannot figure out why adding the driver class path can cause
any conflict. Thanks!

./spark-submit --driver-class-path
/root/workspace/test/spark-examples-1.1.0-SNAPSHOT-hadoop2.3.0.jar
/root/workspace/test/pi.py 

14/10/20 20:37:28 INFO spark.HttpServer: Starting HTTP Server
14/10/20 20:37:28 INFO server.Server: jetty-8.y.z-SNAPSHOT
14/10/20 20:37:28 INFO server.AbstractConnector: Started
SocketConnector@0.0.0.0:46426
14/10/20 20:37:28 INFO server.Server: jetty-8.y.z-SNAPSHOT
14/10/20 20:37:28 INFO server.AbstractConnector: Started
SelectChannelConnector@0.0.0.0:4040
14/10/20 20:37:28 INFO ui.SparkUI: Started SparkUI at
http://:4040
Traceback (most recent call last):
  File /root/workspace/test/pi.py, line 7, in module
sc = SparkContext(conf=conf)
  File /root/spark/python/pyspark/context.py, line 134, in __init__
self._jsc = self._initialize_context(self._conf._jconf)
  File /root/spark/python/pyspark/context.py, line 178, in
_initialize_context
return self._jvm.JavaSparkContext(jconf)
  File /root/spark/python/lib/py4j-0.8.1-src.zip/py4j/java_gateway.py,
line 669, in __call__
  File /root/spark/python/lib/py4j-0.8.1-src.zip/py4j/protocol.py, line
300, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling
None.org.apache.spark.api.java.JavaSparkContext.
: java.lang.ExceptionInInitializerError
at org.apache.spark.SparkContext.init(SparkContext.scala:228)
at
org.apache.spark.api.java.JavaSparkContext.init(JavaSparkContext.scala:53)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
at
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:234)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
at py4j.Gateway.invoke(Gateway.java:214)
at
py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:79)
at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:68)
at py4j.GatewayConnection.run(GatewayConnection.java:207)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.RuntimeException:
java.lang.reflect.InvocationTargetException
at
org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:131)
at org.apache.hadoop.security.Groups.init(Groups.java:55)
at
org.apache.hadoop.security.Groups.getUserToGroupsMappingService(Groups.java:182)
at
org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:235)
at
org.apache.hadoop.security.UserGroupInformation.setConfiguration(UserGroupInformation.java:249)
at 
org.apache.spark.deploy.SparkHadoopUtil.init(SparkHadoopUtil.scala:36)
at
org.apache.spark.deploy.SparkHadoopUtil$.init(SparkHadoopUtil.scala:109)
at 
org.apache.spark.deploy.SparkHadoopUtil$.clinit(SparkHadoopUtil.scala)
... 13 more
Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
at
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
at
org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:129)
... 20 more
Caused by: java.lang.UnsatisfiedLinkError:
org.apache.hadoop.security.JniBasedUnixGroupsMapping.anchorNative()V
at 
org.apache.hadoop.security.JniBasedUnixGroupsMapping.anchorNative(Native
Method)
at
org.apache.hadoop.security.JniBasedUnixGroupsMapping.clinit(JniBasedUnixGroupsMapping.java:49)
at
org.apache.hadoop.security.JniBasedUnixGroupsMappingWithFallback.init(JniBasedUnixGroupsMappingWithFallback.java:38)
... 25 more





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/example-jar-caused-exception-when-running-pi-py-spark-1-1-tp16849.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: example.jar caused exception when running pi.py, spark 1.1

2014-10-20 Thread freedafeng
Fixed by recompiling. Thanks.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/example-jar-caused-exception-when-running-pi-py-spark-1-1-tp16849p16862.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



EC2 cluster set up and access to HBase in a different cluster

2014-10-16 Thread freedafeng
The plan is to create an EC2 cluster and run the (py) spark on it. Input data
is from s3, output data goes to an hbase in a persistent cluster (also EC2).
My questions are:

1. I need to install some software packages on all the workers (sudo apt-get
install ...). Is there a better way to do this than going to every node to
manually install them?

2. I assume the spark can access the hbase which is in a different cluster.
Am I correct? if yes, how?

Thanks!



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/EC2-cluster-set-up-and-access-to-HBase-in-a-different-cluster-tp16622.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: EC2 cluster set up and access to HBase in a different cluster

2014-10-16 Thread freedafeng
Maybe I should create a private AMI to use for my question No.1? Assuming I
use the default instance type as the base image.. Anyone tried this?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/EC2-cluster-set-up-and-access-to-HBase-in-a-different-cluster-tp16622p16628.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



performance comparison: join vs cogroup?

2014-10-06 Thread freedafeng
For two large key-value data sets, if they have the same set of keys, what is
the fastest way to join them into one?  Suppose all keys are unique in each
data set, and we only care about those keys that appear in both data sets. 

input data I have: (k, v1) and (k, v2)

data I want to get from the input: (k, (v1, v2)). 

I don't mind using co-group if it's faster, because only minor work needs to
be done to convert into the format I need. Join is more straightforward, but
I think join assumes the keys are not unique. There could be some
performance loss there (I might be wrong here.)

Thanks.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/performance-comparison-join-vs-cogroup-tp15823.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark 1.1.0 hbase_inputformat.py not work

2014-09-23 Thread freedafeng
I don't know if it's relevant, but I had to compile spark for my specific
hbase and hadoop version to make that hbase_inputformat.py work.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-1-1-0-hbase-inputformat-py-not-work-tp14905p14912.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



spark 1.1 examples build failure on cdh 5.1

2014-09-18 Thread freedafeng
This is a mvn build.

[ERROR] Failed to execute goal on project spark-examples_2.10: Could not
resolve dependencies for project
org.apache.spark:spark-examples_2.10:jar:1.1.0: Could not find artifact
org.apache.hbase:hbase:jar:0.98.1 in central
(https://repo1.maven.org/maven2) - [Help 1]
[ERROR] 

It seems maven cannot find the jar in the central repo. Any idea how to
solve this? Thanks!




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/spark-1-1-examples-build-failure-on-cdh-5-1-tp14608.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



request to merge the pull request #1893 to master

2014-09-18 Thread freedafeng
We are working on a project that needs python + spark to work on hdfs and
hbase data. We like to use a not-too-old version of hbase such as hbase
0.98.x. We have tried many different ways (and platforms) to compile and
test Spark 1.1 official release, but got all sorts of issues. The only
version that compiled successfully and passed our test is this pull request. 

Is there anyone still working on this pull request?  Could this pull request
be merged and finally go to the official release in some time soon? 

Thanks!



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/request-to-merge-the-pull-request-1893-to-master-tp14633.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



How to ship cython library to workers?

2014-09-17 Thread freedafeng
I have a library written in Cython and C. wondering if it can be shipped to
the workers which don't have cython installed. maybe create an egg package
from this library? how? 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-ship-cython-library-to-workers-tp14467.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



spark 1.1 failure. class conflict?

2014-09-12 Thread freedafeng
Newbie for Java. so please be specific on how to resolve this,

The command I was running is

$ ./spark-submit --driver-class-path
/home/cloudera/Downloads/spark-1.1.0-bin-hadoop2.3/lib/spark-examples-1.1.0-hadoop2.3.0.jar
/home/cloudera/Downloads/spark-1.1.0-bin-hadoop2.3/examples/src/main/python/hbase_inputformat.py
 
quickstart.cloudera data1

14/09/12 14:12:07 WARN spark.SparkConf: Setting
'spark.executor.extraClassPath' to
':/usr/lib/hbase/hbase-protocol-0.98.1-cdh5.1.0.jar:/usr/lib/hbase/hbase-protocol-0.98.1-cdh5.1.0.jar'
as a work-around.
Traceback (most recent call last):
  File
/home/cloudera/Downloads/spark-1.1.0-bin-hadoop2.3/examples/src/main/python/hbase_inputformat.py,
line 61, in module
sc = SparkContext(appName=HBaseInputFormat)
  File
/home/cloudera/Downloads/spark-1.1.0-bin-hadoop2.3/python/pyspark/context.py,
line 107, in __init__
conf)
  File
/home/cloudera/Downloads/spark-1.1.0-bin-hadoop2.3/python/pyspark/context.py,
line 155, in _do_init
self._jsc = self._initialize_context(self._conf._jconf)
  File
/home/cloudera/Downloads/spark-1.1.0-bin-hadoop2.3/python/pyspark/context.py,
line 201, in _initialize_context
return self._jvm.JavaSparkContext(jconf)
  File
/usr/lib/python2.6/site-packages/py4j-0.8.2.1-py2.6.egg/py4j/java_gateway.py,
line 701, in __call__
self._fqn)
  File
/usr/lib/python2.6/site-packages/py4j-0.8.2.1-py2.6.egg/py4j/protocol.py,
line 300, in get_return_value
format(target_id, '.', name), value)
py4j.protocol.Py4JJavaError: An error occurred while calling
None.org.apache.spark.api.java.JavaSparkContext.
: org.apache.spark.SparkException: Found both spark.driver.extraClassPath
and SPARK_CLASSPATH. Use only the former.
at
org.apache.spark.SparkConf$$anonfun$validateSettings$5$$anonfun$apply$6.apply(SparkConf.scala:300)
at
org.apache.spark.SparkConf$$anonfun$validateSettings$5$$anonfun$apply$6.apply(SparkConf.scala:298)
at scala.collection.immutable.List.foreach(List.scala:318)
at
org.apache.spark.SparkConf$$anonfun$validateSettings$5.apply(SparkConf.scala:298)
at
org.apache.spark.SparkConf$$anonfun$validateSettings$5.apply(SparkConf.scala:286)
at scala.Option.foreach(Option.scala:236)
at org.apache.spark.SparkConf.validateSettings(SparkConf.scala:286)
at org.apache.spark.SparkContext.init(SparkContext.scala:158)
at
org.apache.spark.api.java.JavaSparkContext.init(JavaSparkContext.scala:53)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
at
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:234)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
at py4j.Gateway.invoke(Gateway.java:214)
at
py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:79)
at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:68)
at py4j.GatewayConnection.run(GatewayConnection.java:207)
at java.lang.Thread.run(Thread.java:745)




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/spark-1-1-failure-class-conflict-tp14127.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: spark 1.1 failure. class conflict?

2014-09-12 Thread freedafeng
The same command passed in another quick-start vm (v4.7) which has hbase 0.96
installed. maybe there are some conflicts for the newer hbase version and
spark 1.1.0? just my guess.

Thanks.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/spark-1-1-failure-class-conflict-tp14127p14131.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: how to run python examples in spark 1.1?

2014-09-10 Thread freedafeng
Just want to provide more information on how I ran the examples.

Environment: Cloudera quick start Vm 5.1.0 (HBase 0.98.1 installed). I
created a table called 'data1', and 'put' two records in it. I can see the
table and data are fine in hbase shell. 

I cloned spark repo and checked out to 1.1 branch, built it by running
sbt/sbt assembly/assembly
sbt/sbt examples/assembly.

The script is basically,

if __name__ == __main__:

conf = SparkConf().setAppName('testspark_similar_users_v2')

sc = SparkContext(conf=conf, batchSize=512)

conf2 = {hbase.zookeeper.quorum: localhost,
hbase.mapreduce.inputtable: 'data1'}
hbase_rdd = sc.newAPIHadoopRDD(
org.apache.hadoop.hbase.mapreduce.TableInputFormat,
org.apache.hadoop.hbase.io.ImmutableBytesWritable,
org.apache.hadoop.hbase.client.Result,
   
keyConverter=org.apache.spark.examples.pythonconverters.ImmutableBytesWritableToStringConverter,
   
valueConverter=org.apache.spark.examples.pythonconverters.HBaseResultToStringConverter,
conf=conf2)
output = hbase_rdd.collect()
for (k, v) in output:
print (k, v)

sc.stop()

The error message is,

14/09/10 09:41:52 INFO ZooKeeper: Initiating client connection,
connectString=localhost:2181 sessionTimeout=18 watcher=hconnection
14/09/10 09:41:52 INFO RecoverableZooKeeper: The identifier of this process
is 25963@quickstart.cloudera
14/09/10 09:41:52 INFO ClientCnxn: Opening socket connection to server
quickstart.cloudera/127.0.0.1:2181. Will not attempt to authenticate using
SASL (unknown error)
14/09/10 09:41:52 INFO ClientCnxn: Socket connection established to
quickstart.cloudera/127.0.0.1:2181, initiating session
14/09/10 09:41:52 INFO ClientCnxn: Session establishment complete on server
quickstart.cloudera/127.0.0.1:2181, sessionid = 0x1485b365c450016,
negotiated timeout = 4
14/09/10 09:52:32 ERROR TableInputFormat:
org.apache.hadoop.hbase.client.NoServerForRegionException: Unable to find
region for data1,,99 after 10 tries.
at
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegionInMeta(HConnectionManager.java:980)
at
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(HConnectionManager.java:885)
at
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegionInMeta(HConnectionManager.java:987)
at
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(HConnectionManager.java:889)
at
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(HConnectionManager.java:846)
at org.apache.hadoop.hbase.client.HTable.finishSetup(HTable.java:234)
at org.apache.hadoop.hbase.client.HTable.init(HTable.java:174)
at org.apache.hadoop.hbase.client.HTable.init(HTable.java:133)
at
org.apache.hadoop.hbase.mapreduce.TableInputFormat.setConf(TableInputFormat.java:96)
at 
org.apache.spark.rdd.NewHadoopRDD.getPartitions(NewHadoopRDD.scala:90)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:202)
at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:202)
at org.apache.spark.rdd.RDD.take(RDD.scala:1060)
at org.apache.spark.rdd.RDD.first(RDD.scala:1091)
at
org.apache.spark.api.python.SerDeUtil$.pairRDDToPython(SerDeUtil.scala:70)
at
org.apache.spark.api.python.PythonRDD$.newAPIHadoopRDD(PythonRDD.scala:441)
at 
org.apache.spark.api.python.PythonRDD.newAPIHadoopRDD(PythonRDD.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
at py4j.Gateway.invoke(Gateway.java:259)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:207)
at java.lang.Thread.run(Thread.java:745)

Traceback (most recent call last):
  File /home/cloudera/workspace/recom_env_testing/bin/sparkhbasecheck.py,
line 71, in module

how to run python examples in spark 1.1?

2014-09-09 Thread freedafeng
I'm mostly interested in the hbase examples in the repo. I saw two examples
hbase_inputformat.py and hbase_outputformat.py in the 1.1 branch. Can you
show me how to run them? 

Compile step is done. I tried to run the examples, but failed. 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/how-to-run-python-examples-in-spark-1-1-tp13841.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org