jdk: 1.8.0_77
scala: 2.10.4
mvn: 3.3.9.
Slightly changed the pom.xml:
$ diff pom.xml pom.original
130c130
< 2.6.0-cdh5.7.0-SNAPSHOT
---
> 2.2.0
133c133
< 1.2.0-cdh5.7.0-SNAPSHOT
---
> 0.98.7-hadoop2
command: build/mvn -Pyarn -Phadoop-2.6 -Dhadoop.version=2.6.0-cdh5.6.0
-DskipTes
agh.. typo. supposed to use cdh5.7.0. I rerun the command with the fix, but
still get the same error.
build/mvn -Pyarn -Phadoop-2.6 -Dhadoop.version=2.6.0-cdh5.7.0 -DskipTests
clean package
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/build-spark-1-6-a
cdh 5.7.1. pyspark.
codes: ===
from pyspark import SparkContext, SparkConf
conf = SparkConf().setAppName('s3 ---')
sc = SparkContext(conf=conf)
myRdd =
sc.textFile("s3n:///y=2016/m=5/d=26/h=20/2016.5.26.21.9.52.6d53180a-28b9-4e65-b749-b4a2694b9199.json.gz")
count = myRdd.co
The question is, what is the cause of the problem? and how to fix it? Thanks.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/spark-1-6-0-read-s3-files-error-tp27417p27424.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
---
BTW, I also tried yarn. Same error.
When I ran the script, I used the real credentials for s3, which is omitted
in this post. sorry about that.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/spark-1-6-0-read-s3-files-error-tp27417p27425.html
Sent from the
tried the following. still failed the same way.. it ran in yarn. cdh5.8.0
from pyspark import SparkContext, SparkConf
conf = SparkConf().setAppName('s3 ---')
sc = SparkContext(conf=conf)
sc._jsc.hadoopConfiguration().set("fs.s3n.awsAccessKeyId", "...")
sc._jsc.hadoopConfiguration().set("fs.s3n.a
Any one, please? I believe many of us are using spark 1.6 or higher with
s3...
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/spark-1-6-0-read-s3-files-error-tp27417p27451.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
-
Solution:
sc._jsc.hadoopConfiguration().set("fs.s3a.awsAccessKeyId", "...")
sc._jsc.hadoopConfiguration().set("fs.s3a.awsSecretAccessKey", "...")
Got this solution from a cloudera lady. Thanks Neerja.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/spark-
Hello there,
I have a spark running in a 20 node cluster. The job is logically simple,
just a mapPartition and then sum. The return value of the mapPartitions is
an integer for each partition. The tasks got some random failure (which
could be caused by a 3rh party key-value store connections. The
Hi,
This seems to be a known issue (see here:
http://apache-spark-user-list.1001560.n3.nabble.com/ALS-failure-with-size-gt-Integer-MAX-VALUE-td19982.html)
The data set is about 1.5 T bytes. There are 14 region servers. I am not
sure how many regions there are for this data set. But very likely ea
Hi,
The input data has 2048 partitions. The final step is to load the processed
data into hbase through saveAsNewAPIHadoopDataset(). Every step except the
last one ran in parallel in the cluster. But the last step only has 1 task
which runs on only 1 node using one core.
Spark 1.1.1. + CDH5.3.0
Suppose I have an object to broadcast and then use it in a mapper function,
sth like follows, (Python codes)
obj2share = sc.broadcast("Some object here")
someRdd.map(createMapper(obj2share)).collect()
The createMapper function will create a mapper function using the shared
object's value. Anothe
spark1.1.1 + Hbase (CDH5.3.1). 20 nodes each with 4 cores and 32G memory. 3
cores and 16G memory were assigned to spark in each worker node. Standalone
mode. Data set is 3.8 T. wondering how to fix this. Thanks!
org.apache.spark.rdd.PairRDDFunctions.saveAsNewAPIHadoopDataset(PairRDDFunctions.scala
I'm mostly interested in the hbase examples in the repo. I saw two examples
hbase_inputformat.py and hbase_outputformat.py in the 1.1 branch. Can you
show me how to run them?
Compile step is done. I tried to run the examples, but failed.
--
View this message in context:
http://apache-spark-u
Just want to provide more information on how I ran the examples.
Environment: Cloudera quick start Vm 5.1.0 (HBase 0.98.1 installed). I
created a table called 'data1', and 'put' two records in it. I can see the
table and data are fine in hbase shell.
I cloned spark repo and checked out to 1.1 br
Newbie for Java. so please be specific on how to resolve this,
The command I was running is
$ ./spark-submit --driver-class-path
/home/cloudera/Downloads/spark-1.1.0-bin-hadoop2.3/lib/spark-examples-1.1.0-hadoop2.3.0.jar
/home/cloudera/Downloads/spark-1.1.0-bin-hadoop2.3/examples/src/main/python/
The same command passed in another quick-start vm (v4.7) which has hbase 0.96
installed. maybe there are some conflicts for the newer hbase version and
spark 1.1.0? just my guess.
Thanks.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/spark-1-1-failure-cla
I have a library written in Cython and C. wondering if it can be shipped to
the workers which don't have cython installed. maybe create an egg package
from this library? how?
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-ship-cython-library-to-work
This is a mvn build.
[ERROR] Failed to execute goal on project spark-examples_2.10: Could not
resolve dependencies for project
org.apache.spark:spark-examples_2.10:jar:1.1.0: Could not find artifact
org.apache.hbase:hbase:jar:0.98.1 in central
(https://repo1.maven.org/maven2) -> [Help 1]
[ERROR]
We are working on a project that needs python + spark to work on hdfs and
hbase data. We like to use a not-too-old version of hbase such as hbase
0.98.x. We have tried many different ways (and platforms) to compile and
test Spark 1.1 official release, but got all sorts of issues. The only
version t
I don't know if it's relevant, but I had to compile spark for my specific
hbase and hadoop version to make that hbase_inputformat.py work.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-1-1-0-hbase-inputformat-py-not-work-tp14905p14912.html
Sent from
For two large key-value data sets, if they have the same set of keys, what is
the fastest way to join them into one? Suppose all keys are unique in each
data set, and we only care about those keys that appear in both data sets.
input data I have: (k, v1) and (k, v2)
data I want to get from the
The plan is to create an EC2 cluster and run the (py) spark on it. Input data
is from s3, output data goes to an hbase in a persistent cluster (also EC2).
My questions are:
1. I need to install some software packages on all the workers (sudo apt-get
install ...). Is there a better way to do this t
boto?
This is not a spark question, but a python question.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-delete-file-folder-in-amazon-s3-using-pyspark-tp16616p16623.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
Maybe I should create a private AMI to use for my question No.1? Assuming I
use the default instance type as the base image.. Anyone tried this?
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/EC2-cluster-set-up-and-access-to-HBase-in-a-different-cluster-tp1
created a EC2 cluster using spark-ec2 command. If I run the pi.py example in
the cluster without using the example.jar, it works. But if I added the
example.jar as the driver class (sth like follows), it will fail with an
exception. Could anyone help with this? -- what is the cause of the problem?
Fixed by recompiling. Thanks.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/example-jar-caused-exception-when-running-pi-py-spark-1-1-tp16849p16862.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
-
what could cause this type of 'stage failure'? Thanks!
This is a simple py spark script to list data in hbase.
command line: ./spark-submit --driver-class-path
~/spark-examples-1.1.0-hadoop2.3.0.jar /root/workspace/test/sparkhbase.py
14/10/21 17:53:50 INFO BlockManagerInfo: Added broadcast_2_pi
maybe set up a hbase.jar in the conf?
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/stage-failure-Task-0-in-stage-0-0-failed-4-times-tp16928p16929.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
--
Thanks for the help!
Hadoop version: 2.3.0
Hbase version: 0.98.1
Use python to read/write data from/to hbase.
Only change over the official spark 1.1.0 is the pom file under examples.
Compilation:
spark:mvn -Pyarn -Phadoop-2.3 -Dhadoop.version=2.3.0 -DskipTests clean
package
spark/examples:mv
Thanks Daniil! if I use --spark-git-repo, is there a way to specify the mvn
command line parameters? like following
mvn -Pyarn -Phadoop-2.3 -Dhadoop.version=2.3.0 -DskipTests clean package
mvn -Pyarn -Phadoop-2.3 -Phbase-hadoop2 -Dhadoop.version=2.3.0 -DskipTests
clean package
--
View this mess
I modified the pom files in my private repo to use those parameters as
default to solve the problem. But after the deployment, I found the
installed version is not the customized version, but an official one. Anyone
please give a hint on how the spark-ec2 work with spark from private repos..
--
was working on a 'hack'.
modified spark-ec2.py (to point to my own spark-ec2 repo at github), and
then built a customized spark-ec2 repo (copied from github) then modified
the init.sh under spark folder so that the master node downloads the
customized spark from s3. the spark was successfully ins
I have a python program got exception when using --jars, but works fine if
using --driver-class-path (with that examples.jar). What's the difference
between these two args? Thanks!
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/difference-between-jars-and-d
Hi,
Got this error when running spark 1.1.0 to read Hbase 0.98.1 through simple
python code in a ec2 cluster. The same program runs correctly in local mode.
So this error only happens when running in a real cluster.
Here's what I got,
14/10/30 17:51:53 INFO TaskSetManager: Starting task 0.1 in
The worker side has error message as this,
14/10/30 18:29:00 INFO Worker: Asked to launch executor
app-20141030182900-0006/0 for testspark_v1
14/10/30 18:29:01 INFO ExecutorRunner: Launch command: "java" "-cp"
"::/root/spark-1.1.0/conf:/root/spark-1.1.0/assembly/target/scala-2.10/spark-assembly-1.
Hi, I saw the same issue as this thread,
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-1-0-1-akka-connection-refused-td9864.html
Anyone has a fix for this bug? Please?!
The log info in my worker node is like,
14/10/30 20:15:18 INFO Worker: Asked to kill executor
app-20141030201514-0
followed this
http://community.cloudera.com/t5/Advanced-Analytics-Apache-Spark/Akka-Error-while-running-Spark-Jobs/td-p/18602
but the problem was not fixed..
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/akka-connection-refused-bug-fix-tp17764p17774.html
Any one has experience or advice to fix this problem? highly appreciated!
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/akka-connection-refused-bug-fix-tp17764p17972.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
--
Hollo there,
Just set up an ec2 cluster with no HDFS, hadoop, hbase whatsoever. Just
installed spark to read/process data from a hbase in a different cluster.
The spark was built against the hbase/hadoop version in the remote (ec2)
hbase cluster, which is 0.98.1 and 2.3.0 respectively.
but I got
problem is solved. I basically built a fat spark jar that includes all hbase
stuff and sent over the examples.jar over to the slaves too.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/IllegalStateException-unread-block-data-tp18011p18102.html
Sent from th
Thanks for the solution! I did figure out how to create an .egg file to ship
out to the workers. Using ipython seems to be another cool solution.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-ship-cython-library-to-workers-tp14467p18116.html
Sent fr
Hello there,
I am wondering how to get the column family names and column qualifier names
when using pyspark to read an hbase table with multiple column families.
I have a hbase table as follows,
hbase(main):007:0> scan 'data1'
ROW COLUMN+CELL
checked the source, found the following,
class HBaseResultToStringConverter extends Converter[Any, String] {
override def convert(obj: Any): String = {
val result = obj.asInstanceOf[Result]
Bytes.toStringBinary(result.value())
}
}
I feel using 'result.value()' here is a big limitation
just wrote a custom convert in scala to replace HBaseResultToStringConverter.
Just couple of lines of code.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/pyspark-get-column-family-and-qualifier-names-from-hbase-table-tp18613p18639.html
Sent from the Apach
Hi,
This is my code,
import org.apache.hadoop.hbase.CellUtil
/**
* JF: convert a Result object into a string with column family and
qualifier names. Sth like
*
'columnfamily1:columnqualifier1:value1;columnfamily2:columnqualifier2:value2'
etc.
* k-v pairs are separated by ';'. different colum
Hi Nick,
I saw the HBase api has experienced lots of changes. If I remember
correctly, the default hbase in spark 1.1.0 is 0.94.6. The one I am using is
0.98.1. To get the column family names and qualifier names, we need to call
different methods for these two different versions. I don't know how
It seems sparkcontext is good fit to be used with 'with' in python. A context
manager will do.
example:
with SparkContext(conf=conf, batchSize=512) as sc:
Then sc.stop() is not necessary to write any more.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.n
Hi,
I am wondering how to write logging info in a worker when running a pyspark
app. I saw the thread
http://apache-spark-user-list.1001560.n3.nabble.com/logging-in-pyspark-td5458.html
but did not see a solution. Anybody know a solution? Thanks!
--
View this message in context:
http://apach
Hi, wondering if anyone could help with this. We use ec2 cluster to run spark
apps in standalone mode. The default log info goes to /$spark_folder/work/.
This folder is in the 10G root fs. So it won't take long to fill up the
whole fs.
My goal is
1. move the logging location to /mnt, where we hav
cat spark-env.sh
--
#!/usr/bin/env bash
export SPARK_WORKER_OPTS="-Dspark.executor.logs.rolling.strategy=time
-Dspark.executor.logs.rolling.time.interval=daily
-Dspark.executor.logs.rolling.maxRetainedFiles=3"
export SPARK_LOCAL_DIRS=/mnt/spark
export SPARK_WORKER_DIR=/mnt/spark
--
Bu
Hi,
I installed the cdh5.3.0 core+Hbase in a new ec2 cluster. Then I manually
installed spark1.1 in it. but when I started the slaves, I got an error as
follows,
./bin/spark-class org.apache.spark.deploy.worker.Worker spark://master:7077
Error: Could not find or load main class s.rolling.maxReta
Could anyone come up with your experience on how to do this?
I have created a cluster and installed cdh5.3.0 on it with basically core +
Hbase. but cloudera installed and configured the spark in its parcels
anyway. I'd like to install our custom spark on this cluster to use the
hadoop and hbase s
I installed the custom as a standalone mode as normal. The master and slaves
started successfully.
However, I got error when I ran a job. It seems to me from the error message
the some library was compiled against hadoop1, but my spark was compiled
against hadoop2.
15/01/08 23:27:36 INFO ClientC
I ran the release spark in cdh5.3.0 but got the same error. Anyone tried to
run spark in cdh5.3.0 using its newAPIHadoopRDD?
command:
spark-submit --master spark://master:7077 --jars
/opt/cloudera/parcels/CDH-5.3.0-1.cdh5.3.0.p0.30/jars/spark-examples-1.2.0-cdh5.3.0-hadoop2.5.0-cdh5.3.0.jar
./sp
A cdh5.3.0 with spark is set up. just wondering how to run a python
application on it.
I used 'spark-submit --master yarn-cluster ./loadsessions.py' but got the
error,
Error: Cluster deploy mode is currently not supported for python
applications.
Run with --help for usage help or --verbose for d
Got help from Marcelo and Josh. Now it is running smoothly. In case you need
this info - Just use "yarn-client" instead of "yarn-cluster"
Thanks folks!
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/how-to-run-python-app-in-yarn-tp21141p21142.html
Sent fr
I think Sampo's thought is to get a function that only tests if a RDD is
empty. He does not want to know the size of the RDD, and getting the size of
a RDD is expensive for large data sets.
I myself saw many times that my app threw out exceptions because an empty
RDD cannot be saved. This is not
58 matches
Mail list logo