Hi Tathagata,
Thanks very much for this tip and for the quick reply. It absolutely did the
trick!
One small note for others: make sure your command-line invocation of the
spark-submit script has the argument “—master” also setting “local[n]” with n >
1.
Thanks again,
Rishi
On Aug 28, 2014, a
I set partitions to 64:
//
kInMsg.repartition(64)
val outdata = kInMsg.map(x=>normalizeLog(x._2,configMap))
//
Still see all activity only on the two nodes that seem to be receiving
from Kafka.
On Thu, Aug 28, 2014 at 5:47 PM, Tim Smith wrote:
> TD - Apologies, didn't realize I was replying t
I upped the ulimit to 128k files on all nodes. Job crashed again with
"DAGScheduler: Failed to run runJob at ReceiverTracker.scala:275".
Couldn't get the logs because I killed the job and looks like yarn
wipe the container logs (not sure why it wipes the logs under
/var/log/hadoop-yarn/container).
ok, but what if I have a long list do I need to hard code like this every
element of my list of is there a function that translate a list into a
tuple ?
On Fri, Aug 29, 2014 at 3:24 AM, Michael Armbrust
wrote:
> You don't need the Seq, as in is a variadic function.
>
> personTable.where('name i
Hey Michael, Cheng,
Thanks for the replies. Sadly I can't remember the specific error so I'm
going to chalk it up to user error, especially since others on the list
have not had a problem.
@michael By the way, was at the Spark 1.1 meetup yesterday. Great event,
very informative, cheers and keep d
hi, dear
Thanks for the response. Some comments below. and yes, I am using spark on
yarn.
1. The release doc of spark says multi jobs can be submitted in one application
if the jobs(actions) are submit by different threads. I wrote some java thread
code in driver, one action in each thread, a
Hi ,
I am working with pyspark and doing simple aggregation
def doSplit(x):
y = x.split(',')
if(len(y)==3):
return y[0],(y[1],y[2])
counts = lines.map(doSplit).groupByKey()
output = counts.collect()
Iterating over output I got such format of the data u'1385501280'
Filed SPARK-3295.
On Mon, Aug 25, 2014 at 12:49 PM, Michael Armbrust
wrote:
>> SO I tried the above (why doesn't union or ++ have the same behavior
>> btw?)
>
>
> I don't think there is a good reason for this. I'd open a JIRA.
>
>>
>> and it works, but is slow because the original Rdds are not
>
Change your conf/spark-env.sh:
export HADOOP_CONF_DIR="/etc/hadoop/conf"export YARN_CONF_DIR="/etc/hadoop/conf"
Date: Thu, 28 Aug 2014 16:19:05 -0700
From: ml-node+s1001560n13074...@n3.nabble.com
To: linkpatrick...@live.com
Subject: problem connection to hdfs on localhost from spark-shell
Hi,
Please see the answers following each question. If there's any mistake, please
let me know. Thanks!
I am not sure which mode you are running. So I will assume you are using
spark-submit script to submit spark applications to spark
cluster(spark-standalone or Yarn)
1. how to start 2 or more
Hi,
You can set the settings in conf/spark-env.sh like this:export
SPARK_LIBRARY_PATH=/usr/lib/hadoop/lib/native/
SPARK_JAVA_OPTS+="-Djava.library.path=$SPARK_LIBRARY_PATH
"SPARK_JAVA_OPTS+="-Dspark.io.compression.codec=org.apache.spark.io.SnappyCompressionCodec
"SPARK_JAVA_OPTS+="-Dio.compress
Click the Storage tab. You have some (tiny) RDD persisted in memory.
On Fri, Aug 29, 2014 at 5:58 AM, SK wrote:
> Hi,
> Thanks for the responses. I understand that the second values in the Memory
> Used column for the executors add up to 95.5 GB and the first values add up
> to 17.3 KB. If 95.5
It did. It got failed and respawned 4 times.
In this case, the too many open files is a sign that you need increase the
system-wide limit of open files.
Try adding ulimit -n 16000 to your conf/spark-env.sh.
TD
On Thu, Aug 28, 2014 at 5:29 PM, Tim Smith wrote:
> Appeared after running for a whi
Hi,
Thanks for the responses. I understand that the second values in the Memory
Used column for the executors add up to 95.5 GB and the first values add up
to 17.3 KB. If 95.5 GB is the memory used to store the RDDs, then what is
17.3 KB ? is that the memory used for shuffling operations? For non
(Please ignore if duplicated)
Hi,
I use Spark 1.0.2 with Hive 0.13.1
I have already set the hive mysql database to latine1;
mysql:
alter database hive character set latin1;
Spark:
scala> val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
scala> hiveContext.hql("create table tes
Hello
I'm new to Spark and playing around, but saw the following error. Could
anyone to help on it?
Thanks
Gary
scala> c
res15: org.apache.spark.rdd.RDD[String] = FlatMappedRDD[7] at flatMap at
:23
scala> group
res16: org.apache.spark.rdd.RDD[(String, Iterable[String])] =
MappedValuesRDD[5] a
I’m currently using the Spark 1.1 branch and have been able to get the Thrift
service up and running. The quick questions were whether I should able to use
the Thrift service to connect to SparkSQL generated tables and/or Hive tables?
As well, by any chance do we have any documents that point
Hi,
Spark uses by default approximately 60% of the executor heap memory to store
RDDs. That's why you have 8.6GB instead of 16GB. 95.5 is therefore the sum of
all the 8.6 GB of executor memory + the driver memory.
Best,
Burak
- Original Message -
From: "SK"
To: u...@spark.incubator.ap
Each executor reserves some memory for storing RDDs in memory, and
some for executor operations like shuffling. The number you see is
memory reserved for storing RDDs, and defaults to about 0.6 of the
total (spark.storage.memoryFraction).
On Fri, Aug 29, 2014 at 2:32 AM, SK wrote:
> Hi,
>
> I am
Yeah, saveAsTextFile is an RDD specific method. If you really want to use that
method, just turn the map into an RDD:
`sc.parallelize(x.toSeq).saveAsTextFile(...)`
Reading through the api-docs will present you many more alternate solutions!
Best,
Burak
- Original Message -
From: "SK"
Hi list,
Any change on this one? I think I have seen a lot of work being done on this
lately but I am unable to forge a working solution from jira tickets. Any
example would be highly appreciated.
Tomas
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Sorti
Hey Sparkies...
I have an odd "bug".
I am running Spark 0.9.2 on Amazon EC2 machines as a job (i.e. not in REPL)
After a bunch of processing, I tell spark to save my rdd to S3 using:
rdd.saveAsSequenceFile(uri,codec)
That line of code hangs. By hang I mean
(a) Spark stages UI shows no update on
Anyone? No customers using streaming at scale?
From: Steve Nunez
Date: Wednesday, August 27, 2014 at 9:08
To: "user@spark.apache.org"
Subject: Reference Accounts & Large Node Deployments
> All,
>
> Does anyone have specific references to customers, use cases and large-scale
> deployments
Hi,
While using HiveContext.
If hive table created as test_datatypes(testbigint bigint, ss bigint ) select
below works fine.
For "create table test_datatypes(testbigint bigint, testdec decimal(5,2) )"
scala> val dataTypes=hiveContext.hql("select * from test_datatypes")
14/08/28 21:18:44 INFO p
Hi,
I fixed the issue by copying libsnappy.so to Java ire.
Regards
Arthur
On 29 Aug, 2014, at 8:12 am, arthur.hk.c...@gmail.com
wrote:
> Hi,
>
> If change my etc/hadoop/core-site.xml
>
> from
>
> io.compression.codecs
>
> org.apache.hadoop.io.compress.SnappyCodec,
>
hi, guys
I am trying to understand how spark work on the concurrent model. I read
below from https://spark.apache.org/docs/1.0.2/job-scheduling.html
quote
" Inside a given Spark application (SparkContext instance), multiple parallel
jobs can run simultaneously if they were submitted from sep
Hi,
I am using a cluster where each node has 16GB (this is the executor memory).
After I complete an MLlib job, the executor tab shows the following:
Memory: 142.6 KB Used (95.5 GB Total)
and individual worker nodes have the Memory Used values as 17.3 KB / 8.6 GB
(this is different for differe
You don't need the Seq, as in is a variadic function.
personTable.where('name in ("foo", "bar"))
On Thu, Aug 28, 2014 at 3:09 AM, Jaonary Rabarisoa
wrote:
> Hi all,
>
> What is the expression that I should use with spark sql DSL if I need to
> retreive
> data with a field in a given set.
> Fo
The comma is just the way the default toString works for Row objects.
Since SchemaRDDs are also RDDs, you can do arbitrary transformations on
the Row objects that are returned.
For example, if you'd rather the delimiter was '|':
sql("SELECT * FROM src").map(_.mkString("|")).collect()
On Thu, A
TD - Apologies, didn't realize I was replying to you instead of the list.
What does "numPartitions" refer to when calling createStream? I read an
earlier thread that seemed to suggest that numPartitions translates to
partitions created on the Spark side?
http://mail-archives.apache.org/mod_mbox/in
Appeared after running for a while. I re-ran the job and this time, it
crashed with:
14/08/29 00:18:50 WARN ReceiverTracker: Error reported by receiver for
stream 0: Error in block pushing thread - java.net.SocketException: Too
many open files
Shouldn't the failed receiver get re-spawned on a diff
Hi,
If change my etc/hadoop/core-site.xml
from
io.compression.codecs
org.apache.hadoop.io.compress.SnappyCodec,
org.apache.hadoop.io.compress.GzipCodec,
org.apache.hadoop.io.compress.DefaultCodec,
org.apache.hadoop.io.compress.BZip2Codec
to
great question, wei. this is very important to understand from a
performance perspective. and this extends is beyond kinesis - it's for any
streaming source that supports shards/partitions.
i need to do a little research into the internals to confirm my theory.
lemme get back to you!
-chris
If you are repartitioning to 8 partitions, and your node happen to have at
least 4 cores each, its possible that all 8 partitions are assigned to only
2 nodes. Try increasing the number of partitions. Also make sure you have
executors (allocated by YARN) running on more than two nodes if you want t
I have HDFS servers running locally and "hadoop dfs -ls /" are all running fine.
>From spark-shell I do this:
val lines = sc.textFile("hdfs:///test")
and I get this error message.
java.io.IOException: Failed on local exception: java.io.EOFException; Host
Details : local host is: "localhost.locald
Do you see this error right in the beginning or after running for sometime?
The root cause seems to be that somehow your Spark executors got killed,
which killed receivers and caused further errors. Please try to take a look
at the executor logs of the lost executor to find what is the root cause
val map = Map("foo" -> 1, "bar" -> 2, "baz" -> 3)
val rdd = sc.parallelize(map.toSeq)
rdd is a an RDD[(String,Int)] and you can do what you like from there.
On Thu, Aug 28, 2014 at 11:56 PM, SK wrote:
> Hi,
>
> How do I convert a Map object to an RDD so that I can use the
> saveAsTextFile() oper
Hi,
How do I convert a Map object to an RDD so that I can use the
saveAsTextFile() operation to output the Map object?
thanks
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/transforming-a-Map-object-to-RDD-tp13071.html
Sent from the Apache Spark User Li
Hi,
Have a Spark-1.0.0 (CDH5) streaming job reading from kafka that died with:
14/08/28 22:28:15 INFO DAGScheduler: Failed to run runJob at
ReceiverTracker.scala:275
Exception in thread "Thread-59" 14/08/28 22:28:15 INFO
YarnClientClusterScheduler: Cancelling stage 2
14/08/28 22:28:15 INFO DAGSch
Hi,
In my streaming app, I receive from kafka where I have tried setting the
partitions when calling "createStream" or later, by calling repartition -
in both cases, the number of nodes running the tasks seems to be stubbornly
stuck at 2. Since I have 11 nodes in my cluster, I was hoping to use mo
Hello Daniel, If you’re not using Hadoop then why you want to grab the Hadoop
package? CDH5 will download all the Hadoop packages and cloudera manager too.
Just curious what happen if you start spark on EC2 cluster, what it choose for
the data store as default?
-Sanjeev
From: Daniel Siegmann [
To emulate a Mapper, flatMap() is exactly what you want. Since it
flattens, it means you return an Iterable of values instead of 1
value. That can be a Collection containing many values, or 1, or 0.
For a reducer, to really reproduce what a Reducer does in Java, I
think you will need groupByKey()
I assume your on-demand calculations are a streaming flow? If your data
aggregated from batch isn't too large, maybe you should just save it to
disk; when your streaming flow starts you can read the aggregations back
from disk and perhaps just broadcast them. Though I guess you'd have to
restart yo
If you aren't using Hadoop, I don't think it matters which you download.
I'd probably just grab the Hadoop 2 package.
Out of curiosity, what are you using as your data store? I get the
impression most Spark users are using HDFS or something built on top.
On Thu, Aug 28, 2014 at 4:07 PM, Sanjeev
On Thu, Aug 28, 2014 at 7:00 AM, Rok Roskar wrote:
> I've got an RDD where each element is a long string (a whole document). I'm
> using pyspark so some of the handy partition-handling functions aren't
> available, and I count the number of elements in each partition with:
>
> def count_partitio
Hi,
I'm building a system for near real-time data analytics. My plan is to have
an ETL batch job which calculates aggregations running periodically. User
queries are then parsed for on-demand calculations, also in Spark. Where are
the pre-calculated results supposed to be saved? I mean, after fini
Try using "local[n]" with n > 1, instead of local. Since receivers take up
1 slot, and "local" is basically 1 slot, there is no slot left to process
the data. That's why nothing gets printed.
TD
On Thu, Aug 28, 2014 at 10:28 AM, Verma, Rishi (398J) <
rishi.ve...@jpl.nasa.gov> wrote:
> Hi Folks,
Hello there,
I've a basic question on the downloadthat which option I need to
downloadfor standalone cluster.
I've a private cluster of three machineson Centos. When I click on
download it shows me following:
Download Spark
The latest release is Spark 1.0.2, released August 5, 2014 (re
But then if you want to generate ids that are unique across ALL the records
that you are going to see in a stream (which can be potentially infinite),
then you definitely need a number space larger than long :)
TD
On Thu, Aug 28, 2014 at 12:48 PM, Soumitra Kumar
wrote:
> Yes, that is an option
Yes, that is an option.
I started with a function of batch time, and index to generate id as long. This
may be faster than generating UUID, with added benefit of sorting based on time.
- Original Message -
From: "Tathagata Das"
To: "Soumitra Kumar"
Cc: "Xiangrui Meng" , user@spark.apac
thanks for the reply.
I was looking for something for the case when it's running outside of the
spark framework. if I declare a sparkcontext or and rdd that could print
some messages in the log?
The problem I have that if I print something from the scala object that runs
the spark app, it does
Hi,
Thanks for the response. I tried to use countByKey. But I am not able to
write the output to console or to a file. Neither collect() nor
saveAsTextFile() work for the Map object that is generated after
countByKey().
valx = sc.textFile(baseFile)).map { line =>
val field
I was able to recently solve this problem for standalone mode. For this mode,
I did not use a history server. Instead, I set spark.eventLog.dir (in
conf/spark-defaults.conf) to a directory in hdfs (basically this directory
should be in a place that is writable by the master and accessible globally
In many cases when I work with Map Reduce my mapper or my reducer might
take a single value and map it to multiple keys -
The reducer might also take a single key and emit multiple values
I don't think that functions like flatMap and reduceByKey will work or are
there tricks I am not aware of
I'm not sure if this is the case, but basic monitoring is described here:
https://spark.apache.org/docs/latest/monitoring.html
If it comes to something more sophisticated I was for example able to save
some messages into local logs and view them in YARN UI via http by editing
spark source code (use
hey guys
i still try to get used to compile and run the example code
why does the run_example code submit the class with an
org.apache.spark.examples in front of the class itself?
probably a stupid question but i would be glad some one of you explains
by the way.. how was the "spark...example..
Hi,
my check native result:
hadoop checknative
14/08/29 02:54:51 WARN bzip2.Bzip2Factory: Failed to load/initialize
native-bzip2 library system-native, will use pure-Java version
14/08/29 02:54:51 INFO zlib.ZlibFactory: Successfully loaded & initialized
native-zlib library
Native library checki
Thank you Debasish.
I am fine with either Scala or Java. I would like to get a quick
evaluation on the performance gain, e.g., ALS on GPU. I would like to try
whichever library does the business :)
Best regards,
Wei
-
Wei Tan, PhD
Research Staff Member
IBM T. J.
Hi,
I use Hadoop 2.4.1 and HBase 0.98.5 with snappy enabled in both Hadoop and
HBase.
With default setting in Spark 1.0.2, when trying to load a file I got "Class
org.apache.hadoop.io.compress.SnappyCodec not found"
Can you please advise how to enable snappy in Spark?
Regards
Arthur
scala> i
Hi Ted,
I downloaded pom.xml to examples directory.
It works, thanks!!
Regards
Arthur
[INFO]
[INFO] Reactor Summary:
[INFO]
[INFO] Spark Project Parent POM .. SUCCESS [2.119s]
[INFO] Spark Project
@bharat-
overall, i've noticed a lot of confusion about how Spark Streaming scales -
as well as how it handles failover and checkpointing, but we can discuss
that separately.
there's actually 2 dimensions to scaling here: receiving and processing.
*Receiving*
receiving can be scaled out by subm
Sorry for the extremely late reply. It turns out that the same error occurred
when running on yarn. However, I recently updated my project to depend on
cdh5 and the issue I was having disappeared and I am no longer setting the
userClassPathFirst to true.
--
View this message in context:
http:/
bq. Spark 1.0.2
For the above release, you can download pom.xml attached to the JIRA and
place it in examples directory
I verified that the build against 0.98.4 worked using this command:
mvn -Dhbase.profile=hadoop2 -Phadoop-2.4,yarn -Dhadoop.version=2.4.1
-DskipTests clean package
Patch v5 is
Hi Yana,
The fact is that the DB writing is happening on the node level and not on
Spark level. One of the benefits of distributed computing nature of Spark is
enabling IO distribution as well. For example, is much faster to have the
nodes to write to Cassandra instead of having them all collected
Hi All,
@Andrew
Thanks for the tips. I just built the master branch of Spark last
night, but am still having problems viewing history through the
standalone UI. I dug into the Spark job events directories as you
suggested, and I see at a minimum 'SPARK_VERSION_1.0.0' and
'EVENT_LOG_1'; for appli
Hi Folks,
I’d like to find out tips on how to convert the RDDs inside a Spark Streaming
DStream to a set of SchemaRDDs.
My DStream contains JSON data pushed over from Kafka, and I’d like to use
SparkSQL’s JSON import function (i.e. jsonRDD) to register the JSON dataset as
a table, and perform
Hi
I'd like to announce a couple of updates to the SparkR project. In order to
facilitate better collaboration for new features and development we have a
new mailing list, issue tracker for SparkR.
- The new JIRA is hosted at https://sparkr.atlassian.net/browse/SPARKR/ and
we have migrated all ex
Hi,
patch -p1 -i spark-1297-v5.txt
can't find file to patch at input line 5
Perhaps you used the wrong -p or --strip option?
The text leading up to this was:
--
|diff --git docs/building-with-maven.md docs/building-with-maven.md
|index 672d0ef..f8bcd2b 100644
|--- docs/bui
I attached patch v5 which corresponds to the pull request.
Please try again.
On Thu, Aug 28, 2014 at 9:50 AM, arthur.hk.c...@gmail.com <
arthur.hk.c...@gmail.com> wrote:
> Hi,
>
> I have just tried to apply the patch of SPARK-1297:
> https://issues.apache.org/jira/browse/SPARK-1297
>
> There ar
Hi,
I have just tried to apply the patch of SPARK-1297:
https://issues.apache.org/jira/browse/SPARK-1297
There are two files in it, named spark-1297-v2.txt and spark-1297-v4.txt
respectively.
When applying the 2nd one, I got "Hunk #1 FAILED at 45"
Can you please advise how to fix it in order
Breeze author David also has a github project on cuda binding in
scalado you prefer using java or scala ?
On Aug 27, 2014 2:05 PM, "Frank van Lankvelt"
wrote:
> you could try looking at ScalaCL[1], it's targeting OpenCL rather than
> CUDA, but that might be close enough?
>
> cheers, Frank
>
Hi all,
Just wondering if there is a way to use logging to print to spark logs some
additional info (similar to debug in scalding).
Thanks,
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Print-to-spark-log-tp13035.html
Sent from the Apache Spark User List
Hi All,
Is there any way to change the delimiter from being a comma ?
Some of the strings in my data contain commas as well, making it very
difficult to parse the results.
Yadid
The following error is given when I try to add this line:
[info] Set current project to Simple Project (in build
file:/C:/Users/D062844/Desktop/Hand
sOnSpark/Install/spark-1.0.2-bin-hadoop2/)
[info] Compiling 1 Scala source to
C:\Users\D062844\Desktop\HandsOnSpark\Install\spark-1.0
.2-bin-hadoop
You should set this as early as possible in your program, before other
code runs.
On Thu, Aug 28, 2014 at 3:27 PM, Hingorani, Vineet
wrote:
> Thank you Sean and Guru for giving the information. Btw I have to put this
> line, but where should I add it? In my scala file below other ‘import …’
> lin
Thank you Sean and Guru for giving the information. Btw I have to put this
line, but where should I add it? In my scala file below other ‘import …’ lines
are written?
System.setProperty("hadoop.home.dir", "d:\\winutil\\")
Thank you
Vineet
From: Sean Owen [mailto:so...@cloudera.com]
Sent: Donner
Thanks Sean.
Looks like there is a workaround as per the JIRA
https://issues.apache.org/jira/browse/SPARK-2356 .
http://qnalist.com/questions/4994960/run-spark-unit-test-on-windows-7.
May be that's worth a shot?
On Aug 28, 2014, at 8:15 AM, Sean Owen wrote:
> Yes, but I think at the moment t
A bit in analogy with a linked-list a double linked-list. It might introduce
overhead in terms of memory usage, but you could use two directed edges to
substitute the uni-directed edge.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Graphx-undirected-graph-
I've got an RDD where each element is a long string (a whole document). I'm
using pyspark so some of the handy partition-handling functions aren't
available, and I count the number of elements in each partition with:
def count_partitions(id, iterator):
c = sum(1 for _ in iterator)
yiel
I didn't see that problem.
Did you run this command ?
mvn -Phbase-hadoop2,hadoop-2.4,yarn -Dhadoop.version=2.4.1 -DskipTests
clean package
Here is what I got:
TYus-MacBook-Pro:spark-1.0.2 tyu$ sbin/start-all.sh
starting org.apache.spark.deploy.master.Master, logging to
/Users/tyu/spark-1.0.2/sbi
Can you clarify the scenario:
val ssc = new StreamingContext(sparkConf, Seconds(10))
ssc.checkpoint(checkpointDirectory)
val stream = KafkaUtils.createStream(...)
val wordCounts = lines.flatMap(_.split(" ")).map(x => (x, 1L))
val wordDstream= wordCounts.updateStateByKey[Int](updateFunc)
wo
Hi there,
I'm trying to run JavaSparkPi example on YARN with master = yarn-client but
I have a problem.
It runs smoothly with submitting application, first container for
Application Master works too.
When job is starting and there are some tasks to do I'm getting this warning
on console (I'm us
Yes, but I think at the moment there is still a dependency on Hadoop even
when not using it. See https://issues.apache.org/jira/browse/SPARK-2356
On Thu, Aug 28, 2014 at 2:14 PM, Guru Medasani wrote:
> Can you copy the exact spark-submit command that you are running?
>
> You should be able to r
Can you copy the exact spark-submit command that you are running?
You should be able to run it locally without installing hadoop.
Here is an example on how to run the job locally.
# Run application locally on 8 cores
./bin/spark-submit \
--class org.apache.spark.examples.SparkPi \
--master
How can I set HADOOP_HOME if I am running the Spark on my local machine without
anything else? Do I have to install some other pre-built file? I am on Windows
7 and Spark’s official site says that it is available on Windows, I added Java
path in the PATH variable.
Vineet
From: Sean Owen [mailt
If you mean your values are all a Seq or similar already, then you just
take the top 1 ordered by the size of the value:
rdd.top(1)(Ordering.by(_._2.size))
On Thu, Aug 28, 2014 at 9:34 AM, Deep Pradhan
wrote:
> Hi,
> I have a RDD of key-value pairs. Now I want to find the "key" for which
> the
For your reference:
val d1 = textFile.map(line => {
val fileds = line.split(",")
((fileds(0),fileds(1)), fileds(2).toDouble)
})
val d2 = d1.reduceByKey(_+_)
d2.foreach(println)
2014-08-28 20:04 GMT+08:00 MEETHU MATHEW :
> Hi all,
>
> I have an RDD which has values in t
"0.98.2" is not an HBase version, but "0.98.2-hadoop2" is:
http://search.maven.org/#search%7Cgav%7C1%7Cg%3A%22org.apache.hbase%22%20AND%20a%3A%22hbase%22
On Thu, Aug 28, 2014 at 2:54 AM, arthur.hk.c...@gmail.com <
arthur.hk.c...@gmail.com> wrote:
> Hi,
>
> I need to use Spark with HBase 0.98 an
It sounds like you are adding the same key to every element, and joining,
in order to accomplish a full cartesian join? I can imagine doing it that
way would blow up somewhere. There is a cartesian() method to do this maybe
more efficiently.
However if your data set is large, this sort of algorith
Hi all,
I have an RDD which has values in the format "id,date,cost".
I want to group the elements based on the id and date columns and get the sum
of the cost for each group.
Can somebody tell me how to do this?
Thanks & Regards,
Meethu M
On 08/28/2014 07:20 AM, marylucy wrote:
fileA=1 2 3 4 one number a line,save in /sparktest/1/
fileB=3 4 5 6 one number a line,save in /sparktest/2/
I want to get 3 and 4
var a = sc.textFile("/sparktest/1/").map((_,1))
var b = sc.textFile("/sparktest/2/").map((_,1))
a.filter(param=>{b.lookup(p
You need to set HADOOP_HOME. Is Spark officially supposed to work on
Windows or not at this stage? I know the build doesn't quite yet.
On Thu, Aug 28, 2014 at 11:37 AM, Hingorani, Vineet <
vineet.hingor...@sap.com> wrote:
> The file is compiling properly but when I try to run the jar file using
fileA=1 2 3 4 one number a line,save in /sparktest/1/
fileB=3 4 5 6 one number a line,save in /sparktest/2/
I want to get 3 and 4
var a = sc.textFile("/sparktest/1/").map((_,1))
var b = sc.textFile("/sparktest/2/").map((_,1))
a.filter(param=>{b.lookup(param._1).length>0}).map(_._1).foreach(prin
Hi,
I tried to start Spark but failed:
$ ./sbin/start-all.sh
starting org.apache.spark.deploy.master.Master, logging to
/mnt/hadoop/spark-1.0.2/sbin/../logs/spark-edhuser-org.apache.spark.deploy.master.Master-1-m133.out
failed to launch org.apache.spark.deploy.master.Master:
Failed to find Spa
Not sure if this make sense, but maybe would be nice to have a kind of "flag"
available within the code that tells me if I'm running in a "normal"
situation or during a recovery.
To better explain this, let's consider the following scenario:
I am processing data, let's say from a Kafka streaming, a
The file is compiling properly but when I try to run the jar file using
spark-submit, it is giving some errors. I am running spark locally and have
downloaded a pre-built version of Spark named "For Hadoop 2 (HDP2, CDH5)". AI
don't know if it is a dependency problem but I don't want to have Hado
I see 0.98.5 in dep.txt
You should be good to go.
On Thu, Aug 28, 2014 at 3:16 AM, arthur.hk.c...@gmail.com <
arthur.hk.c...@gmail.com> wrote:
> Hi,
>
> tried
> mvn -Phbase-hadoop2,hadoop-2.4,yarn -Dhadoop.version=2.4.1 -DskipTests
> dependency:tree > dep.txt
>
> Attached the dep. txt for your
Hi,tried mvn -Phbase-hadoop2,hadoop-2.4,yarn -Dhadoop.version=2.4.1 -DskipTests dependency:tree > dep.txtAttached the dep. txt for your information. [WARNING]
[WARNING] Some problems were encountered while building the effective settings
[WARNING] Unrecognised tag: 'mirrors' (position: START_TAG s
Maybe you can refer sliding method of RDD, but it's right now mllib private
method.
Look at org.apache.spark.mllib.rdd.RDDFunctions.
2014-08-26 12:59 GMT+08:00 Vida Ha :
> Can you paste the code? It's unclear to me how/when the out of memory is
> occurring without seeing the code.
>
>
>
>
> On
Thanks for the reply. Sorry I could not ask more earlier.
Trying to use a parquet file is not working at all.
case class Rec(name:String,pv:Int)
val sqlContext=new org.apache.spark.sql.SQLContext(sc)
import sqlContext.createSchemaRDD
val d1=sc.parallelize(Array(("a",10),("b",3))).map(e=>Rec(e._1
1 - 100 of 108 matches
Mail list logo