This is where you can get started
https://spark.apache.org/docs/latest/sql-programming-guide.html
Thanks
Best Regards
On Mon, Jul 13, 2015 at 3:54 PM, vinod kumar vinodsachin...@gmail.com
wrote:
Hi Everyone,
I am developing application which handles bulk of data around
millions(This may
1. Yes open up the webui running on 8080 to see the memory/cores allocated
to your workers, and open up the ui running on 4040 and click on the
Executor tab to see the memory allocated for the executor.
2. mllib codes can be found over here
https://github.com/apache/spark/tree/master/mllib and
Why not add a trigger to your database table and whenever its updated push
the changes to kafka etc and use normal sparkstreaming? You can also write
a receiver based architecture
https://spark.apache.org/docs/latest/streaming-custom-receivers.html for
this, but that will be a bit time consuming.
I am also facing the same issue, anyone figured it? Please help
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Standalone-mode-connection-failure-from-worker-node-to-master-tp23101p23816.html
Sent from the Apache Spark User List mailing list archive at
Hi, hari,
I don't think job-server can work with SparkR (also pySpark). It seems it would
be technically possible but needs support from job-server and SparkR(also
pySpark), which doesn't exist yet.
But there may be some in-direct ways of sharing RDDs between SparkR and an
application. For
Hi,
I am new to spark and need some guidance on below mentioned points:
1)I am using spark 1.2,is it possible to see how much memory is being
allocated to an executor for web UI. If not how can we figure that out.2) I
am interested in source code of mlib,it is possible to get access to
genericRecordsAndKeys.persist(StorageLevel.MEMORY_AND_DISK) with 17 as
repartitioning argument is throwing this exception:
7/13 23:26:36 INFO yarn.ApplicationMaster: Final app status: FAILED,
exitCode: 15, (reason: User class threw exception:
org.apache.spark.SparkException: Job aborted due to
FYI, another benchmark:
http://eastcirclek.blogspot.kr/2015/06/terasort-for-spark-and-flink-with-range.html
quote: I have observed a lot of fetch failures while running Spark, which
results in many restarted tasks and, therefore, takes the longest time. I
suspect that executors are incapable of
Hi Akhil
Is my choice to switch to spark is good? because I don't have enough
information regards limitation and working environment of spark.
I tried spark SQL but it seems it returns data slower than compared to
MsSQL.( I have tested with data which has 4 records)
On Tue, Jul 14, 2015 at
hello community,
i want run my spark app on a cluster (cloudera 5.4.4) with 3 nodes (one pc
has i7 8core with 16GB RAM). now i want submit my spark job on yarn (20GB
RAM).
my script to submit the job is to time the following:
export HADOOP_CONF_DIR=/etc/hadoop/conf/
Look in the worker logs and see whats going on.
Thanks
Best Regards
On Tue, Jul 14, 2015 at 4:02 PM, Arthur Chan arthur.hk.c...@gmail.com
wrote:
Hi,
I use Spark 1.4. When saving the model to HDFS, I got error?
Please help!
Regards
my scala command:
Hi
At this moment we have the same requirement. Unfortunately, database
owners will not be able to push to a msg queue but they have enabled Oracle
CDC which synchronously update a replica of production DB. Our task will be
query the replica and create msg streams to Kinesis. There is already an
Ok thanks. It seems that --jars is not behaving as expected - getting class
not found for even the most simple object from my lib. But anyways, I have
to do at least a filter transformation before collecting the HBaseRDD into
R so will have to go the route of using scala spark shell to transform
Hi Everyone,
As mentioned in Spark sQL programming guide, Spark SQL support Hive UDFs.
I have built the UDF's in hive meta store. It working perfectly in hive
connection. But it is not working in spark (java.lang.RuntimeException:
Couldn't find function DATE_FORMAT).
Could you please help how to
It might take some time to understand the echo system. I'm not sure about
what kind of environment you are having (like #cores, Memory etc.), To
start with, you can basically use a jdbc connector or dump your data as csv
and load it into Spark and query it. You get the advantage of caching if
you
Hi,
I use Spark 1.4. When saving the model to HDFS, I got error?
Please help!
Regards
my scala command:
sc.makeRDD(model.clusterCenters,10).saveAsObjectFile(/tmp/tweets/model)
The error log:
15/07/14 18:27:40 INFO SequenceFileRDDFunctions: Saving as sequence file of
type
I have just restarted the job and it doesn't seem that the shutdown hook is
executed. I have attached to this email the log from the driver. It seems
that the slave are not accepting the tasks... but we haven't change
anything on our mesos cluster, we have only upgrade one job to spark 1.4;
is
I m still looking forward for the answer. I want to know how to properly
close everything about spark in java standalone app.
--
View this message in context:
Could you give more details about the mis-behavior of --jars for SparkR? maybe
it's a bug.
From: Michal Haris [michal.ha...@visualdna.com]
Sent: Tuesday, July 14, 2015 5:31 PM
To: Sun, Rui
Cc: Michal Haris; user@spark.apache.org
Subject: Re: Including additional
Hi all:
I have a question about why spark on yarn will need extra memory
I apply for 10 executors, executor memory 6g, I find that it will allocate 1g
more for 1 executor, totally 7g for 1 executor.
I try to set spark.yarn.executor.memoryOverhead, but it did not help.
1g for 1 executor is too
Hi, Below is the log form the worker.
15/07/14 17:18:56 ERROR FileAppender: Error writing stream to file
/spark/app-20150714171703-0004/5/stderr
java.io.IOException: Stream closed
at java.io.BufferedInputStream.getBufIfOpen(BufferedInputStream.java:170)
at
Hi Shahid,
To be honest I think this question is better suited for Stack Overflow than
for a PhD thesis.
On Tue, Jul 14, 2015 at 7:42 AM, shahid ashraf sha...@trialx.com wrote:
hi
I have a 10 node cluster i loaded the data onto hdfs, so the no. of
partitions i get is 9. I am running a spark
Cool thanks. Will take a look...
Sent from my iPhone
On Jul 13, 2015, at 6:40 PM, Michael Armbrust mich...@databricks.com wrote:
I'd look at the JDBC server (a long running yarn job you can submit queries
too)
Hi
In our case, we have some data stored in a Oracle database table, and new
records will be added into this table. We need to analyse new records to
calculate some values continuesly, then we write a program to monitor the table
every minute. Because every record has a increased unique ID
What do you need in sparkR that mllib / ml don't havemost of the basic
analysis that you need on stream can be done through mllib components...
On Jul 13, 2015 2:35 PM, Feynman Liang fli...@databricks.com wrote:
Sorry; I think I may have used poor wording. SparkR will let you use R to
Your question is very interesting. What I suggest is, that copy your output
in some text file. Read text file in your code and apply RDD. Just consider
wordcount example by Spark. I love this example with Java client. Well,
Spark is an analytical engine and it has a slogan to analyze big big data
Someone else also reported this error with spark 1.4.0
Thanks
Best Regards
On Tue, Jul 14, 2015 at 6:57 PM, Arthur Chan arthur.hk.c...@gmail.com
wrote:
Hi, Below is the log form the worker.
15/07/14 17:18:56 ERROR FileAppender: Error writing stream to file
A small correction when I typed it is not RDDBackend it is RBackend,sorry.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Share-RDD-from-SparkR-and-another-application-tp23795p23828.html
Sent from the Apache Spark User List mailing list archive at
executor.memory only sets the maximum heap size of executor and the JVM needs
non-heap memory to store class metadata, interned strings and other native
overheads coming from networking libraries, off-heap storage levels, etc. These
are (of course) legitimate usage of resources and you'll have
There was a fix for `--jars` that went into 1.4.1
https://github.com/apache/spark/commit/2579948bf5d89ac2d822ace605a6a4afce5258d6
Shivaram
On Tue, Jul 14, 2015 at 4:18 AM, Sun, Rui rui@intel.com wrote:
Could you give more details about the mis-behavior of --jars for SparkR?
maybe it's a
Hi,
Thanks for all the help.
I'm still missing something very basic.
If I wont use sparkR, which doesn't support streaming (will use mlib
instead as Debasish suggested), and I have my scala receiver working, how
the receiver should save the data in memory? I do see the store method, so
if i use
haven't you thought about spark streaming? there is thread that could help
https://www.mail-archive.com/user%40spark.apache.org/msg30105.html
On 14 July 2015 at 18:20, Hafsa Asif hafsa.a...@matchinguu.com wrote:
Your question is very interesting. What I suggest is, that copy your output
in
I use Spark Streaming where messages read from Kafka topics are stored into
JavaDStreamString this rdd contains actual data. Now after going through
documentation and other help I have found we traverse JavaDStream using
foreachRDD
javaDStreamRdd.foreachRDD(new FunctionJavaRDDlt;String,Void() {
I have almost the same case. I will tell you what I am actually doing, if it
is according to your requirement, then I will love to help you.
1. my database is aerospike. I get data from it.
2. written standalone spark app (it does not run in standalone mode, but
with simple java command or maven
You are mixing the 1.0.0 Spark SQL jar with Spark 1.4.0 jars in your build file
Sent from my rotary phone.
On Jul 14, 2015, at 7:57 AM, ashwang168 ashw...@mit.edu wrote:
Hello!
I am currently using Spark 1.4.0, scala 2.10.4, and sbt 0.13.8 to try and
create a jar file from a scala file
hi
I have a 10 node cluster i loaded the data onto hdfs, so the no. of
partitions i get is 9. I am running a spark application , it gets stuck on
one of tasks, looking at the UI it seems application is not using all nodes
to do calculations. attached is the screen shot of tasks, it seems tasks
I appreciate your reply.
Yes,you are right by putting in a parquet etc and reading from another app,I
would rather use spark-jobserver or IBM kernel to achieve the same if it is
not SparkR as it gives more flexibility/scalabilty.
Anyway,I have found a way to run R for my poc from my existing app
Try adding it in your SPARK_CLASSPATH inside conf/spark-env.sh file.
Thanks
Best Regards
On Tue, Jul 14, 2015 at 7:05 AM, Jerrick Hoang jerrickho...@gmail.com
wrote:
Hi all,
I'm having conf/hive-site.xml pointing to my Hive metastore but sparksql
CLI doesn't pick it up. (copying the same
Can you paste your conf/spark-env.sh file? Put SPARK_MASTER_IP as the
master machine's host name in spark-env.sh file. Also add your slaves
hostnames into conf/slaves file and do a sbin/start-all.sh
Thanks
Best Regards
On Tue, Jul 14, 2015 at 1:26 PM, sivarani whitefeathers...@gmail.com
wrote:
Have you seen this SO thread:
http://stackoverflow.com/questions/13471519/running-daemon-with-exec-maven-plugin
This seems to be more related to the plugin than Spark, looking at the
stack trace
On Tue, Jul 14, 2015 at 8:11 AM, Hafsa Asif hafsa.a...@matchinguu.com
wrote:
I m still looking
Hi
As you can see, Spark has taken data locality into consideration and thus
scheduled all tasks as node local. It is because spark could run task on a
node where data is present, so spark went ahead and scheduled the tasks. It
is actually good for reading. If you really want to fan out
Hello!
I am currently using Spark 1.4.0, scala 2.10.4, and sbt 0.13.8 to try and
create a jar file from a scala file (attached above) and run it using
spark-submit. I am also using Hive, Hadoop 2.6.0-cdh5.4.0 which has the
files that I'm trying to read in.
Currently I am very confused about how
How do you manage the spark context elastically when your load grows from
1000 users to 1 users ?
On Tue, Jul 14, 2015 at 8:31 AM, Hafsa Asif hafsa.a...@matchinguu.com
wrote:
I have almost the same case. I will tell you what I am actually doing, if
it
is according to your requirement,
More particular example:
I run pi.py Spark Python example in *yarn-cluster* mode (--master) through
SparkLauncher in Java.
While the program is running, these are the stats of how much memory each
process takes:
SparkSubmit process : 11.266 *gigabyte* Virtual Memory
ApplicationMaster process:
If your rows may have NAs in them, I would process each column individually
by first projecting the column ( map(x = x.nameOfColumn) ), filtering out
the NAs, then running a summarizer over each column.
Even if you have many rows, after summarizing you will only have a vector
of length #columns.
I am running spark application on yarn managed cluster.
When I specify --executor-cores 4 it fails to start the application.
I am starting the app as
spark-submit --class classname --num-executors 10 --executor-cores
5 --master masteradd jarname
Exception in thread main
Hi all,
If you want to launch Spark job from Java in programmatic way, then you
need to Use SparkLauncher.
SparkLauncher uses ProcessBuilder for creating new process - Java seems
handle process creation in an inefficient way.
When you execute a process, you must first fork() and then exec().
On Tue, Jul 14, 2015 at 9:57 AM, Shushant Arora shushantaror...@gmail.com
wrote:
When I specify --executor-cores 4 it fails to start the application.
When I give --executor-cores as 4 , it works fine.
Do you have any NM that advertises more than 4 available cores?
Also, it's always worth it
Ok thanks a lot!
few more doubts :
What happens in a streaming application say with
spark-submit --class classname --num-executors 10 --executor-cores 4
--master masteradd jarname
Will it allocate 10 containers throughout the life of streaming application
on same nodes until any node failure
On Tue, Jul 14, 2015 at 11:13 AM, Shushant Arora shushantaror...@gmail.com
wrote:
spark-submit --class classname --num-executors 10 --executor-cores 4
--master masteradd jarname
Will it allocate 10 containers throughout the life of streaming
application on same nodes until any node failure
Hi,
I'm wondering how to access elements of a linalg.Vector, e.g:
sparseVector: Seq[org.apache.spark.mllib.linalg.Vector] =
List((3,[1,2],[1.0,2.0]), (3,[0,1,2],[3.0,4.0,5.0]))
scala sparseVector(1)
res16: org.apache.spark.mllib.linalg.Vector = (3,[0,1,2],[3.0,4.0,5.0])
How to get the
On 14 Jul 2015, at 23:26, Tathagata Das t...@databricks.com wrote:
Just to be clear, you mean the Spark Standalone cluster manager's master
and not the applications driver, right.
Sorry, by now I have understood that I would not necessarily put the driver app
on the master node and that
You have access to the offset ranges for a given rdd in the stream by
typecasting to HasOffsetRanges. You can then store the offsets wherever
you need to.
On Tue, Jul 14, 2015 at 5:00 PM, Chen Song chen.song...@gmail.com wrote:
A follow up question.
When using createDirectStream approach,
is DataFrame support nested json to dump directely to data base
For simple json it working fine
{id:2,name:Gerald,email:gbarn...@zimbio.com,city:Štoky,country:Czech
Republic,ip:92.158.154.75”},
But for nested json it failed to load
root |-- rows: array (nullable = true) | |-- element:
Hi Roberto,
I have written PySpark code that reads from private S3 buckets, it should
be similar for public S3 buckets as well. You need to set the AWS access
and secret keys into the SparkContext, then you can access the S3 folders
and files with their s3n:// paths. Something like this:
sc =
Yep :)
On Tue, Jul 14, 2015 at 2:44 PM, algermissen1971 algermissen1...@icloud.com
wrote:
On 14 Jul 2015, at 23:26, Tathagata Das t...@databricks.com wrote:
Just to be clear, you mean the Spark Standalone cluster manager's
master and not the applications driver, right.
Sorry, by now I
On Tue, Jul 14, 2015 at 12:03 PM, Shushant Arora shushantaror...@gmail.com
wrote:
Can a container have multiple JVMs running in YARN?
Yes and no. A container runs a single command, but that process can start
other processes, and those also count towards the resource usage of the
container
Any solutions to solve this exception ?
org.apache.spark.shuffle.MetadataFetchFailedException: Missing an
output location for shuffle 1
at
org.apache.spark.MapOutputTracker$$anonfun$org$apache$spark$MapOutputTracker$$convertMapStatuses$1.apply(MapOutputTracker.scala:389)
at
Hi,
I am working at finding the root cause of a bug where rows in dataframes
seem to have misaligned data. My dataframes have two types of columns:
columns from data and columns from UDFs. I seem to be having trouble where
for a given row, the row data doesn't match the data used to compute the
I've opened a PR to fix this; please take a look:
https://github.com/apache/spark/pull/7405
On Tue, Jul 14, 2015 at 11:22 AM, Koert Kuipers ko...@tresata.com wrote:
it works for scala 2.10, but for 2.11 i get:
[ERROR]
A follow up question.
When using createDirectStream approach, the offsets are checkpointed to
HDFS and it is understandable by Spark Streaming job. Is there a way to
expose the offsets via a REST api to end users. Or alternatively, is there
a way to have offsets committed to Kafka Offset Manager
Yes, it works! Thanks a lot Burak!
Cheers,
Dan
2015-07-14 14:34 GMT-05:00 Burak Yavuz brk...@gmail.com:
Hi Dan,
You could zip the indices with the values if you like.
```
val sVec = sparseVector(1).asInstanceOf[
org.apache.spark.mllib.linalg.SparseVector]
val map =
Just to be clear, you mean the Spark Standalone cluster manager's master
and not the applications driver, right.
In that case, the earlier responses are correct.
TD
On Tue, Jul 14, 2015 at 11:26 AM, Mohammed Guller moham...@glassbeam.com
wrote:
The master node does not have to be similar to
Hi Dan,
You could zip the indices with the values if you like.
```
val sVec = sparseVector(1).asInstanceOf[
org.apache.spark.mllib.linalg.SparseVector]
val map = sVec.indices.zip(sVec.values).toMap
```
Best,
Burak
On Tue, Jul 14, 2015 at 12:23 PM, Dan Dong dongda...@gmail.com wrote:
Hi,
Hi!
I am seeing some unexpected behavior with regards to cache() in DataFrames.
Here goes:
In my Scala application, I have created a DataFrame that I run multiple
operations on. It is expensive to recompute the DataFrame, so I have called
cache() after it gets created.
I notice that the
Hi All
To Start new project in Spark , which technology is good .Java8 OR Scala .
I am Java developer , Can i start with Java 8 or I Need to learn Scala .
which one is better technology for quick start any POC project
Thanks
- su
See previous thread:
http://search-hadoop.com/m/q3RTtaXamv1nFTGR
On Tue, Jul 14, 2015 at 1:30 PM, spark user spark_u...@yahoo.com.invalid
wrote:
Hi All
To Start new project in Spark , which technology is good .Java8 OR Scala .
I am Java developer , Can i start with Java 8 or I Need to
Good question. Like you , many are in the same boat(coming from Java
background). Looking forward to response from the community.
Regards
Vineel
On Tue, Jul 14, 2015 at 2:30 PM, spark user spark_u...@yahoo.com.invalid
wrote:
Hi All
To Start new project in Spark , which technology is good
Hi Sujit,
I just wanted to access public datasets on Amazon. Do I still need to provide
the keys?
Thank you,
From: Sujit Pal [mailto:sujitatgt...@gmail.com]
Sent: Tuesday, July 14, 2015 3:14 PM
To: Pagliari, Roberto
Cc: user@spark.apache.org
Subject: Re: Spark on EMR with S3 example (Python)
[Apologies for repost, for those who have seen this response already in the
dev mailing list]
1. When you set ssc.checkpoint(checkpointDir), the spark streaming
periodically saves the state RDD (which is a snapshot of all the state
data) to HDFS using RDD checkpointing. In fact, a streaming app
I found the reason, it is about sc. Thanks
On Tue, Jul 14, 2015 at 9:45 PM, Akhil Das ak...@sigmoidanalytics.com
wrote:
Someone else also reported this error with spark 1.4.0
Thanks
Best Regards
On Tue, Jul 14, 2015 at 6:57 PM, Arthur Chan arthur.hk.c...@gmail.com
wrote:
Hi, Below is
Thanks TD, that is very useful.
On Tue, Jul 14, 2015 at 10:19 PM, Tathagata Das t...@databricks.com wrote:
You can do this.
// global variable to keep track of latest stuff
var latestTime = _
var latestRDD = _
dstream.foreachRDD((rdd: RDD[..], time: Time) = {
latestTime = time
Of course, exactly once receiving is not same as exactly once. In case of
direct kafka stream, the data may actually be pulled multiple time. But
even if the data of a batch is pulled twice because of some failure, the
final result (that is, transformed data accessed through foreachRDD) will
Can you describe how did you cache the tables? In another HiveContext? AFAIK,
cached table only be visible within the same HiveContext, you probably need to
execute the sql query like
“cache table mytable as SELECT xxx” in the JDBC connection also.
Cheng Hao
From: Brandon White
So you’re with different HiveContext instances for the caching. We are not
expected to see the cached tables cached with the other HiveContext instance.
From: Brandon White [mailto:bwwintheho...@gmail.com]
Sent: Wednesday, July 15, 2015 8:48 AM
To: Cheng, Hao
Cc: user
Subject: Re: How do you
Thanks TD and Cody. I saw that.
1. By doing that (foreachRDD), does KafkaDStream checkpoints its offsets on
HDFS at the end of each batch interval?
2. In the code, if I first apply transformations and actions on the
directKafkaStream and then use foreachRDD on the original KafkaDStream to
commit
Thanks TD.
As for 1), if timing is not guaranteed, how does exactly once semantics
supported? It feels like exactly once receiving is not necessarily exactly
once processing.
Chen
On Tue, Jul 14, 2015 at 10:16 PM, Tathagata Das t...@databricks.com wrote:
On Tue, Jul 14, 2015 at 6:42 PM,
This is a known race condition - root cause of SPARK-5681
https://issues.apache.org/jira/browse/SPARK-5681
On Mon, Jul 13, 2015 at 3:35 AM, Juan Rodríguez Hortalá
juan.rodriguez.hort...@gmail.com wrote:
Hi,
I have noticed that when StreamingContext.stop is called when no receiver
has
Hi all,
I'm upgrading from spark1.3 to spark1.4 and when trying to run spark-sql
CLI. It gave an ```ava.lang.UnsupportedOperationException: Not implemented
by the TFS FileSystem implementation``` exception. I did not get this error
with 1.3 and I don't use any TFS FileSystem. Full stack trace is
I have been POC adding a rest service in a Spark Streaming job. Say I
create a stateful DStream X by using updateStateByKey, and each time there
is a HTTP request, I want to apply some transformations/actions on the
latest RDD of X and collect the results immediately but not scheduled by
streaming
Thanks, Marcelo.
That article confused me, thanks for correcting it helpful tips.
I looked into Virtual memory usage (jmap+jvisualvm) does not show that 11.5
g Virtual Memory usage - it is much less. I get 11.5 g Virtual memory usage
using top -p pid command for SparkSubmit process.
The
Relevant documentation -
https://spark.apache.org/docs/latest/streaming-kafka-integration.html,
towards the end.
directKafkaStream.foreachRDD { rdd =
val offsetRanges = rdd.asInstanceOf[HasOffsetRanges]
// offsetRanges.length = # of Kafka partitions being consumed
...
}
On Tue,
I've set up my cluster with a pre-calcualted value for spark.executor.instances
in spark-defaults.conf such that I can run a job and have it maximize the
utilization of the cluster resources by default. However, if I want to run a
job with dynamicAllocation (by passing -c
Hello,
I am using SparkSQL along with ThriftServer so that we can access using Hive
queries.
With Spark 1.3.1, I can register UDF function. But, Spark 1.4.0 doesn't work
for that. The jar of the udf is same.
Below is logs:
I appreciate any advice.
== With Spark 1.4
Beeline version 1.4.0 by
Please take a look at SPARK-2365 which is in progress.
On Tue, Jul 14, 2015 at 5:18 PM, swetha swethakasire...@gmail.com wrote:
Hi,
Is IndexedRDD available in Spark 1.4.0? We would like to use this in Spark
Streaming to do lookups/updates/deletes in RDDs using keys by storing them
as
bq. that is, key-value stores
Please consider HBase for this purpose :-)
On Tue, Jul 14, 2015 at 5:55 PM, Tathagata Das t...@databricks.com wrote:
I do not recommend using IndexRDD for state management in Spark Streaming.
What it does not solve out-of-the-box is checkpointing of indexRDDs,
Why is .remember not ideal?
On Sun, Jul 12, 2015 at 7:22 PM, Brandon White bwwintheho...@gmail.com
wrote:
Hi Yin,
Yes there were no new rows. I fixed it by doing a .remember on the
context. Obviously, this is not ideal.
On Sun, Jul 12, 2015 at 6:31 PM, Yin Huai yh...@databricks.com wrote:
On Tue, Jul 14, 2015 at 3:42 PM, Elkhan Dadashov elkhan8...@gmail.com
wrote:
I looked into Virtual memory usage (jmap+jvisualvm) does not show that
11.5 g Virtual Memory usage - it is much less. I get 11.5 g Virtual memory
usage using top -p pid command for SparkSubmit process.
If you're
Hi,
Is IndexedRDD available in Spark 1.4.0? We would like to use this in Spark
Streaming to do lookups/updates/deletes in RDDs using keys by storing them
as key/value pairs.
Thanks,
Swetha
--
View this message in context:
I do not recommend using IndexRDD for state management in Spark Streaming.
What it does not solve out-of-the-box is checkpointing of indexRDDs, which
important because long running streaming jobs can lead to infinite chain of
RDDs. Spark Streaming solves it for the updateStateByKey operation which
Hi All,
I've met a issue with MLlib when i use LogisticRegressionWithLBFGS
my sample data :
*0 863:1 40646:1 37697:1 1423:1 38648:1 4230:1 23823:1 41594:1 27614:1
5689:1 18493:1 44187:1 5694:1 27799:1 12010:1*
*0 863:1 40646:1 37697:1 1423:1 38648:1 4230:1 23823:1 41594:1 27614:1
5689:1 18493:1
I was able to workaround this by converting the DataFrame to an RDD and then
back to DataFrame. This seems very weird to me, so any insight would be much
appreciated!
Thanks,
Nick
P.S. Here's the updated code with the workaround:
```
// Examples udf's that println when called
val twice
On Tue, Jul 14, 2015 at 6:42 PM, Chen Song chen.song...@gmail.com wrote:
Thanks TD and Cody. I saw that.
1. By doing that (foreachRDD), does KafkaDStream checkpoints its offsets
on HDFS at the end of each batch interval?
The timing is not guaranteed.
2. In the code, if I first apply
Hello there,
I have a JBDC connection setup to my Spark cluster but I cannot see the
tables that I cache in memory. The only tables I can see are those that are
in my Hive instance. I use a HiveContext to register a table and cache it
in memory. How can I enable my JBDC connection to query this
Hi Ankur,
Is IndexedRDD available in Spark 1.4.0? We would like to use this in Spark
Streaming to do lookups/updates/deletes in RDDs using keys by storing them
as key/value pairs.
Thanks,
Swetha
--
View this message in context:
Hello, I am wondering what will happen if I use a reference for transforming
rdd, for example:
def func1(rdd: RDD[Int]): RDD[Int] = {
rdd.map(x = x * 2) // example transformation, but I am using a more
complex function
}
def main() {
.
val myrdd = sc.parallelize(1 to 100)
Thank you Hafsa
On Tue, Jul 14, 2015 at 11:09 AM, Hafsa Asif hafsa.a...@matchinguu.com
wrote:
Hi,
I was also in the same situation as we were using MySQL. Let me give some
clearfications:
1. Spark provides a great methodology for big data analysis. So, if you
want to make your system more
Dear all,
I am trying to join two RDDs, named rdd1 and rdd2.
rdd1 is loaded from a textfile with about 33000 records.
rdd2 is loaded from a table in cassandra which has about 3 billions records.
I tried the following code:
```scala
val rdd1 : (String, XXX) = sc.textFile(...).map(...)
import
Dear all,
I have found a post discussing the same thing:
https://groups.google.com/a/lists.datastax.com/forum/#!searchin/spark-connector-user/join/spark-connector-user/q3GotS-n0Wk/g-LPTteCEg0J
The solution is using joinWithCassandraTable and the documentation
is here:
I don't understand.
By the way, the `joinWithCassandraTable` does improve my query time
from 40 mins to 3 mins.
2015-07-15 13:19 GMT+08:00 ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com:
I have explored spark joins for last few months (you can search my posts)
and its frustrating useless.
On Tue, Jul
1 - 100 of 122 matches
Mail list logo