Can you explain the issue? Further, in which language do you want to code?
There are number of blogs to create a simple a maven project in Eclipse, and
they are pretty simple and straightforward.
--
View this message in context:
Yes, that is correct, sorry for confusing you. But i guess this is what you
are looking for, let me know if that doesn't help:
val filtered_statuses = stream.transform(rdd ={
//Instead of hardcoding, you can fetch these from a MySQL or a file
or whatever
val sampleHashTags =
Hello,
I manage to read all my data back with skipping offset that contains a
corrupt message. I have one more question regarding messageHandler method
vs dstream.foreachRDD.map vs dstream.map.foreachRDD best practices. I'm
using a function to read the serialized message from kafka and convert it
Could you please provide the full stack trace of the exception? And
what's the Git commit hash of the version you were using?
Cheng
On 7/24/15 6:37 AM, Jerry Lam wrote:
Hi Cheng,
I ran into issues related to ENUM when I tried to use Filter push
down. I'm using Spark 1.5.0 (which contains
printSchema calls StructField. buildFormattedString() to output schema
information. buildFormattedString() use DataType.typeName as string
representation of the data type.
LongType. typeName = long
LongType.simpleString = bigint
I am not sure about the difference of these two type name
1. Yes you can, have a look at the EsOutputFormat
https://github.com/elastic/elasticsearch-hadoop/blob/master/mr/src/main/java/org/elasticsearch/hadoop/mr/EsOutputFormat.java
2. I'm not quiet sure about that, you could ask the ES people about it.
Thanks
Best Regards
On Thu, Jul 23, 2015 at 11:58
Hi there,
Details below.
Organisation: Woodside
URL: http://www.woodside.com.au/
Components: Spark Core 1.31/1/41 and Spark SQL
Use Case: Spark is being used for near real time predictive analysis over
millions of equipment sensor readings and within our Data Integration processes
for data
I guess it would wait for sometime and throw up something like this:
Initial job has not accepted any resources; check your cluster UI to ensure
that workers are registered and have sufficient memory
Thanks
Best Regards
On Thu, Jul 23, 2015 at 7:53 AM, bit1...@163.com bit1...@163.com wrote:
I want to program in scala for spark.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Eclipse-IDE-Maven-tp23977p23981.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
Try using insert instead of merge. Typically we use insert append to do
bulk inserts to oracle.
On Thu, Jul 23, 2015 at 1:12 AM, diplomatic Guru diplomaticg...@gmail.com
wrote:
Thanks Robin for your reply.
I'm pretty sure that writing to Oracle is taking longer as when writing to
HDFS is
I am currently working on the latest version of Apache Spark (1.4.1),
pre-built package for Hadoop 2.6+.
Is there any feature in Spark/Hadoop to encrypt RDDs or in-memory (similarly
to Altibase's HDB:
http://altibase.com/in-memory-database-computing-solutions/security/
I don’t think this is a bug either. For an empty JSON array |[]|,
there’s simply no way to infer its actual data type, and in this case
Spark SQL just tries to fill in the “safest” type, which is
|StringType|, because basically you can cast any data type to |StringType|.
In general, schema
Hi Akhil,
the namenode is definitely configured correctly, otherwise the job would
not start at all. It registers with YARN and starts up, but once the nodes
try to communicate to each other it fails. Note that a hadoop MR job using
the identical configuration executes without any problems. The
Your guess is partly right. Firstly, Spark SQL doesn’t have an
equivalent data type to Parquet BINARY (ENUM), and always falls back to
normal StringType. So in your case, Spark SQL just see a StringType,
which maps to Parquet BINARY (UTF8), but the underlying data type is
BINARY (ENUM).
I am trying to use the spark job server to run a query against a DSE
Cassandra table. Can any one please help me with instructions? I am *NOT* a
Java person.
What I have done so far:
1. I have a single node DataStax Cassandra ver 4.6 running on a Centos 6.6
VM
2. Tested dse spark to query tables
Hi,
I am trying one transformation by calling scala method
this scala method returns MutableList[AvroObject]
def processRecords(id: String, list1: Iterable[(String, GenericRecord)]):
scala.collection.mutable.MutableList[AvroObject]
Hence, the output of transaformation is
HI,
I am getting this error for some of spark applications. I have multiple spark
applications running in parallel. Is there a limit in the number of spark
applications that I can run in parallel.
ERROR SparkUI: Failed to bind SparkUI
java.net.BindException: Address already in use: Service
Hi Chintan,
This is more of Oracle VirtualBox virtualization issue than Spark issue.
VT-x is hardware assisted virtualization and it is required by Oracle
VirtualBox for all (64 bits) guests. The error message indicates that
either your processor does not support VT-x (but your VM is
Hi,
Spark's configuration file (useful to retrieve metrics), namely
//conf/metrics.properties/, states what follows:
Within an instance, a source specifies a particular set of grouped
metrics. there are two kinds of sources:
1. Spark internal sources, like /MasterSource/, /WorkerSource/, etc,
Hi Joji,
To my knowledge, Spark does not offer any such function.
I agree, defining a function to find an open (random) port would be a good
option. However, in order to invoke the corresponding SparkUI one needs
to know this port number.
Thanks,
Ajay
On Fri, Jul 24, 2015 at 10:19 AM, Joji
Hello Apache Kafka community,
Say there is only one topic with single partition and a single message on
it.
Result of calling a poll with new consumer will return ConsumerRecord for
that message and it will have offset of 0.
After processing message, current KafkaConsumer implementation expects
Thanks for the additional info, I tried to follow that and went ahead and
directly added netlib to my application POM/JAR - that should be sufficient
to make it work? And that is at least definietely on the executor class
path? Still got the same warning, so not sure where else to take it.
Thanks
Thanks Ajay.
The way we wrote our spark application is that we have a generic python code,
multiple instances of which can be called using different parameters. Does
spark offer any function to bind it to a available port?
I guess the other option is to define a function to find open port
Well... there are only 2 hard problems in computer science: naming things,
cache invalidation, and off-by-one errors.
The direct stream implementation isn't asking you to commit anything.
It's asking you to provide a starting point for the stream on startup.
Because offset ranges are inclusive
Hi,
We are converting some csv log files to parquet but the job is getting
progressively slower the more files we add to the parquet folder.
The parquet files are being written to s3, we are using a spark
standalone cluster running on ec2 and the spark version is 1.4.1. The
parquet files are
that is pretty odd -- toMap not being there would be from scala...but what
is even weirder is that toMap is positively executed on the driver machine,
which is the same when you do spark-shell and spark-submit...does it work
if you run with --master local[*]?
Also, you can try to put a set -x in
Hi Akhil,
That's exactly what I needed. You saved my day :)
Thanks a lot,
Best,
Zoran
On Fri, Jul 24, 2015 at 12:28 AM, Akhil Das ak...@sigmoidanalytics.com
wrote:
Yes, that is correct, sorry for confusing you. But i guess this is what
you are looking for, let me know if that doesn't help:
THanks. Sorry the last section was supposed be
streams.par.foreach { nameAndStream =
nameAndStream._2.foreachRDD { rdd =
df = sqlContext.jsonRDD(rdd)
df.insertInto(stream._1)
}
}
ssc.start()
On Fri, Jul 24, 2015 at 10:39 AM, Dean Wampler deanwamp...@gmail.com
wrote:
You don't
Hi,
We have a requirement to maintain the user session state and to
maintain/update the metrics for minute, day and hour granularities for a
user session in our Streaming job. How to maintain the metrics over a window
along with maintaining the session state using updateStateByKey in Spark
Sorry, wrong ML.
On Fri, Jul 24, 2015 at 7:07 PM, Cody Koeninger c...@koeninger.org wrote:
Are you intending to be mailing the spark list or the kafka list?
On Fri, Jul 24, 2015 at 11:56 AM, Stevo Slavić ssla...@gmail.com wrote:
Hello Cody,
I'm not sure we're talking about same thing.
You don't need the par (parallel) versions of the Scala collections,
actually, Recall that you are building a pipeline in the driver, but it
doesn't start running cluster tasks until ssc.start() is called, at which
point Spark will figure out the task parallelism. In fact, you might as
well do the
Hello Cody,
I'm not sure we're talking about same thing.
Since you're mentioning streams I guess you were referring to current high
level consumer, while I'm talking about new yet unreleased high level
consumer.
Poll I was referring to is
Are you intending to be mailing the spark list or the kafka list?
On Fri, Jul 24, 2015 at 11:56 AM, Stevo Slavić ssla...@gmail.com wrote:
Hello Cody,
I'm not sure we're talking about same thing.
Since you're mentioning streams I guess you were referring to current high
level consumer,
Sorry, I didn't mention I'm using the Python API, which doesn't have the
saveAsObjectFiles method.
Is there any alternative from Python?
And also, I want to write the raw bytes of my object into files on disk,
and not using some Serialization format to be read back into Spark.
Is it possible?
Any
Thanks, but something is not clear...
I have the mesos cluster.
- I want to submit my application and scheduled with chronos.
- For cluster mode I need a dispatcher, this is another container (machine
in the real world)? What will this do? It's needed when I using chronos?
- How can I access to my
When running Spark in Mesos cluster mode, the driver program runs in one of
the cluster nodes, like the other Spark processes that are spawned. You
won't need a special node for this purpose. I'm not very familiar with
Chronos, but its UI or the regular Mesos UI should show you where the
driver is
Hi Jodi,
I guess, there is no hard limit on number of Spark applications running in
parallel. However, you need to ensure that you do not use the same (e.g.,
default) port numbers for each application.
In your specific case, for example, if you try using default SparkUI port
4040 for more than
Hi,
When I try to broadcast a hashmap, it runs much slower than the same data
broadcast in array.
It hangs in SparkContext: Created broadcast 0 for few secondes (30s), while
an array does not.
The broadcast dataset is about 1G.
best!
huanglr
Hi,
When running two experiments with the same application, we see a 50%
performance difference between using HDFS and files on disk, both using the
textFile/saveAsTextFile call. Almost all performance loss is in Stage 1.
Input (in Stage 0):
The file is read in using val input =
Hi
I am running a spark stream app on yarn and using apache httpasyncclient 4.1
This client Jar internally has a dependency on jar http-core4.4.1.jar.
This jar's( http-core .jar) old version i.e. httpcore-4.2.5.jar is also
present in class path and has higher priority in classpath(coming earlier
Hi guys,
I am new to apache spark, I wanted to start contributing to this project.
But before that I need to understand the basic coding flow here. I read
How to contribute to apache spark but I couldn't find any way to start
reading the code and start understanding Code Flow. Can anyone tell me
Hey folks,
I am wanting to setup a single machine or a small cluster machine to
run our Spark based exploration lab.
Does anyone have suggestions or metrics on feasibility of running
Spark standalone on a good size RAM machine (64GB) with SSDs without
resource manager.
I expect on or two users
Please checkout the Spark source from Github, and look here:
https://github.com/apache/spark/tree/master/examples/src/main
On Fri, Jul 24, 2015 at 8:43 PM, Chintan Bhatt
chintanbhatt...@charusat.ac.in wrote:
Hi.
Can I know how to get such folder/code for spark implementation?
On Sat,
It's really a question of whether you need access to the
MessageAndMetadata, or just the key / value from the message.
If you just need the key/value, dstream map is fine.
In your case, since you need to be able to control a possible failure when
deserializing the message from the
Hi all,
I have a standalone spark cluster setup on EC2 machines. I did the setup
manually without the ec2 scripts. I have two questions about Spark/GraphX
performance:
1) When I run the PageRank example, the storage tab does not show that all
RDDs are cached. Only one RDD is 100% cached, but the
while running below getting the error un yarn log can anybody hit this issue
./bin/spark-submit --class org.apache.spark.examples.SparkPi --master
yarn-cluster lib/spark-examples-1.4.1-hadoop2.6.0.jar 10
2015-07-24 12:06:10,846 ERROR [RMCommunicator Allocator]
You can certainly start jobs without Chronos, but to automatically restart
finished jobs or to run jobs at specific times or periods, you'll want
something like Chronos.
dean
Dean Wampler, Ph.D.
Author: Programming Scala, 2nd Edition
http://shop.oreilly.com/product/0636920033073.do (O'Reilly)
I'd recommend starting with a few of the code examples to get a sense of
Spark usage (in the examples/ folder when you check out the code). Then,
you can work through the Spark methods they call, tracing as deep as needed
to understand the component you are interested in.
You can also find an
48 matches
Mail list logo