If something is persisted you can easily see them under the Storage tab in
the web ui.
Thanks
Best Regards
On Tue, Nov 18, 2014 at 7:26 PM, Aniket Bhatnagar
aniket.bhatna...@gmail.com wrote:
I am trying to figure out if sorting is persisted after applying Pair RDD
transformations and I am
Hello,
I'm writing a process that ingests json files and saves them a parquet
files.
The process is as such:
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val jsonRequests=sqlContext.jsonFile(/requests)
val parquetRequests=sqlContext.parquetFile(/requests_parquet)
You can also insert into existing tables via .insertInto(tableName, overwrite).
You just have to import sqlContext._
On 19.11.2014, at 09:41, Daniel Haviv danielru...@gmail.com wrote:
Hello,
I'm writing a process that ingests json files and saves them a parquet files.
The process is as such:
Very cool thank you!
On Wed, Nov 19, 2014 at 11:15 AM, Marius Soutier mps@gmail.com wrote:
You can also insert into existing tables via .insertInto(tableName,
overwrite). You just have to import sqlContext._
On 19.11.2014, at 09:41, Daniel Haviv danielru...@gmail.com wrote:
Hello,
You can serve queries over your RDD data yes, and return results to the
user/client as long as your driver is alive.
For example, I have built a play! application that acts as a driver
(creating a spark context), loads up data from my database, organize it and
subsequently receive and process
Hi,
When reading through ALS code, I find that:
class ALS private (
private var numUserBlocks: Int,
private var numProductBlocks: Int,
private var rank: Int,
private var iterations: Int,
private var lambda: Double,
private var implicitPrefs: Boolean,
private var
Anyone has idea on this ?
Thx
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Understanding-spark-operation-pipeline-and-block-storage-tp18201p19263.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
I'm trying to set up a PySpark ETL job that takes in JSON log files and
spits out fact table files for upload to Redshift. Is there an efficient
way to send different event types to different outputs without having to
just read the same cached RDD twice? I have my first RDD which is just a
json
Akhil, I think Aniket uses the word persisted in a different way than
what you mean. I.e. not in the RDD.persist() way. Aniket asks if running
combineByKey on a sorted RDD will result in a sorted RDD. (I.e. the sorting
is preserved.)
I think the answer is no. combineByKey uses AppendOnlyMap,
Thanks Daniel. I can understand that the keys will not be in sorted order
but what I am trying to understanding is whether the functions are passed
values in sorted order in a given partition.
For example:
sc.parallelize(1 to 8).map(i = (1, i)).sortBy(t = t._2).foldByKey(0)((a,
b) = b).collect
Ah, so I misunderstood you too :).
My reading of org/ apache/spark/Aggregator.scala is that your function will
always see the items in the order that they are in the input RDD. An RDD
partition is always accessed as an iterator, so it will not be read out of
order.
On Wed, Nov 19, 2014 at 2:28
Hello experts,
Is there an easy way to debug a spark java application?
I'm putting debug logs in the map's function but there aren't any logs on
the console.
Also can i include my custom jars while launching spark-shell and do my poc
there?
This might me a naive question but any help here is
We keep running into https://issues.apache.org/jira/browse/SPARK-2823 when
trying to use GraphX. The cost of repartitioning the data is really high
for us (lots of network traffic) which is killing the job performance.
I understand the bug was reverted to stabilize unit tests, but frankly it
Hi,
I wrote below simple spark code, and met a runtime issue which seems that
the system can't find some methods of scala refect library.
package org.apache.spark.examples
import scala.io.Source
import scala.reflect._
import scala.reflect.api.JavaUniverse
import scala.reflect.runtime.universe
I joined two datasets together, and my resulting logs look like this:
(975894369,((72364,20141112T170627,web,MEMPHIS,AR,US,Central),(Male,John,Smith)))
(253142991,((30058,20141112T171246,web,ATLANTA16,GA,US,Southeast),(Male,Bob,Jones)))
When a field of an object is enclosed in a closure, the object itself is
also enclosed automatically, thus the object need to be serializable.
On 11/19/14 6:39 PM, Hao Ren wrote:
Hi,
When reading through ALS code, I find that:
class ALS private (
private var numUserBlocks: Int,
can you please post the full source of your code and some sample data to
run it on ?
2014-11-19 16:23 GMT+01:00 YaoPau jonrgr...@gmail.com:
I joined two datasets together, and my resulting logs look like this:
(975894369,((72364,20141112T170627,web,MEMPHIS,AR,US,Central),(Male,John,Smith)))
Thanks Daniel :-). It seems to make sense and something I was hoping for. I
will proceed with this assumption and report back if I see any anomalies.
On Wed Nov 19 2014 at 19:30:02 Daniel Darabos
daniel.dara...@lynxanalytics.com wrote:
Ah, so I misunderstood you too :).
My reading of org/
I have for now submitted a JIRA ticket @
https://issues.apache.org/jira/browse/SPARK-4473. I will collate all my
experiences ( hacks) and submit them as a feature request for public API.
On Tue Nov 18 2014 at 20:35:00 andy petrella andy.petre...@gmail.com
wrote:
yep, we should also propose to
Thanks for pointing this out Mark. Had totally missed the existing JIRA
items
On Wed Nov 19 2014 at 21:42:19 Mark Hamstra m...@clearstorydata.com wrote:
This is already being covered by SPARK-2321 and SPARK-4145. There are
pull requests that are already merged or already very far along --
This is already being covered by SPARK-2321 and SPARK-4145. There are pull
requests that are already merged or already very far along -- e.g.,
https://github.com/apache/spark/pull/3009
If there is anything that needs to be added, please add it to those issues
or PRs.
On Wed, Nov 19, 2014 at
Hi,
I'm starting with Spark and I just trying to understand if I want to
use Spark Streaming, should I use to feed it Flume or Kafka? I think
there's not a official Sink for Flume to Spark Streaming and it seems
that Kafka it fits better since gives you readibility.
Could someone give a good
Hi all!
Thanks for answering!
@Sean, I tried to run with 30 executor-cores , and 1 machine still without
processing.
@Vanzin, I checked RM's web UI, and all nodes were detecteds and RUNNING.
The interesting fact is that available
memory and available core of 1 node was different of other 2, with
Hello!
I'm working on a POC with Spark SQL, where I’m trying to get data from
Cassandra into Tableau using Spark SQL.
Here is the stack:
- cassandra (v2.1)
- spark SQL (pre build v1.1 hadoop v2.4)
- cassandra / spark sql connector
(https://github.com/datastax/spark-cassandra-connector)
- hive
-
Hi all, I am running HiveThriftServer2 and noticed that the process stays
up even though there is no driver connected to the Spark master.
I started the server via sbin/start-thriftserver and my namenodes are
currently not operational. I can see from the log that there was an error
on startup:
For debugging you can refer these two threads
http://apache-spark-user-list.1001560.n3.nabble.com/How-do-you-hit-breakpoints-using-IntelliJ-In-functions-used-by-an-RDD-td12754.html
This is not by design. Can you please file a JIRA?
On Wed, Nov 19, 2014 at 9:19 AM, Yana Kadiyska yana.kadiy...@gmail.com
wrote:
Hi all, I am running HiveThriftServer2 and noticed that the process stays
up even though there is no driver connected to the Spark master.
I started the server
On Tue, Nov 18, 2014 at 10:34 PM, Night Wolf nightwolf...@gmail.com wrote:
Is there a better way to mock this out and test Hive/metastore with
SparkSQL?
I would use TestHive which creates a fresh metastore each time it is
invoked.
Has anyone else received this type of error? We are not sure what the
issue is nor how to correct it to get our job to complete...
On Wed, Nov 19, 2014 at 12:41 AM, Daniel Haviv danielru...@gmail.com
wrote:
Another problem I have is that I get a lot of small json files and as a
result a lot of small parquet files, I'd like to merge the json files into
a few parquet files.. how I do that?
You can use `coalesce` on any
I am not very familiar with the JSONSerDe for Hive, but in general you
should not need to manually create a schema for data that is loaded from
hive. You should just be able to call saveAsParquetFile on any SchemaRDD
that is returned from hctx.sql(...).
I'd also suggest you check out the
The whole stacktrack/exception would be helpful. Hive is an optional
dependency of Spark SQL, but you will need to include it if you are
planning to use the thrift server to connect to Tableau. You can enable it
by add -Phive when you build Spark.
You might also try asking on the cassandra
That error can mean a whole bunch of things (and we've been working in
recently to make it more descriptive). Often the actual cause is in the
executor logs.
On Wed, Nov 19, 2014 at 10:50 AM, Gary Malouf malouf.g...@gmail.com wrote:
Has anyone else received this type of error? We are not sure
You can override the schema inference by passing a schema as the second
argument to jsonRDD, however thats not a super elegant solution. We are
considering one option to make this easier here:
https://issues.apache.org/jira/browse/SPARK-4476
On Tue, Nov 18, 2014 at 11:06 PM, Akhil Das
https://issues.apache.org/jira/browse/SPARK-4497
On Wed, Nov 19, 2014 at 1:48 PM, Michael Armbrust mich...@databricks.com
wrote:
This is not by design. Can you please file a JIRA?
On Wed, Nov 19, 2014 at 9:19 AM, Yana Kadiyska yana.kadiy...@gmail.com
wrote:
Hi all, I am running
Thank you Michael
I will try it out tomorrow
Daniel
On 19 בנוב׳ 2014, at 21:07, Michael Armbrust mich...@databricks.com wrote:
You can override the schema inference by passing a schema as the second
argument to jsonRDD, however thats not a super elegant solution. We are
considering one
I am hitting the same issue, i.e., after running for some time, if spark
streaming job lost or timeout
kafka connection, it will just start to return empty RDD's ..
Is there a timeline for when this issue will be fixed so that I can plan
accordingly?
Thanks.
Tian
--
View this message in
Hi - I was curious if anyone is using the Spark SQL Thrift JDBC server with
Cassandra. It would be great be if you could share how you got it working? For
example, what config changes have to be done in hive-site.xml, what additional
jars are required, etc.?
I have a Spark app that can
I'm trying to run Spark on Yarn on a hortonworks 2.1.5 cluster. I'm getting
this error:
14/11/19 13:46:34 INFO cluster.YarnClientSchedulerBackend: Registered
executor: Actor[akka.tcp://sparkExecutor@#/user/Executor#-
2027837001] with ID 42
14/11/19 13:46:34 WARN
Oh, actually, we do not support MapType provided by the schema given to
jsonRDD at the moment (my bad..). Daniel, you need to wait for the patch of
4476 (I should have one soon).
Thanks,
Yin
On Wed, Nov 19, 2014 at 2:32 PM, Daniel Haviv danielru...@gmail.com wrote:
Thank you Michael
I will
Your Hadoop configuration is set to look for this file to determine racks. Is
the file present on cluster nodes? If not, look at your hdfs-site.xml and
remove the setting for a rack topology script there (or it might be in
core-site.xml).
Matei
On Nov 19, 2014, at 12:13 PM, Arun Luthra
As of now, you can feed Spark Streaming from both kafka and flume.
Currently though there is no API to write data back to either of the two
directly.
I sent a PR which should eventually add something like this:
Btw, if you want to write to Spark Streaming from Flume -- there is a sink
(it is a part of Spark, not Flume). See Approach 2 here:
http://spark.apache.org/docs/latest/streaming-flume-integration.html
On Wed, Nov 19, 2014 at 12:41 PM, Hari Shreedharan
hshreedha...@cloudera.com wrote:
As of
I'm running into similar problems with accumulators failing to serialize
properly. Are there any examples of accumulators being used in more
complex environments than simply initializing them in the same class and
then using them in a .foreach() on an RDD referenced a few lines below?
From the
I don't have a solution for you, but it sounds like you might want to
follow this issue:
SPARK-3533 https://issues.apache.org/jira/browse/SPARK-3533 - Add
saveAsTextFileByKey() method to RDDs
On Wed Nov 19 2014 at 6:41:11 AM Tom Seddon mr.tom.sed...@gmail.com wrote:
I'm trying to set up a
Question ... when you mean different versions, different versions of
dependency files? what are the dependency files for spark?
On Tue Nov 18 2014 at 5:27:18 PM Anson Abraham anson.abra...@gmail.com
wrote:
when cdh cluster was running, i did not set up spark role. When I did for
the first
Hi,
I'm running out of memory when I run a GraphX program for dataset moe than
10 GB, It was handle pretty well in case of noraml spark operation when did
StorageLevel.MEMORY_AND_DISK.
In case of GraphX I found its only allowed storing in memory, and it is
because in Graph constructor, this
Hi,
I have a graph where no. of edges b/w two vertices are more than once
possible. Now I need to find out who are top vertices between which no. of
calls happen more?
output should look like (V1, V2 , No. of edges)
So I need to know, how to find out total no. of edges b/w only that two
I have been using Spark SQL to read in JSON data, like so:
val myJsonFile = sqc.jsonFile(args(myLocation))
myJsonFile.registerTempTable(myTable)
sqc.sql(mySQLQuery).map { row =
myFunction(row)
}
And then in myFunction(row) I can read the various columns with the
Row.getX methods. However, this
Hi sparkers,
I'm trying to use spark.sql.thriftserver.scheduler.pool for the first time
(earlier I was stuck because of
https://issues.apache.org/jira/browse/SPARK-4037)
I have two pools setup:
[image: Inline image 1]
and would like to issue a query against the low priority pool.
I am doing
You can extract the nested fields in sql: SELECT field.nestedField ...
If you don't do that then nested fields are represented as rows within rows
and can be retrieved as follows:
t.getAs[Row](0).getInt(0)
Also, I would write t.getAs[Buffer[CharSequence]](12) as
t.getAs[Seq[String]](12) since
Hi Anson,
We've seen this error when incompatible classes are used in the driver
and executors (e.g., same class name, but the classes are different
and thus the serialized data is different). This can happen for
example if you're including some 3rd party libraries in your app's
jar, or changing
Hi Landon,
I tried this but it didn't work for me. I get Task not serializable
exception:
Caused by: java.io.NotSerializableException:
org.apache.hadoop.conf.Configuration
How do you make org.apache.hadoop.conf.Configuration hadoopConfiguration
available to tasks?
--
View this message in
Thank you for your answer, I don't know if I typed the question
correctly. But your nswer helps me.
I'm going to make the question again for knowing if you understood me.
I have this topology:
DataSource1, , DataSourceN -- Kafka -- SparkStreaming -- HDFS
DataSource1, , DataSourceN --
Thank you for your answer, I don't know if I typed the question
correctly. But your nswer helps me.
I'm going to make the question again for knowing if you understood me.
I have this topology:
DataSource1, , DataSourceN -- Kafka -- SparkStreaming -- HDFS
yeah but in this case i'm not building any files. just deployed out config
files in CDH5.2 and initiated a spark-shell to just read and output a file.
On Wed Nov 19 2014 at 4:52:51 PM Marcelo Vanzin van...@cloudera.com wrote:
Hi Anson,
We've seen this error when incompatible classes are used
I think your config may be the issue then. It sounds like 1 server is
configured in a different YARN group that states it has way less
resource than it does.
On Wed, Nov 19, 2014 at 5:27 PM, Alan Prando a...@scanboo.com.br wrote:
Hi all!
Thanks for answering!
@Sean, I tried to run with 30
On Wed, Nov 19, 2014 at 2:13 PM, Anson Abraham anson.abra...@gmail.com wrote:
yeah but in this case i'm not building any files. just deployed out config
files in CDH5.2 and initiated a spark-shell to just read and output a file.
In that case it is a little bit weird. Just to be sure, you are
This works great, thank you!
Simone Franzini, PhD
http://www.linkedin.com/in/simonefranzini
On Wed, Nov 19, 2014 at 3:40 PM, Michael Armbrust mich...@databricks.com
wrote:
You can extract the nested fields in sql: SELECT field.nestedField ...
If you don't do that then nested fields are
yeah CDH distribution (1.1).
On Wed Nov 19 2014 at 5:29:39 PM Marcelo Vanzin van...@cloudera.com wrote:
On Wed, Nov 19, 2014 at 2:13 PM, Anson Abraham anson.abra...@gmail.com
wrote:
yeah but in this case i'm not building any files. just deployed out
config
files in CDH5.2 and initiated a
Hi,
I am running some Spark code on my cluster in standalone mode. However, I
have noticed that the most powerful machines (32 cores, 192 Gb mem) hardly
get any tasks, whereas my small machines (8 cores, 128 Gb mem) all get
plenty of tasks. The resources are all displayed correctly in the WebUI
I would use just textFile unless you actually need a guarantee that you
will be seeing a whole file at time (textFile splits on new lines).
RDDs are immutable, so you cannot add data to them. You can however union
two RDDs, returning a new RDD that contains all the data.
On Wed, Nov 19, 2014 at
Sorry meant cdh 5.2 w/ spark 1.1.
On Wed, Nov 19, 2014, 17:41 Anson Abraham anson.abra...@gmail.com wrote:
yeah CDH distribution (1.1).
On Wed Nov 19 2014 at 5:29:39 PM Marcelo Vanzin van...@cloudera.com
wrote:
On Wed, Nov 19, 2014 at 2:13 PM, Anson Abraham anson.abra...@gmail.com
wrote:
I have a class which is a subclass of Tuple2, and I want to use it with
PairRDDFunctions. However, I seem to be limited by the invariance of T in
RDD[T] (see SPARK-1296 https://issues.apache.org/jira/browse/SPARK-1296).
My Scala-fu is weak: the only way I could think to make this work would be
to
I think you should also be able to get away with casting it back and forth
in this case using .asInstanceOf.
On Wed, Nov 19, 2014 at 4:39 PM, Daniel Siegmann daniel.siegm...@velos.io
wrote:
I have a class which is a subclass of Tuple2, and I want to use it with
PairRDDFunctions. However, I
As Marcelo mentioned, the issue occurs mostly when incompatible classes are
used by executors or drivers. Try out if the output is coming on
spark-shell. If yes, then most probably in your case, there might be some
issue with your configuration files. It will be helpful if you can paste
the
Here is my attempt:
val sparkConf = new SparkConf().setAppName(LogCounter)
val ssc = new StreamingContext(sparkConf, Seconds(2))
val sc = new SparkContext()
val geoData = sc.textFile(data/geoRegion.csv)
.map(_.split(','))
.map(line = (line(0),
Casting to Tuple2 is easy, but the output of reduceByKey is presumably a
new Tuple2 instance so I'll need to map those to new instances of my class.
Not sure how much overhead will be added by the creation of those new
instances.
If I do that everywhere in my code though, it will make the code
Hi Sean,
Thank you for your reply. I was wondering whether there is a method of reusing
locally-built components without installing them? That is, if I have
successfully built the spark project as a whole, how should I configure it so
that I can incrementally build (only) the spark-examples
Hi,
it looks what you are trying to use as a Tuple cannot be inferred to be a
Tuple from the compiler. Try to add type declarations and maybe you will
see where things fail.
Tobias
Hi,
How can I view log on yarn-client mode?
When I insert the following line on mapToPair function for example,
System.out.println(TEST TEST);
On local mode, it is displayed on console.
But on yarn-client mode, it is not on anywhere.
When I use yarn resource manager web UI, the size
Just figured it out using Graph constructor you can pass the storage level
for both Edge and Vertex :
Graph.fromEdges(edges, defaultValue =
(,),StorageLevel.MEMORY_AND_DISK,StorageLevel.MEMORY_AND_DISK )
Thanks to this post : https://issues.apache.org/jira/browse/SPARK-1991
-
--Harihar
Hello,
I'm loading and saving json files into an existing directory with parquet files
using the insertIntoTable method.
If the method fails for some reason (differences in the schema in my case), the
_metadata file of the parquet dir is automatically deleted, rendering the
existing parquet
Thanks for replying .I was unable to figure out how after I use
jsonFile/jsonRDD be able to load data into a hive table. Also I was able to
save the SchemaRDD I got via hiveContext.sql(...).saveAsParquetFile(Path)
ie. save schemardd as parquetfile but when I tried to fetch data from
parquet file
I created a simple Spark Streaming program - it received numbers and
computed averages and sent the results to Kafka.
It worked perfectly in local mode as well as standalone master/slave mode
across a two-node cluster.
It did not work however in yarn-client or yarn-cluster mode.
The job was
Hi, all
Suppose I have a RDD of (K, V) tuple and I do groupBy on it with the key K.
My question is how to make each groupBy resukt whick is (K, iterable[V]) a RDD.
BTW, can we transform it as a DStream and also each groupBY result is a RDD in
it?
Best Regards,
Kevin.
I have been trying the Naive Baye's implementation of Spark's MLlib.During
testing phase, I wish to eliminate data with low confidence of prediction.
My data set primarily consists of form based documents like reports and
application forms. They contain key-value pair type text and hence I assume
While the app is running, you can find logs from the YARN web UI by
navigating to containers through the Nodes link.
After the app has completed, you can use the YARN logs command:
yarn logs -applicationId your app ID
-Sandy
On Wed, Nov 19, 2014 at 6:01 PM, innowireless TaeYun Kim
Sorry about the confusion I created . I just have started learning this week.
Silly me, I was actually writing the schema to a txt file and expecting
records. This is what I was supposed to do. Also if you could let me know
about adding the data from jsonFile/jsonRDD methods of hiveContext to hive
You can save the results as parquet file or as text file and created a hive
table based on these files
Daniel
On 20 בנוב׳ 2014, at 08:01, akshayhazari akshayhaz...@gmail.com wrote:
Sorry about the confusion I created . I just have started learning this week.
Silly me, I was actually
Make sure the executor cores are set to a value which is = 2 while
submitting the job.
Thanks
Best Regards
On Thu, Nov 20, 2014 at 10:36 AM, kam lee cloudher...@gmail.com wrote:
I created a simple Spark Streaming program - it received numbers and
computed averages and sent the results to
What's your use case? You would not generally want to make so many small
RDDs.
On Nov 20, 2014 6:19 AM, Dai, Kevin yun...@ebay.com wrote:
Hi, all
Suppose I have a RDD of (K, V) tuple and I do groupBy on it with the key K.
My question is how to make each groupBy resukt whick is (K,
You can also look at the Amazon's kinesis if you don't want to handle the
pain of maintaining kafka/flume infra.
Thanks
Best Regards
On Thu, Nov 20, 2014 at 3:32 AM, Guillermo Ortiz konstt2...@gmail.com
wrote:
Thank you for your answer, I don't know if I typed the question
correctly. But your
1. You don't have to create another sparkContext. you can simply call the
*ssc.sparkContext*
2. May be after the transformation on geoData, you could do a persist so
next time, it will be read from memory.
Thanks
Best Regards
On Thu, Nov 20, 2014 at 6:43 AM, YaoPau jonrgr...@gmail.com wrote:
Thank you, but I'm just considering a free options.
2014-11-20 7:53 GMT+01:00 Akhil Das ak...@sigmoidanalytics.com:
You can also look at the Amazon's kinesis if you don't want to handle the
pain of maintaining kafka/flume infra.
Thanks
Best Regards
On Thu, Nov 20, 2014 at 3:32 AM,
Why not install them? It doesn't take any work and is the only correct way
to do it. mvn install is all you need.
On Nov 20, 2014 2:35 AM, Yiming (John) Zhang sdi...@gmail.com wrote:
Hi Sean,
Thank you for your reply. I was wondering whether there is a method of
reusing locally-built
86 matches
Mail list logo