You will have to properly order the columns before writing or you can
change the column order in the actual table according to your job.
Thanks
Best Regards
On Tue, Dec 15, 2015 at 1:47 AM, Bob Corsaro wrote:
> Is there anyway to map pyspark.sql.Row columns to JDBC table
Send the mail to user-unsubscr...@spark.apache.org read more over here
http://spark.apache.org/community.html
Thanks
Best Regards
On Tue, Dec 15, 2015 at 3:39 AM, Mithila Joshi
wrote:
> unsubscribe
>
> On Mon, Dec 14, 2015 at 4:49 PM, Tim Barthram
hi,
I think that people have reported the same issue elsewhere, and this should
be registered as a bug in SPARK
https://forums.databricks.com/questions/2142/self-join-in-spark-sql.html
Regards,
Gourav
On Thu, Dec 17, 2015 at 10:52 AM, Gourav Sengupta wrote:
> Hi
You can broadcast your json data and then do a map side join. This article
is a good start http://dmtolpeko.com/2015/02/20/map-side-join-in-spark/
Thanks
Best Regards
On Wed, Dec 16, 2015 at 2:51 AM, Alexander Pivovarov
wrote:
> I have big folder having ORC files. Files
Something like this? This one uses the ZLIB compression, you can replace
the decompression logic with GZip one in your case.
compressedStream.map(x => {
val inflater = new Inflater()
inflater.setInput(x.getPayload)
val decompressedData = new Array[Byte](x.getPayload.size * 2)
Hi,
I have a table which is directly from S3 location and even a self join on
that cached table is causing the data to be read from S3 again.
The query plan in mentioned below:
== Parsed Logical Plan ==
Aggregate [count(1) AS count#1804L]
Project [user#0,programme_key#515]
Join Inner,
Thank you, Luciano, Shixiong.
I thought the "_2.11" part referred to the Kafka version - an
unfortunate coincidence.
Indeed
spark-submit --jars spark-streaming-kafka-assembly_2.10-1.5.2.jar
my_kafka_streaming_wordcount.py
OR
spark-submit --packages
Is there any difference between the following snippets:
val df = hiveContext.createDataFrame(rows, schema)
df.registerTempTable("myTable")
df.cache()
and
val df = hiveContext.createDataFrame(rows, schema)
df.registerTempTable("myTable")
hiveContext.cacheTable("myTable")
-Sahil
I have never tried this but there is yarn client api's that you can use in
your spark program to get the application id.
Here is the link to the yarn client java doc:
http://hadoop.apache.org/docs/r2.4.1/api/org/apache/hadoop/yarn/client/api/YarnClient.html
getApplications() is the method for your
Which version of spark are you using? You can test this by opening up a
spark-shell, firing a simple job (sc.parallelize(1 to 100).collect()) and
then accessing the
http://sigmoid-driver:4040/api/v1/applications/Spark%20shell/jobs
[image: Inline image 1]
Thanks
Best Regards
On Tue, Dec 15, 2015
*First you create the HBase configuration:*
val hbaseTableName = "paid_daylevel"
val hbaseColumnName = "paid_impression"
val hconf = HBaseConfiguration.create()
hconf.set("hbase.zookeeper.quorum", "sigmoid-dev-master")
hconf.set("hbase.zookeeper.property.clientPort",
If the port 7077 is open for public on your cluster, that's all you need to
take over the cluster. You can read a bit about it here
https://www.sigmoid.com/securing-apache-spark-cluster/
You can also look at this small exploit I wrote
https://www.exploit-db.com/exploits/36562/
Thanks
Best
I am trying to use updateStateByKey but receiving the following error.
(Spark Version 1.4.0)
Can someone please point out what might be the possible reason for this
error.
*The method
updateStateByKey(Function2)
in the type JavaPairDStream is
Awesome, thanks for the PR Koert!
/Anders
On Thu, Dec 17, 2015 at 10:22 PM Prasad Ravilla wrote:
> Thanks, Koert.
>
> Regards,
> Prasad.
>
> From: Koert Kuipers
> Date: Thursday, December 17, 2015 at 1:06 PM
> To: Prasad Ravilla
> Cc: Anders Arpteg, user
>
> Subject: Re:
Did you happened to have a look at this
https://issues.apache.org/jira/browse/SPARK-9629
Thanks
Best Regards
On Thu, Dec 17, 2015 at 12:02 PM, yaoxiaohua wrote:
> Hi guys,
>
> I have two nodes used as spark master, spark1,spark2
>
> Spark1.4.0
>
> Jdk
Hi All,
Imagine I have a Production spark streaming kafka (direct connection)
subscriber and publisher jobs running which publish and subscriber (receive)
data from a kafka topic and I save one day's worth of data using
dstream.slice to Cassandra daily table (so I create daily table before
Hello there
I have the same requirement.
I submit a streaming job with yarn-cluster mode.
If I want to shutdown this endless YARN application, I should find out the
application id by myself and use "yarn appplication -kill " to kill
the application.
Therefore, if I can get returned application
Spark 1.5.2
dfOld.registerTempTable("oldTableName")
sqlContext.cacheTable("oldTableName")
//
// do something
//
dfNew.registerTempTable("oldTableName")
sqlContext.cacheTable("oldTableName")
Now when I use the "oldTableName" table I do get the latest contents
from dfNew but do the
CacheManager#cacheQuery() is called where:
* Caches the data produced by the logical representation of the given
[[Queryable]].
...
val planToCache = query.queryExecution.analyzed
if (lookupCachedData(planToCache).nonEmpty) {
Is the schema for dfNew different from that of dfOld ?
Hi,
the attached DAG shows that for the same table (self join) SPARK is
unnecessarily getting data from S3 for one side of the join where as its
able to use cache for the other side.
Regards,
Gourav
On Fri, Dec 18, 2015 at 10:29 AM, Gourav Sengupta wrote:
> Hi,
>
The picture is a bit hard to read.
I did a brief search but haven't found JIRA for this issue.
Consider logging a SPARK JIRA.
Cheers
On Fri, Dec 18, 2015 at 4:37 AM, Gourav Sengupta
wrote:
> Hi,
>
> the attached DAG shows that for the same table (self join) SPARK
If you're really doing a daily batch job, have you considered just using
KafkaUtils.createRDD rather than a streaming job?
On Fri, Dec 18, 2015 at 5:04 AM, kali.tumm...@gmail.com <
kali.tumm...@gmail.com> wrote:
> Hi All,
>
> Imagine I have a Production spark streaming kafka (direct connection)
You'll need to keep track of the offsets.
On Fri, Dec 18, 2015 at 9:51 AM, sri hari kali charan Tummala <
kali.tumm...@gmail.com> wrote:
> Hi Cody,
>
> KafkaUtils.createRDD totally make sense now I can run my spark job once in
> 15 minutes extract data out of kafka and stop ..., I rely on kafka
Hello everyone,
I am testing some parallel program submission to a stand alone cluster.
Everything works alright, the problem is, for some reason, I can't submit more
than 3 programs to the cluster.
The fourth one, whether legacy or REST, simply hangs until one of the first
three completes.
I
See this thread:
http://search-hadoop.com/m/q3RTtEor1vYWbsW
which mentioned:
SPARK-11105 Disitribute the log4j.properties files from the client to the
executors
FYI
On Fri, Dec 18, 2015 at 7:23 AM, Kalpesh Jadhav <
kalpesh.jad...@citiustech.com> wrote:
> Hi all,
>
>
>
> I am new to spark, I am
Hi Cody,
KafkaUtils.createRDD totally make sense now I can run my spark job once in
15 minutes extract data out of kafka and stop ..., I rely on kafka offset
for Incremental data am I right ? so no duplicate data will be returned.
Thanks
Sri
On Fri, Dec 18, 2015 at 2:41 PM, Cody Koeninger
Hi all,
I am new to spark, I am trying to use log4j for logging my application.
But any how the logs are not getting written at specified file.
I have created application using maven, and kept log.properties file at
resources folder.
Application written in scala .
If there is any
Hi,
I am running into performance issue when joining data frames created from avro
files using spark-avro library.
The data frames are created from 120K avro files and the total size is around
1.5 TB.
The two data frames are very huge with billions of records.
The join for these two
Hi,
Am trying to configure log4j on an AWS EMR 4.2 Spark cluster for a streaming
job set in client mode.
I changed
/etc/spark/conf/log4j.properties
to use a FileAppender. However the INFO logging still goes to console.
Thanks for any suggestions,
--
Nick
>From the console:
This method in CacheManager:
private[sql] def lookupCachedData(plan: LogicalPlan): Option[CachedData]
= readLock {
cachedData.find(cd => plan.sameResult(cd.plan))
Ied me to the following in
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LogicalPlan.scala
:
def
That was exactly the problem Michael, mode details in this post:
http://stackoverflow.com/questions/34184079/cannot-run-queries-in-sqlcontext-from-apache-spark-sql-1-5-2-getting-java-lang
*Matheus*
On Wed, Dec 9, 2015 at 4:43 PM, Michael Armbrust
wrote:
>
Hi,
My Spark Batch job seems to hung up sometimes for a long time before it
starts the next stage/exits. Basically it happens when it has
mapPartition/foreachPartition in a stage. Any idea as to why this is
happening?
Thanks,
Swetha
--
View this message in context:
Need some clarification about the documentation. According to Spark doc
"the default interval is a multiple of the batch interval that is at least 10
seconds. It can be set by using dstream.checkpoint(checkpointInterval).
Typically, a checkpoint interval of 5 - 10 sliding intervals of a DStream
Thanks Ted!
Yes, The schema might be different or the same.
What would be the answer for each situation?
On Fri, Dec 18, 2015 at 6:02 PM, Ted Yu wrote:
> CacheManager#cacheQuery() is called where:
> * Caches the data produced by the logical representation of the given
>
So I looked at the function, my only worry is that the cache should be
cleared if I'm overwriting the cache with the same table name. I did this
experiment and the cache shows as table not cached but want to confirm
this. In addition to not using the old table values is it actually
Can you try the lastest 1.6.0 RC which includes SPARK-1 ?
Cheers
On Fri, Dec 18, 2015 at 7:38 AM, Prasad Ravilla wrote:
> Hi,
>
> I am running into performance issue when joining data frames created from
> avro files using spark-avro library.
>
> The data frames are
When second attempt is made to cache df3 which has same schema as the first
DataFrame, you would see the warning below:
scala> sqlContext.cacheTable("t1")
scala> sqlContext.isCached("t1")
res5: Boolean = true
scala> sqlContext.sql("select * from t1").show
+---+---+
| a| b|
+---+---+
| 1| 1|
During spark-submit when running hive on spark I get:
Exception in thread "main" java.util.ServiceConfigurationError:
org.apache.hadoop.fs.FileSystem: Provider
org.apache.hadoop.hdfs.HftpFileSystem could not be instantiated
Caused by: java.lang.IllegalAccessError: tried to access method
>From the UI I see two rows for this on a streaming application:
RDD NameStorage LevelCached PartitionsFraction CachedSize in MemorySize in
ExternalBlockStoreSize on DiskIn-memory table myColorsTableMemory
Deserialized 1x Replicated2100%728.2 KB0.0 B0.0 BIn-memory table
myColorsTableMemory
Looks you have a reference to some Akka class. Could you post your codes?
Best Regards,
Shixiong Zhu
2015-12-17 23:43 GMT-08:00 Pankaj Narang :
> I am encountering below error. Can somebody guide ?
>
> Something similar is one this link
>
I created the following data, data.file
1 1
1 2
1 3
2 4
3 5
4 6
5 7
6 1
7 2
8 8
The following code:
def parse_line(line):
tokens = line.split(' ')
return (int(tokens[0]), int(tokens[1])), 1.0
lines = sc.textFile('./data.file')
linesTest = sc.textFile('./data.file')
Changing equality check from “<=>”to “===“ solved the problem. Performance
skyrocketed.
I am wondering why “<=>” cause a performance degrade?
val dates = new RetailDates()
val dataStructures = new DataStructures()
// Reading CSV Format input files -- retailDates
// This DF has 75 records
val
See https://issues.apache.org/jira/browse/SPARK-7301
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Ambiguous-references-to-a-field-set-in-a-partitioned-table-AND-the-data-tp22325p25740.html
Sent from the Apache Spark User List mailing list archive at
This is fixed in Spark 1.6.
On Fri, Dec 18, 2015 at 3:06 PM, Prasad Ravilla wrote:
> Changing equality check from “<=>”to “===“ solved the problem.
> Performance skyrocketed.
>
> I am wondering why “<=>” cause a performance degrade?
>
> val dates = new RetailDates()
> val
Hi,
How to run multiple Spark jobs that takes Spark Streaming data as the
input as a workflow in Oozie? We have to run our Streaming job first and
then have a workflow of Spark Batch jobs to process the data.
Any suggestions on this would be of great help.
Thanks!
--
View this message in
Hi Roy,
I believe Spark just gets its application ID from YARN, so you can just do
`sc.applicationId`.
-Andrew
2015-12-18 0:14 GMT-08:00 Deepak Sharma :
> I have never tried this but there is yarn client api's that you can use in
> your spark program to get the
Hi Rastan,
Unless you're using off-heap memory or starting multiple executors per
machine, I would recommend the r3.2xlarge option, since you don't actually
want gigantic heaps (100GB is more than enough). I've personally run Spark
on a very large scale with r3.8xlarge instances, but I've been
Found the issue, a conflict between setting Java options in both
spark-defaults.conf and in the spark-submit.
--
Nick
From: Afshartous, Nick
Sent: Friday, December 18, 2015 11:46 AM
To: user@spark.apache.org
Subject: Configuring
You are right. "checkpointInterval" is only for data checkpointing.
"metadata checkpoint" is done for each batch. Feel free to send a PR to add
the missing doc.
Best Regards,
Shixiong Zhu
2015-12-18 8:26 GMT-08:00 Lan Jiang :
> Need some clarification about the documentation.
I am going to say no, but have not actually tested this. Just going on
this line in the docs:
http://spark.apache.org/docs/latest/configuration.html
|spark.driver.extraClassPath| (none) Extra classpath entries to
prepend to the classpath of the driver.
/Note:/ In client mode, this config
Hi Saif, have you verified that the cluster has enough resources for all 4
programs?
-Andrew
2015-12-18 5:52 GMT-08:00 :
> Hello everyone,
>
> I am testing some parallel program submission to a stand alone cluster.
> Everything works alright, the problem is, for
Hi Antony,
The configuration to enable dynamic allocation is per-application.
If you only wish to enable this for one of your applications, just set
`spark.dynamicAllocation.enabled` to true for that application only. The
way it works under the hood is that application will start sending
You need to use window functions to get this kind of behavior. Or use max
and a struct (
http://stackoverflow.com/questions/13523049/hive-sql-find-the-latest-record)
On Thu, Dec 17, 2015 at 11:55 PM, Timothée Carayol <
timothee.cara...@gmail.com> wrote:
> Hi all,
>
> I tried to do something
Andrew, it's going to be 4 execotor jvms on each r3.8xlarge.
Rastan, you can run quick test using emr spark cluster on spot instances
and see what configuration works better. Without the tests it is all
speculation.
On Dec 18, 2015 1:53 PM, "Andrew Or" wrote:
> Hi Rastan,
Hi - I'm running Spark Streaming using PySpark 1.3 in yarn-client mode on CDH
5.4.4. The job sometimes runs a full 24hrs, but more often it fails
sometime during the day.
I'm getting several vague errors that I don't see much about when searching
online:
- py4j.Py4JException: Error while
Hello experts... i am new to spark, anyone please explain me how to fetch
data from hbase table in spark java
Thanks in Advance...
56 matches
Mail list logo