Hi ,
I observed that we have installed only one cluster,
and submiting job as yarn-cluster then getting below error, so is this cause
that installation is only one cluster?
Please correct me, if this is not cause then why I am not able to run in
cluster mode,
spark submit command is -
spark-submit
Check your spam or any filter,
On Wed, Apr 8, 2015 at 2:17 PM, Jeetendra Gangele gangele...@gmail.com
wrote:
Hi All how can I subscribe myself in this group so that every mail sent to
this group comes to me as well.
I already sent request to user-subscr...@spark.apache.org ,still Iam not
Hi folks,
I am writing to ask how to filter and partition a set of files thru Spark.
The situation is that I have N big files (cannot fit into single machine). And
each line of files starts with a category (say Sport, Food, etc), while only
have less than 100 categories actually. I need a
I have a spark stage that has 8 tasks. 7/8 have completed. However 1 task
is failing with Cannot find address
Aggregated Metrics by ExecutorExecutor IDAddressTask TimeTotal TasksFailed
TasksSucceeded TasksShuffle Read Size / RecordsShuffle Write Size /
RecordsShuffle
Spill (Memory)Shuffle Spill
Hi All how can I subscribe myself in this group so that every mail sent to
this group comes to me as well.
I already sent request to user-subscr...@spark.apache.org ,still Iam not
getting mail sent to this group by other persons.
Regards
Jeetendra
Hi Cheng,
I tried both these patches, and seems still not resolve my issue. And I
found the most time is spend on this line in newParquet.scala:
ParquetFileReader.readAllFootersInParallel(
sparkContext.hadoopConfiguration, seqAsJavaList(leaves), taskSideMetaData)
Which need read all the files
I use EMR 3.3.1 which comes with Java 7. Do you think that this may cause
the issue? Did you test it with Java 8?
Thanks
From: Nick Pentreath [mailto:nick.pentre...@gmail.com]
Sent: Tuesday, April 07, 2015 5:52 PM
To: Puneet Kumar Ojha
Cc: user@spark.apache.org
Subject: Re: Difference between textFile Vs hadoopFile (textInoutFormat) on
HDFS data
There is no difference - textFile calls hadoopFile with a
Spark Version 1.3
Command:
./bin/spark-submit -v --master yarn-cluster --driver-class-path
Hi Michael,
In fact, I find that all workers are hanging when SQL/DF join is running.
So I picked the master and one of the workers. jstack is the following:
Master
This means the spark workers exited with code 15; probably nothing YARN
related itself (unless there are classpath-related problems).
Have a look at the logs of the app/container via the resource manager. You can
also increase the time that logs get kept on the nodes themselves to something
I will look into this today.
On Wed, Apr 8, 2015 at 7:35 AM, Stefano Parmesan parme...@spaziodati.eu wrote:
Did anybody by any chance had a look at this bug? It keeps on happening to
me, and it's quite blocking, I would like to understand if there's something
wrong in what I'm doing, or
Hi All,
In some cases, I have below exception when I run spark in local mode (I
haven't see this in a cluster). This is weird but also affect my local unit
test case (it is not always happen, but usually one per 4-5 times run). From
the stack, looks like error happen when create the context,
Hi folks, I am noticing a pesky and persistent warning in my logs (this is
from Spark 1.2.1):
15/04/08 15:23:05 WARN ShellBasedUnixGroupsMapping: got exception
trying to get groups for user anonymous
org.apache.hadoop.util.Shell$ExitCodeException: id: anonymous: No such user
at
+1
Interestingly, I ran into the exactly the same issue yesterday. I couldn’t
find any documentation about which project to include as a dependency in
build.sbt to use HiveThriftServer2. Would appreciate help.
Mohammed
From: Todd Nist [mailto:tsind...@gmail.com]
Sent: Wednesday, April 8,
Did anybody by any chance had a look at this bug? It keeps on happening to
me, and it's quite blocking, I would like to understand if there's something
wrong in what I'm doing, or whether there's a workaround or not.
Thank you all,
--
Dott. Stefano Parmesan
Backend Web Developer and Data Lover
You may have seen this thread: http://search-hadoop.com/m/JW1q5SlRpt1
Cheers
On Wed, Apr 8, 2015 at 6:15 AM, Eric Eijkelenboom
eric.eijkelenb...@gmail.com wrote:
Hi guys
*I’ve got:*
- 180 days of log data in Parquet.
- Each day is stored in a separate folder in S3.
- Each day
Yes, should be fine since you are running on YARN. This is probably more
appropriate for the cdh-user list.
On Apr 8, 2015 9:35 AM, roy rp...@njit.edu wrote:
Hi,
We have cluster running on CDH 5.3.2 and Spark 1.2 (Which is current
version in CDH5.3.2), But We want to try Spark 1.3 without
spark.eventLog.dir should contain the full HDFS URL. In general,
this should be sufficient:
spark.eventLog.dir=hdfs:/user/spark/applicationHistory
On Wed, Apr 8, 2015 at 6:45 AM, Vijayasarathy Kannan kvi...@vt.edu wrote:
I am trying to run a Spark application using spark-submit on a cluster
There are a couple of options. Increase timeout (see Spark configuration).
Also see past mails in the mailing list.
Another option you may try (I have gut feeling that may work, but I am not
sure) is calling GC on the driver periodically. The cleaning up of stuff is
tied to GCing of RDD objects
Hi,
Does SparkContext's textFile() method handle files with Unicode characters?
How about files in UTF-8 format?
Going further, is it possible to specify encodings to the method? If not,
what should one do if the files to be read are in some encoding?
Thanks,
arun
How do I build Spark SQL Avro Library for Spark 1.2 ?
I was following this https://github.com/databricks/spark-avro and was able
to build spark-avro_2.10-1.0.0.jar by simply running sbt/sbt package from
the project root.
but we are on Spark 1.2 and need compatible spark-avro jar.
Any idea how
Hi,
We have cluster running on CDH 5.3.2 and Spark 1.2 (Which is current
version in CDH5.3.2), But We want to try Spark 1.3 without breaking existing
setup, so is it possible to have Spark 1.3 on existing setup ?
Thanks
--
View this message in context:
If you are using Spark Standalone deployment, make sure you set the
WORKER_MEMROY over 20G, and you do have 20G physical memory.
Yong
Date: Tue, 7 Apr 2015 20:58:42 -0700
From: li...@adobe.com
To: user@spark.apache.org
Subject: EC2 spark-submit --executor-memory
Dear Spark team,
I'm
Please email user-subscr...@spark.apache.org
On Apr 8, 2015, at 6:28 AM, Idris Ali psychid...@gmail.com wrote:
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail:
Hi guys
I’ve got:
180 days of log data in Parquet.
Each day is stored in a separate folder in S3.
Each day consists of 20-30 Parquet files of 256 MB each.
Spark 1.3 on Amazon EMR
This makes approximately 5000 Parquet files with a total size if 1.5 TB.
My code:
val in =
To use the HiveThriftServer2.startWithContext, I thought one would use the
following artifact in the build:
org.apache.spark%% spark-hive-thriftserver % 1.3.0
But I am unable to resolve the artifact. I do not see it in maven central
or any other repo. Do I need to build Spark and
I am trying to run a Spark application using spark-submit on a cluster
using Cloudera manager. I get the error
Exception in thread main java.io.IOException: Error in creating log
directory: file:/user/spark/applicationHistory//app-20150408094126-0008
Adding the below lines in
It should be noted I'm a newbie to Spark so please have patience ...
I'm trying to convert an existing application over to spark and am running
into some high level questions that I can't seem to resolve. Possibly
because what I'm trying to do is not supported.
In a nutshell as I process
Hi all,
I am using Spark Streaming to monitor an S3 bucket for objects that contain
JSON. I want
to import that JSON into Spark SQL DataFrame.
Here's my current code:
*from pyspark import SparkContext, SparkConf*
*from pyspark.streaming import StreamingContext*
*import json*
*from pyspark.sql
Thanks for the report. We improved the speed here in 1.3.1 so would be
interesting to know if this helps. You should also try disabling schema
merging if you do not need that feature (i.e. all of your files are the
same schema).
sqlContext.load(path, parquet, Map(mergeSchema - false))
On Wed,
I think your thread dump for the master is actually just a thread dump for
SBT that is waiting on a forked driver program.
...
java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
- waiting on 0x7fed624ff528 (a java.lang.UNIXProcess)
at
Hi Muhammad,
There are lots of ways to do it. My company actually develops a text
mining solution which embeds a very fast Approximate Neighbours solution
(a demo with real time queries on the wikipedia dataset can be seen at
wikinsights.org). For the record, we now prepare a dataset of 4.5
Sorry guys. I didn't realize that
https://issues.apache.org/jira/browse/SPARK-4925 was not fixed yet.
You can publish locally in the mean time (sbt/sbt publishLocal).
On Wed, Apr 8, 2015 at 8:29 AM, Mohammed Guller moham...@glassbeam.com
wrote:
+1
Interestingly, I ran into the exactly
Back to the user list so everyone can see the result of the discussion...
Ah. It all makes sense now. The issue is that when I created the parquet
files, I included an unnecessary directory name (data.parquet) below the
partition directories. It’s just a leftover from when I started with
I am trying to start the worker by:
sbin/start-slave.sh spark://ip-10-241-251-232:7077
In the logs it's complaining about:
Master must be a URL of the form spark://hostname:port
I also have this in spark-defaults.conf
spark.master spark://ip-10-241-251-232:7077
Did I miss
some additional context:
Since, I am using features of spark 1.3.0, I have downloaded spark 1.3.0 and
used spark-submit from there.
The cluster is still on spark-1.2.0.
So, this looks to me that at runtime, the executors could not find some
libraries of spark-1.3.0, even though I ran
Spark use the Hadoop TextInputFormat to read the file. Since Hadoop is almost
only supporting Linux, so UTF-8 is the only encoding supported, as it is the
the one on Linux.
If you have other encoding data, you may want to vote for this
Jira:https://issues.apache.org/jira/browse/MAPREDUCE-232
Michael,
Thank you!
Looks like the sbt build is broken for 1.3. I downloaded the source code for
1.3, but I get the following error a few minutes after I run “sbt/sbt
publishLocal”
[error] (network-shuffle/*:update) sbt.ResolveException: unresolved dependency:
I am seeing the following, is this because of my maven version?
15/04/08 15:42:22 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0,
ip-10-241-251-232.us-west-2.compute.internal):
java.io.InvalidClassException: org.apache.spark.Aggregator; local class
incompatible: stream classdesc
I am trying to unit test some code which takes an existing HiveContext and
uses it to execute a CREATE TABLE query (among other things). Unfortunately
I've run into some hurdles trying to unit test this, and I'm wondering if
anyone has a good approach.
The metastore DB is automatically created in
When I call *transform* or *foreachRDD *on* DStream*, I keep getting an
error that I have an empty RDD, which make sense since my batch interval
maybe smaller than the rate at which new data are coming in. How to guard
against it?
Thanks,
Vadim
ᐧ
Since we are running in local mode, won't all the executors be in the same
JVM as the driver?
Thanks
NB
On Wed, Apr 8, 2015 at 1:29 PM, Tathagata Das t...@databricks.com wrote:
Its does take effect on the executors, not on the driver. Which is okay
because executors have all the data and
I am loading some avro data into spark using the following code:
sqlContext.sql(CREATE TEMPORARY TABLE foo USING com.databricks.spark.avro
OPTIONS (path 'hdfs://*.avro'))
The avro data contains some binary fields that get translated to the
BinaryType data type. I am struggling with how to use
More generic version of a question below:
Is it possible to append a column to existing DataFrame at all? I understand
that this is not an easy task in Spark environment, but is there any
workaround?
--
View this message in context:
Hi all,
I figured it out! The DataFrames and SQL example in Spark Streaming docs
were useful.
Best,
Vadim
ᐧ
On Wed, Apr 8, 2015 at 2:38 PM, Vadim Bichutskiy vadim.bichuts...@gmail.com
wrote:
Hi all,
I am using Spark Streaming to monitor an S3 bucket for objects that
contain JSON. I want
Hi,
If I perform a sortByKey(true, 2).saveAsTextFile(filename) on a cluster,
will the data be sorted per partition, or in total. (And is this
guaranteed?)
Example:
Input 4,2,3,6,5,7
Sorted per partition:
part-0: 2,3,7
part-1: 4,5,6
Sorted in total:
part-0: 2,3,4
part-1: 5,6,7
You could convert DF to RDD, then in map phase or in join add new column,
and then again convert to DF. I know this is not elegant solution and maybe
it is not a solution at all. :) But this is the first thing that popped in
my mind.
I am new also to DF api.
Best
Bojan
On Apr 9, 2015 00:37,
Thanks TD. I believe that might have been the issue. Will try for a few
days after passing in the GC option on the java command line when we start
the process.
Thanks for your timely help.
NB
On Wed, Apr 8, 2015 at 6:08 PM, Tathagata Das t...@databricks.com wrote:
Yes, in local mode they the
Yes, in local mode they the driver and executor will be same the process.
And in that case the Java options in SparkConf configuration will not
work.
On Wed, Apr 8, 2015 at 1:44 PM, N B nb.nos...@gmail.com wrote:
Since we are running in local mode, won't all the executors be in the same
JVM
See the scaladoc from OrderedRDDFunctions.scala :
* Sort the RDD by key, so that each partition contains a sorted range of
the elements. Calling
* `collect` or `save` on the resulting RDD will return or output an
ordered list of records
* (in the `save` case, they will be written to
bq. one is Oracle and the other is OpenJDK
I don't have experience with mixed JDK's.
Can you try with using single JDK ?
Cheers
On Wed, Apr 8, 2015 at 3:26 PM, Mohit Anchlia mohitanch...@gmail.com
wrote:
For the build I am using java version 1.7.0_65 which seems to be the
same as the one on
Hi Eric - Would you mind to try either disabling schema merging as what
Michael suggested, or disabling the new Parquet data source by
sqlContext.setConf(spark.sql.parquet.useDataSourceApi, false)
Cheng
On 4/9/15 2:43 AM, Michael Armbrust wrote:
Thanks for the report. We improved the speed
Hey Patrick, Michael and Todd,
Thank you for your help!
As you guys recommended, I did a local install and got my code to compile.
As an FYI, on my local machine the sbt build fails even if I add –DskipTests.
So I used mvn.
Mohammed
From: Patrick Wendell [mailto:patr...@databricks.com]
Sent:
On 4/9/15 3:09 AM, Michael Armbrust wrote:
Back to the user list so everyone can see the result of the discussion...
Ah. It all makes sense now. The issue is that when I created the
parquet files, I included an unnecessary directory name
(data.parquet) below the partition
Aah yes. The jsonRDD method needs to walk through the whole RDD to
understand the schema, and does not work if there is not data in it. Making
sure there is no data in it using take(1) should work.
TD
It's because your tests are running in parallel and you can only have one
context running at a time.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Cannot-run-unit-test-tp14459p22429.html
Sent from the Apache Spark User List mailing list archive at
I wanted to run the groupBy(partition ) but this is not working.
here first part in pairvendorData will be repeated multiple second part.
Both are object do I need to overrite the equals and hash code?
Is groupBy fast enough?
JavaPairRDDVendorRecord, VendorRecord pairvendorData
Thanks!
arun
On Wed, Apr 8, 2015 at 10:51 AM, java8964 java8...@hotmail.com wrote:
Spark use the Hadoop TextInputFormat to read the file. Since Hadoop is
almost only supporting Linux, so UTF-8 is the only encoding supported, as
it is the the one on Linux.
If you have other encoding data,
Thanks TD!
On Apr 8, 2015, at 9:36 PM, Tathagata Das t...@databricks.com wrote:
Aah yes. The jsonRDD method needs to walk through the whole RDD to understand
the schema, and does not work if there is not data in it. Making sure there
is no data in it using take(1) should work.
TD
We noticed similar perf degradation using Parquet (outside of Spark) and it
happened due to merging of multiple schemas. Would be good to know if
disabling merge of schema (if the schema is same) as Michael suggested
helps in your case.
On Wed, Apr 8, 2015 at 11:43 AM, Michael Armbrust
Please take a look at zipWithIndex() of RDD.
Cheers
On Wed, Apr 8, 2015 at 3:40 PM, Jeetendra Gangele gangele...@gmail.com
wrote:
Hi All I have a RDDSomeObject I want to convert it to
RDDsequenceNumber,SomeObject this sequence number can be 1 for first
SomeObject 2 for second SomeOjejct
The Thrift server hasn't support authentication or Hadoop doAs yet, so
you can simply ignore this warning.
To avoid this, when connecting via JDBC you may specify the user to the
same user who starts the Thrift server process. For Beeline, use -n
user.
On 4/8/15 11:49 PM, Yana Kadiyska
What is the computation you are doing in the foreachRDD, that is throwing
the exception?
One way to guard against is to do a take(1) to see if you get back any
data. If there is none, then don't do anything with the RDD.
TD
On Wed, Apr 8, 2015 at 1:08 PM, Vadim Bichutskiy
Hi All I have a RDDSomeObject I want to convert it to
RDDsequenceNumber,SomeObject this sequence number can be 1 for first
SomeObject 2 for second SomeOjejct
Regards
jeet
For the build I am using java version 1.7.0_65 which seems to be the same
as the one on the spark host. However one is Oracle and the other is
OpenJDK. Does that make any difference?
On Wed, Apr 8, 2015 at 1:24 PM, Ted Yu yuzhih...@gmail.com wrote:
What version of Java do you use to build ?
What version of Java do you use to build ?
Cheers
On Wed, Apr 8, 2015 at 12:43 PM, Mohit Anchlia mohitanch...@gmail.com
wrote:
I am seeing the following, is this because of my maven version?
15/04/08 15:42:22 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0,
Which version of Joda are you using ?
Here is snippet of dependency:tree out w.r.t. Joda :
[INFO] +- org.apache.flume:flume-ng-core:jar:1.4.0:compile
...
[INFO] | +- joda-time:joda-time:jar:2.1:compile
FYI
On Wed, Apr 8, 2015 at 12:53 PM, Patrick Grandjean p.r.grandj...@gmail.com
wrote:
Hi,
Please take a look at
sql/hive/src/main/scala/org/apache/spark/sql/hive/test/TestHive.scala :
protected def configure(): Unit = {
warehousePath.delete()
metastorePath.delete()
setConf(javax.jdo.option.ConnectionURL,
sjdbc:derby:;databaseName=$metastorePath;create=true)
Hi Mohammed,
I think you just need to add -DskipTests to you build. Here is how I built
it:
mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.6.0 -Phive -Phive-thriftserver
-DskipTests clean package install
build/sbt does however fail even if only doing package which should skip
tests.
I am able to
71 matches
Mail list logo