I can do a
wget http://repo.maven.apache.org/maven2/org/apache/apache/14/apache-14.pom
and get the file successfully on a shell.
On Thu, Jan 29, 2015 at 11:51 AM, Boromir Widas vcsub...@gmail.com wrote:
At least a part of it is due to connection refused, can you check if curl
can reach the
On Thu, Jan 29, 2015 at 11:05 AM, Arush Kharbanda
ar...@sigmoidanalytics.com wrote:
Does the error change on build with and without the built options?
What do you mean by build options? I'm just doing ./sbt/sbt assembly from
$SPARK_HOME
Did you try using maven? and doing the proxy settings
At least a part of it is due to connection refused, can you check if curl
can reach the URL with proxies -
[FATAL] Non-resolvable parent POM: Could not transfer artifact
org.apache:apache:pom:14 from/to central (
http://repo.maven.apache.org/maven2): Error transferring file: Connection
refused
Install virtual box which run Linux? That does not help us. We have business
reason to run it on Windows operating system, e.g. Windows 2008 R2.
If anybody have done that, please give some advise on what version of spark,
which version of Hadoop do you built spark against, etc…. Note that we
fs.s3a.server-side-encryption-algorithm is honored by s3a support in
hadoop 2.6.0+ as well.
Cheers
On Thu, Jan 29, 2015 at 6:51 AM, Danny kont...@dannylinden.de wrote:
On Spark 1.2.0 you have the s3a library to work with S3. And there is a
config param named
Spark 1.2 on Hadoop 2.3
Read one big csv file, create a schemaRDD on it and saveAsParquetFile.
It creates a large number of small (~1MB ) parquet part-x- files.
Any way to control so that smaller number of large files are created ?
Thanks,
Charles,
Thank you very much for another suggestion. Unfortunately I couldn't make
it work that way either. So I downgraded my SolrJ library from 4.10.3 to
4.0.0 [1].
Maybe using Relocating Classes [2] feature of Maven could handle this
issue, but I did not want to complicate my pom.xml further,
Hello fellow Sparkians,
Is there's some preferred way to have
*some given set-up task run on all workers?*The task at hand isn't a
computational task then, but rather some initial setup I want to run it for
its *side-effects*. This could be to set-up some custom logging settings,
or metrics.
Hello,
SQLContext and hiveContext have a jsonRDD method which accept an
RDD[String] where the string is a JSON String a returns a SchemaRDD, it
extends RDD[Row] which the type you want.
After words you should be able to do a join to keep your tuple.
Best,
Ayoub.
2015-01-29 10:12 GMT+01:00
Hi Ayoub,
thanks for your mail!
On Thu, Jan 29, 2015 at 6:23 PM, Ayoub benali.ayoub.i...@gmail.com wrote:
SQLContext and hiveContext have a jsonRDD method which accept an
RDD[String] where the string is a JSON String a returns a SchemaRDD, it
extends RDD[Row] which the type you want.
After
Hi Sarwar,
For a quick fix you can exclude dependencies for yarn(you wont be needing
them if you are running locally).
libraryDependencies +=
log4j % log4j % 1.2.15 exclude(javax.jms, jms)
You can also analyze your dependencies using this plugin
Hi,
I have data as RDD[(Long, String)], where the Long is a timestamp and the
String is a JSON-encoded string. I want to infer the schema of the JSON and
then do a SQL statement on the data (no aggregates, just column selection
and UDF application), but still have the timestamp associated with
(By the way, you can use wordRDD.countByValue instead of the map and
reduceByKey. It won't make a difference to your issue but is more
compact.)
As you say, the problem is the very limited range of keys (word
lengths). I wonder if you can use sortBy instead of map and sortByKey,
and instead
Hi,
I submitted a job using spark-submit and got the following exception.
Anybody knows how to fix this? Thanks.
Ey-Chih Chow
15/01/29 08:53:10 INFO storage.BlockManagerMasterActor: Registering block
manager
Hi
There are 2 ways to resolve the issue.
1.Increasing the heap size, via -Xmx1024m (or more), or
2.Disabling the error check altogether, via -XX:-UseGCOverheadLimit.
as per
http://stackoverflow.com/questions/5839359/java-lang-outofmemoryerror-gc-overhead-limit-exceeded
you can pass the java
Thanks Arush. I did look into the dependency tree but couldn't figure which
dependency was bringing the wrong Hadoop-yarn-common in.
I'll try he quick fix first.
Sarwar
On Thu, 29 Jan 2015 at 09:33 Arush Kharbanda ar...@sigmoidanalytics.com
wrote:
Hi Sarwar,
For a quick fix you can exclude
Hi Mohit,
You can set the master instance type with -m.
To setup a cluster you need to use the ec2/spark-ec2 script.
You need to create a AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY in your
aws web console under Security Credentials. And pass it on to script above.
Once you do that you should
Thanks a lot.
After reading Mesos-1688, I still don't understand how/why a job will hoard
and hold on to so many resources even in the presence of that bug.
Looking at the release notes, I think this ticket could be relevant to
preventing the behavior we're seeing:
[MESOS-186] - Resource offers
Hi,
I am facing the following issue when I am connecting from spark-shell. Please
tell me how to avoid it.
15/01/29 17:21:27 ERROR Shell: Failed to locate the winutils binary in the
hadoop binary path
java.io.IOException: Could not locate executable null\bin\winutils.exe in the
Hadoop
Hello,
I would appreciate insights on the following questions:
1) Using Spark Streaming, I would like to keep windowed statistics
for the past 30, 60 and 120 minutes.
Is there an integrated/better way of doing this than creating three
separate windows and pointing them to the same DStream?
2)
Thanks. I actually looked up foreachPartition() in this context yesterday,
and couldn't land where it's documented in Javadocs or elsewhere.. probably
for some silly reason. Can you please point me in the right direction?
Many thanks!
By the way, I realize the solution should rather be to
Ok, Cheng.
Thank you!
Un saludo
Jorge López-Malla Matute
Big Data Developer
Vía de las Dos Castillas, 33. Ática 4. 3ª Planta
28224 Pozuelo de Alarcón, Madrid
Tel: 91 828 64 73 // @stratiobd
2015-01-28 19:44 GMT+01:00 Cheng Lian lian.cs@gmail.com:
Hey Jorge,
This is expected.
Just curious, is this set to be merged at some point?
On Thu Jan 22 2015 at 4:34:46 PM Ankur Dave ankurd...@gmail.com wrote:
At 2015-01-22 02:06:37 -0800, NicolasC nicolas.ch...@inria.fr wrote:
I try to execute a simple program that runs the ShortestPaths algorithm
Hi,
I have the following usecase, assuming that I have my data in e.g. hdfs, a
single file sequence file containing rows of CSV entries that I can split and
build an RDD of arrays of (smaller) strings.
What I want to do is to build two RDDs where the first RDD contains a subset of
columns and
Thanks for the clarification on the partitioning.
I did what you suggested and tried reading in individual part-* files --
some of them are ~1.7Gb in size and that's where it's failing. When I
increase the number of partitions before writing to disk, it seems to work.
Would be nice if this was
I am also looking for connector for CouchDB in Spark. did you find anything ?
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/connector-for-CouchDB-tp18630p21422.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
i am looking for the spark connector for Couch DB please help me .
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/spark-connector-for-CouchDB-tp21421.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
Hi,
Using spark under windows is a really bad idea, because even you solve the
problems about hadoop, you probably will meet the problem of
java.net.SocketException. connection reset by peer. It is caused by the
fact we ask socket port too frequently under windows. In my knowledge, it
is really
You need to set your HADOOP_HOME in the environment.
Here :
Could not locate executable null\bin\winutils.exe in the Hadoop binaries.
null is supposed to be your HADOOP_HOME.
On 29 Jan 2015 15:57, Naveen Kumar Pokala npok...@spcapitaliq.com wrote:
Hi,
I am facing the following issue when I
I solved this problem following this article
http://qnalist.com/questions/4994960/run-spark-unit-test-on-windows-7
1) download compiled winutils.exe from
On Spark 1.2.0 you have the s3a library to work with S3. And there is a
config param named fs.s3a.server-side-encryption-algorithm:
https://github.com/Aloisius/hadoop-s3a
--
View this message in context:
Hi,
I tried to use spark under windows once. However the only solution that I
found is to install virtualbox
Hope this can help you.
Best
Gen
On Thu, Jan 29, 2015 at 4:18 PM, Wang, Ningjun (LNG-NPV)
ningjun.w...@lexisnexis.com wrote:
I deployed spark-1.1.0 on Windows 7 and was albe to
Eventually it would be nice for us to have some sort of function to do the
conversion you are talking about on a single column, but for now I usually
hack it as you suggested:
val withId = origRDD.map { case (id, str) = s{id:$id,
${str.trim.drop(1)} }
val table = sqlContext.jsonRDD(withId)
On
Does the error change on build with and without the built options?
Did you try using maven? and doing the proxy settings there.
On Thu, Jan 29, 2015 at 9:17 PM, Soumya Simanta soumya.sima...@gmail.com
wrote:
I'm trying to build Spark (v1.1.1 and v1.2.0) behind a proxy using
./sbt/sbt assembly
I would characterize the difference as follows:
Spark SQL http://spark.apache.org/docs/latest/sql-programming-guide.html
is the native engine for processing structured data using Spark. In
contrast to Shark or Hive on Spark is has its own optimizer that was
designed for the RDD model. It is
I'm trying to build Spark (v1.1.1 and v1.2.0) behind a proxy using
./sbt/sbt assembly and I get the following error. I've set the http and
https proxy as well as the JAVA_OPTS. Any idea what am I missing ?
[warn] one warning found
org.apache.maven.model.building.ModelBuildingException: 1 problem
Francois,
RDD.aggregate() does not support aggregation by key. But, indeed, that is the
kind of implementation I am looking for, one that does not allocate
intermediate space for storing (K,V) pairs. When working with large datasets
this type of intermediate memory allocation wrecks havoc with
You can use coalesce or repartition to control the number of file output by
any Spark operation.
On Thu, Jan 29, 2015 at 9:27 AM, Manoj Samel manojsamelt...@gmail.com
wrote:
Spark 1.2 on Hadoop 2.3
Read one big csv file, create a schemaRDD on it and saveAsParquetFile.
It creates a large
Oh, I’m sorry, I meant `aggregateByKey`.
https://spark.apache.org/docs/1.2.0/api/scala/#org.apache.spark.rdd.PairRDDFunctions
—
FG
On Thu, Jan 29, 2015 at 7:58 PM, Mohit Jaggi mohitja...@gmail.com wrote:
Francois,
RDD.aggregate() does not support aggregation by key. But, indeed, that is
Sorry, I answered too fast. Please disregard my last message: I did mean
aggregate.
You say: RDD.aggregate() does not support aggregation by key.
What would you need aggregation by key for, if you do not, at the beginning,
have an RDD of key-value pairs, and do not want to build one ?
Thanks for the reminder. I just created a PR:
https://github.com/apache/spark/pull/4273
Ankur
On Thu, Jan 29, 2015 at 7:25 AM, Jay Hutfles jayhutf...@gmail.com wrote:
Just curious, is this set to be merged at some point?
-
To
Here is a spark challenge for you!
I have a data set where each entry has a date. I would like to identify
gaps in the dates greater larger a given length. For example, if the data
were log entries, then the gaps would tell me when I was missing log data
for long periods of time. What is the
Looks like the application is using a lot more memory than available. Could be
a bug somewhere in the code or just underpowered machine. Hard to say without
looking at the code.
Caused by: java.lang.OutOfMemoryError: GC overhead limit exceeded
Mohammed
-Original Message-
From:
I have the code set up the Cassandra
SparkConf conf = new SparkConf(true);
conf.setAppName(Java cassandra RD);
conf.set(*spark.cassandra.connection.host, 10.34.224.249*);
but I got log try to connect different host.
15/01/29 16:16:42 INFO NettyBlockTransferService: Server created on
Another solution would be to use the reduce action.
Mohammed
From: Ganelin, Ilya [mailto:ilya.gane...@capitalone.com]
Sent: Thursday, January 29, 2015 1:32 PM
To: 'derrickburns'; 'user@spark.apache.org'
Subject: RE: spark challenge: zip with next???
Make a copy of your RDD with an extra entry
Make a copy of your RDD with an extra entry in the beginning to offset. The you
can zip the two RDDs and run a map to generate an RDD of differences.
Sent with Good (www.good.com)
-Original Message-
From: derrickburns [derrickrbu...@gmail.commailto:derrickrbu...@gmail.com]
Sent:
Hello everyone. I am having what I am sure is a configuration error. I am
trying to use my spark cluster in cluster mode with out success. So far
search results have not yielded any clues. If I use my the same submit
command but with client mode specified everything works fine. I have tried
yes please but i am new for spark and couchdb .
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/connector-for-CouchDB-tp18630p21428.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
How much memory are you assigning to the Spark executor on the worker node?
Mohammed
From: ey-chih chow [mailto:eyc...@hotmail.com]
Sent: Thursday, January 29, 2015 3:35 PM
To: Mohammed Guller; user@spark.apache.org
Subject: RE: unknown issue in submitting a spark job
The worker node has 15G
Hi,
When ever I enable DEBUG level logs for my spark cluster, on running a job
all the executors die with the below exception. On disabling the DEBUG logs
my jobs move to the next step.
I am on spark-1.1.0
Is this a known issue with spark?
Thanks
Ankur
2015-01-29 22:27:42,467 [main] INFO
I can also switch to the mongodb if spark have a support for the.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/connector-for-CouchDB-tp18630p21429.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
Hi,
I am no expert but have a small application working with Spark and
Cassandra.
I faced these issues when we were deploying our cluster on EC2 instances
with some machines on public network and some on private.
This seems to be a similar issue as you are trying to connect to
10.34.224.249
Hi All,
I noticed in pom.xml that there is no entry for Hadoop 2.5. Has anyone tried
Spark with 2.5.0-cdh5.2.1? Will replicating the 2.4 entry be sufficient to make
this work?
Mohit.
-
To unsubscribe, e-mail:
Mohit,
I'm using spark modules provided by Cloudera repos, it works fine.
Please add Cloudera maven repo, and specify dependencies with CDH
version, like spark-core_2.10-1.1.0-cdh5.2.1.
To add Cloudera maven repo, see:
The worker node has 15G memory, 1x32 GB SSD, and 2 core. The data file is from
S3. If I don't set mapred.max.split.size, it is fine with only one partition.
Otherwise, it will generate OOME.
Ey-Chih Chow
From: moham...@glassbeam.com To: eyc...@hotmail.com; user@spark.apache.org
Subject: RE:
do set executor memory as well. You have RAM in each node and storage. set it
o 6 GB or more , if require change driver memory from 10 gb to more.
--Harihar
--
View this message in context:
Hi,
The input data has 2048 partitions. The final step is to load the processed
data into hbase through saveAsNewAPIHadoopDataset(). Every step except the
last one ran in parallel in the cluster. But the last step only has 1 task
which runs on only 1 node using one core.
Spark 1.1.1. +
Hi,
On Fri, Jan 30, 2015 at 6:32 AM, Ganelin, Ilya ilya.gane...@capitalone.com
wrote:
Make a copy of your RDD with an extra entry in the beginning to offset.
The you can zip the two RDDs and run a map to generate an RDD of
differences.
Does that work? I recently tried something to compute
Trying to cluster small text msgs, using HashingTF and IDF with L2
Normalization. Data looks like this
id, msg
1, some text1
2, some more text2
3, sample text 3
Input data file size is 1.7 MB with 10 K rows. It runs (very slow took 3
hrs) for upto 20 clusters, but when I ask for 200 clusters
http://mail-archives.apache.org/mod_mbox/spark-user/201405.mbox/%3ccalrvtpkn65rolzbetc+ddk4o+yjm+tfaf5dz8eucpl-2yhy...@mail.gmail.com%3E
http://mail-archives.apache.org/mod_mbox/spark-user/201405.mbox/%3ccalrvtpkn65rolzbetc+ddk4o+yjm+tfaf5dz8eucpl-2yhy...@mail.gmail.com%3E
you can use the MLLib
*Dear all,*
*I have no idea when it raises an error when I run the following code.*
def getRow(data):
return data.msg
first_sql = select * from logs.event where dt = '20150120' and et = 'ppc'
LIMIT 10#error
#first_sql = select * from hivecrawler.vip_crawler where src='xx' and
dt=' +
What version of Spark and Hive are you using? Spark 1.1.0 and prior
version /only/ support Hive 0.12.0. Spark 1.2.0 supports Hive 0.12.0
/or/ 0.13.1.
Cheng
On 1/29/15 6:36 PM, QiuxuanZhu wrote:
*Dear all,
*
*I have no idea when it raises an error when I run the following code.*
*
*
def
Hi,I am trying saveAsTable on SchemaRDD created from HiveContext and it
fails. This is on Spark 1.2.0.Following are details of the code, command and
exceptions:
http://stackoverflow.com/questions/28222496/how-to-enable-sql-on-schemardd-via-the-jdbc-interface-is-it-even-possible
You are running yarn-client mode. How about increase the --driver-memory and
give it a try?
Thanks.
Zhan Zhang
On Jan 29, 2015, at 6:36 PM, QiuxuanZhu
ilsh1...@gmail.commailto:ilsh1...@gmail.com wrote:
Dear all,
I have no idea when it raises an error when I run the following code.
def
On Thu, Jan 29, 2015 at 6:36 PM, QiuxuanZhu ilsh1...@gmail.com wrote:
Dear all,
I have no idea when it raises an error when I run the following code.
def getRow(data):
return data.msg
first_sql = select * from logs.event where dt = '20150120' and et = 'ppc'
LIMIT 10#error
I'm creating a real-time visualization of counts of ads shown on my website,
using that data pushed through by Spark Streaming.
To avoid clutter, it only looks good to show 4 or 5 lines on my
visualization at once (corresponding to 4 or 5 different ads), but there are
50+ different ads that show
Try rdd.coalesce(1).saveAsParquetFile(...)
http://spark.apache.org/docs/1.2.0/programming-guide.html#transformations
--- Original Message ---
From: Manoj Samel manojsamelt...@gmail.com
Sent: January 29, 2015 9:28 AM
To: user@spark.apache.org
Subject: schemaRDD.saveAsParquetFile creates large
My answer was based off the specs that Antony mentioned: different amounts
of memory, but 10 cores on all the boxes. In that case, a single Spark
application's homogeneously sized executors won't be able to take advantage
of the extra memory on the bigger boxes.
Cloudera Manager can certainly
No, I changed it to MongoDB. but you can write you custom code to connect
couchDB directly but in market there is no such connector available.
with few classes extends you can achieve to read couch DB. I can help you
in that let me know if you really interested.
On 30 January 2015 at 06:46,
I use the default value, which I think is 512MB. If I change to 1024MB, Spark
submit will fail due to not enough memory for rdd.
Ey-Chih Chow
From: moham...@glassbeam.com
To: eyc...@hotmail.com; user@spark.apache.org
Subject: RE: unknown issue in submitting a spark job
Date: Fri, 30 Jan 2015
I think it is expected. Refer to the comments in saveAsTable Note that this
currently only works with SchemaRDDs that are created from a HiveContext”. If I
understand correctly, here the SchemaRDD means those generated by
HiveContext.sql, instead of applySchema.
Thanks.
Zhan Zhang
On Jan
Hello,
I had the same issue then I found this JIRA ticket
https://issues.apache.org/jira/browse/SPARK-4825
So I switched to Spark 1.2.1-snapshot witch solved the problem.
2015-01-30 8:40 GMT+01:00 Zhan Zhang zzh...@hortonworks.com:
I think it is expected. Refer to the comments in
@Sandy,
There are two issues.
The spark context (executor) and then the cluster under YARN.
If you have a box where each yarn job needs 3GB, and your machine has 36GB
dedicated as a YARN resource, you can run 12 executors on the single node.
If you have a box that has 72GB dedicated to
73 matches
Mail list logo