Most likely you are closing the connection with HDFS. Can you paste the
piece of code that you are executing?
We were having similar problem when we closed the FileSystem object in our
code.
Thanks
Best Regards
On Thu, Jul 24, 2014 at 11:00 PM, Eric Friedman eric.d.fried...@gmail.com
wrote:
I ported the same code to scala. No problems. But in pyspark, this fails
consistently:
ctx = SQLContext(sc)
pf = ctx.parquetFile(...)
rdd = pf.map(lambda x: x)
crdd = ctx.inferSchema(rdd)
crdd.saveAsParquetFile(...)
If I do
rdd = sc.parallelize([hello, world])
rdd.saveAsTextFile(...)
It works.
Cool I'll take a look and give it a try!
Thanks,
Ron
Sent from my iPad
On Jul 24, 2014, at 10:35 PM, Andrew Ash and...@andrewash.com wrote:
Hi Ron,
I think you're encountering the issue where cacheing data from Hadoop ends up
with many duplicate values instead of what you expect. Try
Any idea about the probable dates for this implementation. I believe it would
be a wonderful (and essential) functionality to gain more acceptance in the
community.
--
View this message in context:
Try without the *
val avroRdd = sc.newAPIHadoopFile(hdfs://url:8020/my dir/,
classOf[AvroSequenceFileInputFormat[AvroKey[GenericRecord],NullWritable]],
classOf[AvroKey[GenericRecord]], classOf[NullWritable])
avroRdd.collect()
Thanks
Best Regards
On Fri, Jul 25, 2014 at 7:22 PM, Sparky
Thanks for the suggestion. I can confirm that my problem is I have files
with zero bytes. It's a known bug and is marked as a high priority:
https://issues.apache.org/jira/browse/SPARK-1960
--
View this message in context:
Well, anyone can open an account on apache jira and post a new
ticket/enhancement/issue/bug...
Bertrand Dechoux
On Fri, Jul 25, 2014 at 4:07 PM, Sparky gullo_tho...@bah.com wrote:
Thanks for the suggestion. I can confirm that my problem is I have files
with zero bytes. It's a known bug and
Hi,
We’re trying to use Docker containerization within Mesos via Deimos. We’re
submitting Spark jobs from localhost to our cluster. We’ve managed it to work
(with fix deimos configuration), but we have issues with passing some options
(like job dependent container image) in TaskInfo to Mesos
Hi,
Is there an implementation for Nonnegative Matrix Factorization in Spark? I
understand that MLlib comes with matrix factorization, but it does not seem
to cover the nonnegative case.
I'm pretty sure this was already fixed last week in SPARK-2414:
https://github.com/apache/spark/commit/7c23c0dc3ed721c95690fc49f435d9de6952523c
On Fri, Jul 25, 2014 at 1:34 PM, innowireless TaeYun Kim
taeyun@innowireless.co.kr wrote:
Hi,
I'm using Spark 1.0.0.
On filter() - map() -
The map and flatMap methods have a similar purpose, but map is 1 to 1,
while flatMap is 1 to 0-N (outputting 0 is similar to a filter, except of
course it could be outputting a different type).
On Thu, Jul 24, 2014 at 6:41 PM, abhiguruvayya sharath.abhis...@gmail.com
wrote:
Can any one help me
Hi all,
I am using Spark 1.0.0 with CDH 5.1.0.
I want to
aggregate the data in a raw table using a simple query like
below
SELECT MIN(field1), MAX(field2), AVG(field3),
PERCENTILE(field4), year,month,day FROM raw_data_table GROUP
BY year, month, day
MIN, MAX and AVG functions work fine
for
Hi all,
Amazon Linux, AWS, Spark 1.0.1 reading a file.
The UI shows there are workers and shows this app context with the 2
tasks waiting. All the hostnames resolve properly so I am guessing
the message is correct and that the workers won't accept the job for
mem reasons.
What params do I
No idea. Right now implementing this is up for grabs by the community.
On Fri, Jul 25, 2014 at 5:40 AM, Shubhabrata mail2shu...@gmail.com wrote:
Any idea about the probable dates for this implementation. I believe it
would
be a wonderful (and essential) functionality to gain more acceptance
Any suggestions to work around this issue ? The pre built spark binaries
don't appear to work against cdh as documented, unless there's a build
issue, which seems unlikely.
On 25-Jul-2014 3:42 pm, Bharath Ravi Kumar reachb...@gmail.com wrote:
I'm encountering a hadoop client protocol mismatch
It is ALS with setNonnegative. -Xiangrui
On Fri, Jul 25, 2014 at 7:38 AM, Aureliano Buendia buendia...@gmail.com wrote:
Hi,
Is there an implementation for Nonnegative Matrix Factorization in Spark? I
understand that MLlib comes with matrix factorization, but it does not seem
to cover the
Anyone can help? I'm using spark 1.0.1
I'm confusing that if the block is found, why no non-empty blocks is got,
and the process keeps going forever?
Thanks!
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-got-stuck-with-a-loop-tp10590p10663.html
Is it possible now to share spark context among machines (through
serialization or some other ways)? I am looking for possible ways to make
the spark job submission to be HA (high availability). For example, if a
job submitted to machine A fails in the middle (due to machine A crash), I
want this
This indicates your app is not actually using the version of the HDFS
client you think. You built Spark from source with the right deps it
seems, but are you sure you linked to your build in your app?
On Fri, Jul 25, 2014 at 4:32 PM, Bharath Ravi Kumar reachb...@gmail.com wrote:
Any suggestions
How many partitions did you use and how many CPU cores in total? The
former shouldn't be much larger than the latter. Could you also check
the shuffle size from the WebUI? -Xiangrui
On Fri, Jul 25, 2014 at 4:10 AM, Charles Li littlee1...@gmail.com wrote:
Hi Xiangrui,
Thanks for your
Hi Xiangrui,
I have 16 * 40 cpu cores in total. But I am only using 200 partitions on the
200 executors. I use coalesce without shuffle to reduce the default partition
of RDD.
The shuffle size from the WebUI is nearly 100m.
On Jul 25, 2014, at 23:51, Xiangrui Meng men...@gmail.com wrote:
Folks,
I've been able to submit simple jobs to yarn thus far. However, when I did
something more complicated that added 194 dependency jars using --addJars, the
job fails in YARN with no logs. What ends up happening is that no container
logs get created (app master or executor). If I add just
thx for the reply,
the UI says my application has cores and mem
ID NameCores Memory per Node Submitted Time UserState Duration
app-20140725164107-0001 SectionsAndSeamsPipeline6 512.0 MB
2014/07/25
16:41:07tercel RUNNING 21 s
--
View this
Thanks for responding. I used the pre built spark binaries meant for
hadoop1,cdh3u5. I do not intend to build spark against a specific
distribution. Irrespective of whether I build my app with the explicit cdh
hadoop client dependency, I get the same error message. I also verified
that my app's
If you link against the pre-built binary, that's for Hadoop 1.0.4. Can
you show your deps to clarify what you are depending on? Building
custom Spark and depending on it is a different thing from depending
on plain Spark and changing its deps. I think you want the latter.
On Fri, Jul 25, 2014 at
Hi all,
Currently we have Kafka 0.7.2 running in production and can't upgrade for
external reasons however spark streaming (1.0.1) was built with Kafka 0.8.0.
What is the best way to use spark streaming with older versions of Kafka.
Currently I'm investigating trying to build spark streaming
Folks,
I had some pyspark code which used to hang with no useful debug logs. It
got fixed when I changed my code to keep the sparkcontext forever instead
of stopping it and then creating another one later. Is this a bug or
expected behavior?
Mohit.
It may be confusing at first but there is also an important difference
between reduce and reduceByKey operations.
reduce is an action on an RDD. Hence, it will request the evaluation of
transformations that resulted to the RDD.
In contrast, reduceByKey is a transformation on PairRDDs, not an
I could find out the issue. In fact, I did not realize before that when
loaded into memory, the data is deserialized. As a result, what seems to be
a 21Gb dataset occupies 77Gb in memory.
Details about this is clearly explained in the guide on serialization and
memory tuning
yes, the output is continuous. So I used a threshold to get binary labels.
If prediction threshold, then class is 0 else 1. I use this binary label
to then compute the accuracy. Even with this binary transformation, the
accuracy with decision tree model is low compared to LR or SVM (for the
Hi Michael,
I have similar question
http://apache-spark-user-list.1001560.n3.nabble.com/Caching-issue-with-msg-RDD-block-could-not-be-dropped-from-memory-as-it-does-not-exist-td10248.html#a10677
before. My problem was that my data was too large to be cached in memory
because of
Can you share the dataset via a gist or something and we can take a look at
what's going on?
On Fri, Jul 25, 2014 at 10:51 AM, SK skrishna...@gmail.com wrote:
yes, the output is continuous. So I used a threshold to get binary labels.
If prediction threshold, then class is 0 else 1. I use
That's right, I'm looking to depend on spark in general and change only the
hadoop client deps. The spark master and slaves use the
spark-1.0.1-bin-hadoop1 binaries from the downloads page. The relevant
snippet from the app's maven pom is as follows:
dependency
Hi All,I am trying to load data from Hive tables using Spark SQL. I am using
spark-shell. Here is what I see:
val trainingDataTable = sql(SELECT prod.prod_num, demographics.gender,
demographics.birth_year, demographics.income_group FROM prod p JOIN
demographics d ON d.user_id = p.user_id)
Hi Jerry,Thanks for your reply. I was following the steps in this programming
guide. It does not mention anything about creating HiveContext or HQL
explicitly.
http://databricks.com/blog/2014/03/26/spark-sql-manipulating-structured-data-using-spark-2.html
Users(userId INT, name String, email
Thanks, Jerry.
Date: Fri, 25 Jul 2014 17:48:27 -0400
Subject: Re: Spark SQL and Hive tables
From: chiling...@gmail.com
To: user@spark.apache.org
Hi Sameer,
The blog post you referred to is about Spark SQL. I don't think the intent of
the article is meant to guide you how to read data from Hive
Hi Jianshi,
Could you provide which HBase version you're using?
By the way, a quick sanity check on whether the Workers can access HBase?
Were you able to manually write one record to HBase with the serialize
function? Hardcode and test it ?
From: jianshi.hu...@gmail.com
Date: Fri, 25 Jul 2014
solution: opened all ports on the ec2 machine that the driver was running on.
need to narrow down what ports akka wants... but the issue is solved.
--
View this message in context:
The stack trace was from running the Actor count sample directly, without a
spark cluster, so I guess the logs would be from both? I enabled more logging
and got this stack trace
4/07/25 17:55:26 [INFO] SecurityManager: Changing view acls to: alan
14/07/25 17:55:26 [INFO] SecurityManager:
Maybe this is me misunderstanding the Spark system property behavior, but
I'm not clear why the class being loaded ends up having '/' rather than '.'
in it's fully qualified name. When I tested this out locally, the '/' were
preventing the class from being loaded.
On Fri, Jul 25, 2014 at 2:27
40 matches
Mail list logo