I'm trying to read input files from S3. The files are compressed using LZO.
i-e from spark-shell
sc.textFile(s3n://path/xx.lzo).first returns 'String = �LZO?'
Spark does not uncompress the data from the file. I am using cloudera
manager 5, with CDH 5.0.2. I've already installed 'GPLEXTRAS'
Hi Thunder,
Please understand that both MLlib and breeze are in active
development. Before v1.0, we used jblas but in the public APIs we only
exposed Array[Double]. In v1.0, we introduced Vector that supports
both dense and sparse data and switched the backend to
breeze/netlib-java (except ALS).
Hi Dmitriy,
It is sweet to have the bindings, but it is very easy to downgrade the
performance with them. The BLAS/LAPACK APIs have been there for more
than 20 years and they are still the top choice for high-performance
linear algebra. I'm thinking about whether it is possible to make the
That's good to know. I will try it out.
Thanks Romain
On Friday, June 27, 2014, Romain Rigaux romain.rig...@gmail.com wrote:
So far Spark Job Server does not work with Spark 1.0:
https://github.com/ooyala/spark-jobserver
So this works only with Spark 0.9 currently:
It sounds really strange...
I guess it is a bug, critical bug and must be fixed... at least some flag
must be add (unable.hadoop)
I found the next workaround :
1) download compiled winutils.exe from
Are you having sbt directory inside your spark directory?
Thanks
Best Regards
On Wed, Jul 2, 2014 at 10:17 PM, Imran Akbar im...@infoscoutinc.com wrote:
Hi,
I'm trying to install spark 1 on my hadoop cluster running on EMR. I
didn't have any problem installing the previous versions, but
Hi,
Can someone provide me pointers for this issue.
Thanks
Subacini
On Wed, Jul 2, 2014 at 3:34 PM, Subacini B subac...@gmail.com wrote:
Hi,
Below code throws compilation error , not found: *value Sum* . Can
someone help me on this. Do i need to add any jars or imports ? even for
Count
If you have downloaded the pre-compiled binary, it will not have sbt
directory inside it.
Thanks
Best Regards
On Thu, Jul 3, 2014 at 12:35 PM, Akhil Das ak...@sigmoidanalytics.com
wrote:
Are you having sbt directory inside your spark directory?
Thanks
Best Regards
On Wed, Jul 2, 2014 at
Hi Sameer,
If you set those two IDs to be a Tuple2 in the key of the RDD, then you can
join on that tuple.
Example:
val rdd1: RDD[Tuple3[Int, Int, String]] = ...
val rdd2: RDD[Tuple3[Int, Int, String]] = ...
val resultRDD = rdd1.map(k = ((k._1, k._2), k._3)).join(
rdd2.map(k =
Hi,
You need to import Sum and Count like:
import org.apache.spark.sql.catalyst.expressions.{Sum,Count} // or
with wildcard _
or if you use current master branch build, you can use sum('colB)
instead of Sum('colB).
Thanks.
2014-07-03 16:09 GMT+09:00 Subacini B subac...@gmail.com:
Hi,
Can
add MASTER=yarn-client then the JDBC / Thrift server will run on yarn
2014-07-02 16:57 GMT-07:00 田毅 tia...@asiainfo.com:
hi, Matei
Do you know how to run the JDBC / Thrift server on Yarn?
I did not find any suggestion in docs.
2014-07-02 16:06 GMT-07:00 Matei Zaharia
Hi,
I'm trying to convert scala spark job into java.
In case of scala, I typically use 'case class' to apply schema to RDD.
It can be converted into POJO class in java, but what I really want to do is
dynamically creating POJO classes like scala REPL do.
For this reason, I import javassist to
Hi Andrew,
This does not work (the application failed), I have the following error when I
put 3 slashes in the hdfs scheme:
(...)
Caused by: java.lang.IllegalArgumentException: Pathname
With spark-1.0.0 this is the cmdline from /proc/#pid: (with the export line
export _JAVA_OPTIONS=...)
Hi ,
Can any one please help me to understand which version of Hive support
Spark and Shark
--
--
Regards,
RAVI PRASAD. T
I found a web page for hint.
http://ardoris.wordpress.com/2014/03/30/how-spark-does-class-loading/
I learned SparkIMain has internal httpserver to publish class object but
can't figure out how I use it in java.
Any ideas?
Thanks,
Kevin
--
View this message in context:
can i enable spark to use dfs.client.read.shortcircuit property to improve
performance and ready natively on local nodes instead of hdfs api ?
The information contained in this message may be confidential and legally
protected under applicable law. The message
I've had some odd behavior with jobs showing up in the history server in
1.0.0. Failed jobs do show up but it seems they can show up minutes or
hours later. I see in the history server logs messages about bad task ids.
But then eventually the jobs show up.
This might be your situation.
I have given this a try in a spark-shell and I still get many Allocation
Failures
On Thursday, July 3, 2014 9:51 AM, Xiangrui Meng men...@gmail.com wrote:
The SparkKMeans is just an example code showing a barebone
implementation of k-means. To run k-means on big datasets, please use
the
Hi:
I am working on a project where a few thousand text files (~20M in size) will
be dropped in an hdfs directory every 15 minutes. Data from the file will used
to update counters in cassandra (non-idempotent operation). I was wondering
what is the best to deal with this:
* Use text
Hi All,
We are using ALS.train to generate a model for predictions. We are using
DStream[] to collect the predicted output and then trying to dump in a
text file using these two approaches dstream.saveAsTextFiles() and
dstream.foreachRDD(rdd=rdd.saveAsTextFile).But both these approaches are
Hi Singh!
For this use-case its better to have a Streaming context listening to that
directory in hdfs where the files are being dropped and you can set the
Streaming interval as 15 minutes and let this driver program run
continuously, so as soon as new files are arrived they are taken for
On Wed, July 2, 2014 2:00 am, Mayur Rustagi wrote:
two job context cannot share data, are you collecting the data to the
master then sending it to the other context?
Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi https://twitter.com/mayur_rustagi
On
Printing the model show the intercept is always 0 :(
Should I open a bug for that ?
2014-07-02 16:11 GMT+02:00 Eustache DIEMERT eusta...@diemert.fr:
Hi list,
I'm benchmarking MLlib for a regression task [1] and get strange results.
Namely, using RidgeRegressionWithSGD it seems the
Hello,
I am running the BroadcastTest example in a standalone cluster using
spark-submit. I have 8 host machines and made Host1 the master. Host2 to
Host8 act as 7 workers to connect to the master. The connection was fine as
I could see all 7 hosts on the master web ui. The BroadcastTest example
Hi Konstantin,
Could you please create a jira item at:
https://issues.apache.org/jira/browse/SPARK/ so this issue can be tracked?
Thanks,
Denny
On July 2, 2014 at 11:45:24 PM, Konstantin Kudryavtsev
(kudryavtsev.konstan...@gmail.com) wrote:
It sounds really strange...
I guess it is a bug,
That’s an obvious workaround, yes, thank you Tobias.
However, I’m prototyping substitution to real batch process, where I’d have to
create six streams (and possibly more). It could be a bit messy.
On the other hand, under the hood KafkaInputDStream which is create with this
KafkaUtils call,
Hi All,
I was able to resolve this matter with a simple fix. It seems that in order
to process a reduceByKey and the flat map operations at the same time, the
only way to resolve was to increase the number of threads to 1.
Since I'm developing on my personal machine for speed, I simply updated
Hi Christophe, another Andrew speaking.
Your configuration looks fine to me. From the stack trace it seems that we
are in fact closing the file system pre-maturely elsewhere in the system,
such that when it tries to write the APPLICATION_COMPLETE file it throws
the exception you see. This does
Hi:
Is there a way to find out when spark has finished processing a text file (both
for streaming and non-streaming cases) ?
Also, after processing, can spark copy the file to another directory ?
Thanks
Just to provide more information on this issue. It seems that SPARK_HOME
environment variable is causing the issue. If I unset the variable in
spark-class script and run in the local mode my code runs fine without
the exception. But if I run with SPARK_HOME, I get the exception
mentioned below. I
On Wed, Jul 2, 2014 at 11:40 PM, Xiangrui Meng men...@gmail.com wrote:
Hi Dmitriy,
It is sweet to have the bindings, but it is very easy to downgrade the
performance with them. The BLAS/LAPACK APIs have been there for more
than 20 years and they are still the top choice for high-performance
Hi Denny,
just created https://issues.apache.org/jira/browse/SPARK-2356
On Jul 3, 2014, at 7:06 PM, Denny Lee denny.g@gmail.com wrote:
Hi Konstantin,
Could you please create a jira item at:
https://issues.apache.org/jira/browse/SPARK/ so this issue can be tracked?
Thanks,
Denny
Thanks! will take a look at this later today. HTH!
On Jul 3, 2014, at 11:09 AM, Kostiantyn Kudriavtsev
kudryavtsev.konstan...@gmail.com wrote:
Hi Denny,
just created https://issues.apache.org/jira/browse/SPARK-2356
On Jul 3, 2014, at 7:06 PM, Denny Lee denny.g@gmail.com wrote:
Hi All,
I'm a dev a Continuum and we are developing a fair amount of tooling around
Spark. A few days ago someone expressed interest in numpy+pyspark and
Anaconda came up as a reasonable solution.
I spent a number of hours yesterday trying to rework the base Spark AMI on
EC2 but sadly was
I think I have found my answers but if anyone has thoughts please share.
After testing for a while I think the error doesn't have any effect on the
process.
I think it is the case that there must be elements left in the window from
last run otherwise my system is completely whack.
Please let me
Hi all,
Could you please share your the best practices on writing logs in Spark? I’m
running it on YARN, so when I check logs I’m bit confused…
Currently, I’m writing System.err.println to put a message in log and access it
via YARN history server. But, I don’t like this way… I’d like to use
Hi Ben,
Has the PYSPARK_PYTHON environment variable been set in
spark/conf/spark-env.sh to the path of the new python binary?
FYI, there's a /root/copy-dirs script that can be handy when updating
files on an already-running cluster. You'll want to restart the spark
cluster for the changes to
Doing an offset is actually pretty expensive in a distributed query engine,
so in many cases it probably makes sense to just collect and then perform
the offset as you are doing now. This is unless the offset is very large.
Another limitation here is that HiveQL does not support OFFSET. That
Spark SQL is based on Hive 0.12.0.
On Thu, Jul 3, 2014 at 2:29 AM, Ravi Prasad raviprasa...@gmail.com wrote:
Hi ,
Can any one please help me to understand which version of Hive support
Spark and Shark
--
--
Regards,
RAVI PRASAD. T
Hi Jack,
1. Several previous instances of key not valid? error had been
attributed to memory issues, either memory allocated per executor or per
task, depending on the context. You can google it to see some examples.
2. I think your case is similar, even though its happening due to
Hello!
I want to play around with several different cluster settings and measure
performances for MLlib and GraphX and was wondering if anybody here could
hit me up with datasets for these applications from 5GB onwards?
I mostly interested in SVM and Triangle Count, but would be glad for any
Take a look at Kaggle competition datasets - https://www.kaggle.com/competitions
For svm there are a couple of ad click prediction datasets of pretty large size.
For graph stuff the SNAP has large network data: https://snap.stanford.edu/data/
—
Sent from Mailbox
On Thu, Jul 3, 2014 at
Nick Pentreath wrote
Take a look at Kaggle competition datasets
- https://www.kaggle.com/competitions
I was looking for files in LIBSVM format and never found something on Kaggle
in bigger size. Most competitions I ve seen need data processing and feature
generating, but maybe I ve to take a
The Kaggle data is not in libsvm format so you'd have to do some transformation.
The Criteo and KDD cup datasets are if I recall fairly large. Criteo ad
prediction data is around 2-3GB compressed I think.
To my knowledge these are the largest binary classification datasets I've come
across
This will load listed jars when SparkContext is created.
In case of REPL, we define and import classes after SparkContext created.
According to above mentioned site, Executor install class loader in
'addReplClassLoaderIfNeeded' method using spark.repl.class.uri
configuration.
Then I will try to
Sergey,
On Fri, Jul 4, 2014 at 1:06 AM, Sergey Malov sma...@collective.com wrote:
On the other hand, under the hood KafkaInputDStream which is create with
this KafkaUtils call, calls ConsumerConnector.createMessageStream which
returns a Map[String, List[KafkaStream] keyed by topic. It is,
Folks, I have a program derived from the Kafka streaming wordcount example
which works fine standalone.
Running on Mesos is not working so well. For starters, I get the error below
No FileSystem for scheme: hdfs.
I've looked at lots of promising comments on this issue so now I have -
*
...and a real subject line.
From: Steven Cox [s...@renci.org]
Sent: Thursday, July 03, 2014 9:21 PM
To: user@spark.apache.org
Subject:
Folks, I have a program derived from the Kafka streaming wordcount example
which works fine standalone.
Running on Mesos is
Are the hadoop configuration files on the classpath for your mesos
executors?
On Thu, Jul 3, 2014 at 6:45 PM, Steven Cox s...@renci.org wrote:
...and a real subject line.
--
*From:* Steven Cox [s...@renci.org]
*Sent:* Thursday, July 03, 2014 9:21 PM
*To:*
They weren't. They are now and the logs look a bit better - like perhaps some
serialization is completing that wasn't before.
But I still get the same error periodically. Other thoughts?
From: Soren Macbeth [so...@yieldbot.com]
Sent: Thursday, July 03, 2014 9:54
A common reason for the Joining ... is slow message is that you're
joining VertexRDDs without having cached them first. This will cause Spark
to recompute unnecessarily, and as a side effect, the same index will get
created twice and GraphX won't be able to do an efficient zip join.
For example,
Would appreciate help on:
1. How to convert streaming RDD into JavaSchemaRDD
2. How to structure the driver program to do interactive SparkSQL
Using Spark 1.0 with Java.
I have steaming code that does upateStateByKey resulting in JavaPairDStream.
I am using JavaDStream::compute(time) to get
Most likely you are missing the hadoop configuration files (present in
conf/*.xml).
Thanks
Best Regards
On Fri, Jul 4, 2014 at 7:38 AM, Steven Cox s...@renci.org wrote:
They weren't. They are now and the logs look a bit better - like perhaps
some serialization is completing that wasn't
54 matches
Mail list logo