Here's a Java version
https://github.com/cloudera/parquet-examples/tree/master/MapReduce It won't
be that hard to make that in Scala.
Thanks
Best Regards
On Mon, Mar 9, 2015 at 9:55 PM, Shuai Zheng szheng.c...@gmail.com wrote:
Hi All,
I have a lot of parquet files, and I try to open them
Don't you think 1000 is too less for 160GB of data? Also you could try
using KryoSerializer, Enabling RDD Compression.
Thanks
Best Regards
On Mon, Mar 9, 2015 at 11:01 PM, mingweili0x m...@spokeo.com wrote:
I'm basically running a sorting using spark. The spark program will read
from
HDFS,
Thanks for the quick reply.
I am running the application in YARN client mode.
And I want to run the AM on the same node as RM inorder use the node which
otherwise would run AM.
How can I get AM run on the same node as RM?
On Tue, Mar 10, 2015 at 3:49 PM, Sean Owen so...@cloudera.com wrote:
Hi,
I need o develop couple of UDAFs and use them in the SparkSQL. While UDFs
can be registered as a function in HiveContext, I could not find any
documentation of how UDAFs can be registered in the HiveContext?? so far
what I have found is to make a JAR file, out of developed UDAF class, and
I suppose you just provision enough resource to run both on that
node... but it really shouldn't matter. The RM and your AM aren't
communicating heavily.
On Tue, Mar 10, 2015 at 10:23 AM, Harika Matha matha.har...@gmail.com wrote:
Thanks for the quick reply.
I am running the application in
I'm using Spark 1.3.0 RC3 build with Hive support.
In Spark Shell, I want to reuse the HiveContext instance to different
warehouse locations. Below are the steps for my test (Assume I have
loaded a file into table src).
==
15/03/10 18:22:59 INFO SparkILoop: Created sql context (with
Hi all,
I have Spark cluster setup on YARN with 4 nodes(1 master and 3 slaves). When
I run an application, YARN chooses, at random, one Application Master from
among the slaves. This means that my final computation is being carried
only on two slaves. This decreases the performance of the
This is more of an aside, but why repartition this data instead of letting
it define partitions naturally? You will end up with a similar number.
On Mar 9, 2015 5:32 PM, mingweili0x m...@spokeo.com wrote:
I'm basically running a sorting using spark. The spark program will read
from
HDFS, sort
In YARN cluster mode, there is no Spark master, since YARN is your
resource manager. Yes you could force your AM somehow to run on the
same node as the RM, but why -- what do think is faster about that?
On Tue, Mar 10, 2015 at 10:06 AM, Harika matha.har...@gmail.com wrote:
Hi all,
I have Spark
Hi,
Can anyone give an idea about this?
Just did some google search, it seems related to the 2gb limitation on
block size, https://issues.apache.org/jira/browse/SPARK-1476.
The whole process is that:
1. load the data
2. convert each line of data into labeled points using some feature hashing
I think the work around is clear.
Using JDK 7, and implement your own saveAsRemoteWinText() using java.nio.path.
Yong
From: ningjun.w...@lexisnexis.com
To: java8...@hotmail.com; user@spark.apache.org
Subject: RE: sc.textFile() on windows cannot access UNC path
Date: Tue, 10 Mar 2015 03:02:37
You can add the additional jar when submitting your job, something like:
./bin/spark-submit --jars xx.jar …
More options can be listed by just typing ./bin/spark-submit
From: shahab [mailto:shahab.mok...@gmail.com]
Sent: Tuesday, March 10, 2015 8:48 PM
To: user@spark.apache.org
Subject: Does
Currently, Spark SQL doesn’t provide interface for developing the custom UDTF,
but it can work seamless with Hive UDTF.
I am working on the UDTF refactoring for Spark SQL, hopefully will provide an
Hive independent UDTF soon after that.
From: shahab [mailto:shahab.mok...@gmail.com]
Sent:
Hi,
Does any one know how to deploy a custom UDAF jar file in SparkSQL? Where
should i put the jar file so SparkSQL can pick it up and make it accessible
for SparkSQL applications?
I do not use spark-shell instead I want to use it in an spark application.
best,
/Shahab
Thanks TD and Jerry for suggestions. I have done some experiments and worked
out a reasonable solution to the problem of spreading receivers to a set of
worker hosts. It would be a bit too tedious to document in email. So I discuss
the solution in a blog:
Hi,
On Tue, Mar 10, 2015 at 2:13 PM, Cesar Flores ces...@gmail.com wrote:
I am new to the SchemaRDD class, and I am trying to decide in using SQL
queries or Language Integrated Queries (
https://spark.apache.org/docs/1.2.0/api/scala/index.html#org.apache.spark.sql.SchemaRDD
).
Can someone
Probably the cleanup work like clean shuffle files, tmp files cost too much
of CPUs, since if we run Spark Streaming for a long time, lots of files
will be generated, so cleanup this files before app is exited could be
time-consuming.
Thanks
Jerry
2015-03-11 10:43 GMT+08:00 Tathagata Das
I'm trying to play with the implementation of least square solver (Ax = b)
in mlmatrix.TSQR where A is a 5*1024 matrix and b a 5*10 matrix.
It works but I notice
that it's 8 times slower than the implementation given in the latest
ampcamp :
Hi Harika,
Did you get any solution for this?
I want to use yarn , but the spark-ec2 script does not support it.
Thanks
-Roni
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Setting-up-Spark-with-YARN-on-EC2-cluster-tp21818p21991.html
Sent from the Apache
Spark SQL supports a subset of HiveQL:
http://spark.apache.org/docs/latest/sql-programming-guide.html#compatibility-with-apache-hive
On Mon, Mar 9, 2015 at 11:32 PM, Ravindra ravindra.baj...@gmail.com wrote:
From the archives in this user list, It seems that Spark-SQL is yet to
achieve SQL 92
If you are using tools like SBT/Maven/Gradle/etc, they figure out all the
recursive dependencies and includes them in the class path. I haven't
touched Eclipse in years so I am not sure off the top of my head what's
going on instead. Just in case you only downloaded the
spark-streaming_2.10.jar
Thanks Hao,
But my question concerns UDAF (user defined aggregation function ) not
UDTF( user defined type function ).
I appreciate if you could point me to some starting point on UDAF
development in Spark.
Thanks
Shahab
On Tuesday, March 10, 2015, Cheng, Hao hao.ch...@intel.com wrote:
You have to include Scala libraries in the Eclipse dependencies.
TD
On Tue, Mar 10, 2015 at 10:54 AM, Mohit Anchlia mohitanch...@gmail.com
wrote:
I am trying out streaming example as documented and I am using spark 1.2.1
streaming from maven for Java.
When I add this code I get compilation
Thank you for your reply.
1. Which version of Spark do you use now?
I use Spark 1.2.0. (CDH 5.3.1)
2. Why don't you check whether `productJavaRDD ` and `userJavaRDD ` are
cached with Web UI or not?
I checked SparkUI.
Task was stopped at 1/2 (Succeeded/Total tasks).
Here is
How do I do that? I haven't used Scala before.
Also, linking page doesn't mention that:
http://spark.apache.org/docs/1.2.0/streaming-programming-guide.html#linking
On Tue, Mar 10, 2015 at 10:57 AM, Sean Owen so...@cloudera.com wrote:
It means you do not have Scala library classes in your
Do you have event logging enabled?
That could be the problem. The Master tries to aggressively recreate the
web ui of the completed job with the event logs (when it is enabled)
causing the Master to stall.
I created a JIRA for this.
https://issues.apache.org/jira/browse/SPARK-6270
On Tue, Mar 10,
Have you tried Apache Daemon?
http://commons.apache.org/proper/commons-daemon/procrun.html
From: Wang, Ningjun (LNG-NPV)
Date: Tuesday, March 10, 2015 at 11:47 PM
To: user@spark.apache.orgmailto:user@spark.apache.org
Subject: Is it possible to use windows service to start and stop spark
Hi All,
I am hoping someone has seen this issue before with S3, as I haven't been
able to find a solution for this problem.
When I try to save as Text file to s3 into a subfolder, it only ever writes
out to the bucket level folder
and produces block level generated file names and not my output
Hi,
On Wed, Mar 11, 2015 at 9:33 AM, Cheng, Hao hao.ch...@intel.com wrote:
Intel has a prototype for doing this, SaiSai and Jason are the authors.
Probably you can ask them for some materials.
The github repository is here: https://github.com/intel-spark/stream-sql
Also, what I did is
Thanks Michael,
That helps. So just to summarise that we should not make any assumption
about Spark being fully compliant with any SQL Standards until announced by
the community and maintain the same status quo as you have suggested.
Regards,
Ravi.
On Tue, Mar 10, 2015 at 11:14 PM Michael
We are using spark stand alone cluster on Windows 2008 R2. I can start spark
clusters by open an command prompt and run the following
bin\spark-class.cmd org.apache.spark.deploy.master.Master
bin\spark-class.cmd org.apache.spark.deploy.worker.Worker
spark://mywin.mydomain.com:7077
I can stop
There isn't a great way currently. The best option is probably to convert
to scipy.sparse column vectors and add using scipy.
Joseph
On Mon, Mar 9, 2015 at 4:21 PM, Daniel, Ronald (ELS-SDG)
r.dan...@elsevier.com wrote:
Hi,
Sorry to ask this, but how do I compute the sum of 2 (or more) mllib
Hey,
Recently, we found in our cluster, that when we kill a spark streaming
app, the whole cluster cannot response for 10 minutes.
And, we investigate the master node, and found the master process
consumes 100% CPU when we kill the spark streaming app.
How could it happen? Did
Hi All,
I need some help with a problem in pyspark which is causing a major issue.
Recently I've noticed that the behaviour of the python.deamons on the worker
nodes for compute-intensive tasks have changed from using all the avaliable
cores to using only a single core. On each worker node, 8
Hi Holden
Thanks Holden for pointing me the package. Indeed StreamingSuiteBase
trait hides a lot, especially regarding clock manipulation. Did you
encounter problems with concurrent tests execution from SBT
(SPARK-2243)? I had to disable parallel execution and configure SBT to
use separate JVM
I am getting following error. When I look at the sources it seems to be a
scala source, but not sure why it's complaining about it.
The method map(FunctionString,R) in the type JavaDStreamString is not
applicable for the arguments (new
PairFunctionString,String,Integer(){})
And my code has
Hi,
I have a CDH5.3.2(Spark1.2) cluster.
I am getting an local class incompatible exception for my spark
application during an action.
All my classes are case classes(To best of my knowledge)
Appreciate any help.
Exception in thread main org.apache.spark.SparkException: Job aborted due
to
Hi All,
I am currently trying to write a very wide file into parquet using spark
sql. I have 100K column records that I am trying to write out, but of
course I am running into space issues(out of memory - heap space). I was
wondering if there are any tweaks or work arounds for this.
I am
I navigated to maven dependency and found scala library. I also found
Tuple2.class and when I click on it in eclipse I get invalid LOC header
(bad signature)
java.util.zip.ZipException: invalid LOC header (bad signature)
at java.util.zip.ZipFile.read(Native Method)
I am wondering if I should
On Tue, Mar 10, 2015 at 1:18 PM, Marcin Kuthan marcin.kut...@gmail.com
wrote:
Hi Holden
Thanks Holden for pointing me the package. Indeed StreamingSuiteBase
trait hides a lot, especially regarding clock manipulation. Did you
encounter problems with concurrent tests execution from SBT
Ah, that's a typo in the example: use words.mapToPair
I can make a little PR to fix that.
On Tue, Mar 10, 2015 at 8:32 PM, Mohit Anchlia mohitanch...@gmail.com wrote:
I am getting following error. When I look at the sources it seems to be a
scala source, but not sure why it's complaining about
In Spark 1.2 I used to be able to do this:
scala
org.apache.spark.sql.hive.HiveMetastoreTypes.toDataType(structint:bigint)
res30: org.apache.spark.sql.catalyst.types.DataType =
StructType(List(StructField(int,LongType,true)))
That is, the name of a column can be a keyword like int. This is no
I am new to the SchemaRDD class, and I am trying to decide in using SQL
queries or Language Integrated Queries (
https://spark.apache.org/docs/1.2.0/api/scala/index.html#org.apache.spark.sql.SchemaRDD
).
Can someone tell me what is the main difference between the two approaches,
besides using
Or another option is to use Scala-IDE, which is built on top of Eclipse,
instead of pure Eclipse, so Scala comes with it.
Yong
From: so...@cloudera.com
Date: Tue, 10 Mar 2015 18:40:44 +
Subject: Re: Compilation error
To: mohitanch...@gmail.com
CC: t...@databricks.com;
I am using maven and my dependency looks like this, but this doesn't seem
to be working
dependencies
dependency
groupIdorg.apache.spark/groupId
artifactIdspark-streaming_2.10/artifactId
version1.2.0/version
/dependency
dependency
groupIdorg.apache.spark/groupId
See if you can import scala libraries in your project.
On Tue, Mar 10, 2015 at 11:32 AM, Mohit Anchlia mohitanch...@gmail.com
wrote:
I am using maven and my dependency looks like this, but this doesn't seem
to be working
dependencies
dependency
groupIdorg.apache.spark/groupId
Hello,
I'm new to ec2. I've set up a spark cluster on ec2 and am using
persistent-hdfs with the data nodes mounting ebs. I launched my cluster
using spot-instances
./spark-ec2 -k mykeypair -i ~/aws/mykeypair.pem -t m3.xlarge -s 4 -z
us-east-1c --spark-version=1.2.0 --spot-price=.0321
Hi All,
I try to pass parameter to the spark-shell when I do some test:
spark-shell --driver-memory 512M --executor-memory 4G --master
spark://:7077 --conf spark.sql.parquet.compression.codec=snappy --conf
spark.sql.parquet.binaryAsString=true
This works fine on my local pc. And
Harika,
I think you can modify existing spark on ec2 cluster to run Yarn mapreduce,
not sure if this is what you are looking for.
To try:
1) logon to master
2) go into either ephemeral-hdfs/conf/ or persistent-hdfs/conf/
and add this to mapred-site.xml :
property
I ran the dependency command and see the following dependencies:
I only see org.scala-lang.
[INFO] org.spark.test:spak-test:jar:0.0.1-SNAPSHOT
[INFO] +- org.apache.spark:spark-streaming_2.10:jar:1.2.0:compile
[INFO] | +- org.eclipse.jetty:jetty-server:jar:8.1.14.v20131031:compile
[INFO] | |
From the archives in this user list, It seems that Spark-SQL is yet to
achieve SQL 92 level. But there are few things still not clear.
1. This is from an old post dated : Aug 09, 2014.
2. It clearly says that it doesn't support DDL and DML operations. Does
that means, all reads (select) are sql
It will be good if you can explain the entire usecase like what kind of
requests, what sort of processing etc.
Thanks
Best Regards
On Mon, Mar 9, 2015 at 11:18 PM, Tarun Garg bigdat...@live.com wrote:
Hi,
I have a existing web base system which receives the request and process
that. This
Are you using SparkSQL for the join? In that case I'm not quiet sure you
have a lot of options to join on the nearest co-ordinate. If you are using
the normal Spark code (by creating key-pair on lat,lon) you can apply
certain logic like trimming the lat,lon etc. If you want more specific
computing
What I found from a quick search of the Spark source code (from my local
snapshot on January 25, 2015):
// Interval between each check for event log updates
private val UPDATE_INTERVAL_MS =
conf.getInt(spark.history.fs.updateInterval,
conf.getInt(spark.history.updateInterval, 10)) * 1000
I would expect base trait for testing purposes in spark distribution.
ManualClock should be exposed as well. And some documentation how to
configure SBT to avoid problems with multiple spark contexts. I'm
going to create improvement proposal on Spark issue tracker about it.
Right now I
Thank you Charles and Meethu.
On Tue, Mar 10, 2015 at 12:47 AM, Charles Feduke charles.fed...@gmail.com
wrote:
What I found from a quick search of the Spark source code (from my local
snapshot on January 25, 2015):
// Interval between each check for event log updates
private val
Hi,
I am trying to understand Hadoop Map method compared to spark Map and I
noticed that spark Map only receives 3 arguments 1) input value 2) output
key 3) output value, however in hadoop map it has 4 values 1) input key 2)
input value 3) output key 4) output value. Is there any reason it was
works now. I should have checked :)
On Tue, Mar 10, 2015 at 1:44 PM, Sean Owen so...@cloudera.com wrote:
Ah, that's a typo in the example: use words.mapToPair
I can make a little PR to fix that.
On Tue, Mar 10, 2015 at 8:32 PM, Mohit Anchlia mohitanch...@gmail.com
wrote:
I am getting
Thanks for reporting. This was a result of a change to our DDL parser that
resulted in types becoming reserved words. I've filled a JIRA and will
investigate if this is something we can fix.
https://issues.apache.org/jira/browse/SPARK-6250
On Tue, Mar 10, 2015 at 1:51 PM, Nitay Joffe
They should have the same performance, as they are compiled down to the
same execution plan.
Note that starting in Spark 1.3, SchemaRDD is renamed DataFrame:
https://databricks.com/blog/2015/02/17/introducing-dataframes-in-spark-for-large-scale-data-science.html
On Tue, Mar 10, 2015 at 2:13
There are some techniques you can use If you geohash
http://en.wikipedia.org/wiki/Geohash the lat-lngs. They will naturally be
sorted by proximity (with some edge cases so watch out). If you go the join
route, either by trimming the lat-lngs or geohashing them, you’re essentially
grouping
import com.google.gson.{GsonBuilder, JsonParser}
import org.apache.spark.mllib.clustering.KMeans
import org.apache.spark.sql.SQLContext
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.mllib.clustering.KMeans
/**
* Examine the collected tweets and trains a model based on
Does anyone know how to get the HighlyCompressedMapStatus to compile?
I will try turning off kryo in 1.2.0 and hope things don't break. I want
to benefit from the MapOutputTracker fix in 1.2.0.
On Tue, Mar 3, 2015 at 5:41 AM, Imran Rashid iras...@cloudera.com wrote:
the scala syntax for
Hi Nitay,
Can you try using backticks to quote the column name? Like
org.apache.spark.sql.hive.HiveMetastoreTypes.toDataType(
struct`int`:bigint)?
Thanks,
Yin
On Tue, Mar 10, 2015 at 2:43 PM, Michael Armbrust mich...@databricks.com
wrote:
Thanks for reporting. This was a result of a change
Does Spark Streaming also supports SQLs? Something like how Esper does CEP.
Oh, sorry, my bad, currently Spark SQL doesn’t provide the user interface for
UDAF, but it can work seamlessly with Hive UDAF (via HiveContext).
I am also working on the UDAF interface refactoring, after that we can provide
the custom interface for extension.
I am not so sure if Hive supports change the metastore after initialized, I
guess not. Spark SQL totally rely on Hive Metastore in HiveContext, probably
that's why it doesn't work as expected for Q1.
BTW, in most of cases, people configure the metastore settings in
hive-site.xml, and will not
I have Hadoop Input Format which reads records and produces
JavaPairRDDString,String locatedData where
_1() is a formatted version of the file location - like
12690,, 24386 .27523 ...
_2() is data to be processed
For historical reasons I want to convert _1() into in integer
Intel has a prototype for doing this, SaiSai and Jason are the authors.
Probably you can ask them for some materials.
From: Mohit Anchlia [mailto:mohitanch...@gmail.com]
Sent: Wednesday, March 11, 2015 8:12 AM
To: user@spark.apache.org
Subject: SQL with Spark Streaming
Does Spark Streaming also
69 matches
Mail list logo