Yes, the problem is that the Java API inadvertently requires an
Iterable return value, not an Iterator:
https://issues.apache.org/jira/browse/SPARK-3369 I think this can't be
fixed until Spark 2.x.
It seems possible to cheat and return a wrapper like the
IteratorIterable I posted in the JIRA. You
Hi Yana,
Thanks for your kindly response. My question is indeed unclear.
What I wanna do is to join a state stream, which is the
*updateStateByKey *output
of last-run.
*updateStateByKey *is useful if application logic doesn't (heavily) rely on
states. So that you can run application without
Hi Yana,
So, user quotas need another data store, which can guarantee persistence
and afford frequent data updates/access. Is it correct?
Thanks,
Chia-Chun
2014-10-01 21:48 GMT+08:00 Yana Kadiyska yana.kadiy...@gmail.com:
I don't think persist is meant for end-user usage. You might want to
This looks like you are just running your own program. To run Spark
programs, you use spark-submit. It has options that control the
executor and driver memory. The settings below are not affecting
Spark.
On Wed, Oct 1, 2014 at 10:21 PM, 陈韵竹 anny9...@gmail.com wrote:
Thanks Sean. This is how I
I am noticing disparities in behavior between the REPL and in my standalone
program in terms of implicit conversion of an RDD to SchemaRDD.
In the REPL the following sequence works:
import sqlContext._
val mySchemaRDD = myNormalRDD.where(1=1)
However when attempting similar in a standalone
Hi,
Is there anyone having clue of sending messages to specific vertex(not to
immediate neighbour), whose vId is stored in property of source vertex, in
Pregel api?
More precisely, how to do this in sendMessage() ?
to pass more general Triplets into above function?
(Obviously we can do it
How about this?
Class.forName([Lorg.apache.spark.util.collection.CompactBuffer;)
On Tue, Sep 30, 2014 at 5:33 PM, Andras Barjak
andras.bar...@lynxanalytics.com wrote:
Hi,
what is the correct scala code to register an Array of this private spark
class to Kryo?
i used this solution to get the class name correctly at runtime:
kryo.register(ClassTag(Class.forName(org.apache.spark.util.collection.CompactBuffer)).wrap.runtimeClass)
2014-10-02 12:50 GMT+02:00 Daniel Darabos daniel.dara...@lynxanalytics.com
:
How about this?
Hi, here in Sematext we almost done with Spark monitoring
http://www.sematext.com/spm/index.html
But we need 1 thing from Spark, something like
https://groups.google.com/forum/#!topic/storm-user/2fNCF341yqU in Storm.
Something like 'placeholder' in java opts which Spark will fills for
executor,
Yes -- persist is more akin to caching -- it's telling Spark to materialize
that RDD for fast reuse but it's not meant for the end user to query/use
across processes, etc.(at least that's my understanding).
On Thu, Oct 2, 2014 at 4:04 AM, Chia-Chun Shih chiachun.s...@gmail.com
wrote:
Hi Yana,
Hi Mayur,
Thanks for your suggestion.
In fact, that's i'm thinking about; to pass those data, and return only the
percentage of the outlier in a particular window.
I also have some doubt if i would implement the outlier detection on rdd as
you have suggested.
From what i understand that those
You don't need to do anything special to run in local mode from within
Eclipse. Just create a simple SparkConf and create a SparkContext from
that. I have unit tests which execute on a local SparkContext, and they
work from inside Eclipse as well as SBT.
val conf = new
Hi, in an effort to migrate off of Shark I recently tried the Thrift JDBC
server that comes with Spark 1.1.0.
However I observed that conditional functions do not work (I tried 'case'
and 'coalesce')
some string functions like 'concat' also did not work.
Is there a list of what's missing or a
Hello Mark,
I am no expert but I can answer some of your questions.
On Oct 2, 2014 2:15 AM, Mark Mandel mark.man...@gmail.com wrote:
Hi,
So I'm super confused about how to take my Spark code and actually deploy
and run it on a cluster.
Let's assume I'm writing in Java, and we'll take a
If you are using textFiles() to read data in, it also takes in a parameter
the number of minimum partitions to create. Would that not work for you?
On Oct 2, 2014 7:00 AM, jamborta jambo...@gmail.com wrote:
Hi all,
I have been testing repartitioning to ensure that my algorithms get similar
Here is the specific code
val sc = new SparkContext(slocal[$NWorkers], HBaseTestsSparkContext)
val ctx = new SQLContext(sc)
import ctx._
case class MyTable(col1: String, col2: Byte)
val myRows = ctx.sparkContext.parallelize((Range(1,21).map{ix =
MyTable(scol1$ix,
That would work - I normally use hive queries through spark sql, I
have not seen something like that there.
On Thu, Oct 2, 2014 at 3:13 PM, Ashish Jain ashish@gmail.com wrote:
If you are using textFiles() to read data in, it also takes in a parameter
the number of minimum partitions to
Hi Tamas,
Can you try to set mapred.map.tasks and see if it works?
Thanks,
Yin
On Thu, Oct 2, 2014 at 10:33 AM, Tamas Jambor jambo...@gmail.com wrote:
That would work - I normally use hive queries through spark sql, I
have not seen something like that there.
On Thu, Oct 2, 2014 at 3:13
Hi all,
I successfully implemented my algorithm in Scala but my team wants it in
Java. I have a problem with Generics, can anyone help me?
I have a first JavaPairRDD with a structure like ((ean, key), [from, to,
value])
* ean and key are string
* from and to are DateTime
* value is a
I am seeing this same issue. Bumping for visibility.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Kafka-Spark-Streaming-job-has-an-issue-when-the-worker-reading-from-Kafka-is-killed-tp12595p15611.html
Sent from the Apache Spark User List mailing list
Hi,
Currently the history server provides application details for only the
successfully completed jobs (where the APPLICATION_COMPLETE file is
generated). However, (long-running) jobs that we terminate manually or
failed jobs where the APPLICATION_COMPLETE may not be generated, dont show
up on
You may want to take a look at this PR:
https://github.com/apache/spark/pull/1558
Long story short: while not a terrible idea to show running
applications, your particular case should be solved differently.
Applications are responsible for calling SparkContext.stop() at the
end of their run,
Hi tsingfu,
I want to see metrics in ganglia too.
But I don't understand this step:
./make-distribution.sh --tgz --skip-java-test -Phadoop-2.3 -Pyarn -Phive
-Pspark-ganglia-lgpl
Are you installing the hadoop, yarn, hive AND ganglia??
If I want to install just ganglia?
Can you suggest me
I'm seeing a lot of Akka timeouts which eventually lead to job failure in
spark streaming when removing blocks (Example stack trace below). It appears
to be related to these issues: SPARK-3015
https://issues.apache.org/jira/browse/SPARK-3015 and SPARK-3139
All,
I am having trouble getting a sequence file sorted. My sequence file is
(Text, Text) and when trying to sort it, Spark complains that it can not
because Text is not serializable. To get around this issue, I performed a
map on the sequence file to turn it into (String, String). I then
Hi,
I am sure you can use the -Pspark-ganglia-lgpl switch to enable Ganglia.
This step only adds the support for Hadoop,Yarn,Hive et al in the spark
executable.No need to run if one is not using them.
Cheers
k/
On Thu, Oct 2, 2014 at 12:29 PM, danilopds danilob...@gmail.com wrote:
Hi
Hi,
In Spark 1.1 HiveContext, I ran a create partitioned table command followed by
a cache table command and got a java.sql.SQLSyntaxErrorException: Table/View
'PARTITIONS' does not exist. But cache table worked fine if the table is not a
partitioned table.
Can anybody confirm that cache of
Ok Krishna Sankar,
In relation to this information on Spark monitoring webpage,
For sbt users, set the SPARK_GANGLIA_LGPL environment variable before
building. For Maven users, enable the -Pspark-ganglia-lgpl profile
Do you know what I need to do to install with sbt?
Thanks.
--
View this
Hi Yin,
Thanks for the reply. I've found the section as well, a couple of days ago and managed to integrate es-hadoop with Spark
SQL [1]
Cheers,
[1] http://www.elasticsearch.org/guide/en/elasticsearch/hadoop/master/spark.html
On 10/2/14 6:32 PM, Yin Huai wrote:
Hi Costin,
I am answering
Hi,
I am trying to extract the number of distinct users from a file using Spark
SQL, but I am getting the following error:
ERROR Executor: Exception in task 1.0 in stage 8.0 (TID 15)
java.lang.ArrayIndexOutOfBoundsException: 1
I am following the code in examples/sql/RDDRelation.scala. My
The bug is likely in your data. Do you have lines in your input file that
do not contain the \t character? If so .split will only return a single
element and p(1) from the .map() is going to throw java.lang.
ArrayIndexOutOfBoundsException: 1
On Thu, Oct 2, 2014 at 3:35 PM, SK
-- Forwarded message --
From: Liquan Pei liquan...@gmail.com
Date: Thu, Oct 2, 2014 at 3:42 PM
Subject: Re: Spark SQL: ArrayIndexOutofBoundsException
To: SK skrishna...@gmail.com
There is only one place you use index 1. One possible issue is that your
may have only one element
Thanks for the help. Yes, I did not realize that the first header line has a
different separator.
By the way, is there a way to drop the first line that contains the header?
Something along the following lines:
sc.textFile(inp_file)
.drop(1) // or tail() to drop the header
You can do filter with startswith ?
On Thu, Oct 2, 2014 at 4:04 PM, SK skrishna...@gmail.com wrote:
Thanks for the help. Yes, I did not realize that the first header line has
a
different separator.
By the way, is there a way to drop the first line that contains the header?
Something along
Hello, I'm trying to use Spark to process a large number of files in S3.
I'm running into an issue that I believe is related to the high number of
files, and the resources required to build the listing within the driver
program. If anyone in the Spark community can provide insight or guidance,
it
I believe this is known as the Hadoop Small Files Problem, and it affects
Spark as well. The best approach I've seen to merging small files like this
is by using s3distcp, as suggested here
http://snowplowanalytics.com/blog/2013/05/30/dealing-with-hadoops-small-files-problem/,
as a pre-processing
hi all,
I have a job that runs about for 15 mins, at some point I get an error on
both nodes (all executors) saying:
14/10/02 23:14:38 WARN TaskSetManager: Lost task 80.0 in stage 3.0 (TID 253,
backend-tes): ExecutorLostFailure (executor lost)
In the end, it seems that the job recovers and
Here is the code in question
//read in the hadoop sequence file to sort
val file = sc.sequenceFile(input, classOf[Text], classOf[Text])
//this is the code we would like to avoid that maps the Hadoop Text Input to
Strings so the sortyByKey will run
file.map{ case (k,v) = (k.toString(),
Those logs you included are from the Spark executor processes, as opposed
to the YARN NodeManager processes.
If you don't think you have access to the NodeManager logs, I would try
setting spark.yarn.executor.memoryOverhead to something like 1024 or 2048
and seeing if that helps. If it does,
Hi,
Would anybody know how to get the following information from HiveContext given
a Hive table name?
- partition key(s)
- table directory
- input/output format
I am new to Spark. And I have a couple tables created using Parquet data like:
CREATE EXTERNAL TABLE parquet_table (
COL1 string,
Hi,
I am trying to play around with Spark and Spark SQL.
I have logs being stored in HDFS on a 10 minute window. Each 10 minute
window could have as many as 10 files with random names of 2GB each.
Now, I want to run some analysis on these files. These files are parquet
files.
I am trying to run
parquetFile accepts a comma separated list of files.
Also, unionAll does not write to disk. However, unless you are running a
recent version (compiled from master since this was added
https://github.com/apache/spark/commit/f858f466862541c3faad76a1fa2391f1c17ec9dd)
its missing an optimization and
We actually leave all the DDL commands up to hive, so there is no
programatic way to access the things you are looking for.
On Thu, Oct 2, 2014 at 5:17 PM, Banias calvi...@yahoo.com.invalid wrote:
Hi,
Would anybody know how to get the following information from HiveContext
given a Hive table
This is hard to do in general, but you can get what you are asking for by
putting the following class in scope.
implicit class BetterRDD[A: scala.reflect.ClassTag](rdd:
org.apache.spark.rdd.RDD[A]) {
def dropOne = rdd.mapPartitionsWithIndex((i, iter) = if(i == 0
iter.hasNext) { iter.next; iter
Hi Arun,
Have you found a solution? Seems that I have the same problem.
thanks,
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Any-issues-with-repartition-tp13462p15654.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
have you found a solution this problem? (or at least a cause)
thanks,
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/new-error-for-me-tp10378p15655.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
Cache table works with partitioned table.
I guess you’re experimenting with a default local metastore and the
metastore_db directory doesn’t exist at the first place. In this case,
all metastore tables/views don’t exist at first and will throw the error
message you saw when the |PARTITIONS|
Thanks Michael.
On Thursday, October 2, 2014 8:41 PM, Michael Armbrust mich...@databricks.com
wrote:
We actually leave all the DDL commands up to hive, so there is no programatic
way to access the things you are looking for.
On Thu, Oct 2, 2014 at 5:17 PM, Banias
I have rebuild package with -Phive
Copied hive-site.xml to conf (I am using hive-0.12)
When I run ./bin/spark-sql, I get java.lang.NoSuchMethodError for every
command.
What am I missing here?
Could somebody share what would be the right procedure to make it work?
java.lang.NoSuchMethodError:
Consider there is some connection / external resource allocation required
to be accessed/mutated by each of the rows from within a single worker
thread. That connection should only be opened/closed before the first row
is accessed / after the last row is completed.
It is my understanding that
50 matches
Mail list logo