Re: What can be done if a FlatMapFunctions generated more data that can be held in memory

2014-10-02 Thread Sean Owen
Yes, the problem is that the Java API inadvertently requires an Iterable return value, not an Iterator: https://issues.apache.org/jira/browse/SPARK-3369 I think this can't be fixed until Spark 2.x. It seems possible to cheat and return a wrapper like the IteratorIterable I posted in the JIRA. You

Re: any code examples demonstrating spark streaming applications which depend on states?

2014-10-02 Thread Chia-Chun Shih
Hi Yana, Thanks for your kindly response. My question is indeed unclear. What I wanna do is to join a state stream, which is the *updateStateByKey *output of last-run. *updateStateByKey *is useful if application logic doesn't (heavily) rely on states. So that you can run application without

Re: persistent state for spark streaming

2014-10-02 Thread Chia-Chun Shih
Hi Yana, So, user quotas need another data store, which can guarantee persistence and afford frequent data updates/access. Is it correct? Thanks, Chia-Chun 2014-10-01 21:48 GMT+08:00 Yana Kadiyska yana.kadiy...@gmail.com: I don't think persist is meant for end-user usage. You might want to

Re: still GC overhead limit exceeded after increasing heap space

2014-10-02 Thread Sean Owen
This looks like you are just running your own program. To run Spark programs, you use spark-submit. It has options that control the executor and driver memory. The settings below are not affecting Spark. On Wed, Oct 1, 2014 at 10:21 PM, 陈韵竹 anny9...@gmail.com wrote: Thanks Sean. This is how I

Implicit conversion RDD - SchemaRDD

2014-10-02 Thread Stephen Boesch
I am noticing disparities in behavior between the REPL and in my standalone program in terms of implicit conversion of an RDD to SchemaRDD. In the REPL the following sequence works: import sqlContext._ val mySchemaRDD = myNormalRDD.where(1=1) However when attempting similar in a standalone

how to send message to specific vertex by Pregel api

2014-10-02 Thread Yifan LI
Hi, Is there anyone having clue of sending messages to specific vertex(not to immediate neighbour), whose vId is stored in property of source vertex, in Pregel api? More precisely, how to do this in sendMessage() ? to pass more general Triplets into above function? (Obviously we can do it

Re: registering Array of CompactBuffer to Kryo

2014-10-02 Thread Daniel Darabos
How about this? Class.forName([Lorg.apache.spark.util.collection.CompactBuffer;) On Tue, Sep 30, 2014 at 5:33 PM, Andras Barjak andras.bar...@lynxanalytics.com wrote: Hi, what is the correct scala code to register an Array of this private spark class to Kryo?

Re: registering Array of CompactBuffer to Kryo

2014-10-02 Thread Andras Barjak
i used this solution to get the class name correctly at runtime: kryo.register(ClassTag(Class.forName(org.apache.spark.util.collection.CompactBuffer)).wrap.runtimeClass) 2014-10-02 12:50 GMT+02:00 Daniel Darabos daniel.dara...@lynxanalytics.com : How about this?

Is there a way to provide individual property to each Spark executor?

2014-10-02 Thread Vladimir Tretyakov
Hi, here in Sematext we almost done with Spark monitoring http://www.sematext.com/spm/index.html But we need 1 thing from Spark, something like https://groups.google.com/forum/#!topic/storm-user/2fNCF341yqU in Storm. Something like 'placeholder' in java opts which Spark will fills for executor,

Re: persistent state for spark streaming

2014-10-02 Thread Yana Kadiyska
Yes -- persist is more akin to caching -- it's telling Spark to materialize that RDD for fast reuse but it's not meant for the end user to query/use across processes, etc.(at least that's my understanding). On Thu, Oct 2, 2014 at 4:04 AM, Chia-Chun Shih chiachun.s...@gmail.com wrote: Hi Yana,

Re: Spark Streaming for time consuming job

2014-10-02 Thread Eko Susilo
Hi Mayur, Thanks for your suggestion. In fact, that's i'm thinking about; to pass those data, and return only the percentage of the outlier in a particular window. I also have some doubt if i would implement the outlier detection on rdd as you have suggested. From what i understand that those

Re: Spark inside Eclipse

2014-10-02 Thread Daniel Siegmann
You don't need to do anything special to run in local mode from within Eclipse. Just create a simple SparkConf and create a SparkContext from that. I have unit tests which execute on a local SparkContext, and they work from inside Eclipse as well as SBT. val conf = new

[SparkSQL] Function parity with Shark?

2014-10-02 Thread Yana Kadiyska
Hi, in an effort to migrate off of Shark I recently tried the Thrift JDBC server that comes with Spark 1.1.0. However I observed that conditional functions do not work (I tried 'case' and 'coalesce') some string functions like 'concat' also did not work. Is there a list of what's missing or a

Re: Confusion over how to deploy/run JAR files to a Spark Cluster

2014-10-02 Thread Ashish Jain
Hello Mark, I am no expert but I can answer some of your questions. On Oct 2, 2014 2:15 AM, Mark Mandel mark.man...@gmail.com wrote: Hi, So I'm super confused about how to take my Spark code and actually deploy and run it on a cluster. Let's assume I'm writing in Java, and we'll take a

Re: partition size for initial read

2014-10-02 Thread Ashish Jain
If you are using textFiles() to read data in, it also takes in a parameter the number of minimum partitions to create. Would that not work for you? On Oct 2, 2014 7:00 AM, jamborta jambo...@gmail.com wrote: Hi all, I have been testing repartitioning to ensure that my algorithms get similar

Re: Implicit conversion RDD - SchemaRDD

2014-10-02 Thread Stephen Boesch
Here is the specific code val sc = new SparkContext(slocal[$NWorkers], HBaseTestsSparkContext) val ctx = new SQLContext(sc) import ctx._ case class MyTable(col1: String, col2: Byte) val myRows = ctx.sparkContext.parallelize((Range(1,21).map{ix = MyTable(scol1$ix,

Re: partition size for initial read

2014-10-02 Thread Tamas Jambor
That would work - I normally use hive queries through spark sql, I have not seen something like that there. On Thu, Oct 2, 2014 at 3:13 PM, Ashish Jain ashish@gmail.com wrote: If you are using textFiles() to read data in, it also takes in a parameter the number of minimum partitions to

Re: partition size for initial read

2014-10-02 Thread Yin Huai
Hi Tamas, Can you try to set mapred.map.tasks and see if it works? Thanks, Yin On Thu, Oct 2, 2014 at 10:33 AM, Tamas Jambor jambo...@gmail.com wrote: That would work - I normally use hive queries through spark sql, I have not seen something like that there. On Thu, Oct 2, 2014 at 3:13

Type problem in Java when using flatMapValues

2014-10-02 Thread Robin Keunen
Hi all, I successfully implemented my algorithm in Scala but my team wants it in Java. I have a problem with Generics, can anyone help me? I have a first JavaPairRDD with a structure like ((ean, key), [from, to, value]) * ean and key are string * from and to are DateTime * value is a

Re: Kafka Spark Streaming job has an issue when the worker reading from Kafka is killed

2014-10-02 Thread maddenpj
I am seeing this same issue. Bumping for visibility. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Kafka-Spark-Streaming-job-has-an-issue-when-the-worker-reading-from-Kafka-is-killed-tp12595p15611.html Sent from the Apache Spark User List mailing list

Application details for failed and teminated jobs

2014-10-02 Thread SK
Hi, Currently the history server provides application details for only the successfully completed jobs (where the APPLICATION_COMPLETE file is generated). However, (long-running) jobs that we terminate manually or failed jobs where the APPLICATION_COMPLETE may not be generated, dont show up on

Re: Application details for failed and teminated jobs

2014-10-02 Thread Marcelo Vanzin
You may want to take a look at this PR: https://github.com/apache/spark/pull/1558 Long story short: while not a terrible idea to show running applications, your particular case should be solved differently. Applications are responsible for calling SparkContext.stop() at the end of their run,

Re: Can not see any spark metrics on ganglia-web

2014-10-02 Thread danilopds
Hi tsingfu, I want to see metrics in ganglia too. But I don't understand this step: ./make-distribution.sh --tgz --skip-java-test -Phadoop-2.3 -Pyarn -Phive -Pspark-ganglia-lgpl Are you installing the hadoop, yarn, hive AND ganglia?? If I want to install just ganglia? Can you suggest me

Block removal causes Akka timeouts

2014-10-02 Thread maddenpj
I'm seeing a lot of Akka timeouts which eventually lead to job failure in spark streaming when removing blocks (Example stack trace below). It appears to be related to these issues: SPARK-3015 https://issues.apache.org/jira/browse/SPARK-3015 and SPARK-3139

Sorting a Sequence File

2014-10-02 Thread jritz
All, I am having trouble getting a sequence file sorted. My sequence file is (Text, Text) and when trying to sort it, Spark complains that it can not because Text is not serializable. To get around this issue, I performed a map on the sequence file to turn it into (String, String). I then

Re: Can not see any spark metrics on ganglia-web

2014-10-02 Thread Krishna Sankar
Hi, I am sure you can use the -Pspark-ganglia-lgpl switch to enable Ganglia. This step only adds the support for Hadoop,Yarn,Hive et al in the spark executable.No need to run if one is not using them. Cheers k/ On Thu, Oct 2, 2014 at 12:29 PM, danilopds danilob...@gmail.com wrote: Hi

HiveContext: cache table not supported for partitioned table?

2014-10-02 Thread Du Li
Hi, In Spark 1.1 HiveContext, I ran a create partitioned table command followed by a cache table command and got a java.sql.SQLSyntaxErrorException: Table/View 'PARTITIONS' does not exist. But cache table worked fine if the table is not a partitioned table. Can anybody confirm that cache of

Re: Can not see any spark metrics on ganglia-web

2014-10-02 Thread danilopds
Ok Krishna Sankar, In relation to this information on Spark monitoring webpage, For sbt users, set the SPARK_GANGLIA_LGPL environment variable before building. For Maven users, enable the -Pspark-ganglia-lgpl profile Do you know what I need to do to install with sbt? Thanks. -- View this

Re: SparkSQL DataType mappings

2014-10-02 Thread Costin Leau
Hi Yin, Thanks for the reply. I've found the section as well, a couple of days ago and managed to integrate es-hadoop with Spark SQL [1] Cheers, [1] http://www.elasticsearch.org/guide/en/elasticsearch/hadoop/master/spark.html On 10/2/14 6:32 PM, Yin Huai wrote: Hi Costin, I am answering

Spark SQL: ArrayIndexOutofBoundsException

2014-10-02 Thread SK
Hi, I am trying to extract the number of distinct users from a file using Spark SQL, but I am getting the following error: ERROR Executor: Exception in task 1.0 in stage 8.0 (TID 15) java.lang.ArrayIndexOutOfBoundsException: 1 I am following the code in examples/sql/RDDRelation.scala. My

Re: Spark SQL: ArrayIndexOutofBoundsException

2014-10-02 Thread Michael Armbrust
The bug is likely in your data. Do you have lines in your input file that do not contain the \t character? If so .split will only return a single element and p(1) from the .map() is going to throw java.lang. ArrayIndexOutOfBoundsException: 1 On Thu, Oct 2, 2014 at 3:35 PM, SK

Fwd: Spark SQL: ArrayIndexOutofBoundsException

2014-10-02 Thread Liquan Pei
-- Forwarded message -- From: Liquan Pei liquan...@gmail.com Date: Thu, Oct 2, 2014 at 3:42 PM Subject: Re: Spark SQL: ArrayIndexOutofBoundsException To: SK skrishna...@gmail.com There is only one place you use index 1. One possible issue is that your may have only one element

Re: Fwd: Spark SQL: ArrayIndexOutofBoundsException

2014-10-02 Thread SK
Thanks for the help. Yes, I did not realize that the first header line has a different separator. By the way, is there a way to drop the first line that contains the header? Something along the following lines: sc.textFile(inp_file) .drop(1) // or tail() to drop the header

Re: Fwd: Spark SQL: ArrayIndexOutofBoundsException

2014-10-02 Thread Sunny Khatri
You can do filter with startswith ? On Thu, Oct 2, 2014 at 4:04 PM, SK skrishna...@gmail.com wrote: Thanks for the help. Yes, I did not realize that the first header line has a different separator. By the way, is there a way to drop the first line that contains the header? Something along

Strategies for reading large numbers of files

2014-10-02 Thread Landon Kuhn
Hello, I'm trying to use Spark to process a large number of files in S3. I'm running into an issue that I believe is related to the high number of files, and the resources required to build the listing within the driver program. If anyone in the Spark community can provide insight or guidance, it

Re: Strategies for reading large numbers of files

2014-10-02 Thread Nicholas Chammas
I believe this is known as the Hadoop Small Files Problem, and it affects Spark as well. The best approach I've seen to merging small files like this is by using s3distcp, as suggested here http://snowplowanalytics.com/blog/2013/05/30/dealing-with-hadoops-small-files-problem/, as a pre-processing

how to debug ExecutorLostFailure

2014-10-02 Thread jamborta
hi all, I have a job that runs about for 15 mins, at some point I get an error on both nodes (all executors) saying: 14/10/02 23:14:38 WARN TaskSetManager: Lost task 80.0 in stage 3.0 (TID 253, backend-tes): ExecutorLostFailure (executor lost) In the end, it seems that the job recovers and

Re: Sorting a Sequence File

2014-10-02 Thread bhusted
Here is the code in question //read in the hadoop sequence file to sort val file = sc.sequenceFile(input, classOf[Text], classOf[Text]) //this is the code we would like to avoid that maps the Hadoop Text Input to Strings so the sortyByKey will run file.map{ case (k,v) = (k.toString(),

Re: Help Troubleshooting Naive Bayes

2014-10-02 Thread Sandy Ryza
Those logs you included are from the Spark executor processes, as opposed to the YARN NodeManager processes. If you don't think you have access to the NodeManager logs, I would try setting spark.yarn.executor.memoryOverhead to something like 1024 or 2048 and seeing if that helps. If it does,

Getting table info from HiveContext

2014-10-02 Thread Banias
Hi, Would anybody know how to get the following information from HiveContext given a Hive table name? - partition key(s) - table directory - input/output format I am new to Spark. And I have a couple tables created using Parquet data like: CREATE EXTERNAL TABLE parquet_table ( COL1 string,

Load multiple parquet file as single RDD

2014-10-02 Thread Mohnish Kodnani
Hi, I am trying to play around with Spark and Spark SQL. I have logs being stored in HDFS on a 10 minute window. Each 10 minute window could have as many as 10 files with random names of 2GB each. Now, I want to run some analysis on these files. These files are parquet files. I am trying to run

Re: Load multiple parquet file as single RDD

2014-10-02 Thread Michael Armbrust
parquetFile accepts a comma separated list of files. Also, unionAll does not write to disk. However, unless you are running a recent version (compiled from master since this was added https://github.com/apache/spark/commit/f858f466862541c3faad76a1fa2391f1c17ec9dd) its missing an optimization and

Re: Getting table info from HiveContext

2014-10-02 Thread Michael Armbrust
We actually leave all the DDL commands up to hive, so there is no programatic way to access the things you are looking for. On Thu, Oct 2, 2014 at 5:17 PM, Banias calvi...@yahoo.com.invalid wrote: Hi, Would anybody know how to get the following information from HiveContext given a Hive table

Re: Fwd: Spark SQL: ArrayIndexOutofBoundsException

2014-10-02 Thread Michael Armbrust
This is hard to do in general, but you can get what you are asking for by putting the following class in scope. implicit class BetterRDD[A: scala.reflect.ClassTag](rdd: org.apache.spark.rdd.RDD[A]) { def dropOne = rdd.mapPartitionsWithIndex((i, iter) = if(i == 0 iter.hasNext) { iter.next; iter

Re: Any issues with repartition?

2014-10-02 Thread jamborta
Hi Arun, Have you found a solution? Seems that I have the same problem. thanks, -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Any-issues-with-repartition-tp13462p15654.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: new error for me

2014-10-02 Thread jamborta
have you found a solution this problem? (or at least a cause) thanks, -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/new-error-for-me-tp10378p15655.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: HiveContext: cache table not supported for partitioned table?

2014-10-02 Thread Cheng Lian
Cache table works with partitioned table. I guess you’re experimenting with a default local metastore and the metastore_db directory doesn’t exist at the first place. In this case, all metastore tables/views don’t exist at first and will throw the error message you saw when the |PARTITIONS|

Re: Getting table info from HiveContext

2014-10-02 Thread Banias
Thanks Michael. On Thursday, October 2, 2014 8:41 PM, Michael Armbrust mich...@databricks.com wrote: We actually leave all the DDL commands up to hive, so there is no programatic way to access the things you are looking for. On Thu, Oct 2, 2014 at 5:17 PM, Banias

How to make ./bin/spark-sql work with hive?

2014-10-02 Thread Li HM
I have rebuild package with -Phive Copied hive-site.xml to conf (I am using hive-0.12) When I run ./bin/spark-sql, I get java.lang.NoSuchMethodError for every command. What am I missing here? Could somebody share what would be the right procedure to make it work? java.lang.NoSuchMethodError:

Setup/Cleanup for RDD closures?

2014-10-02 Thread Stephen Boesch
Consider there is some connection / external resource allocation required to be accessed/mutated by each of the rows from within a single worker thread. That connection should only be opened/closed before the first row is accessed / after the last row is completed. It is my understanding that