Re: flume spark streaming receiver host random

2014-09-26 Thread Sean Owen
I think you may be missing a key word here. Are you saying that the machine has multiple interfaces and it is not using the one you expect or the receiver is not running on the machine you expect? On Sep 26, 2014 3:33 AM, centerqi hu cente...@gmail.com wrote: Hi all My code is as follows:

Re: Spark Streaming + Actors

2014-09-26 Thread Madabhattula Rajesh Kumar
Hi Team, Could you please respond on my below request. Regards, Rajesh On Thu, Sep 25, 2014 at 11:38 PM, Madabhattula Rajesh Kumar mrajaf...@gmail.com wrote: Hi Team, Can I use Actors in Spark Streaming based on events type? Could you please review below Test program and let me know if

Access by name in tuples in Scala with Spark

2014-09-26 Thread rzykov
Could you advise the best practice of using some named tuples in Scala with Spark RDD. Currently we can access by a field number in a tuple: RDD.map{_.2} But want to see such construction: RDD.map{_.itemId} This one will be helpful for debugging purposes. -- View this message in context:

Re: Access by name in tuples in Scala with Spark

2014-09-26 Thread Sean Owen
I think you are simply looking for a case class in Scala. It is a simple way to define an object with named, typed fields. On Fri, Sep 26, 2014 at 8:31 AM, rzykov rzy...@gmail.com wrote: Could you advise the best practice of using some named tuples in Scala with Spark RDD. Currently we can

Spark SQL question: is cached SchemaRDD storage controlled by spark.storage.memoryFraction?

2014-09-26 Thread Haopu Wang
Hi, I'm querying a big table using Spark SQL. I see very long GC time in some stages. I wonder if I can improve it by tuning the storage parameter. The question is: the schemaRDD has been cached with cacheTable() function. So is the cached schemaRDD part of memory storage controlled by the

Re: Log hdfs blocks sending

2014-09-26 Thread Alexey Romanchuk
Hello Andrew! Thanks for reply. Which logs and on what level should I check? Driver, master or worker? I found this on master node, but there is only ANY locality requirement. Here it is the driver (spark sql) log - https://gist.github.com/13h3r/c91034307caa33139001 and one of the workers log -

Re: Job cancelled because SparkContext was shut down

2014-09-26 Thread jamborta
Just wanted to answer my question in case someone else runs into the same problem. It is related to the problem discussed here: http://apache-spark-developers-list.1001551.n3.nabble.com/Lost-executor-on-YARN-ALS-iterations-td7916.html and here: https://issues.apache.org/jira/browse/SPARK-2121

Re: Issue with Spark-1.1.0 and the start-thriftserver.sh script

2014-09-26 Thread Cheng Lian
Hi Helene, Thanks for the report. In Spark 1.1, we use a special exit code to indicate |SparkSubmit| fails because of class not found. But unfortunately I chose a not so special exit code — 1… So whenever the process exit with 1 as exit code, the |-Phive| error message is shown. A PR that

Re: Using one sql query's result inside another sql query

2014-09-26 Thread Cheng Lian
H Twinkle, The failure is caused by case sensitivity. The temp table actually stores the original un-analyzed logical plan, thus field names remain capital (F1, F2, etc.). I believe this issue has already been fixed by PR #2382 https://github.com/apache/spark/pull/2382. As a workaround, you

executorAdded event to DAGScheduler

2014-09-26 Thread praveen seluka
Can someone explain the motivation behind passing executorAdded event to DAGScheduler ? *DAGScheduler *does *submitWaitingStages *when *executorAdded *method is called by *TaskSchedulerImpl*. I see some issue in the below code, *TaskSchedulerImpl.scala code* if (!executorsByHost.contains(o.host))

Re: executorAdded event to DAGScheduler

2014-09-26 Thread praveen seluka
Some corrections. On Fri, Sep 26, 2014 at 5:32 PM, praveen seluka praveen.sel...@gmail.com wrote: Can someone explain the motivation behind passing executorAdded event to DAGScheduler ? *DAGScheduler *does *submitWaitingStages *when *executorAdded *method is called by *TaskSchedulerImpl*. I

Re: SparkSQL Thriftserver in Mesos

2014-09-26 Thread Cheng Lian
You can avoid install Spark on each node by uploading Spark distribution tarball file to HDFS setting |spark.executor.uri| to the HDFS location. In this way, Mesos will download and the tarball file before launching containers. Please refer to this Spark documentation page

Re: executorAdded event to DAGScheduler

2014-09-26 Thread Nan Zhu
just a quick reply, we cannot start two executors in the same host for a single application in the standard deployment (one worker per machine) I’m not sure if it will create an issue when you have multiple workers in the same host, as submitWaitingStages is called everywhere and I never try

Re: executorAdded event to DAGScheduler

2014-09-26 Thread praveen seluka
In Yarn, we can easily have multiple containers allocated in the same node. On Fri, Sep 26, 2014 at 6:05 PM, Nan Zhu zhunanmcg...@gmail.com wrote: just a quick reply, we cannot start two executors in the same host for a single application in the standard deployment (one worker per machine)

How to run hive scripts pro-grammatically in Spark 1.1.0 ?

2014-09-26 Thread Sherine
I am unable to run hive scripts in Spark 1.1.0 pro-grammatically in hadoop prompt but I could do it manually. Can anyone help me to run hive scripts pro-grammatically in spark1.1.0 cluster on EMR? Manual running steps:- hadoop@ip-10-151-71-224:~/tmpSpark/spark1.1/spark$ ./bin/spark-shell

Re: problem with HiveContext inside Actor

2014-09-26 Thread Cheng Lian
This is reasonable, since the actual constructor gets called is |Driver()| rather than |Driver(HiveConf)|. The former initializes the |conf| field by: |conf = SessionState.get().getConf() | And |SessionState.get()| reads a TSS value. Thus executing SQL queries within another thread causes

Re: Systematic error when re-starting Spark stream unless I delete all checkpoints

2014-09-26 Thread Svend Vanderveken
Hi all, I apologise for re-posting this, I realise some mail systems are filtering all the code samples from the original post. I would greatly appreciate any pointer regarding, this issue basically renders spark streaming not fault-tolerant for us. Thanks in advance, S --- I experience

java.io.IOException Error in task deserialization

2014-09-26 Thread Arun Ahuja
Has anyone else seen this erorr in task deserialization? The task is processing a small amount of data and doesn't seem to have much data hanging to the closure? I've only seen this with Spark 1.1 Job aborted due to stage failure: Task 975 in stage 8.0 failed 4 times, most recent failure: Lost

Re: Access file name in map function

2014-09-26 Thread Cheng Lian
If the size of each file is small, you may try |SparkContext.wholeTextFiles|. Otherwise you can try something like this: |val filenames: Seq[String] = ... val combined: RDD[(String,String)] = filenames.map { name = sc.textFile(name).map(line = name - line) }.reduce(_ ++ _) | On 9/26/14

Re: Spark SQL question: is cached SchemaRDD storage controlled by spark.storage.memoryFraction?

2014-09-26 Thread Cheng Lian
Yes it is. The in-memory storage used with |SchemaRDD| also uses |RDD.cache()| under the hood. On 9/26/14 4:04 PM, Haopu Wang wrote: Hi, I'm querying a big table using Spark SQL. I see very long GC time in some stages. I wonder if I can improve it by tuning the storage parameter. The

mappartitions data size

2014-09-26 Thread jamborta
Hi all, I am using mappartitions to do some heavy computing on subsets of the data. I have a dataset with about 1m rows, running on a 32 core cluster. Unfortunately, is seems that mappartitions splits the data into two sets so it is only running on two cores. Is there a way to force it to split

Re: mappartitions data size

2014-09-26 Thread Daniel Siegmann
Use RDD.repartition (see here: http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.RDD ). On Fri, Sep 26, 2014 at 10:19 AM, jamborta jambo...@gmail.com wrote: Hi all, I am using mappartitions to do some heavy computing on subsets of the data. I have a dataset with

Re: SPARK UI - Details post job processiong

2014-09-26 Thread Matt Narrell
Yes, I’m running Hadoop’s Timeline server that does this for the YARN/Hadoop logs (and works very nicely btw). Are you saying I can do the same for the SparkUI as well? Also, where do I set these Spark configurations since this will be executed inside a YARN container? On the “client”

Re: Is it possible to use Parquet with Dremel encoding

2014-09-26 Thread matthes
Thank you Jey, That is a nice introduction but it is a may be to old (AUG 21ST, 2013) Note: If you keep the schema flat (without nesting), the Parquet files you create can be read by systems like Shark and Impala. These systems allow you to query Parquet files as tables using SQL-like syntax.

Re: Is it possible to use Parquet with Dremel encoding

2014-09-26 Thread Frank Austin Nothaft
Hi Matthes, Can you post an example of your schema? When you refer to nesting, are you referring to optional columns, nested schemas, or tables where there are repeated values? Parquet uses run-length encoding to compress down columns with repeated values, which is the case that your example

Re: SPARK UI - Details post job processiong

2014-09-26 Thread Chester @work
I am working on a PR that allows one to send the same spark listener event message back to the application in yarn cluster mode. So far I have put this function in our application, our UI will receive and display the same spark job event message such as progress, job start, completed etc

Build error when using spark with breeze

2014-09-26 Thread Jaonary Rabarisoa
Hi all, I'm using some functions from Breeze in a spark job but I get the following build error : *Error:scalac: bad symbolic reference. A signature in RandBasis.class refers to term math3* *in package org.apache.commons which is not available.* *It may be completely missing from the current

Re: rsync problem

2014-09-26 Thread rapelly kartheek
Hi, This is the command I am using for submitting my application, SimpleApp: ./bin/spark-submit --class org.apache.spark.examples.SimpleApp --deploy-mode client --master spark://karthik:7077 $SPARK_HOME/examples/*/scala-*/spark-examples-*.jar /text-data On Thu, Sep 25, 2014 at 6:52 AM, Tobias

Re: Build error when using spark with breeze

2014-09-26 Thread Ted Yu
spark-core's dependency on commons-math3 is @ test scope (core/pom.xml): dependency groupIdorg.apache.commons/groupId artifactIdcommons-math3/artifactId version3.3/version scopetest/scope /dependency Adjusting the scope should solve the problem below. On Fri, Sep

Re: Is it possible to use Parquet with Dremel encoding

2014-09-26 Thread matthes
Hi Frank, thanks al lot for your response, this is a very helpful! Actually I'm try to figure out does the current spark version supports Repetition levels (https://blog.twitter.com/2013/dremel-made-simple-with-parquet) but now it looks good to me. It is very hard to find some good things about

How to do operations on multiple RDD's

2014-09-26 Thread Johan Stenberg
Hi, This is my first post to the email list so give me some feedback if I do something wrong. To do operations on two RDD's to produce a new one you can just use zipPartitions, but if I have an arbitrary number of RDD's that I would like to perform an operation on to produce a single RDD, how do

Re: Build error when using spark with breeze

2014-09-26 Thread Jaonary Rabarisoa
Thank Ted. Can you tell me how to adjust the scope ? On Fri, Sep 26, 2014 at 5:47 PM, Ted Yu yuzhih...@gmail.com wrote: spark-core's dependency on commons-math3 is @ test scope (core/pom.xml): dependency groupIdorg.apache.commons/groupId artifactIdcommons-math3/artifactId

Re: Is it possible to use Parquet with Dremel encoding

2014-09-26 Thread Frank Austin Nothaft
Matthes, Ah, gotcha! Repeated items in Parquet seem to correspond to the ArrayType in Spark-SQL. I only use Spark, but it does looks like that should be supported in Spark-SQL 1.1.0. I’m not sure though if you can apply predicates on repeated items from Spark-SQL. Regards, Frank Austin

Re: Build error when using spark with breeze

2014-09-26 Thread Sean Owen
Shouldn't the user's application depend on commons-math3 if it uses it? it shouldn't require a Spark change. Maybe I misunderstand. On Fri, Sep 26, 2014 at 4:47 PM, Ted Yu yuzhih...@gmail.com wrote: spark-core's dependency on commons-math3 is @ test scope (core/pom.xml): dependency

Re: Build error when using spark with breeze

2014-09-26 Thread Ted Yu
You can use scope of runtime. See http://maven.apache.org/guides/introduction/introduction-to-dependency-mechanism.html#Dependency_Scope Cheers On Fri, Sep 26, 2014 at 8:57 AM, Jaonary Rabarisoa jaon...@gmail.com wrote: Thank Ted. Can you tell me how to adjust the scope ? On Fri, Sep 26,

Re: Build error when using spark with breeze

2014-09-26 Thread Jaonary Rabarisoa
I solve the problem by including the commons-math3 package in my sbt dependencies as Sean suggested. Thanks. On Fri, Sep 26, 2014 at 6:05 PM, Ted Yu yuzhih...@gmail.com wrote: You can use scope of runtime. See

Re: spark-ec2 script with Tachyon

2014-09-26 Thread mrm
Hi, Did you manage to figure this out? I would appreciate if you could share the answer. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/spark-ec2-script-with-Tachyon-tp9996p15249.html Sent from the Apache Spark User List mailing list archive at

Re: how to run spark job on yarn with jni lib?

2014-09-26 Thread Marcelo Vanzin
I assume you did those things in all machines, not just on the machine launching the job? I've seen that workaround used successfully (well, actually, they copied the library to /usr/lib or something, but same idea). On Thu, Sep 25, 2014 at 7:45 PM, taqilabon g945...@gmail.com wrote: You're

Re: java.io.IOException Error in task deserialization

2014-09-26 Thread Brad Miller
I've had multiple jobs crash due to java.io.IOException: unexpected exception type; I've been running the 1.1 branch for some time and am now running the 1.1 release binaries. Note that I only use PySpark. I haven't kept detailed notes or the tracebacks around since there are other problems that

Re: Build error when using spark with breeze

2014-09-26 Thread Xiangrui Meng
We removed commons-math3 from dependencies to avoid version conflict with hadoop-common. hadoop-common-2.3+ depends on commons-math3-3.1.1, while breeze depends on commons-math3-3.3. 3.3 is not backward compatible with 3.1.1. So we removed it because the breeze functions we use do not touch

Re: How to do operations on multiple RDD's

2014-09-26 Thread Daniel Siegmann
There are numerous ways to combine RDDs. In your case, it seems you have several RDDs of the same type and you want to do an operation across all of them as if they were a single RDD. The way to do this is SparkContext.union or RDD.union, which have minimal overhead. The only difference between

Re: java.io.IOException Error in task deserialization

2014-09-26 Thread Arun Ahuja
No for me as well it is non-deterministic. It happens in a piece of code that does many filter and counts on a small set of records (~1k-10k). The originally set is persisted in memory and we have a Kryo serializer set for it. The task itself takes in just a few filtering parameters. This with

Re: java.io.IOException Error in task deserialization

2014-09-26 Thread Brad Miller
FWIW I suspect that each count operation is an opportunity for you to trigger the bug, and each filter operation increases the likelihood of setting up the bug. I normally don't come across this error until my job has been running for an hour or two and had a chance to build up longer lineages

problem with spark-ec2 launch script Re: spark-ec2 ERROR: Line magic function `%matplotlib` not found

2014-09-26 Thread Andy Davidson
Hi Davies The real issue is about cluster management. I am new to the spark world and am not a system administrator. It seem like the problem is with the spark-ec2 launch script. It is installing old version of python In the mean time I am trying to figure out how I can manually install the

Re: problem with spark-ec2 launch script Re: spark-ec2 ERROR: Line magic function `%matplotlib` not found

2014-09-26 Thread Nicholas Chammas
Are you able to use the regular PySpark shell on your EC2 cluster? That would be the first thing to confirm is working. I don’t know whether the version of Python on the cluster would affect whether IPython works or not, but if you want to try manually upgrading Python on a cluster launched by

Re: java.lang.NegativeArraySizeException in pyspark

2014-09-26 Thread Brad Miller
What is the error? Could you file a JIRA for it? Turns out there's actually 3 separate errors (indicated below), one of which **silently returns the wrong value to the user*.* Should I file a separate JIRA for each one? What level should I mark these as (critical, major, etc.)? I'm not sure

SF Scala: Spark and Machine Learning Videos

2014-09-26 Thread Alexy Khrabrov
Folks -- we're happy to share the videos of Spark talks made at SF Scala meetup (sfscala.org) and Scala By the Bay conference (scalabythebay.org). We thank Databricks for presenting and also sponsoring the first talk video, which was a joint event with SF Bay Area Machine Learning meetup.

Fwd: Spark SQL question: is cached SchemaRDD storage controlled by spark.storage.memoryFraction?

2014-09-26 Thread Liquan Pei
-- Forwarded message -- From: Liquan Pei liquan...@gmail.com Date: Fri, Sep 26, 2014 at 1:33 AM Subject: Re: Spark SQL question: is cached SchemaRDD storage controlled by spark.storage.memoryFraction? To: Haopu Wang hw...@qilinsoft.com Hi Haopu, Internally, cactheTable on a

Communication between threads within a worker

2014-09-26 Thread lokesh.gidra
Hello, Can someone please explain me how the various threads within a single worker (and hence a single JVM instance) communicate with each other. I mean how do they send intermediate data/RDDs to each other? Is it done through network? Please also point me to the location in source code where I

Re: problem with spark-ec2 launch script Re: spark-ec2 ERROR: Line magic function `%matplotlib` not found

2014-09-26 Thread Andy Davidson
Many many thanks Andy From: Nicholas Chammas nicholas.cham...@gmail.com Date: Friday, September 26, 2014 at 11:24 AM To: Andrew Davidson a...@santacruzintegration.com Cc: Davies Liu dav...@databricks.com, user@spark.apache.org user@spark.apache.org Subject: Re: problem with spark-ec2 launch

SparkSQL: map type MatchError when inserting into Hive table

2014-09-26 Thread Du Li
Hi, I was loading data into a partitioned table on Spark 1.1.0 beeline-thriftserver. The table has complex data types such as mapstring, string and arraymapstring,string. The query is like ³insert overwrite table a partition (Š) select Š² and the select clause worked if run separately. However,

Re: SparkSQL: map type MatchError when inserting into Hive table

2014-09-26 Thread Du Li
It might be a problem when inserting into a partitioned table. It worked fine to when the target table was unpartitioned. Can you confirm this? Thanks, Du On 9/26/14, 4:48 PM, Du Li l...@yahoo-inc.com.INVALID wrote: Hi, I was loading data into a partitioned table on Spark 1.1.0

Re: flume spark streaming receiver host random

2014-09-26 Thread centerqi hu
the receiver is not running on the machine I expect 2014-09-26 14:09 GMT+08:00 Sean Owen so...@cloudera.com: I think you may be missing a key word here. Are you saying that the machine has multiple interfaces and it is not using the one you expect or the receiver is not running on the

Re: SparkSQL: map type MatchError when inserting into Hive table

2014-09-26 Thread Cheng Lian
Would you mind to provide the DDL of this partitioned table together with the query you tried? The stacktrace suggests that the query was trying to cast a map into something else, which is not supported in Spark SQL. And I doubt whether Hive support casting a complex type to some other type.

Re: problem with HiveContext inside Actor

2014-09-26 Thread Cheng Lian
This fix is reasonable, since the actual constructor gets called is |Driver()| rather than |Driver(HiveConf)|. The former initializes the |conf| field by: |conf = SessionState.get().getConf() | And |SessionState.get()| reads a TSS value. Thus executing SQL queries within another thread

Re: SparkSQL: map type MatchError when inserting into Hive table

2014-09-26 Thread Cheng Lian
Would you mind to provide the DDL of this partitioned table together with the query you tried? The stacktrace suggests that the query was trying to cast a map into something else, which is not supported in Spark SQL. And I doubt whether Hive support casting a complex type to some other type.