Re: Auto BroadcastJoin optimization failed in latest Spark

2014-12-04 Thread Jianshi Huang
With Liancheng's suggestion, I've tried setting spark.sql.hive.convertMetastoreParquet false but still analyze noscan return -1 in rawDataSize Jianshi On Fri, Dec 5, 2014 at 3:33 PM, Jianshi Huang wrote: > If I run ANALYZE without NOSCAN, then Hive can successfully get the size: > > parame

Re: SPARK LIMITATION - more than one case class is not allowed !!

2014-12-04 Thread Tobias Pfeiffer
Rahul, On Fri, Dec 5, 2014 at 3:51 PM, Rahul Bindlish < rahul.bindl...@nectechnologies.in> wrote: > > 1. Copy csv files in current directory. > 2. Open spark-shell from this directory. > 3. Run "one_scala" file which will create object-files from csv-files in > current directory. > 4. Restart spar

Re: Auto BroadcastJoin optimization failed in latest Spark

2014-12-04 Thread Jianshi Huang
If I run ANALYZE without NOSCAN, then Hive can successfully get the size: parameters:{numFiles=0, EXTERNAL=TRUE, transient_lastDdlTime=1417764589, COLUMN_STATS_ACCURATE=true, totalSize=0, numRows=1156, rawDataSize=76296} Is Hive's PARQUET support broken? Jianshi On Fri, Dec 5, 2014 at 3:30 PM,

drop table if exists throws exception

2014-12-04 Thread Jianshi Huang
Hi, I got exception saying Hive: NoSuchObjectException(message: table not found) when running "DROP TABLE IF EXISTS " Looks like a new regression in Hive module. Anyone can confirm this? Thanks, -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github & Blog: http://huangjs.github.com/

Re: Auto BroadcastJoin optimization failed in latest Spark

2014-12-04 Thread Jianshi Huang
Sorry for the late of follow-up. I used Hao's DESC EXTENDED command and found some clue: new (broadcast broken Spark build): parameters:{numFiles=0, EXTERNAL=TRUE, transient_lastDdlTime=1417763892, COLUMN_STATS_ACCURATE=false, totalSize=0, numRows=-1, rawDataSize=-1} old (broadcast working Spark

Clarifications on Spark

2014-12-04 Thread Ajay
Hello, I work for an eCommerce company. Currently we are looking at building a Data warehouse platform as described below: DW as a Service | REST API | SQL On No SQL (Drill/Pig/Hive/Spark SQL) | No SQL databases (One or more. May be RDBMS directly too) | (Bulk load) My SQL Databas

Re: spark-ec2 Web UI Problem

2014-12-04 Thread Akhil Das
Its working http://ec2-54-148-248-162.us-west-2.compute.amazonaws.com:8080/ If it ddn't install it correctly, then you could try spark-ec2 script with *--resume* i think Thanks Best Regards On Fri, Dec 5, 2014 at 3:11 AM, Xingwei Yang wrote: > Hi Guys: > > I have succsefully installed apache-s

Re: how do you turn off info logging when running in local mode

2014-12-04 Thread Akhil Das
Yes, there is away. Just add the following piece of code before creating the SparkContext. import org.apache.log4j.Logger import org.apache.log4j.Level Logger.getLogger("org").setLevel(Level.OFF) Logger.getLogger("akka").setLevel(Level.OFF) Thanks Best Regards On Fri, Dec 5, 2014 at 12:48 AM,

Re: Spark executor lost

2014-12-04 Thread Akhil Das
It says connection refused, just make sure the network is configured properly (open the ports between master and the worker nodes). If the ports are configured correctly, then i assume the process is getting killed for some reason and hence connection refused. Thanks Best Regards On Fri, Dec 5, 2

Re: SPARK LIMITATION - more than one case class is not allowed !!

2014-12-04 Thread Rahul Bindlish
Tobias, Find csv and scala files and below are steps: 1. Copy csv files in current directory. 2. Open spark-shell from this directory. 3. Run "one_scala" file which will create object-files from csv-files in current directory. 4. Restart spark-shell 5. a. Run "two_scala" file, while running it is

Re: Spark-Streaming: output to cassandra

2014-12-04 Thread Akhil Das
Batch is the batch duration that you are specifying while creating the StreamingContext, so at the end of every batch's computation the data will get flushed to Cassandra, and why are you stopping your program with Ctrl + C? You can always specify the time with the sc.awaitTermination(Duration) Th

RDD.aggregate?

2014-12-04 Thread ll
can someone please explain how RDD.aggregate works? i looked at the average example done with aggregate() but i'm still confused about this function... much appreciated. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/RDD-aggregate-tp20434.html Sent from th

Re: Spark-Streaming: output to cassandra

2014-12-04 Thread m.sarosh
Hi Gerard/Akhil, By "how do I specify a batch" I was trying to ask that when does the data in the JavaDStream gets flushed into Cassandra table?. I read somewhere that the streaming data in batches gets written in Cassandra. This batch can be of some particular time, or one particular run. Th

Re: How to incrementally compile spark examples using mvn

2014-12-04 Thread MEETHU MATHEW
Hi all, I made some code changes  in mllib project and as mentioned in the previous mails I did  mvn install -pl mllib  Now  I run a program in examples using run-example, the new code is not executing.Instead the previous code itself is running. But if I do an  "mvn install" in the entire spark

Re: Exception adding resource files in latest Spark

2014-12-04 Thread Patrick Wendell
Thanks for flagging this. I reverted the relevant YARN fix in Spark 1.2 release. We can try to debug this in master. On Thu, Dec 4, 2014 at 9:51 PM, Jianshi Huang wrote: > I created a ticket for this: > > https://issues.apache.org/jira/browse/SPARK-4757 > > > Jianshi > > On Fri, Dec 5, 2014 at

Re: SPARK LIMITATION - more than one case class is not allowed !!

2014-12-04 Thread Tobias Pfeiffer
Rahul, On Fri, Dec 5, 2014 at 2:50 PM, Rahul Bindlish < rahul.bindl...@nectechnologies.in> wrote: > > I have done so thats why spark is able to load objectfile [e.g. person_obj] > and spark has maintained serialVersionUID [person_obj]. > > Next time when I am trying to load another objectfile [e.g

Re: Exception adding resource files in latest Spark

2014-12-04 Thread Jianshi Huang
I created a ticket for this: https://issues.apache.org/jira/browse/SPARK-4757 Jianshi On Fri, Dec 5, 2014 at 1:31 PM, Jianshi Huang wrote: > Correction: > > According to Liancheng, this hotfix might be the root cause: > > > https://github.com/apache/spark/commit/38cb2c3a36a5c9ead4494cbc3dde

Re: SPARK LIMITATION - more than one case class is not allowed !!

2014-12-04 Thread Rahul Bindlish
Tobias, Thanks for quick reply. Definitely, after restart case classes need to be defined again. I have done so thats why spark is able to load objectfile [e.g. person_obj] and spark has maintained serialVersionUID [person_obj]. Next time when I am trying to load another objectfile [e.g. office

Re: Exception adding resource files in latest Spark

2014-12-04 Thread Jianshi Huang
Correction: According to Liancheng, this hotfix might be the root cause: https://github.com/apache/spark/commit/38cb2c3a36a5c9ead4494cbc3dde008c2f0698ce Jianshi On Fri, Dec 5, 2014 at 12:45 PM, Jianshi Huang wrote: > Looks like the datanucleus*.jar shouldn't appear in the hdfs path in > Yarn

Re: SPARK LIMITATION - more than one case class is not allowed !!

2014-12-04 Thread Tobias Pfeiffer
Rahul, On Fri, Dec 5, 2014 at 1:29 PM, Rahul Bindlish < rahul.bindl...@nectechnologies.in> wrote: > > I have created objectfiles [person_obj,office_obj] from > csv[person_csv,office_csv] files using case classes[person,office] with API > (saveAsObjectFile) > > Now I restarted spark-shell and load

Re: Exception adding resource files in latest Spark

2014-12-04 Thread Jianshi Huang
Looks like the datanucleus*.jar shouldn't appear in the hdfs path in Yarn-client mode. Maybe this patch broke yarn-client. https://github.com/apache/spark/commit/a975dc32799bb8a14f9e1c76defaaa7cfbaf8b53 Jianshi On Fri, Dec 5, 2014 at 12:02 PM, Jianshi Huang wrote: > Actually my HADOOP_CLASSPA

Re: SchemaRDD partition on specific column values?

2014-12-04 Thread nitin
With some quick googling, I learnt that I can we can provide "distribute by " in hive ql to distribute data based on a column values. My question now if I use "distribute by id", will there be any performance improvements? Will I be able to avoid data movement in shuffle(Excahnge before JOIN step)

Re: Window function by Spark SQL

2014-12-04 Thread Cheng Lian
Window functions are not supported yet, but there is a PR for it: https://github.com/apache/spark/pull/2953 On 12/5/14 12:22 PM, Dai, Kevin wrote: Hi, ALL How can I group by one column and order by another one, then select the first row for each group (which is just like window function doi

Re: SPARK LIMITATION - more than one case class is not allowed !!

2014-12-04 Thread Rahul Bindlish
Hi Tobias, Thanks Tobias for your response. I have created objectfiles [person_obj,office_obj] from csv[person_csv,office_csv] files using case classes[person,office] with API (saveAsObjectFile) Now I restarted spark-shell and load objectfiles using API(objectFile). *Once any of one object-cla

SparkContext.textfile() cannot load file using UNC path on windows

2014-12-04 Thread Ningjun Wang
SparkContext.textfile() cannot load file using UNC path on windows I run the following on Windows XP val conf = new SparkConf().setAppName("testproj1.ClassificationEngine").setMaster("local") val sc = new SparkContext(conf) sc.textFile(raw"\\10.209.128.150\TempShare\SvmPocData\reuters-two-categor

Window function by Spark SQL

2014-12-04 Thread Dai, Kevin
Hi, ALL How can I group by one column and order by another one, then select the first row for each group (which is just like window function doing) by SparkSQL? Best Regards, Kevin.

Re: SPARK LIMITATION - more than one case class is not allowed !!

2014-12-04 Thread Tobias Pfeiffer
On Fri, Dec 5, 2014 at 12:53 PM, Rahul Bindlish < rahul.bindl...@nectechnologies.in> wrote: > Is it a limitation that spark does not support more than one case class at > a > time. > What do you mean? I do not have the slightest idea what you *could* possibly mean by "to support a case class". T

Re: Exception adding resource files in latest Spark

2014-12-04 Thread Jianshi Huang
Actually my HADOOP_CLASSPATH has already been set to include /etc/hadoop/conf/* export HADOOP_CLASSPATH=/etc/hbase/conf/hbase-site.xml:/usr/lib/hbase/lib/hbase-protocol.jar:$(hbase classpath) Jianshi On Fri, Dec 5, 2014 at 11:54 AM, Jianshi Huang wrote: > Looks like somehow Spark failed to fin

Re: Loading a large Hbase table into SPARK RDD takes quite long time

2014-12-04 Thread bonnahu
Hi Ted, Here is the information about the Regions: Region Server Region Count http://regionserver1:60030/ 44 http://regionserver2:60030/ 39 http://regionserver3:60030/ 55 -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Loading-

Re: Exception adding resource files in latest Spark

2014-12-04 Thread Jianshi Huang
Looks like somehow Spark failed to find the core-site.xml in /et/hadoop/conf I've already set the following env variables: export YARN_CONF_DIR=/etc/hadoop/conf export HADOOP_CONF_DIR=/etc/hadoop/conf export HBASE_CONF_DIR=/etc/hbase/conf Should I put $HADOOP_CONF_DIR/* to HADOOP_CLASSPATH? Jia

SPARK LIMITATION - more than one case class is not allowed !!

2014-12-04 Thread Rahul Bindlish
Is it a limitation that spark does not support more than one case class at a time. Regards, Rahul -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/serialization-issue-in-case-of-case-class-is-more-than-1-tp20334p20415.html Sent from the Apache Spark User Lis

Re: Loading a large Hbase table into SPARK RDD takes quite long time

2014-12-04 Thread bonnahu
Hi, Here is the configuration of the cluster: Workers: 2 For each worker, Cores: 24 Total, 0 Used Memory: 69.6 GB Total, 0.0 B Used For the spark.executor.memory, I didn't set it, so it should be the default value 512M. How much space does one row only consisting of the 3 columns consume? the s

Exception adding resource files in latest Spark

2014-12-04 Thread Jianshi Huang
I got the following error during Spark startup (Yarn-client mode): 14/12/04 19:33:58 INFO Client: Uploading resource file:/x/home/jianshuang/spark/spark-latest/lib/datanucleus-api-jdo-3.2.6.jar -> hdfs://stampy/user/jianshuang/.sparkStaging/application_1404410683830_531767/datanucleus-api-jdo-3.2.

Re: Market Basket Analysis

2014-12-04 Thread Rohit Pujari
Sure, I'm looking to perform frequent item set analysis on POS data set. Apriori is a classic algorithm used for such tasks. Since Apriori implementation is not part of MLLib yet, (see https://issues.apache.org/jira/browse/SPARK-4001) What are some other options/algorithms I could use to perfor

spark assembly jar caused "changed on src filesystem" error

2014-12-04 Thread Hu, Leo
Hi all when I execute: /spark-1.1.1-bin-hadoop2.4/bin/spark-submit --verbose --master yarn-cluster --class spark.SimpleApp --jars /spark-1.1.1-bin-hadoop2.4/lib/spark-assembly-1.1.1-hadoop2.4.0.jar --executor-memory 1G --num-executors 2 /spark-1.1.1-bin-hadoop2.4/testfile/simple-project-1.0.ja

Re: representing RDF literals as vertex properties

2014-12-04 Thread Ankur Dave
At 2014-12-04 16:26:50 -0800, spr wrote: > I'm also looking at how to represent literals as vertex properties. It seems > one way to do this is via positional convention in an Array/Tuple/List that is > the VD; i.e., to represent height, weight, and eyeColor, the VD could be a > Tuple3(Double, Dou

Re: scopt.OptionParser

2014-12-04 Thread Caron
Hi, I ran into the same problem posted in this thread earlier when I tried to write my own program: "$ spark-submit --class "someClass" --master local[4] target/scala-2.10/someclass_2.10-1.0.jar " gives me "Exception in thread "main" java.lang.NoClassDefFoundError: scopt/OptionParser" I tried yo

Re: Stateful mapPartitions

2014-12-04 Thread Tobias Pfeiffer
Hi, On Fri, Dec 5, 2014 at 3:56 AM, Akshat Aranya wrote: > Is it possible to have some state across multiple calls to mapPartitions > on each partition, for instance, if I want to keep a database connection > open? > If you're using Scala, you can use a singleton object, this will exist once pe

Re: Market Basket Analysis

2014-12-04 Thread Tobias Pfeiffer
Hi, On Thu, Dec 4, 2014 at 11:58 PM, Rohit Pujari wrote: > > I'd like to do market basket analysis using spark, what're my options? > To do it or not to do it ;-) Seriously, could you elaborate a bit on what you want to know? Tobias

Re: Stateful mapPartitions

2014-12-04 Thread Akshat Aranya
I want to have a database connection per partition of the RDD, and then reuse that connection whenever mapPartitions is called, which results in compute being called on the partition. On Thu, Dec 4, 2014 at 11:07 AM, Paolo Platter wrote: > Could you provide some further details ? > What do you

representing RDF literals as vertex properties

2014-12-04 Thread spr
@ankurdave's concise code at https://gist.github.com/ankurdave/587eac4d08655d0eebf9, responding to an earlier thread (http://apache-spark-user-list.1001560.n3.nabble.com/How-to-construct-graph-in-graphx-tt16335.html#a16355) shows how to build a graph with multiple edge-types ("predicates" in RDF-sp

Getting all the results from MySQL table

2014-12-04 Thread gargp
Hi, I am a new spark and scala user. Was trying to use JdbcRDD to query a MySQL table. It needs a lowerbound and upperbound as parameters, but I want to get all the records from the table in a single query. Is there a way I can do that? -- View this message in context: http://apache-spark-user

Re: Loading a large Hbase table into SPARK RDD takes quite long time

2014-12-04 Thread Ted Yu
bonnahu: How many regions does your table have ? Are they evenly distributed ? Cheers On Thu, Dec 4, 2014 at 3:34 PM, Jörn Franke wrote: > Hi, > > What is your cluster setup? How mich memory do you have? How much space > does one row only consisting of the 3 columns consume? Do you run other >

Re: Loading a large Hbase table into SPARK RDD takes quite long time

2014-12-04 Thread Jörn Franke
Hi, What is your cluster setup? How mich memory do you have? How much space does one row only consisting of the 3 columns consume? Do you run other stuff in the background? Best regards Am 04.12.2014 23:57 schrieb "bonnahu" : > I am trying to load a large Hbase table into SPARK RDD to run a Spar

printing mllib.linalg.vector

2014-12-04 Thread debbie
Basic question: What is the best way to loop through one of these and print their components? Convert them to an array? Thanks Deb

Re: Spark 1.1.0 Can not read snappy compressed sequence file

2014-12-04 Thread Stéphane Verlet
Yes , It is working with this in spark-env.sh export JAVA_LIBRARY_PATH=$JAVA_LIBRARY_PATH:$HADOOP_HOME/lib/native export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$HADOOP_HOME/lib/native export SPARK_LIBRARY_PATH=$SPARK_LIBRARY_PATH:$HADOOP_HOME/lib/native export SPARK_CLASSPATH=$SPARK_CLASSPATH:$HADOOP_HO

Re: SQL query in scala API

2014-12-04 Thread Stéphane Verlet
Disclaimer : I am new at Spark I did something similar in a prototype which works but I that did not test at scale yet val agg =3D users.mapValues(_ =3D> 1)..aggregateByKey(new CustomAggregation())(CustomAggregation.sequenceOp, CustomAggregation.comboO= p) class CustomAggregation() extends

Loading a large Hbase table into SPARK RDD takes quite long time

2014-12-04 Thread bonnahu
I am trying to load a large Hbase table into SPARK RDD to run a SparkSQL query on the entity. For an entity with about 6 million rows, it will take about 35 seconds to load it to RDD. Is it expected? Is there any way to shorten the loading process? I have been getting some tips from http://hbase.ap

Re: How to make symbol for one column in Spark SQL.

2014-12-04 Thread Tim Chou
... Thank you! I'm so stupid... This is the only thing I miss in the tutorial...orz Thanks, Tim 2014-12-04 16:49 GMT-06:00 Michael Armbrust : > You need to import sqlContext._ > > On Thu, Dec 4, 2014 at 2:26 PM, Tim Chou wrote: > >> I have tried to use function where and filter in SchemaRDD. >

Re: How to make symbol for one column in Spark SQL.

2014-12-04 Thread Michael Armbrust
You need to import sqlContext._ On Thu, Dec 4, 2014 at 2:26 PM, Tim Chou wrote: > I have tried to use function where and filter in SchemaRDD. > > I have build class for tuple/record in the table like this: > case class Region(num:Int, str1:String, str2:String) > > I also successfully create a Sc

How to extend an one-to-one RDD of Spark that can be persisted?

2014-12-04 Thread Peng Cheng
In my project I extend a new RDD type that wraps another RDD and some metadata. The code I use is similar to FilteredRDD implementation: case class PageRowRDD( self: RDD[PageRow], @transient keys: ListSet[KeyLike] = ListSet() ){ override def getPartitions: A

How to make symbol for one column in Spark SQL.

2014-12-04 Thread Tim Chou
I have tried to use function where and filter in SchemaRDD. I have build class for tuple/record in the table like this: case class Region(num:Int, str1:String, str2:String) I also successfully create a SchemaRDD. scala> val results = sqlContext.sql("select * from region") results: org.apache.spa

Unable to run applications on clusters on EC2

2014-12-04 Thread Xingwei Yang
I think it is related to my previous questions, but I separate them. In my previous question, I could not connect to WebUI even though I could log into the cluster without any problem. Also, I tried lynx localhost:8080 and I could get the information about the cluster; I could also user spark-sub

Re: Spark SQL table Join, one task is taking long

2014-12-04 Thread Veeranagouda Mukkanagoudar
Have you tried joins on regular RDD instead of schemaRDD? We have found that its 10 times faster than joins between schemaRDDs. val largeRDD = ... val smallRDD = ... largeRDD.join(smallRDD) // otherway JOIN would run for long. Only limitation i see with that implementation is regular RDD suppor

spark-ec2 Web UI Problem

2014-12-04 Thread Xingwei Yang
Hi Guys: I have succsefully installed apache-spark on Amazon ec2 using spark-ec2 command and I could login to the master node. Here is the installation message: RSYNC'ing /etc/ganglia to slaves... ec2-54-148-197-89.us-west-2.compute.amazonaws.com Shutting down GANGLIA gmond: [FAILED] Start

Re: Spark SQL table Join, one task is taking long

2014-12-04 Thread Venkat Subramanian
Hi Cheng, Thank you very much for taking your time and providing a detailed explanation. I tried a few things you suggested and some more things. The ContactDetail table (8 GB) is the fact table and DAgents is the Dim table (<500 KB), reverse of what you are assuming, but your ideas still apply.

Re: SQL query in scala API

2014-12-04 Thread Arun Luthra
Is that Spark SQL? I'm wondering if it's possible without spark SQL. On Wed, Dec 3, 2014 at 8:08 PM, Cheng Lian wrote: > You may do this: > > table("users").groupBy('zip)('zip, count('user), countDistinct('user)) > > On 12/4/14 8:47 AM, Arun Luthra wrote: > > I'm wondering how to do this kind

Re: Any ideas why a few tasks would stall

2014-12-04 Thread akhandeshi
This did not work for me. that is, rdd.coalesce(200, forceShuffle) . Does anyone have ideas on how to distribute your data evenly and co-locate partitions of interest? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Any-ideas-why-a-few-tasks-would-stall-tp

Re: Spark metrics for ganglia

2014-12-04 Thread danilopds
Hello Samudrala, Did you solve this issue about view metrics in Ganglia?? Because I have the same problem. Thanks. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-metrics-for-ganglia-tp14335p20385.html Sent from the Apache Spark User List mailing lis

RE: Determination of number of RDDs

2014-12-04 Thread Kapil Malik
Regarding: Can we create such an array and then parallelize it? Parallelizing an array of RDDs -> i.e. RDD[RDD[x]] is not possible. RDD is not serializable. From: Deep Pradhan [mailto:pradhandeep1...@gmail.com] Sent: 04 December 2014 15:39 To: user@spark.apache.org Subject: Determination of numbe

Re: Can not see any spark metrics on ganglia-web

2014-12-04 Thread danilopds
I used the command below because I'm using Spark 1.0.2 built with SBT and it worked. SPARK_HADOOP_VERSION=2.2.0 SPARK_YARN=true SPARK_GANGLIA_LGPL=true sbt/sbt assembly -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Can-not-see-any-spark-metrics-on-gangli

how do you turn off info logging when running in local mode

2014-12-04 Thread Ron Ayoub
I have not yet gotten to the point of running standalone. In my case I'm still working on the initial product and I'm running directly in Eclipse and I've compiled using the Spark maven project since the downloadable spark binaries require Hadoop. With that said, I'm running fine and I have thin

Re: Using sparkSQL to convert a collection of python dictionary of dictionaries to schma RDD

2014-12-04 Thread Davies Liu
Which version of Spark are you using? inferSchema() is improved to support empty dict in 1.2+, could you try the 1.2-RC1? Also, you can use applySchema(): from pyspark.sql import * fields = [StructField('field1', IntegerType(), True), StructField('field2', StringType(), True), Struc

Re: Spark SQL with a sorted file

2014-12-04 Thread Michael Armbrust
I'll add that some of our data formats will actual infer this sort of useful information automatically. Both parquet and cached inmemory tables keep statistics on the min/max value for each column. When you have predicates over these sorted columns, partitions will be eliminated if they can't pos

R: Stateful mapPartitions

2014-12-04 Thread Paolo Platter
Could you provide some further details ? What do you nerd to do with db cpnnection? Paolo Inviata dal mio Windows Phone Da: Akshat Aranya Inviato: ‎04/‎12/‎2014 18:57 A: user@spark.apache.org Oggetto: Statefu

Re: Spark executor lost

2014-12-04 Thread S. Zhou
Here is a sample exception I collected from a spark worker node: (there are many such errors across over work nodes). It looks to me that spark worker failed to communicate to executor locally. 14/12/04 04:26:37 ERROR EndpointWriter: AssociationError [akka.tcp://sparkwor...@spark-prod1.xxx:7079]

Stateful mapPartitions

2014-12-04 Thread Akshat Aranya
Is it possible to have some state across multiple calls to mapPartitions on each partition, for instance, if I want to keep a database connection open?

How can a function get a TaskContext

2014-12-04 Thread Steve Lewis
https://github.com/apache/spark/blob/master/core/src/main/java/org/apache/spark/TaskContext.java has a Java implementation if TaskContext wit a very useful method /** * Return the currently active TaskContext. This can be called inside of * user functions to access contextual information about run

Re: Any ideas why a few tasks would stall

2014-12-04 Thread Steve Lewis
Thanks - I found the same thing - calling boolean forceShuffle = true; myRDD = myRDD.coalesce(120,forceShuffle ); worked - there were 120 partitions but forcing a shuffle distributes the work I believe there is a bug in my code causing memory to accumulate as partitions grow in si

Re: Spark-Streaming: output to cassandra

2014-12-04 Thread Gerard Maas
I guess he's already doing so, given the 'saveToCassandra' usage. What I don't understand is the question "how do I specify a batch". That doesn't make much sense to me. Could you explain further? -kr, Gerard. On Thu, Dec 4, 2014 at 5:36 PM, Akhil Das wrote: > You can use the datastax's Cassand

Failed to read chunk exception

2014-12-04 Thread Steve Lewis
I am running a large job using 4000 partitions - after running for four hours on a 16 node cluster it fails with the following message. The errors are in spark code and seem address unreliability at the level of the disk - Anyone seen this and know what is going on and how to fix it. Exception in

Re: Spark-Streaming: output to cassandra

2014-12-04 Thread Akhil Das
You can use the datastax's Cassandra connector. Thanks Best Regards On Thu, Dec 4, 2014 at 8:21 PM, wrote: > Hi, > > > I have written the code below which is streaming data from kafka, and > printing to the co

Re: Efficient way to get top K values per key in (key, value) RDD?

2014-12-04 Thread Sean Owen
You probably want to use combineByKey, and create an empty min queue for each key. Merge values into the queue if its size is < K. If >= K, only merge the value if it exceeds the smallest element; if so add it and remove the smallest element. This gives you an RDD of keys mapped to collections of

Spark-Streaming: output to cassandra

2014-12-04 Thread m.sarosh
Hi, I have written the code below which is streaming data from kafka, and printing to the console. I want to extend this, and want my data to go into Cassandra table instead. JavaStreamingContext jssc = new JavaStreamingContext("local[4]", "SparkStream", new Duration(1000)); JavaPairReceiver

Market Basket Analysis

2014-12-04 Thread Rohit Pujari
Hello Folks: I'd like to do market basket analysis using spark, what're my options? Thanks, Rohit Pujari Solutions Architect, Hortonworks -- CONFIDENTIALITY NOTICE NOTICE: This message is intended for the use of the individual or entity to which it is addressed and may contain information that

Efficient way to get top K values per key in (key, value) RDD?

2014-12-04 Thread Theodore Vasiloudis
Hello everyone, I was wondering what is the most efficient way for retrieving the top K values per key in a (key, value) RDD. The simplest way I can think of is to do a groupByKey, sort the iterables and then take the top K elements for every key. But reduceByKey is an operation that can be ver

Re: How to Integrate openNLP with Spark

2014-12-04 Thread Nikhil
Did anyone get a chance to look at this? Please provide some help. Thanks Nikhil -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-Integrate-openNLP-with-Spark-tp20117p20368.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

RE: Spark Streaming empty RDD issue

2014-12-04 Thread Shao, Saisai
Hi, According to my knowledge of current Spark Streaming Kafka connector, I think there's no chance for APP user to detect such kind of failure, this will either be done by Kafka consumer with ZK coordinator, either by ReceiverTracker in Spark Streaming, so I think you don't need to take care o

Re: Does filter on an RDD scan every data item ?

2014-12-04 Thread nsareen
I'm not sure sample is what i was looking for. As mentioned in another post above. this is what i'm looking for. 1) My RDD contains this structure. Tuple2. 2) Each CustomTuple is a combination of string id's e.g. CustomTuple.dimensionOne="AE232323" CustomTuple.dimensionTwo="BE232323" CustomTupl

Re: Does filter on an RDD scan every data item ?

2014-12-04 Thread nsareen
Thanks for the reply! To be honest, I was expecting spark to have some sort of Indexing for keys, which would help it locate the keys efficiently. I wasn't using Spark SQL here, but if it helps perform this efficiently, i can try it out, can you please elaborate, how will it be helpful in this sc

Re: Using sparkSQL to convert a collection of python dictionary of dictionaries to schma RDD

2014-12-04 Thread sahanbull
Hi Davies, Thanks for the reply The problem is I have empty dictionaries in my field3 as well. It gives me an error : Traceback (most recent call last): File "", line 1, in File "/root/spark/python/pyspark/sql.py", line 1042, in inferSchema schema = _infer_schema(first) File "/root

Re: map function

2014-12-04 Thread Yifan LI
Thanks, Paolo and Mark. :) > On 04 Dec 2014, at 11:58, Paolo Platter wrote: > > Hi, > > rdd.flatMap( e => e._2.map( i => ( i, e._1))) > > Should work, but I didn't test it so maybe I'm missing something. > > Paolo > > Inviata dal mio Windows Phone > Da: Yifan LI

Re: Example usage of StreamingListener

2014-12-04 Thread Hafiz Mujadid
Thanks Akhil You are so helping Dear. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Example-usage-of-StreamingListener-tp20357p20362.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: MLlib Naive Bayes classifier confidence

2014-12-04 Thread MariusFS
That was it, Thanks. (Posting here so people know it's the right answer in case they have the same need :) ). sowen wrote > Probabilities won't sum to 1 since this expression doesn't incorporate > the probability of the evidence, I imagine? it's constant across > classes so is usually excluded.

Re: Example usage of StreamingListener

2014-12-04 Thread Akhil Das
Here you go http://stackoverflow.com/questions/20950268/how-to-stop-spark-streaming-context-when-the-network-connection-tcp-ip-is-clos Thanks Best Regards On Thu, Dec 4, 2014 at 4:30 PM, Hafiz Mujadid wrote: > Hi! > > does anybody has some useful example of StreamingListener interface. When > a

Re: running Spark Streaming just once and stop it

2014-12-04 Thread Hafiz Mujadid
Hi Kal El! Have you done stopping streaming after first iteration? if yes can you share example code. Thanks -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/running-Spark-Streaming-just-once-and-stop-it-tp1382p20359.html Sent from the Apache Spark User Li

R: map function

2014-12-04 Thread Paolo Platter
Hi, rdd.flatMap( e => e._2.map( i => ( i, e._1))) Should work, but I didn't test it so maybe I'm missing something. Paolo Inviata dal mio Windows Phone Da: Yifan LI Inviato: ‎04/‎12/‎2014 09:27 A: user@spark.apache.org

Example usage of StreamingListener

2014-12-04 Thread Hafiz Mujadid
Hi! does anybody has some useful example of StreamingListener interface. When and how can we use this interface to stop streaming when one batch of data is processed? Thanks alot -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Example-usage-of-StreamingLis

Re: GraphX Pregel halting condition

2014-12-04 Thread Ankur Dave
There's no built-in support for doing this, so the best option is to copy and modify Pregel to check the accumulator at the end of each iteration. This is robust and shouldn't be too hard, since the Pregel code is short and only uses public GraphX APIs. Ankur At 2014-12-03 09:37:01 -0800, Jay

Re: MLLib: loading saved model

2014-12-04 Thread manish_k
Hi Sameer, Your model recreation should be: val model = new LinearRegressionModel(weights, intercept) As you have already got weights for linear regression model using stochastic gradient descent, you just have to use LinearRegressionModel to construct new model. Other points to notice is that w

Re: Determination of number of RDDs

2014-12-04 Thread Ankur Dave
At 2014-12-04 02:08:45 -0800, Deep Pradhan wrote: > I have a graph and I want to create RDDs equal in number to the nodes in > the graph. How can I do that? > If I have 10 nodes then I want to create 10 rdds. Is that possible in > GraphX? This is possible: you can collect the elements to the driv

Re: map function

2014-12-04 Thread Mark Hamstra
rdd.flatMap { case (k, coll) => coll.map { elem => (elem, k) } } On Thu, Dec 4, 2014 at 1:26 AM, Yifan LI wrote: > Hi, > > I have a RDD like below: > (1, (10, 20)) > (2, (30, 40, 10)) > (3, (30)) > … > > Is there any way to map it to this: > (10,1) > (20,1) > (30,2) > (40,2) > (10,2) > (30,3) >

Determination of number of RDDs

2014-12-04 Thread Deep Pradhan
Hi, I have a graph and I want to create RDDs equal in number to the nodes in the graph. How can I do that? If I have 10 nodes then I want to create 10 rdds. Is that possible in GraphX? Like in C language we have array of pointers. Do we have array of RDDs in Spark. Can we create such an array and t

SchemaRDD partition on specific column values?

2014-12-04 Thread nitin
Hi All, I want to hash partition (and then cache) a schema RDD in way that partitions are based on hash of the values of a column ("ID" column in my case). e.g. if my table has "ID" column with values as 1,2,3,4,5,6,7,8,9 and spark.sql.shuffle.partitions is configured as 3, then there should be

Re: Problem creating EC2 cluster using spark-ec2

2014-12-04 Thread Dave Challis
Fantastic, thanks for the quick fix! On 3 December 2014 at 22:11, Andrew Or wrote: > This should be fixed now. Thanks for bringing this to our attention. > > 2014-12-03 13:31 GMT-08:00 Andrew Or : > >> Yeah this is currently broken for 1.1.1. I will submit a fix later today. >> >> 2014-12-02 17:1

Re: MLLIB model export: PMML vs MLLIB serialization

2014-12-04 Thread manish_k
Hi Sourabh, I came across same problem as you. One workable solution for me was to serialize the parts of model that can be used again to recreate it. I serialize RDD's in my model using saveAsObjectFile with a time stamp attached to it in HDFS. My other spark application read from the latest stor

Re: [question]Where can I get the log file

2014-12-04 Thread Prannoy
Hi, You can access your logs in your /spark_home_directory/logs/ directory . cat the file names and you will get the logs. Thanks. On Thu, Dec 4, 2014 at 2:27 PM, FFeng [via Apache Spark User List] < ml-node+s1001560n20344...@n3.nabble.com> wrote: > I have wrote data to spark log. > I get it t

map function

2014-12-04 Thread Yifan LI
Hi, I have a RDD like below: (1, (10, 20)) (2, (30, 40, 10)) (3, (30)) … Is there any way to map it to this: (10,1) (20,1) (30,2) (40,2) (10,2) (30,3) … generally, for each element, it might be mapped to multiple. Thanks in advance! Best, Yifan LI

[no subject]

2014-12-04 Thread Subong Kim
- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org

Re: Monitoring Spark

2014-12-04 Thread Sameer Farooqui
Are you running Spark in Local or Standalone mode? In either mode, you should be able to hit port 4040 (to see the Spark Jobs/Stages/Storage/Executors UI) on the machine where the driver is running. However, in local mode, you won't have a Spark Master UI on 7080 or a Worker UI on 7081. You can ma

  1   2   >