Subscription request

2014-10-22 Thread Sathya
Hi, Kindly subscribe me to the user group. Regards, Sathyanarayanan

Re: How to save ReceiverInputDStream to Hadoop using saveAsNewAPIHadoopFile

2014-10-22 Thread Akhil Das
Hi Buntu, You could something similar to the following: val receiver_stream = new ReceiverInputDStream(ssc) { override def getReceiver(): Receiver[Nothing] = ??? //Whatever }.map((x : String) = (null, x)) val config = new Configuration() config.set(mongo.output.uri,

Re: Spark - HiveContext - Unstructured Json

2014-10-22 Thread Harivardan Jayaraman
For me inference is not an issue as compared to persistence. Imagine a Streaming application where the input is JSON whose format can vary from row to row and whose format I cannot pre-determine. I can use `sqlContext.jsonRDD` , but once I have the `SchemaRDD`, there is no way for me to update the

Re: rdd.checkpoint() : driver-side lineage??

2014-10-22 Thread harsh2005_7
Hi, I am no expert but my best guess is that its a 'closure' problem.Spark map reduce internally does a closure of all the variables outside its scope which are being used for the map operation.It does a serialization check for the map task . Since class scala.util.Random is not serializable it

SchemaRDD Convert

2014-10-22 Thread Dai, Kevin
Hi, ALL I have a RDD of case class T and T contains several primitive types and a Map. How can I convert this to a SchemaRDD? Best Regards, Kevin.

Python vs Scala performance

2014-10-22 Thread Marius Soutier
Hi there, we have a small Spark cluster running and are processing around 40 GB of Gzip-compressed JSON data per day. I have written a couple of word count-like Scala jobs that essentially pull in all the data, do some joins, group bys and aggregations. A job takes around 40 minutes to

RE: Python vs Scala performance

2014-10-22 Thread Ashic Mahtab
I'm no expert, but looked into how the python bits work a while back (was trying to assess what it would take to add F# support). It seems python hosts a jvm inside of it, and talks to scala spark in that jvm. The python server bit translates the python calls to those in the jvm. The python

Re: Python vs Scala performance

2014-10-22 Thread Nicholas Chammas
What version of Spark are you running? Some recent changes https://spark.apache.org/releases/spark-release-1-1-0.html to how PySpark works relative to Scala Spark may explain things. PySpark should not be that much slower, not by a stretch. On Wed, Oct 22, 2014 at 6:11 AM, Ashic Mahtab

Re: Python vs Scala performance

2014-10-22 Thread Marius Soutier
We’re using 1.1.0. Yes I expected Scala to be maybe twice as fast, but not that... On 22.10.2014, at 13:02, Nicholas Chammas nicholas.cham...@gmail.com wrote: What version of Spark are you running? Some recent changes to how PySpark works relative to Scala Spark may explain things.

RE: SchemaRDD Convert

2014-10-22 Thread Cheng, Hao
You needn't do anything, the implicit conversion should do this for you. https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala#L103

Re: Transforming the Dstream vs transforming each RDDs in the Dstream.

2014-10-22 Thread Gerard Maas
Thanks Matt, Unlike the feared RDD operations on the driver, it's my understanding that these Dstream ops on the driver are merely creating an execution plan for each RDD. My question still remains: Is it better to foreachRDD early in the process or do as much Dstream transformations before going

Re: Python vs Scala performance

2014-10-22 Thread Nicholas Chammas
Total guess without knowing anything about your code: Do either of these two notes from the 1.1.0 release notes http://spark.apache.org/releases/spark-release-1-1-0.html affect things at all? - PySpark now performs external spilling during aggregations. Old behavior can be restored by

RE: spark sql query optimization , and decision tree building

2014-10-22 Thread Cheng, Hao
The “output” variable is actually a SchemaRDD, it provides lots of DSL API, see http://spark.apache.org/docs/1.1.0/api/scala/index.html#org.apache.spark.sql.SchemaRDD 1) How to save result values of a query into a list ? [CH:] val list: Array[Row] = output.collect, however get 1M records into

Re: Spark Hive Snappy Error

2014-10-22 Thread arthur.hk.c...@gmail.com
Hi, Yes, I can always reproduce the issue: about you workload, Spark configuration, JDK version and OS version? I ran SparkPI 1000 java -version java version 1.7.0_67 Java(TM) SE Runtime Environment (build 1.7.0_67-b01) Java HotSpot(TM) 64-Bit Server VM (build 24.65-b04, mixed mode) cat

Re: Transforming the Dstream vs transforming each RDDs in the Dstream.

2014-10-22 Thread Gerard Maas
PS: Just to clarify my statement: Unlike the feared RDD operations on the driver, it's my understanding that these Dstream ops on the driver are merely creating an execution plan for each RDD. With feared RDD operations on the driver I meant to contrast an rdd action like rdd.collect that would

RE: Spark Hive Snappy Error

2014-10-22 Thread Shao, Saisai
Thanks a lot, I will try to reproduce this in my local settings and dig into the details, thanks for your information. BR Jerry From: arthur.hk.c...@gmail.com [mailto:arthur.hk.c...@gmail.com] Sent: Wednesday, October 22, 2014 8:35 PM To: Shao, Saisai Cc: arthur.hk.c...@gmail.com; user

Re: Spark Hive Snappy Error

2014-10-22 Thread arthur.hk.c...@gmail.com
Hi, FYI, I use snappy-java-1.0.4.1.jar Regards Arthur On 22 Oct, 2014, at 8:59 pm, Shao, Saisai saisai.s...@intel.com wrote: Thanks a lot, I will try to reproduce this in my local settings and dig into the details, thanks for your information. BR Jerry From:

Re: Python vs Scala performance

2014-10-22 Thread Arian Pasquali
Interesting thread Marius, Btw, I'm curious about your cluster size. How small it is in terms of ram and cores. Arian 2014-10-22 13:17 GMT+01:00 Nicholas Chammas nicholas.cham...@gmail.com: Total guess without knowing anything about your code: Do either of these two notes from the 1.1.0

confused about memory usage in spark

2014-10-22 Thread Darin McBeath
I have a PairRDD of type String,String which I persist to S3 (using the following code). JavaPairRDDText, Text aRDDWritable = aRDD.mapToPair(new ConvertToWritableTypes());aRDDWritable.saveAsHadoopFile(outputFile, Text.class, Text.class, SequenceFileOutputFormat.class); class

Re: confused about memory usage in spark

2014-10-22 Thread Akhil Das
You can enable rdd compression (*spark.rdd.compress*) also you can use MEMORY_ONLY_SER ( *sc.sequenceFile[String,String](s3n://somebucket/part-0).persist(StorageLevel.MEMORY_ONLY_SER* *)* ) to reduce the rdd size in memory. Thanks Best Regards On Wed, Oct 22, 2014 at 7:51 PM, Darin McBeath

[no subject]

2014-10-22 Thread Margusja
unsubscribe - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org

Re: confused about memory usage in spark

2014-10-22 Thread Sean Owen
One thing to remember is that Strings are composed of chars in Java, which take 2 bytes each. The encoding of the text on disk on S3 is probably something like UTF-8, which takes much closer to 1 byte per character for English text. This might explain the factor of ~2 difference. On Wed, Oct 22,

Re:

2014-10-22 Thread Ted Yu
See first section of http://spark.apache.org/community On Wed, Oct 22, 2014 at 7:42 AM, Margusja mar...@roo.ee wrote: unsubscribe - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail:

Re: Python vs Scala performance

2014-10-22 Thread Marius Soutier
Didn’t seem to help: conf = SparkConf().set(spark.shuffle.spill, false).set(spark.default.parallelism, 12) sc = SparkContext(appName=’app_name', conf = conf) but still taking as much time On 22.10.2014, at 14:17, Nicholas Chammas nicholas.cham...@gmail.com wrote: Total guess without knowing

Rdd of Rdds

2014-10-22 Thread Tomer Benyamini
Hello, I would like to parallelize my work on multiple RDDs I have. I wanted to know if spark can support a foreach on an RDD of RDDs. Here's a java example: public static void main(String[] args) { SparkConf sparkConf = new SparkConf().setAppName(testapp);

Re: Rdd of Rdds

2014-10-22 Thread Sean Owen
No, there's no such thing as an RDD of RDDs in Spark. Here though, why not just operate on an RDD of Lists? or a List of RDDs? Usually one of these two is the right approach whenever you feel inclined to operate on an RDD of RDDs. On Wed, Oct 22, 2014 at 3:58 PM, Tomer Benyamini

Re: Python vs Scala performance

2014-10-22 Thread Eustache DIEMERT
Wild guess maybe, but do you decode the json records in Python ? it could be much slower as the default lib is quite slow. If so try ujson [1] - a C implementation that is at least an order of magnitude faster. HTH [1] https://pypi.python.org/pypi/ujson 2014-10-22 16:51 GMT+02:00 Marius

Re: Spark as key/value store?

2014-10-22 Thread Akshat Aranya
Spark, in general, is good for iterating through an entire dataset again and again. All operations are expressed in terms of iteration through all the records of at least one partition. You may want to look at IndexedRDD ( https://issues.apache.org/jira/browse/SPARK-2365) that aims to improve

Re: Spark Bug? job fails to run when given options on spark-submit (but starts and fails without)

2014-10-22 Thread Holden Karau
Hi Michael Campbell, Are you deploying against yarn or standalone mode? In yarn try setting the shell variables SPARK_EXECUTOR_MEMORY=2G in standalone try and set SPARK_WORKER_MEMORY=2G. Cheers, Holden :) On Thu, Oct 16, 2014 at 2:22 PM, Michael Campbell michael.campb...@gmail.com wrote:

Re: Usage of spark-ec2: how to deploy a revised version of spark 1.1.0?

2014-10-22 Thread Daniil Osipov
You can use --spark-version argument to spark-ec2 to specify a GIT hash corresponding to the version you want to checkout. If you made changes that are not in the master repository, you can use --spark-git-repo to specify the git repository to pull down spark from, which contains the specified

Re: saveasSequenceFile with codec and compression type

2014-10-22 Thread Holden Karau
Hi gpatcham, If you want to save as a sequence file with a custom compression type you can use saveAsHadoopFile along with setting the mapred.output.compression.type on the jobconf. If you want to keep using saveAsSequenceFile, and the syntax is much nicer, you could also set that property on

Re: create a Row Matrix

2014-10-22 Thread Yana Kadiyska
This works for me import org.apache.spark.mllib.linalg._ import org.apache.spark.mllib.linalg.distributed.RowMatrix val v1=Vectors.dense(Array(1d,2d)) val v2=Vectors.dense(Array(3d,4d)) val rows=sc.parallelize(List(v1,v2)) val mat=new RowMatrix(rows) val svd:

Re: Python vs Scala performance

2014-10-22 Thread Davies Liu
In the master, you can easily profile you job, find the bottlenecks, see https://github.com/apache/spark/pull/2556 Could you try it and show the stats? Davies On Wed, Oct 22, 2014 at 7:51 AM, Marius Soutier mps@gmail.com wrote: It’s an AWS cluster that is rather small at the moment, 4

Re: spark sql query optimization , and decision tree building

2014-10-22 Thread sanath kumar
Thank you very much , two more small questions : 1) val output = sqlContext.sql(SELECT * From people) my output has 128 columns and single row . how to find the which column has the maximum value in a single row using scala ? 2) as each row has 128 columns how to print each row into a text

Re: Spark Streaming Applications

2014-10-22 Thread Sameer Farooqui
Hi Saiph, Patrick McFadin and Helena Edelson from DataStax taught a tutorial at NYC Strata last week where they created a prototype Spark Streaming + Kafka application for time series data. You can see the code here: https://github.com/killrweather/killrweather On Tue, Oct 21, 2014 at 4:33 PM,

Re: Python vs Scala performance

2014-10-22 Thread Nicholas Chammas
On Wed, Oct 22, 2014 at 11:34 AM, Eustache DIEMERT eusta...@diemert.fr wrote: Wild guess maybe, but do you decode the json records in Python ? it could be much slower as the default lib is quite slow. Oh yeah, this is a good place to look. Also, just upgrading to Python 2.7 may be enough

Sharing spark context across multiple spark sql cli initializations

2014-10-22 Thread Sadhan Sood
We want to run multiple instances of spark sql cli on our yarn cluster. Each instance of the cli is to be used by a different user. This looks non-optimal if each user brings up a different cli given how spark works on yarn by running executor processes (and hence consuming resources) on worker

Re: Usage of spark-ec2: how to deploy a revised version of spark 1.1.0?

2014-10-22 Thread freedafeng
Thanks Daniil! if I use --spark-git-repo, is there a way to specify the mvn command line parameters? like following mvn -Pyarn -Phadoop-2.3 -Dhadoop.version=2.3.0 -DskipTests clean package mvn -Pyarn -Phadoop-2.3 -Phbase-hadoop2 -Dhadoop.version=2.3.0 -DskipTests clean package -- View this

Multitenancy in Spark - within/across spark context

2014-10-22 Thread Ashwin Shankar
Hi Spark devs/users, One of the things we are investigating here at Netflix is if Spark would suit us for our ETL needs, and one of requirements is multi tenancy. I did read the official doc http://spark.apache.org/docs/latest/job-scheduling.html and the book, but I'm still not clear on certain

Shuffle issues in the current master

2014-10-22 Thread DB Tsai
Hi all, With SPARK-3948, the exception in Snappy PARSING_ERROR is gone, but I've another exception now. I've no clue about what's going on; does anyone run into similar issue? Thanks. This is the configuration I use. spark.rdd.compress true spark.shuffle.consolidateFiles true

Re: Multitenancy in Spark - within/across spark context

2014-10-22 Thread Marcelo Vanzin
Hi Ashwin, Let me try to answer to the best of my knowledge. On Wed, Oct 22, 2014 at 11:47 AM, Ashwin Shankar ashwinshanka...@gmail.com wrote: Here are my questions : 1. Sharing spark context : How exactly multiple users can share the cluster using same spark context ? That's not

streaming join sliding windows

2014-10-22 Thread Josh J
Hi, How can I join neighbor sliding windows in spark streaming? Thanks, Josh

Re: Shuffle issues in the current master

2014-10-22 Thread DB Tsai
It seems that this issue should be addressed by https://github.com/apache/spark/pull/2890 ? Am I right? Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com/in/dbtsai On Wed, Oct 22, 2014 at 11:54 AM, DB

Re: Shuffle issues in the current master

2014-10-22 Thread DB Tsai
Or can it be solved by setting both of the following setting into true for now? spark.shuffle.spill.compress true spark.shuffle.compress ture Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com/in/dbtsai

Setting only master heap

2014-10-22 Thread Keith Simmons
We've been getting some OOMs from the spark master since upgrading to Spark 1.1.0. I've found SPARK_DAEMON_MEMORY, but that also seems to increase the worker heap, which as far as I know is fine. Is there any setting which *only* increases the master heap size? Keith

Re: Shuffle issues in the current master

2014-10-22 Thread DB Tsai
PS, sorry for spamming the mailing list. Based my knowledge, both spark.shuffle.spill.compress and spark.shuffle.compress are default to true, so in theory, we should not run into this issue if we don't change any setting. Is there any other big we run into? Thanks. Sincerely, DB Tsai

Re: Rdd of Rdds

2014-10-22 Thread Sonal Goyal
Another approach could be to create artificial keys for each RDD and convert to PairRDDs. So your first RDD becomes JavaPairRDDInt,String rdd1 with values 1,1 ; 1,2 and so on Second RDD becomes rdd2 is 2, a; 2, b;2,c You can union the two RDDs, groupByKey, countByKey etc and maybe achieve what

ERROR ConnectionManager: Corresponding SendingConnection to ConnectionManagerId

2014-10-22 Thread arthur.hk.c...@gmail.com
Hi, I just tried sample PI calculation on Spark Cluster, after returning the Pi result, it shows ERROR ConnectionManager: Corresponding SendingConnection to ConnectionManagerId(m37,35662) not found ./bin/spark-submit --class org.apache.spark.examples.SparkPi --master spark://m33:7077

Re: Multitenancy in Spark - within/across spark context

2014-10-22 Thread Ashwin Shankar
Thanks Marcelo, that was helpful ! I had some follow up questions : That's not something you might want to do usually. In general, a SparkContext maps to a user application My question was basically this. In this http://spark.apache.org/docs/latest/job-scheduling.html page in the official doc,

Does SQLSpark support Hive built in functions?

2014-10-22 Thread shahab
Hi, I just wonder if SparkSQL supports Hive built-in functions (e.g. from_unixtime) or any of the functions pointed out here : ( https://cwiki.apache.org/confluence/display/Hive/Tutorial) best, /Shahab

Re: Python vs Scala performance

2014-10-22 Thread Marius Soutier
Yeah we’re using Python 2.7.3. On 22.10.2014, at 20:06, Nicholas Chammas nicholas.cham...@gmail.com wrote: On Wed, Oct 22, 2014 at 11:34 AM, Eustache DIEMERT eusta...@diemert.fr wrote: Wild guess maybe, but do you decode the json records in Python ? it could be much slower as the

Re: Python vs Scala performance

2014-10-22 Thread Marius Soutier
Can’t install that on our cluster, but I can try locally. Is there a pre-built binary available? On 22.10.2014, at 19:01, Davies Liu dav...@databricks.com wrote: In the master, you can easily profile you job, find the bottlenecks, see https://github.com/apache/spark/pull/2556 Could you try

Re: Multitenancy in Spark - within/across spark context

2014-10-22 Thread Marcelo Vanzin
On Wed, Oct 22, 2014 at 2:17 PM, Ashwin Shankar ashwinshanka...@gmail.com wrote: That's not something you might want to do usually. In general, a SparkContext maps to a user application My question was basically this. In this page in the official doc, under Scheduling within an application

Re: Rdd of Rdds

2014-10-22 Thread Michael Malak
On Wednesday, October 22, 2014 9:06 AM, Sean Owen so...@cloudera.com wrote: No, there's no such thing as an RDD of RDDs in Spark. Here though, why not just operate on an RDD of Lists? or a List of RDDs? Usually one of these two is the right approach whenever you feel inclined to operate on an

Re: Does SQLSpark support Hive built in functions?

2014-10-22 Thread Michael Armbrust
Yes, when using a HiveContext. On Wed, Oct 22, 2014 at 2:20 PM, shahab shahab.mok...@gmail.com wrote: Hi, I just wonder if SparkSQL supports Hive built-in functions (e.g. from_unixtime) or any of the functions pointed out here : ( https://cwiki.apache.org/confluence/display/Hive/Tutorial)

Re: Python vs Scala performance

2014-10-22 Thread Davies Liu
Sorry, there is not, you can try clone from github and build it from scratch, see [1] [1] https://github.com/apache/spark Davies On Wed, Oct 22, 2014 at 2:31 PM, Marius Soutier mps@gmail.com wrote: Can’t install that on our cluster, but I can try locally. Is there a pre-built binary

Re: Sharing spark context across multiple spark sql cli initializations

2014-10-22 Thread Michael Armbrust
The JDBC server is what you are looking for: http://spark.apache.org/docs/latest/sql-programming-guide.html#running-the-thrift-jdbc-server On Wed, Oct 22, 2014 at 11:10 AM, Sadhan Sood sadhan.s...@gmail.com wrote: We want to run multiple instances of spark sql cli on our yarn cluster. Each

Re: Setting only master heap

2014-10-22 Thread Sameer Farooqui
Hi Keith, Would be helpful if you could post the error message. Are you running Spark in Standalone mode or with YARN? In general, the Spark Master is only used for scheduling and it should be fine with the default setting of 512 MB RAM. Is it actually the Spark Driver's memory that you

Re: SchemaRDD Convert

2014-10-22 Thread Yin Huai
The implicit conversion function mentioned by Hao is createSchemaRDD in SQLContext/HiveContext. You can import it by doing val sqlContext = new org.apache.spark.sql.SQLContext(sc) // Or new org.apache.spark.sql.hive.HiveContext(sc) for HiveContext import sqlContext.createSchemaRDD On Wed, Oct

Solving linear equations

2014-10-22 Thread Martin Enzinger
Hi, I'm wondering how to use Mllib for solving equation systems following this pattern 2*x1 + x2 + 3*x3 + + xn = 0 x1 + 0*x2 + 3*x3 + + xn = 0 .. .. 0*x1 + x2 + 0*x3 + + xn = 0 I definitely still have some reading to do to really understand the direct solving

Re: Solving linear equations

2014-10-22 Thread Debasish Das
Hi Martin, This problem is Ax = B where A is your matrix [2 1 3 ... 1; 1 0 3 ...;] and x is what you want to find..B is 0 in this case...For mllib normally this is labelbasically create a labeledPoint where label is 0 always... Use mllib's linear regression and solve the following

Re: Usage of spark-ec2: how to deploy a revised version of spark 1.1.0?

2014-10-22 Thread freedafeng
I modified the pom files in my private repo to use those parameters as default to solve the problem. But after the deployment, I found the installed version is not the customized version, but an official one. Anyone please give a hint on how the spark-ec2 work with spark from private repos.. --

Re: Shuffle issues in the current master

2014-10-22 Thread Aaron Davidson
You may be running into this issue: https://issues.apache.org/jira/browse/SPARK-4019 You could check by having 2000 or fewer reduce partitions. On Wed, Oct 22, 2014 at 1:48 PM, DB Tsai dbt...@dbtsai.com wrote: PS, sorry for spamming the mailing list. Based my knowledge, both

Re: Getting spark to use more than 4 cores on Amazon EC2

2014-10-22 Thread Andy Davidson
On a related note, how are you submitting your job? I have a simple streaming proof of concept and noticed that everything runs on my master. I wonder if I do not have enough load for spark to push tasks to the slaves. Thanks Andy From: Daniel Mahler dmah...@gmail.com Date: Monday, October

Re: ERROR ConnectionManager: Corresponding SendingConnection to ConnectionManagerId

2014-10-22 Thread arthur.hk.c...@gmail.com
Hi, I have managed to resolve it because a wrong setting. Please ignore this . Regards Arthur On 23 Oct, 2014, at 5:14 am, arthur.hk.c...@gmail.com arthur.hk.c...@gmail.com wrote: 14/10/23 05:09:04 WARN ConnectionManager: All connections not cleaned up

Spark: Order by Failed, java.lang.NullPointerException

2014-10-22 Thread arthur.hk.c...@gmail.com
Hi, I got java.lang.NullPointerException. Please help! sqlContext.sql(select l_orderkey, l_linenumber, l_partkey, l_quantity, l_shipdate, L_RETURNFLAG, L_LINESTATUS from lineitem limit 10).collect().foreach(println); 2014-10-23 08:20:12,024 INFO

version mismatch issue with spark breeze vector

2014-10-22 Thread Yang
I'm trying to submit a simple test code through spark-submit. first portion of the code works fine, but some calls to breeze vector library fails: 14/10/22 17:36:02 INFO CacheManager: Partition rdd_1_0 not found, computing it 14/10/22 17:36:02 ERROR Executor: Exception in task 0.0 in stage 0.0

Re: version mismatch issue with spark breeze vector

2014-10-22 Thread Holden Karau
Hi Yang, It looks like your build file is a different version than the version of Spark you are running against. I'd try building against the same version of spark as you are running your application against (1.1.0). Also what is your assembly/shading configuration for your build? Cheers,

Re: Getting spark to use more than 4 cores on Amazon EC2

2014-10-22 Thread Aaron Davidson
Another wild guess, if your data is stored in S3, you might be running into an issue where the default jets3t properties limits the number of parallel S3 connections to 4. Consider increasing the max-thread-counts from here: http://www.jets3t.org/toolkit/configuration.html. On Tue, Oct 21, 2014

SparkSQL , best way to divide data into partitions?

2014-10-22 Thread raymond
Hi I have a json file that can be load by sqlcontext.jsonfile into a table. but this table is not partitioned. Then I wish to transform this table into a partitioned table say on field “date” etc. what will be the best approaching to do this? seems in hive this is usually

Re: Spark Hive Snappy Error

2014-10-22 Thread arthur.hk.c...@gmail.com
Hi,Please find the attached file.{\rtf1\ansi\ansicpg1252\cocoartf1265\cocoasubrtf210 {\fonttbl\f0\fnil\fcharset0 Menlo-Regular;} {\colortbl;\red255\green255\blue255;} \paperw11900\paperh16840\margl1440\margr1440\vieww26300\viewh12480\viewkind0

hive timestamp column always returns null

2014-10-22 Thread tridib
Hello Experts, I created a table using spark-sql CLI. No Hive is installed. I am using spark 1.1.0. create table date_test(my_date timestamp) row format delimited fields terminated by ' ' lines terminated by '\n' LOCATION '/user/hive/date_test'; The data file has following data: 2014-12-11

Re: Spark Hive Snappy Error

2014-10-22 Thread arthur.hk.c...@gmail.com
Hi May I know where to configure Spark to load libhadoop.so? Regards Arthur On 23 Oct, 2014, at 11:31 am, arthur.hk.c...@gmail.com arthur.hk.c...@gmail.com wrote: Hi, Please find the attached file. lsof.rtf my spark-default.xml # Default system properties included when running

Workaround for SPARK-1931 not compiling

2014-10-22 Thread Arpit Kumar
Hi all, I am new to spark/graphx and am trying to use partitioning strategies in graphx on spark 1.0.0 The workaround I saw on the main page seems not to compile. The code I added was def partitionBy[ED](edges: RDD[Edge[ED]], partitionStrategy: PartitionStrategy): RDD[Edge[ED]] = { val

Re: Spark as key/value store?

2014-10-22 Thread Hajime Takase
Thanks! On Thu, Oct 23, 2014 at 10:56 AM, Akshat Aranya aara...@gmail.com wrote: Yes, that is a downside of Spark's design in general. The only way to share data across consumers of data is by having a separate entity that owns the Spark context. That's the idea behind Ooyala's job server.

RE: Spark Hive Snappy Error

2014-10-22 Thread Shao, Saisai
Seems you just add snappy library into your classpath: export CLASSPATH=$HBASE_HOME/lib/hadoop-snappy-0.0.1-SNAPSHOT.jar But for spark itself, it depends on snappy-0.2.jar. Is there any possibility that this problem caused by different version of snappy? Thanks Jerry From:

scalac crash when compiling DataTypeConversions.scala

2014-10-22 Thread Ryan Williams
I started building Spark / running Spark tests this weekend and on maybe 5-10 occasions have run into a compiler crash while compiling DataTypeConversions.scala. Here https://gist.github.com/ryan-williams/7673d7da928570907f4d is a full gist of an innocuous test command (mvn test

About Memory usage in the Spark UI

2014-10-22 Thread Haopu Wang
Hi, please take a look at the attached screen-shot. I wonders what's the Memory Used column mean. I give 2GB memory to the driver process and 12GB memory to the executor process. Thank you!

how to run a dev spark project without fully rebuilding the fat jar ?

2014-10-22 Thread Yang
during tests, I often modify my code a little bit and want to see the result. but spark-submit requires the full fat-jar, which takes quite a lot of time to build. I just need to run in --master local mode. is there a way to run it without rebuilding the fat jar? thanks Yang

Re: spark 1.1.0/yarn hang

2014-10-22 Thread Tian Zhang
We have narrowed this hanging issue down to the calliope package that we used to create RDD from reading cassandra table. The calliope native RDD interface seems hanging and I have decided to switch to the calliope cql3 RDD interface. -- View this message in context:

Re: how to run a dev spark project without fully rebuilding the fat jar ?

2014-10-22 Thread Mohit Jaggi
i think you can give a list of jars - not just one - to spark-submit, so build only the one that has changed source code. On Wed, Oct 22, 2014 at 10:29 PM, Yang tedd...@gmail.com wrote: during tests, I often modify my code a little bit and want to see the result. but spark-submit