Re: Parquet files are only 6-20MB in size?

2014-11-03 Thread Akhil Das
Before doing saveAsParquetFile, you can call the repartition and provide a decent number which will result in the total number of output files generated. Thanks Best Regards On Mon, Nov 3, 2014 at 1:12 PM, ag007 agre...@mac.com wrote: Hi there, I have a pySpark job that is simply taking a

To find distances to reachable source vertices using GraphX

2014-11-03 Thread Madabhattula Rajesh Kumar
Hi All, I'm trying to understand below link example program. When I run this program, I'm getting *java.lang.NullPointerException* at below highlighted line. *https://gist.github.com/ankurdave/4a17596669b36be06100 https://gist.github.com/ankurdave/4a17596669b36be06100* val updatedDists =

Re: SQL COUNT DISTINCT

2014-11-03 Thread Bojan Kostic
Hi Michael, Thanks for response. I did test with query that you send me. And it works really faster: Old queries stats by phases: 3.2min 17s Your query stats by phases: 0.3 s 16 s 20 s But will this improvement also affect when you want to count distinct on 2 or more fields: SELECT COUNT(f1),

GraphX : Vertices details in Triangles

2014-11-03 Thread Madabhattula Rajesh Kumar
Hi All, I'm new to GraphX. I'm understanding Triangle Count use cases. I'm able to get number of triangles in a graph but I'm not able to collect vertices details in each Triangle. *For example* : I'm playing one of the graphx graph example Vertices and Edges val vertexArray = Array( (1L,

Re: Parquet files are only 6-20MB in size?

2014-11-03 Thread ag007
Thanks Akhil, Am I right in saying that the repartition will spread the data randomly so I loose chronological order? I really just want the csv -- parquet format in the same order it came in. If I set repartition with 1 will this not be random? cheers, Ag -- View this message in context:

Re: about aggregateByKey and standard deviation

2014-11-03 Thread Kamal Banga
I don't think directy .aggregateByKey() can be done, because we will need count of keys (for average). Maybe we can use .countByKey() which returns a map and .foldByKey(0)(_+_) (or aggregateByKey()) which gives sum of values per key. I myself ain't getting how to proceed. Regards On Fri, Oct 31,

How number of partitions effect the performance?

2014-11-03 Thread shahab
Hi, I just wonder how number of partitions effect the performance in Spark! Is it just the parallelism (more partitions, more parallel sub-tasks) that improves the performance? or there exist other considerations? In my case,I run couple of map/reduce jobs on same dataset two times with two

Re: OOM with groupBy + saveAsTextFile

2014-11-03 Thread Bharath Ravi Kumar
The result was no different with saveAsHadoopFile. In both cases, I can see that I've misinterpreted the API docs. I'll explore the API's a bit further for ways to save the iterable as chunks rather than one large text/binary. It might also help to clarify this aspect in the API docs. For those

Re: OOM with groupBy + saveAsTextFile

2014-11-03 Thread Bharath Ravi Kumar
I also realized from your description of saveAsText that the API is indeed behaving as expected i.e. it is appropriate (though not optimal) for the API to construct a single string out of the value. If the value turns out to be large, the user of the API needs to reconsider the implementation

Re: OOM with groupBy + saveAsTextFile

2014-11-03 Thread Sean Owen
Yes, that's the same thing really. You're still writing a huge value as part of one single (key,value) record. The value exists in memory in order to be written to storage. Although there aren't hard limits, in general, keys and values aren't intended to be huge, like, hundreds of megabytes. You

Re: SparkSQL performance

2014-11-03 Thread Marius Soutier
I did some simple experiments with Impala and Spark, and Impala came out ahead. But it’s also less flexible, couldn’t handle irregular schemas, didn't support Json, and so on. On 01.11.2014, at 02:20, Soumya Simanta soumya.sima...@gmail.com wrote: I agree. My personal experience with Spark

Hive Context and Mapr

2014-11-03 Thread Addanki, Santosh Kumar
Hi We are currently using Mapr Distribution. To read the files from the file system we specify as follows : test = sc.textFile(mapr/mycluster/user/mapr/test.csv) This works fine from Spark Context. But ... Currently we are trying to create a table in hive using the hiveContext from Spark.

Schema RDD and saveAsTable in hive

2014-11-03 Thread Addanki, Santosh Kumar
Hi, I have a schemaRDD created like below : schemaTransactions = sqlContext.applySchema(transactions,schema); When I try to save the schemaRDD as a table using : schemaTransactions.saveAsTable(transactions) I get the error below Py4JJavaError: An error occurred while calling

Re: How number of partitions effect the performance?

2014-11-03 Thread Sean Owen
Yes partitions matter. Usually you can use the default, which will make a partition per input split, and that's usually good, to let one task process one block of data, which will all be on one machine. Reasons I could imagine why 9 partitions is faster than 7: Probably: Your cluster can execute

Re: How number of partitions effect the performance?

2014-11-03 Thread shahab
Thanks Sean for very useful comments. I understand now better what could be the reasons that my evaluations are messed up. best, /Shahab On Mon, Nov 3, 2014 at 12:08 PM, Sean Owen so...@cloudera.com wrote: Yes partitions matter. Usually you can use the default, which will make a partition per

unsubscribe

2014-11-03 Thread Karthikeyan Arcot Kuppusamy
hi - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org

Re: unsubscribe

2014-11-03 Thread Akhil Das
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org Thanks Best Regards On Mon, Nov 3, 2014 at 5:53 PM, Karthikeyan Arcot Kuppusamy karthikeyan...@zanec.com wrote: hi - To unsubscribe, e-mail:

Bug in DISK related Storage level?

2014-11-03 Thread James
Hello, I am trying to load a very large graph to run a GraphX algorithm, and the graph is not fix the memory, I found that if I use DISK_ONLY or MEMORY_AND_DISK_SER storage level, the program will met OOM, but if I use MEMORY_ONLY_SER, the program will not. Thus I want to know what kind of

Dynamically switching Nr of allocated core

2014-11-03 Thread RodrigoB
Hi all, I can't seem to find a clear answer on the documentation. Does the standalone cluster support dynamic assigment of nr of allocated cores to an application once another app stops? I'm aware that we can have core sharding if we use Mesos between active applications depending on the nr of

Re: Spark cluster stability

2014-11-03 Thread jatinpreet
Great! Thanks for the information. I will try it out. - Novice Big Data Programmer -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-cluster-stability-tp17929p17956.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Spark Kafka Performance

2014-11-03 Thread Eduardo Costa Alfaia
Hi Guys, Anyone could explain me how to work Kafka with Spark, I am using the JavaKafkaWordCount.java like a test and the line command is: ./run-example org.apache.spark.streaming.examples.JavaKafkaWordCount spark://192.168.0.13:7077 computer49:2181 test-consumer-group unibs.it 3 and like a

Fwd: GraphX : Vertices details in Triangles

2014-11-03 Thread Madabhattula Rajesh Kumar
Hi All, I'm new to GraphX. I'm understanding Triangle Count use cases. I'm able to get number of triangles in a graph but I'm not able to collect vertices details in each Triangle. *For example* : I'm playing one of the graphx graph example Vertices and Edges val vertexArray = Array( (1L,

Spark job resource allocation best practices

2014-11-03 Thread Romi Kuntsman
Hello, I have a Spark 1.1.0 standalone cluster, with several nodes, and several jobs (applications) being scheduled at the same time. By default, each Spark job takes up all available CPUs. This way, when more than one job is scheduled, all but the first are stuck in WAITING. On the other hand,

Re: Spark job resource allocation best practices

2014-11-03 Thread Akhil Das
Have a look at scheduling pools https://spark.apache.org/docs/latest/job-scheduling.html. If you want more sophisticated resource allocation, then you are better of to use cluster managers like mesos or yarn Thanks Best Regards On Mon, Nov 3, 2014 at 9:10 PM, Romi Kuntsman r...@totango.com

Re: Spark job resource allocation best practices

2014-11-03 Thread Romi Kuntsman
So, as said there, static partitioning is used in Spark’s standalone and YARN modes, as well as the coarse-grained Mesos mode. That leaves us only with Mesos, where there is *dynamic sharing* of CPU cores. It says when the application is not running tasks on a machine, other applications may run

Cincinnati, OH Meetup for Apache Spark

2014-11-03 Thread Darin McBeath
Let me know if you  are interested in participating in a meet up in Cincinnati, OH to discuss Apache Spark. We currently have 4-5 different companies expressing interest but would like a few more. Darin.

random shuffle streaming RDDs?

2014-11-03 Thread Josh J
Hi, Is there a nice or optimal method to randomly shuffle spark streaming RDDs? Thanks, Josh

Re: Dynamically switching Nr of allocated core

2014-11-03 Thread Romi Kuntsman
I didn't notice your message and asked about the same question, in the thread with the title Spark job resource allocation best practices. Adding specific case to your example: 1 - There are 12 cores available in the cluster 2 - I start app B with all cores - gets 12 3 - I start app A - it needs

Key-Value decomposition

2014-11-03 Thread david
Hi, I'm a newbie in Spark and faces the following use case : val data = Array ( A, 1;2;3) val rdd = sc.parallelize(data) // Something here to produce RDD of (Key,value) // ( A, 1) , (A, 2), (A, 3) Does anybody know how to do ? Thank's -- View this message in

Re: Key-Value decomposition

2014-11-03 Thread Ganelin, Ilya
Very straightforward: You want to use cartesian. If you have two RDDs - RDD_1(³A²) and RDD_2(1,2,3) RDD_1.cartesian(RDD_2) will generate the cross product between the two RDDs and you will have RDD_3((³A²,1), (³B²,2), (³C², 3)) On 11/3/14, 11:38 AM, david david...@free.fr wrote: Hi, I'm a

Re: random shuffle streaming RDDs?

2014-11-03 Thread Sean Owen
I think the answer will be the same in streaming as in the core. You want a random permutation of an RDD? in general RDDs don't have ordering at all -- excepting when you sort for example -- so a permutation doesn't make sense. Do you just want a well-defined but random ordering of the data? Do

Re: random shuffle streaming RDDs?

2014-11-03 Thread Josh J
When I'm outputting the RDDs to an external source, I would like the RDDs to be outputted in a random shuffle so that even the order is random. So far what I understood is that the RDDs do have a type of order, in that the order for spark streaming RDDs would be the order in which spark streaming

Accumulables in transformation operations

2014-11-03 Thread Jorge Lopez-Malla
Hi i'm reading the O´really´s book Learning Spark and i have a doubt, the accumulator's fault tolerance still only happening in the actions operations? this behaviour is also expected if we use accumulables? Thank in advance Jorge López-Malla Matute Big Data Developer Avenida de Europa, 26.

Re: random shuffle streaming RDDs?

2014-11-03 Thread Jay Vyas
A use case would be helpful? Batches of RDDs from Streams are going to have temporal ordering in terms of when they are processed in a typical application ... , but maybe you could shuffle the way batch iterations work On Nov 3, 2014, at 11:59 AM, Josh J joshjd...@gmail.com wrote: When

Re: akka connection refused bug, fix?

2014-11-03 Thread freedafeng
Any one has experience or advice to fix this problem? highly appreciated! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/akka-connection-refused-bug-fix-tp17764p17972.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Parquet files are only 6-20MB in size?

2014-11-03 Thread Davies Liu
Befire saveAsParquetFile(), you can call coalesce(N), then you will have N files, it will keep the order as before (repartition() will not). On Mon, Nov 3, 2014 at 1:16 AM, ag007 agre...@mac.com wrote: Thanks Akhil, Am I right in saying that the repartition will spread the data randomly so I

Re: Spark on Yarn probably trying to load all the data to RAM

2014-11-03 Thread Davies Liu
On Sun, Nov 2, 2014 at 1:35 AM, jan.zi...@centrum.cz wrote: Hi, I am using Spark on Yarn, particularly Spark in Python. I am trying to run: myrdd = sc.textFile(s3n://mybucket/files/*/*/*.json) How many files do you have? and the average size of each file? myrdd.getNumPartitions()

Re: random shuffle streaming RDDs?

2014-11-03 Thread Sean Owen
If you iterated over an RDD's partitions, I'm not sure that in practice you would find the order matches the order they were received. The receiver is replicating data to another node or node as it goes and I don't know much is guaranteed about that. If you want to permute an RDD, how about a

NoClassDefFoundError encountered in Spark 1.2-snapshot build with hive-0.13.1 profile

2014-11-03 Thread Terry Siu
I just built the 1.2 snapshot current as of commit 76386e1a23c using: $ ./make-distribution.sh —tgz —name my-spark —skip-java-test -DskipTests -Phadoop-2.4 -Phive -Phive-0.13.1 -Pyarn I drop in my Hive configuration files into the conf directory, launch spark-shell, and then create my

MLlib - Naive Bayes Java example bug

2014-11-03 Thread Dariusz Kobylarz
Hi, I noticed a bug in the sample java code in MLlib - Naive Bayes docs page: http://spark.apache.org/docs/1.1.0/mllib-naive-bayes.html In the filter: |double accuracy = 1.0 * predictionAndLabel.filter(new FunctionTuple2Double, Double, Boolean() { @Override public Boolean

ParquetFilters and StringType support for GT, GTE, LT, LTE

2014-11-03 Thread Terry Siu
Is there any reason why StringType is not a supported type the GT, GTE, LT, LTE operations? I was able to previously have a predicate where my column type was a string and execute a filter with one of the above operators in SparkSQL w/o any problems. However, I synced up to the latest code this

Re: NoClassDefFoundError encountered in Spark 1.2-snapshot build with hive-0.13.1 profile

2014-11-03 Thread Kousuke Saruta
Hi Terry I think the issue you mentioned will be resolved by following PR. https://github.com/apache/spark/pull/3072 - Kousuke (2014/11/03 10:42), Terry Siu wrote: I just built the 1.2 snapshot current as of commit 76386e1a23c using: $ ./make-distribution.sh —tgz —name my-spark

Re: MLlib - Naive Bayes Java example bug

2014-11-03 Thread Sean Owen
Yes, good catch. I also think the 1.0 * is suboptimal as a cast to double. I searched for similar issues and didn't see any. Open a PR -- I'm not even sure this is enough to warrant a JIRA? but feel free to as well. On Mon, Nov 3, 2014 at 6:46 PM, Dariusz Kobylarz darek.kobyl...@gmail.com wrote:

Re: Spark on Yarn probably trying to load all the data to RAM

2014-11-03 Thread jan.zikes
I have 3 datasets in all the datasets the average file size is 10-12Kb.  I am able to run my code on the dataset with 70K files, but I am not able to run it on datasets with 1.1M and 3.8M files.  __ On Sun, Nov 2, 2014 at 1:35 AM,  

Re: NoClassDefFoundError encountered in Spark 1.2-snapshot build with hive-0.13.1 profile

2014-11-03 Thread Terry Siu
Thanks, Kousuke. I’ll wait till this pull request makes it into the master branch. -Terry From: Kousuke Saruta saru...@oss.nttdata.co.jpmailto:saru...@oss.nttdata.co.jp Date: Monday, November 3, 2014 at 11:11 AM To: Terry Siu terry@smartfocus.commailto:terry@smartfocus.com,

Re: Cannot instantiate hive context

2014-11-03 Thread Pala M Muthaia
Thanks Akhil. I realized that earlier, and i thought mvn -Phive should have captured and included all these dependencies. In any case, i proceeded with that, included other such dependencies that were missing, and finally hit the guava version mismatch issue. (Spark with Guava 14 vs Hadoop/Hive

Re: random shuffle streaming RDDs?

2014-11-03 Thread Josh J
If you want to permute an RDD, how about a sortBy() on a good hash function of each value plus some salt? (Haven't thought this through much but sounds about right.) This sounds promising. Where can I read more about the space (memory and network overhead) and time complexity of sortBy? On

Model characterization

2014-11-03 Thread Sameer Tilak
Hi All, I have been using LinearRegression model of MLLib and very pleased with its scalability and robustness. Right now, we are just calculating MSE of our model. We would like to characterize the performance of our model. I was wondering adding support for computing things such as Confidence

Any Replicated RDD in Spark?

2014-11-03 Thread Shuai Zheng
Hi All, I have spent last two years on hadoop but new to spark. I am planning to move one of my existing system to spark to get some enhanced features. My question is: If I try to do a map side join (something similar to Replicated key word in Pig), how can I do it? Is it anyway to declare a

Re: CANNOT FIND ADDRESS

2014-11-03 Thread akhandeshi
no luck :(! Still observing the same behavior! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/CANNOT-FIND-ADDRESS-tp17637p17988.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Memory limitation on EMR Node?

2014-11-03 Thread Shuai Zheng
Hi, I am planning to run spark on EMR. And because my application might take a lot of memory. On EMR, I know there is a hard limit 16G physical memory on individual mapper/reducer (otherwise I will have an exception and this is confirmed by AWS EMR team, at least it is the spec at this moment).

with SparkStreeaming spark-submit, don't see output after ssc.start()

2014-11-03 Thread spr
I have a Spark Streaming program that works fine if I execute it via sbt runMain com.cray.examples.spark.streaming.cyber.StatefulDhcpServerHisto -f /Users/spr/Documents/.../tmp/ -t 10 but if I start it via $S/bin/spark-submit --master local[12] --class StatefulNewDhcpServers

Re: Any Replicated RDD in Spark?

2014-11-03 Thread Matei Zaharia
You need to use broadcast followed by flatMap or mapPartitions to do map-side joins (in your map function, you can look at the hash table you broadcast and see what records match it). Spark SQL also does it by default for tables smaller than the spark.sql.autoBroadcastJoinThreshold setting (by

Re: --executor-cores cannot change vcores in yarn?

2014-11-03 Thread Gen
Hi, Well, I doesn't find original documentation, but according to http://qnalist.com/questions/2791828/about-the-cpu-cores-and-cpu-usage http://qnalist.com/questions/2791828/about-the-cpu-cores-and-cpu-usage , the vcores is not for physics cpu core but for virtual cores. And I used top command

Re: ParquetFilters and StringType support for GT, GTE, LT, LTE

2014-11-03 Thread Michael Armbrust
That sounds like a regression. Could you open a JIRA with steps to reproduce (https://issues.apache.org/jira/browse/SPARK)? We'll want to fix this before the 1.2 release. On Mon, Nov 3, 2014 at 11:04 AM, Terry Siu terry@smartfocus.com wrote: Is there any reason why StringType is not a

Re: SQL COUNT DISTINCT

2014-11-03 Thread Michael Armbrust
On Mon, Nov 3, 2014 at 12:45 AM, Bojan Kostic blood9ra...@gmail.com wrote: But will this improvement also affect when you want to count distinct on 2 or more fields: SELECT COUNT(f1), COUNT(DISTINCT f2), COUNT(DISTINCT f3), COUNT(DISTINCT f4) FROM parquetFile Unfortunately I think this

OOM - Requested array size exceeds VM limit

2014-11-03 Thread akhandeshi
I am running local (client). My vm is 16 cpu/108gb ram. My configuration is as following: spark.executor.extraJavaOptions -XX:+PrintGCDetails -XX:+UseCompressedOops -XX:+UseParallelGC -XX:+UseParallelOldGC -XX:+DisableExplicitGC -XX:MaxPermSize=1024m spark.daemon.memory=20g

Re: NoClassDefFoundError encountered in Spark 1.2-snapshot build with hive-0.13.1 profile

2014-11-03 Thread Michael Armbrust
It is merged! On Mon, Nov 3, 2014 at 12:06 PM, Terry Siu terry@smartfocus.com wrote: Thanks, Kousuke. I’ll wait till this pull request makes it into the master branch. -Terry From: Kousuke Saruta saru...@oss.nttdata.co.jp Date: Monday, November 3, 2014 at 11:11 AM To: Terry Siu

Spark streaming job failed due to java.util.concurrent.TimeoutException

2014-11-03 Thread Bill Jay
Hi all, I have a spark streaming job that consumes data from Kafka and produces some simple operations on the data. This job is run in an EMR cluster with 10 nodes. The batch size I use is 1 minute and it takes around 10 seconds to generate the results that are inserted to a MySQL database.

is spark a good fit for sequential machine learning algorithms?

2014-11-03 Thread ll
i'm struggling with implementing a few algorithms with spark. hope to get help from the community. most of the machine learning algorithms today are sequential, while spark is all about parallelism. it seems to me that using spark doesn't actually help much, because in most cases you can't

Re: ParquetFilters and StringType support for GT, GTE, LT, LTE

2014-11-03 Thread Terry Siu
Done. https://issues.apache.org/jira/browse/SPARK-4213 Thanks, -Terry From: Michael Armbrust mich...@databricks.commailto:mich...@databricks.com Date: Monday, November 3, 2014 at 1:37 PM To: Terry Siu terry@smartfocus.commailto:terry@smartfocus.com Cc:

Re: Parquet files are only 6-20MB in size?

2014-11-03 Thread ag007
David, that's exactly what I was after :) Awesome, thanks. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Parquet-files-are-only-6-20MB-in-size-tp17935p18002.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Deleting temp dir Exception

2014-11-03 Thread Josh
Hi, I've written a short scala app to perform word counts on a text file and am getting the following exception as the program completes (after it prints out all of the word counts). Exception in thread delete Spark temp dir

Re: To find distances to reachable source vertices using GraphX

2014-11-03 Thread Ankur Dave
The NullPointerException seems to be because edge.dstAttr is null, which might be due to SPARK-3936 https://issues.apache.org/jira/browse/SPARK-3936. Until that's fixed, I edited the Gist with a workaround. Does that fix the problem? Ankur http://www.ankurdave.com/ On Mon, Nov 3, 2014 at 12:23

Snappy and spark 1.1

2014-11-03 Thread Aravind Srinivasan
Team, We are running a build of spark 1.1.1 for hadoop 2.2. We can't get the code to read LZO or snappy files in YARN. It fails to find the native libs. I have tried many different ways of defining the lib path - LD_LIBRARY_PATH, --driver-class-path, spark.executor.extraLibraryPath in

Spark Streaming - Most popular Twitter Hashtags

2014-11-03 Thread Harold Nguyen
Hi all, I was just reading this nice documentation here: http://ampcamp.berkeley.edu/3/exercises/realtime-processing-with-spark-streaming.html And got to the end of it, which says: Note that there are more efficient ways to get the top 10 hashtags. For example, instead of sorting the entire of

IllegalStateException: unread block data

2014-11-03 Thread freedafeng
Hollo there, Just set up an ec2 cluster with no HDFS, hadoop, hbase whatsoever. Just installed spark to read/process data from a hbase in a different cluster. The spark was built against the hbase/hadoop version in the remote (ec2) hbase cluster, which is 0.98.1 and 2.3.0 respectively. but I

Re: hadoop_conf_dir when running spark on yarn

2014-11-03 Thread Tobias Pfeiffer
Hi, On Mon, Nov 3, 2014 at 1:29 PM, Amey Chaugule ambr...@gmail.com wrote: I thought that only applied when you're trying to run a job using spark-submit or in the shell... And how are you starting your Yarn job, if not via spark-submit? Tobias

Re: different behaviour of the same code

2014-11-03 Thread Tobias Pfeiffer
Hi, On Fri, Oct 31, 2014 at 4:31 PM, lieyan lie...@yahoo.com wrote: The code are here: LogReg.scala http://apache-spark-user-list.1001560.n3.nabble.com/file/n17803/LogReg.scala Then I click the Run button of the IDEA, and I get the following error message errlog.txt

cannot import name accumulators in python 2.7

2014-11-03 Thread felixgao
I am running python 2.7.3 and 2.1.0 of ipython notebook. I installed spark in my home directory. '/home/felix/spark-1.1.0/python/lib/py4j-0.8.1-src.zip', '/home/felix/spark-1.1.0/python', '', '/opt/bluekai/python/src/bk', '/usr/local/lib/python2.7/dist-packages/setuptools-6.1-py2.7.egg',

Re: with SparkStreeaming spark-submit, don't see output after ssc.start()

2014-11-03 Thread Steve Reinhardt
From: Tobias Pfeiffer t...@preferred.jpmailto:t...@preferred.jp Am I right that you are actually executing two different classes here? Yes, I realized after I posted that I was calling 2 different classes, though they are in the same JAR. I went back and tried it again with the same class

Re: Spark SQL takes unexpected time

2014-11-03 Thread Shailesh Birari
Yes, I am using Spark1.1.0 and have used rdd.registerTempTable(). I tried by adding sqlContext.cacheTable(), but it took 59 seconds (more than earlier). I also tried by changing schema to use Long data type in some fields but seems conversion takes more time. Is there any way to specify index ?

How to make sure a ClassPath is always shipped to workers?

2014-11-03 Thread Peng Cheng
I have a spark application that deserialize an object 'Seq[Page]', save to HDFS/S3, and read by another worker to be used elsewhere. The serialization and deserialization use the same serializer as Spark itself. (Read from SparkEnv.get.serializer.newInstance()) However I sporadically get this

How to make sure a ClassPath is always shipped to workers?

2014-11-03 Thread Peng Cheng
I have a spark application that deserialize an object 'Seq[Page]', save to HDFS/S3, and read by another worker to be used elsewhere. The serialization and deserialization use the same serializer as Spark itself. (Read from SparkEnv.get.serializer.newInstance()) However I sporadically get this

Re: How to make sure a ClassPath is always shipped to workers?

2014-11-03 Thread Peng Cheng
Sorry its a timeout duplicate, please remove it -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-make-sure-a-ClassPath-is-always-shipped-to-workers-tp18018p18020.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: different behaviour of the same code

2014-11-03 Thread lieyan
You are right. You pointed out the very cause of my problem. Thanks. I have to specify the path to my jar file. The solution can be found in an earlier post. http://apache-spark-user-list.1001560.n3.nabble.com/ClassNotFoundException-with-simple-Spark-job-on-cluster-td932.html -- View

avro + parquet + vectorstring + NullPointerException while reading

2014-11-03 Thread Michael Albert
Greetings! I'm trying to use avro and parquet with the following schema: {     name: TestStruct,     namespace: bughunt,     type: record,     fields: [         {             name: string_array,             type: { type: array, items: string }          }     ] } The writing

Cleaning/transforming json befor converting to SchemaRDD

2014-11-03 Thread Daniel Mahler
I am trying to convert terabytes of json log files into parquet files. but I need to clean it a little first. I end up doing the following txt = sc.textFile(inpath).coalesce(800) val json = (for { line - txt JObject(child) = parse(line) child2 = (for {

Executor Log Rotation Is Not Working?

2014-11-03 Thread Ji ZHANG
Hi, I'm using Spark Streaming 1.1, and I have the following logs keep growing: /opt/spark-1.1.0-bin-cdh4/work/app-20141029175309-0005/2/stderr I think it is executor log, so I setup the following options in spark-defaults.conf: spark.executor.logs.rolling.strategy time

Re: is spark a good fit for sequential machine learning algorithms?

2014-11-03 Thread Xiangrui Meng
Many ML algorithms are sequential because they were not designed to be parallel. However, ML is not driven by algorithms in practice, but by data and applications. As datasets getting bigger and bigger, some algorithms got revised to work in parallel, like SGD and matrix factorization. MLlib tries

Re: To find distances to reachable source vertices using GraphX

2014-11-03 Thread Madabhattula Rajesh Kumar
Thank you Ankur for your help and support!!! On Tue, Nov 4, 2014 at 5:24 AM, Ankur Dave ankurd...@gmail.com wrote: The NullPointerException seems to be because edge.dstAttr is null, which might be due to SPARK-3936 https://issues.apache.org/jira/browse/SPARK-3936. Until that's fixed, I

netty on classpath when using spark-submit

2014-11-03 Thread Tobias Pfeiffer
Hi, I tried hard to get a version of netty into my jar file created with sbt assembly that works with all my libraries. Now I managed that and was really happy, but it seems like spark-submit puts an older version of netty on the classpath when submitting to a cluster, such that my code ends up

Re: --executor-cores cannot change vcores in yarn?

2014-11-03 Thread Shekhar Bansal
If you are using capacity scheduler in yarn: By default yarn capacity scheduler uses DefaultResourceCalculator. DefaultResourceCalculator consider¹s only memory while allocating contains. You can use DominantResourceCalculator, it considers memory and cpu. In capacity-scheduler.xml set

Re: Spark job resource allocation best practices

2014-11-03 Thread Akhil Das
Yes. i believe Mesos is the right choice for you. http://mesos.apache.org/documentation/latest/mesos-architecture/ Thanks Best Regards On Mon, Nov 3, 2014 at 9:33 PM, Romi Kuntsman r...@totango.com wrote: So, as said there, static partitioning is used in Spark’s standalone and YARN modes, as

Got java.lang.SecurityException: class javax.servlet.FilterRegistration's when running job from intellij Idea

2014-11-03 Thread Jaonary Rabarisoa
Hi all, I have a spark job that I build with sbt and I can run without any problem with sbt run. But when I run it inside IntelliJ Idea I got the following error : *Exception encountered when invoking run on a nested suite - class javax.servlet.FilterRegistration's signer information does not

Re: Cannot instantiate hive context

2014-11-03 Thread Akhil Das
Not quiet sure, but moving the Guava 11 jar to first position in the classpath may solve this issue. Thanks Best Regards On Tue, Nov 4, 2014 at 1:47 AM, Pala M Muthaia mchett...@rocketfuelinc.com wrote: Thanks Akhil. I realized that earlier, and i thought mvn -Phive should have captured and

Re: Key-Value decomposition

2014-11-03 Thread david
Hi, But i've only one RDD. Hre is a more complete exemple : my rdd is something like (A, 1;2;3), (B, 2;5;6), (C, 3;2;1) And i expect to have the following result : (A,1) , (A,2) , (A,3) , (B,2) , (B,5) , (B,6) , (C,3) , (C,2) , (C,1) Any idea about how can i achieve this ? Thank's