Re: Using Cassandra as an input stream from Java

2013-12-04 Thread Lucas Fernandes Brunialti
Hi all, This should work: JavaPairRDD> casRdd = context.newAPIHadoopRDD(job.getConfiguration(), ColumnFamilyInputFormat.class.asSubclass(org.apache.hadoop.mapreduce.InputFormat.class), ByteBuffer.class, SortedMap.class); I have translated the word count written in scala to java, i just can't sen

Re: Using Cassandra as an input stream from Java

2013-12-04 Thread Pulasthi Supun Wickramasinghe
Hi Tal, Just checking if you have added your code to github :). if you have could you point me to it. Best Regards, Pulasthi On Thu, Nov 28, 2013 at 11:54 PM, Patrick Wendell wrote: > Tal - that would be great to have open sourced if you can do it! > > On Thu, Nov 28, 2013 at 1:00 AM, Pulasthi

RE: Spark over YARN

2013-12-04 Thread Liu, Raymond
YARN Alpha API support is already there, If you mean Yarn stable API in hadoop 2.2, it probably will be in 0.8.1 Best Regards, Raymond Liu From: Pranay Tonpay [mailto:pranay.ton...@impetus.co.in] Sent: Thursday, December 05, 2013 12:53 AM To: user@spark.incubator.apache.org Subject: Spark over

Re: Removing broadcasts

2013-12-04 Thread Matei Zaharia
Hey Roman, It looks like that pull request was never migrated to the Apache GitHub, but I like the idea. If you migrate it over, we can merge in something like this. In terms of the API, I’d just add a unpersist() method on each Broadcast object. Matei On Dec 3, 2013, at 6:00 AM, Roman Pastukh

Question about saveAsTextFile on DStream

2013-12-04 Thread Parth Patil
Hi Friends, I am new to Spark and Spark streaming. I am trying to save a DStream to file but can't figure out how to do it with provided methods on DStream (saveAsTextFiles). Following is what I am trying to do Eg. DStream of type DStream[(String, String)] ("file1.txt", "msg_a"), ("file1.txt", "m

Re: Pre-build Spark for Windows 8.1

2013-12-04 Thread Matei Zaharia
Hey Adrian, Ideally you shouldn’t use Cygwin to run on Windows — use the .cmd scripts we provide instead. Cygwin might be made to work but we haven’t tried to do this so far so it’s not supported. If you can fix it, that would of course be welcome. Also, the deploy scripts don’t work on Window

Re: Benchmark numbers for terabytes of data

2013-12-04 Thread Christopher Nguyen
Matt, we've done 1TB linear models in 2-3 minutes on 40 node clusters (30GB/node, just enough to hold all partitions simultaneously in memory). You can do with fewer nodes if you're willing to slow things down. Some of our TB benchmark numbers are available in my Spark Summit slides. Sorry I'm on

Pre-build Spark for Windows 8.1

2013-12-04 Thread Adrian Bonar
Separate from my previous thread about building a distribution of Spark on Win8, I am also trying to run the pre-build binaries with little success. I downloaded and extract "spark-0.8.0-incubating-bin-hadoop1" to d:\spark and attempted to start a master with the following error: $ sh bin/start

Spark build/standalone on Windows 8.1

2013-12-04 Thread Adrian Bonar
I am trying to bring up a standalone cluster on Windows 8.1/Windows Server 2013 and I am having troubles getting the make-distribution script to complete successfully. It seems to change a bunch of permissions in /dist and then tries to write to it unsuccessfully. I assume this is not expected b

Re: GroupingComparator in Spark.

2013-12-04 Thread Josh Rosen
It looks like OrderedRDDFunctions ( https://spark.incubator.apache.org/docs/latest/api/core/index.html#org.apache.spark.rdd.OrderedRDDFunctions), which defines sortBy(), is constructed with an implicit Ordered[K], so you could explicitly construct an OrderedRDDFunctions with your own Ordered. You

Re: GroupingComparator in Spark.

2013-12-04 Thread Reynold Xin
Spark's expressiveness allows you to do this fairly easily on your own. sortByKey is implemented in a few lines of code. It would be fairly easy to implement your own sortByKey to do that. Replace the partitioner in sortByKey with a hash partitioner on the key, and then add define a separate way t

Memory configuration of standalone clusters

2013-12-04 Thread Andrew Ash
Hello, I have a few questions about configuring memory usage on standalone clusters. Can someone help me out? 1) The terms "slave" in ./bin/start-slaves.sh and "worker" in the docs seem to be used interchangeably. Are they the same? 2) On a worker/slave, is there only one JVM running that has

Re: Splitting into partitions and sorting the partitions ... how to do that?

2013-12-04 Thread Ceriel Jacobs
Thanks for your answer. The partitioning function is not that important. What is important that I only sort the partitions, not the complete RDD. Your suggestion to use rdd.distinct.coalesce(32).mapPartitions(p => sorted(p)) sounds nice, and I had indeed seen the coalesce method and the mapParti

Re: Benchmark numbers for terabytes of data

2013-12-04 Thread Matei Zaharia
These were EC2 clusters, so the machines were smaller than modern machines. You can definitely have 1 TB datasets on 10 nodes too. Actually if you’re curious about hardware configuration, take a look at http://spark.incubator.apache.org/docs/latest/hardware-provisioning.html. Also, regarding Sp

Re: Splitting into partitions and sorting the partitions ... how to do that?

2013-12-04 Thread Mark Hamstra
You can easily accomplish the hashmod 32 using Spark's partitionBy: val strings: RDD[String] = ... strings.map(s => (s, 1)).partitionBy(new HashPartitioner(32)).keys On Wed, Dec 4, 2013 at 11:05 AM, Andrew Ash wrote: > How important is it that they're partitioned on hashcode() % 32 rather > th

Re: Benchmark numbers for terabytes of data

2013-12-04 Thread Matt Cheah
I'm reading the paper now, thanks. It states 100-node clusters were used. Is this typical in the field to have 100 node clusters for the 1TB scale? We were expecting to be using ~10 nodes. I'm still pretty new to cluster computing, so just not sure how people have set these up. -Matt Cheah Fr

Re: Splitting into partitions and sorting the partitions ... how to do that?

2013-12-04 Thread Andrew Ash
How important is it that they're partitioned on hashcode() % 32 rather than Spark's default partitioning? In scala, you should be able to do this with rdd.distinct.coalesce(32).mapPartitions(p => sorted(p)) I'm not sure what your end goal is here, but if it's just sort a bunch of data and remove

Fwd: GroupingComparator in Spark.

2013-12-04 Thread Archit Thakur
Hi, Was just curious. In Hadoop, You have a flexibilty that you can chose your class for SortComparator and GroupingComparator. I have figured out that there are functions like sortByKey and reduceByKey. But what if, I want to customize what part of key I want to use for sorting and which part for

Re: Benchmark numbers for terabytes of data

2013-12-04 Thread Matei Zaharia
Yes, check out the Shark paper for example: https://amplab.cs.berkeley.edu/publication/shark-sql-and-rich-analytics-at-scale/ The numbers on that benchmark are for Shark. Matei On Dec 3, 2013, at 3:50 PM, Matt Cheah wrote: > Hi everyone, > > I notice the benchmark page for AMPLab provides so

Re: Persisting MatrixFactorizationModel

2013-12-04 Thread Evan R. Sparks
Ah, actually - I just remembered that the user and product features of the model are RDDs, so - you might be better off saving those components to HDFS and then at load time reading them back in and creating a new MatrixFactorizationModel. Sorry for the confusion! Note, the above solution only wo

Re: Persisting MatrixFactorizationModel

2013-12-04 Thread Aslan Bekirov
I thought to convert model to RDD and save to HDFS, and then load it. I will try your method. Thanks a lot. On Wed, Dec 4, 2013 at 7:41 PM, Evan R. Sparks wrote: > The model is serializable - so you should be able to write it out to disk > and load it up in another program. > > See, e.g. - htt

Re: Splitting into partitions and sorting the partitions ... how to do that?

2013-12-04 Thread Ceriel Jacobs
Thanks for your answer. But the problem is that I only want to sort the 32 partitions, individually, not the complete input. So yes, the output has to consist of 32 partitions, each sorted. Ceriel Jacobs On 12/04/2013 06:30 PM, Ashish Rangole wrote: I am not sure if 32 partitions is a hard l

Re: Persisting MatrixFactorizationModel

2013-12-04 Thread Evan R. Sparks
The model is serializable - so you should be able to write it out to disk and load it up in another program. See, e.g. - https://gist.github.com/ramn/5566596 (Note, I haven't tested this particular example, but it looks alright). Spark makes use of this type of scala (and kryo, etc.) serializatio

Persisting MatrixFactorizationModel

2013-12-04 Thread Aslan Bekirov
Hi All, I am creating a model by calling train method of ALS. val model = ALS.train(ratings.) I need to persist this model. Use it from different clients, enable clients to make predictions using this model. In other words, persist and reload this model. Any suggestions, please? BR, Aslan

Re: Splitting into partitions and sorting the partitions ... how to do that?

2013-12-04 Thread Ashish Rangole
I am not sure if 32 partitions is a hard limit that you have. Unless you have a strong reason to use only 32 partitions, please try providing the second optional argument (numPartitions) to reduceByKey and sortByKey methods which will paralellize these Reduce operations. A number 3x the number of

Re: Spark over YARN

2013-12-04 Thread Josh Rosen
Spark has had YARN support since 0.6.0: https://spark.incubator.apache.org/docs/latest/running-on-yarn.html On Wed, Dec 4, 2013 at 8:52 AM, Pranay Tonpay wrote: > Hi, > > Is there a release where Spark over YARN targeted for ? I presume, it’s in > progress at the moment.. > > Pls correct me if

Spark over YARN

2013-12-04 Thread Pranay Tonpay
Hi, Is there a release where Spark over YARN targeted for ? I presume, it's in progress at the moment.. Pls correct me if my info is outdated. Thx pranay NOTE: This message may contain information that is confidential, proprietary, privileged or otherwise

Splitting into partitions and sorting the partitions ... how to do that?

2013-12-04 Thread Ceriel Jacobs
Hi, I am a novice to SPARK, and need some help with the following problem: I have a JavaRDD strings; which is potentially large, hundreds of GBs, and I need to split them into 32 partitions, by means of hashcode()%32, and then sort these partitions, and also remove duplicates. I am having