Hi all,
This should work:
JavaPairRDD> casRdd =
context.newAPIHadoopRDD(job.getConfiguration(),
ColumnFamilyInputFormat.class.asSubclass(org.apache.hadoop.mapreduce.InputFormat.class),
ByteBuffer.class, SortedMap.class);
I have translated the word count written in scala to java, i just can't
sen
Hi Tal,
Just checking if you have added your code to github :). if you have could
you point me to it.
Best Regards,
Pulasthi
On Thu, Nov 28, 2013 at 11:54 PM, Patrick Wendell wrote:
> Tal - that would be great to have open sourced if you can do it!
>
> On Thu, Nov 28, 2013 at 1:00 AM, Pulasthi
YARN Alpha API support is already there, If you mean Yarn stable API in hadoop
2.2, it probably will be in 0.8.1
Best Regards,
Raymond Liu
From: Pranay Tonpay [mailto:pranay.ton...@impetus.co.in]
Sent: Thursday, December 05, 2013 12:53 AM
To: user@spark.incubator.apache.org
Subject: Spark over
Hey Roman,
It looks like that pull request was never migrated to the Apache GitHub, but I
like the idea. If you migrate it over, we can merge in something like this. In
terms of the API, I’d just add a unpersist() method on each Broadcast object.
Matei
On Dec 3, 2013, at 6:00 AM, Roman Pastukh
Hi Friends,
I am new to Spark and Spark streaming. I am trying to save a DStream to
file but can't figure out how to do it with provided methods on DStream
(saveAsTextFiles). Following is what I am trying to do
Eg. DStream of type DStream[(String, String)]
("file1.txt", "msg_a"), ("file1.txt", "m
Hey Adrian,
Ideally you shouldn’t use Cygwin to run on Windows — use the .cmd scripts we
provide instead. Cygwin might be made to work but we haven’t tried to do this
so far so it’s not supported. If you can fix it, that would of course be
welcome.
Also, the deploy scripts don’t work on Window
Matt, we've done 1TB linear models in 2-3 minutes on 40 node clusters
(30GB/node, just enough to hold all partitions simultaneously in memory).
You can do with fewer nodes if you're willing to slow things down.
Some of our TB benchmark numbers are available in my Spark Summit slides.
Sorry I'm on
Separate from my previous thread about building a distribution of Spark on
Win8, I am also trying to run the pre-build binaries with little success. I
downloaded and extract "spark-0.8.0-incubating-bin-hadoop1" to d:\spark and
attempted to start a master with the following error:
$ sh bin/start
I am trying to bring up a standalone cluster on Windows 8.1/Windows Server 2013
and I am having troubles getting the make-distribution script to complete
successfully. It seems to change a bunch of permissions in /dist and then tries
to write to it unsuccessfully. I assume this is not expected b
It looks like OrderedRDDFunctions (
https://spark.incubator.apache.org/docs/latest/api/core/index.html#org.apache.spark.rdd.OrderedRDDFunctions),
which defines sortBy(), is constructed with an implicit Ordered[K], so you
could explicitly construct an OrderedRDDFunctions with your own Ordered.
You
Spark's expressiveness allows you to do this fairly easily on your own.
sortByKey is implemented in a few lines of code. It would be fairly easy to
implement your own sortByKey to do that. Replace the partitioner in
sortByKey with a hash partitioner on the key, and then add define a
separate way t
Hello,
I have a few questions about configuring memory usage on standalone
clusters. Can someone help me out?
1) The terms "slave" in ./bin/start-slaves.sh and "worker" in the docs seem
to be used interchangeably. Are they the same?
2) On a worker/slave, is there only one JVM running that has
Thanks for your answer.
The partitioning function is not that important. What is important that I only
sort the partitions,
not the complete RDD. Your suggestion to use
rdd.distinct.coalesce(32).mapPartitions(p => sorted(p))
sounds nice, and I had indeed seen the coalesce method and the mapParti
These were EC2 clusters, so the machines were smaller than modern machines. You
can definitely have 1 TB datasets on 10 nodes too. Actually if you’re curious
about hardware configuration, take a look at
http://spark.incubator.apache.org/docs/latest/hardware-provisioning.html.
Also, regarding Sp
You can easily accomplish the hashmod 32 using Spark's partitionBy:
val strings: RDD[String] = ...
strings.map(s => (s, 1)).partitionBy(new HashPartitioner(32)).keys
On Wed, Dec 4, 2013 at 11:05 AM, Andrew Ash wrote:
> How important is it that they're partitioned on hashcode() % 32 rather
> th
I'm reading the paper now, thanks. It states 100-node clusters were used. Is
this typical in the field to have 100 node clusters for the 1TB scale? We were
expecting to be using ~10 nodes.
I'm still pretty new to cluster computing, so just not sure how people have set
these up.
-Matt Cheah
Fr
How important is it that they're partitioned on hashcode() % 32 rather than
Spark's default partitioning?
In scala, you should be able to do this with
rdd.distinct.coalesce(32).mapPartitions(p => sorted(p))
I'm not sure what your end goal is here, but if it's just sort a bunch of
data and remove
Hi,
Was just curious. In Hadoop, You have a flexibilty that you can chose your
class for SortComparator and GroupingComparator. I have figured out that
there are functions like sortByKey and reduceByKey.
But what if, I want to customize what part of key I want to use for sorting
and which part for
Yes, check out the Shark paper for example:
https://amplab.cs.berkeley.edu/publication/shark-sql-and-rich-analytics-at-scale/
The numbers on that benchmark are for Shark.
Matei
On Dec 3, 2013, at 3:50 PM, Matt Cheah wrote:
> Hi everyone,
>
> I notice the benchmark page for AMPLab provides so
Ah, actually - I just remembered that the user and product features of the
model are RDDs, so - you might be better off saving those components to
HDFS and then at load time reading them back in and creating a new
MatrixFactorizationModel. Sorry for the confusion!
Note, the above solution only wo
I thought to convert model to RDD and save to HDFS, and then load it.
I will try your method. Thanks a lot.
On Wed, Dec 4, 2013 at 7:41 PM, Evan R. Sparks wrote:
> The model is serializable - so you should be able to write it out to disk
> and load it up in another program.
>
> See, e.g. - htt
Thanks for your answer. But the problem is that I only want to sort the 32
partitions, individually,
not the complete input. So yes, the output has to consist of 32 partitions,
each sorted.
Ceriel Jacobs
On 12/04/2013 06:30 PM, Ashish Rangole wrote:
I am not sure if 32 partitions is a hard l
The model is serializable - so you should be able to write it out to disk
and load it up in another program.
See, e.g. - https://gist.github.com/ramn/5566596 (Note, I haven't tested
this particular example, but it looks alright).
Spark makes use of this type of scala (and kryo, etc.) serializatio
Hi All,
I am creating a model by calling train method of ALS.
val model = ALS.train(ratings.)
I need to persist this model. Use it from different clients, enable
clients to make predictions using this model. In other words, persist and
reload this model.
Any suggestions, please?
BR,
Aslan
I am not sure if 32 partitions is a hard limit that you have.
Unless you have a strong reason to use only 32 partitions, please try
providing the second optional
argument (numPartitions) to reduceByKey and sortByKey methods which will
paralellize these Reduce operations.
A number 3x the number of
Spark has had YARN support since 0.6.0:
https://spark.incubator.apache.org/docs/latest/running-on-yarn.html
On Wed, Dec 4, 2013 at 8:52 AM, Pranay Tonpay
wrote:
> Hi,
>
> Is there a release where Spark over YARN targeted for ? I presume, it’s in
> progress at the moment..
>
> Pls correct me if
Hi,
Is there a release where Spark over YARN targeted for ? I presume, it's in
progress at the moment..
Pls correct me if my info is outdated.
Thx
pranay
NOTE: This message may contain information that is confidential, proprietary,
privileged or otherwise
Hi,
I am a novice to SPARK, and need some help with the following problem:
I have a
JavaRDD strings;
which is potentially large, hundreds of GBs, and I need to split them
into 32 partitions, by means of hashcode()%32, and then sort these partitions,
and also remove duplicates. I am having
28 matches
Mail list logo