Substring in Spark SQL

2014-08-04 Thread Tom
that substr is supported by HiveQL, but not by Spark SQL, correct? Thanks! Tom -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Substring-in-Spark-SQL-tp11373.html Sent from the Apache Spark User List mailing list archive at Nabble.com

Re: Retrieve dataset of Big Data Benchmark

2014-07-17 Thread Tom
the fs.s3n.awsAccessKeyId or fs.s3n.awsSecretAccessKey properties (respectively). I guess the files are publicly available, but only to registered AWS users, so I caved in and registered for the service. Using the credentials that I got I was able to download the files using the local spark shell. Thanks! Tom

Re: Retrieve dataset of Big Data Benchmark

2014-07-16 Thread Tom
Hi Burak, Thank you for your pointer, it is really helping out. I do have some consecutive questions though. After looking at the Big Data Benchmark page https://amplab.cs.berkeley.edu/benchmark/ (Section Run this benchmark yourself), I was expecting the following combination of files: Sets:

Retrieve dataset of Big Data Benchmark

2014-07-15 Thread Tom
], in the amazon cluster. Is there a way I can download this without being a user of the Amazon cluster? I tried bin/hadoop distcp s3n://123:456@big-data-benchmark/pavlo/text/tiny/* ./ but it asks for an AWS Access Key ID and Secret Access Key which I do not have. Thanks in advance, Tom -- View

Re: Is There Any Benchmarks Comparing C++ MPI with Spark

2014-06-16 Thread Tom Vacek
Spark gives you four of the classical collectives: broadcast, reduce, scatter, and gather. There are also a few additional primitives, mostly based on a join. Spark is certainly less optimized than MPI for these, but maybe that isn't such a big deal. Spark has one theoretical disadvantage

Re: Failed RC-10 yarn-cluster job for FS closed error when cleaning up staging directory

2014-05-21 Thread Tom Graves
to.  But they shouldn't have overlapped as far as both being up at the same time. Is that the case you are seeing?  Generally you want to look at why the first application attempt fails. Tom On Wednesday, May 21, 2014 6:10 PM, Kevin Markey kevin.mar...@oracle.com wrote: I tested an application on RC-10

Re: Spark LIBLINEAR

2014-05-16 Thread Tom Vacek
I've done some comparisons with my own implementation of TRON on Spark. From a distributed computing perspective, it does 2x more local work per iteration than LBFGS, so the parallel isoefficiency is improved slightly. I think the truncated Newton solver holds some potential because there have

Re: Spark on Yarn - A small issue !

2014-05-14 Thread Tom Graves
of all node managers. Thus, this is not applicable to hosted clusters). Tom On Monday, May 12, 2014 9:38 AM, Sai Prasanna ansaiprasa...@gmail.com wrote: Hi All,  I wanted to launch Spark on Yarn, interactive - yarn client mode. With default settings of yarn-site.xml and spark-env.sh, i

Re: configure spark history server for running on Yarn

2014-05-05 Thread Tom Graves
either go to the RM UI to link to the spark history UI or go directly to the spark history server ui. Tom On Thursday, May 1, 2014 7:09 PM, Jenny Zhao linlin200...@gmail.com wrote: Hi, I have installed spark 1.0 from the branch-1.0, build went fine, and I have tried running the example

Re: is it okay to reuse objects across RDD's?

2014-04-28 Thread Tom Vacek
As to your last line: I've used RDD zipping to avoid GC since MyBaseData is large and doesn't change. I think this is a very good solution to what is being asked for. On Mon, Apr 28, 2014 at 10:44 AM, Ian O'Connell i...@ianoconnell.com wrote: A mutable map in an object should do what your

Re: is it okay to reuse objects across RDD's?

2014-04-28 Thread Tom Vacek
I'm not sure what I said came through. RDD zip is not hacky at all, as it only depends on a user not changing the partitioning. Basically, you would keep your losses as an RDD[Double] and zip whose with the RDD of examples, and update the losses. You're doing a copy (and GC) on the RDD of

Re: is it okay to reuse objects across RDD's?

2014-04-28 Thread Tom Vacek
Right---They are zipped at each iteration. On Mon, Apr 28, 2014 at 11:56 AM, Chester Chen chesterxgc...@yahoo.comwrote: Tom, Are you suggesting two RDDs, one with loss and another for the rest info, using zip to tie them together, but do update on loss RDD (copy) ? Chester Sent from

Re: is it okay to reuse objects across RDD's?

2014-04-28 Thread Tom Vacek
Ian, I tried playing with your suggestion, but I get a task not serializable error (and some obvious things didn't fix it). Can you get that working? On Mon, Apr 28, 2014 at 10:58 AM, Tom Vacek minnesota...@gmail.com wrote: As to your last line: I've used RDD zipping to avoid GC since

Re: is it okay to reuse objects across RDD's?

2014-04-28 Thread Tom Vacek
to. For instance, will RDDs of the same size usually get partitioned to the same machines - thus not triggering any cross machine aligning, etc. We'll explore it, but I would still very much like to see more direct worker memory management besides RDDs. On Mon, Apr 28, 2014 at 10:26 AM, Tom

Re: GraphX: Help understanding the limitations of Pregel

2014-04-23 Thread Tom Vacek
Here are some out-of-the-box ideas: If the elements lie in a fairly small range and/or you're willing to work with limited precision, you could use counting sort. Moreover, you could iteratively find the median using bisection, which would be associative and commutative. It's easy to think of

internship opportunity

2014-04-22 Thread Tom Vacek
Thomson Reuters is looking for a graduate (or possibly advanced undergraduate) summer intern in Eagan, MN. This is a chance to work on an innovative project exploring how big data sets can be used by professionals such as lawyers, scientists and journalists. If you're subscribed to this mailing

Re: Huge matrix

2014-04-12 Thread Tom V
should be able to distribute the things needed to make a recommendation (either the centroids or the attributes matrix), and just break up the work based on the users you want to generate recommendations for. I hope this helps. Tom On Sat, Apr 12, 2014 at 11:35 AM, Xiaoli Li lixiaolima

Re: Spark 1.0.0 release plan

2014-04-04 Thread Tom Graves
Do we have a list of things we really want to get in for 1.X?   Perhaps move any jira out to a 1.1 release if we aren't targetting them for 1.0.  It might be nice to send out reminders when these dates are approaching.  Tom On Thursday, April 3, 2014 11:19 PM, Bhaskar Dutta bhas...@gmail.com

Re: Pig on Spark

2014-03-06 Thread Tom Graves
helped out with this prototype over Twitter’s hack week.) That work also calls the Scala API directly, because it was done before we had a Java API; it should be easier with the Java one. Tom On Thursday, March 6, 2014 3:11 PM, Sameer Tilak ssti...@live.com wrote: Hi everyone, We are using

<    1   2