Spark SQL: preferred syntax for column reference?

2015-05-13 Thread Diana Carroll
I'm just getting started with Spark SQL and DataFrames in 1.3.0. I notice that the Spark API shows a different syntax for referencing columns in a dataframe than the Spark SQL Programming Guide. For instance, the API docs for the select method show this: df.select($"colA", $"colB") Whereas the

Spark 1.0 docs out of sync?

2014-06-30 Thread Diana Carroll
I'm hoping someone can clear up some confusion for me. When I view the Spark 1.0 docs online (http://spark.apache.org/docs/1.0.0/) they are different than the docs which are packaged with the Spark 1.0.0 download (spark-1.0.0.tgz). In particular, in the online docs, there's a single merged Spark

NotSerializableException in Spark Streaming

2014-05-14 Thread Diana Carroll
Hey all, trying to set up a pretty simple streaming app and getting some weird behavior. First, a non-streaming job that works fine: I'm trying to pull out lines of a log file that match a regex, for which I've set up a function: def getRequestDoc(s: String): String = { "KBDOC-[0-9]*".r.find

Re: logging in pyspark

2014-05-14 Thread Diana Carroll
with external storage systems. > > > Do you really want to log something for each element of your RDD? > > Nick > > > On Tue, May 6, 2014 at 3:31 PM, Diana Carroll wrote: > >> What should I do if I want to log something as part of a task? >> >> This is what

logging in pyspark

2014-05-06 Thread Diana Carroll
What should I do if I want to log something as part of a task? This is what I tried. To set up a logger, I followed the advice here: http://py4j.sourceforge.net/faq.html#how-to-turn-logging-on-off logger = logging.getLogger("py4j") logger.setLevel(logging.INFO) logger.addHandler(logging.StreamHa

Re: performance improvement on second operation...without caching?

2014-05-05 Thread Diana Carroll
, Ethan Jewett wrote: >>> >>>> I believe Spark caches RDDs it has memory for regardless of whether you >>>> actually call the 'cache' method on the RDD. The 'cache' method just tips >>>> off Spark that the RDD should have higher priority. At le

when to use broadcast variables

2014-05-02 Thread Diana Carroll
Anyone have any guidance on using a broadcast variable to ship data to workers vs. an RDD? Like, say I'm joining web logs in an RDD with user account data. I could keep the account data in an RDD or if it's "small", a broadcast variable instead. How small is small? Small enough that I know it c

Re: the spark configuage

2014-04-30 Thread Diana Carroll
I'm guessing your shell stopping when it attempts to connect to the RM is not related to that warning. You'll get that message out of the box from Spark if you don't have HADOOP_HOME set correctly. I'm using CDH 5.0 installed in default locations, and got rid of the warning by setting HADOOP_HOME

Re: running SparkALS

2014-04-28 Thread Diana Carroll
e generic... > > > On Mon, Apr 28, 2014 at 10:47 AM, Diana Carroll wrote: > >> Thanks, Deb. But I'm looking at org.apache.spark.examples.SparkALS, >> which is not in the mllib examples, and does not take any file parameters. >> >> I don't see the clas

Re: running SparkALS

2014-04-28 Thread Diana Carroll
osv solveseems like > that's the bottleneck step...but making double to float is one way to scale > it even further... > > Thanks. > Deb > > > > On Mon, Apr 28, 2014 at 10:30 AM, Diana Carroll wrote: > >> Hi everyone. I'm trying to run some of

running SparkALS

2014-04-28 Thread Diana Carroll
Hi everyone. I'm trying to run some of the Spark example code, and most of it appears to be undocumented (unless I'm missing something). Can someone help me out? I'm particularly interested in running SparkALS, which wants parameters: M U F iter slices What are these variables? They appear to

Re: checkpointing without streaming?

2014-04-21 Thread Diana Carroll
t clears dependencies. You might need checkpoint to cut a > long lineage in iterative algorithms. -Xiangrui > > On Mon, Apr 21, 2014 at 11:34 AM, Diana Carroll > wrote: > > I'm trying to understand when I would want to checkpoint an RDD rather > than > > just persist to

checkpointing without streaming?

2014-04-21 Thread Diana Carroll
I'm trying to understand when I would want to checkpoint an RDD rather than just persist to disk. Every reference I can find to checkpoint related to Spark Streaming. But the method is defined in the core Spark library, not Streaming. Does it exist solely for streaming, or are there circumstance

partitioning of small data sets

2014-04-15 Thread Diana Carroll
I loaded a very tiny file into Spark -- 23 lines of text, 2.6kb Given the size, and that it is a single file, I assumed it would only be in a single partition. But when I cache it, I can see in the Spark App UI that it actually splits it into two partitions: [image: Inline image 1] Is this cor

using Kryo with pyspark?

2014-04-14 Thread Diana Carroll
I'm looking at the Tuning Guide suggestion to use Kryo instead of default serialization. My questions: Does pyspark use Java serialization by default, as Scala spark does? If so, then... can I use Kryo with pyspark instead? The instructions say I should register my classes with the Kryo Seriali

Re: network wordcount example

2014-03-31 Thread Diana Carroll
Not sure what data you are sending in. You could try calling "lines.print()" instead which should just output everything that comes in on the stream. Just to test that your socket is receiving what you think you are sending. On Mon, Mar 31, 2014 at 12:18 PM, eric perler wrote: > Hello > > i ju

streaming: code to simulate a network socket data source

2014-03-28 Thread Diana Carroll
If you are learning about Spark Streaming, as I am, you've probably use netcat "nc" as mentioned in the spark streaming programming guide. I wanted something a little more useful, so I modified the ClickStreamGenerator code to make a very simple script that simply reads a file off disk and passes

Re: spark streaming and the spark shell

2014-03-28 Thread Diana Carroll
Thanks, Tagatha. This and your other reply on awaitTermination are very helpful. Diana On Thu, Mar 27, 2014 at 4:40 PM, Tathagata Das wrote: > Very good questions! Responses inline. > > TD > > On Thu, Mar 27, 2014 at 8:02 AM, Diana Carroll > wrote: > > I'm worki

spark streaming and the spark shell

2014-03-27 Thread Diana Carroll
I'm working with spark streaming using spark-shell, and hoping folks could answer a few questions I have. I'm doing WordCount on a socket stream: import org.apache.spark.streaming.StreamingContext import org.apache.spark.streaming.StreamingContext._ import org.apache.spark.streaming.Seconds var s

spark streaming: what is awaitTermination()?

2014-03-27 Thread Diana Carroll
The API docs for ssc.awaitTermination say simply "Wait for the execution to stop. Any exceptions that occurs during the execution will be thrown in this thread." Can someone help me understand what this means? What causes execution to stop? Why do we need to wait for that to happen? I tried rem

Re: streaming questions

2014-03-26 Thread Diana Carroll
Thanks, Tagatha, very helpful. A follow-up question below... On Wed, Mar 26, 2014 at 2:27 PM, Tathagata Das wrote: > > > *Answer 3:*You can do something like > wordCounts.foreachRDD((rdd: RDD[...], time: Time) => { >if (rdd.take(1).size == 1) { > // There exists at least one element i

streaming questions

2014-03-26 Thread Diana Carroll
I'm trying to understand Spark streaming, hoping someone can help. I've kinda-sorta got a version of Word Count running, and it looks like this: import org.apache.spark.streaming.{Seconds, StreamingContext} import org.apache.spark.streaming.StreamingContext._ object StreamingWordCount { def m

Re: quick start guide: building a standalone scala program

2014-03-24 Thread Diana Carroll
which definitely refers to the spark install. On Monday, March 24, 2014, Nan Zhu wrote: > I found that I never read the document carefully and I never find that > Spark document is suggesting you to use Spark-distributed sbt.. > > Best, > > -- > Nan Zhu > > &g

Re: Writing RDDs to HDFS

2014-03-24 Thread Diana Carroll
Ongen: I don't know why your process is hanging, sorry. But I do know that the way saveAsTextFile works is that you give it a path to a directory, not a file. The "file" is saved in multiple parts, corresponding to the partitions. (part-0, part-1 etc.) (Presumably it does this because i

Re: quick start guide: building a standalone scala program

2014-03-24 Thread Diana Carroll
> > In either case though, it's not a Spark-specific issue...Hopefully > some of all this helps. > > On Mon, Mar 24, 2014 at 4:30 PM, Diana Carroll > wrote: > > Yeah, that's exactly what I did. Unfortunately it doesn't work: > > > > $SPARK_HOME/s

Re: quick start guide: building a standalone scala program

2014-03-24 Thread Diana Carroll
h_to_spark/sbt > > To make it permanent, you can add it to your ~/.bashrc or ~/.bash_profile > or ??? depending on the system you are using. If you are on Windows, sorry, > I can't offer any help there ;) > > Ognen > > > On 3/24/14, 3:16 PM, Diana Carroll wrote: >

Re: quick start guide: building a standalone scala program

2014-03-24 Thread Diana Carroll
ll/ where you can put your multiple > .scala files which will then have to be a part of a package > com.github.dianacarroll (you can just put that as your first line in each > of these scala files). I am new to Java/Scala so this is how I do it. More > educated Java/Scala programmers may

Re: quick start guide: building a standalone scala program

2014-03-24 Thread Diana Carroll
t manually from http://www.scala-sbt.org/ On Mon, Mar 24, 2014 at 4:00 PM, Nan Zhu wrote: > Hi, Diana, > > See my inlined answer > > -- > Nan Zhu > > > On Monday, March 24, 2014 at 3:44 PM, Diana Carroll wrote: > > Has anyone successfully followed the instruct

Re: quick start guide: building a standalone scala program

2014-03-24 Thread Diana Carroll
ing with scala 2.10 as the example shows). But for that > part of the guide, it's not any different than building a scala app. > > On Mon, Mar 24, 2014 at 3:44 PM, Diana Carroll > wrote: > > Has anyone successfully followed the instructions on the Quick Start > page of &

quick start guide: building a standalone scala program

2014-03-24 Thread Diana Carroll
Has anyone successfully followed the instructions on the Quick Start page of the Spark home page to run a "standalone" Scala application? I can't, and I figure I must be missing something obvious! I'm trying to follow the instructions here as close to "word for word" as possible: http://spark.apa

Re: example of non-line oriented input data?

2014-03-19 Thread Diana Carroll
; http://docs.python.org/2/library/xml.sax.reader.html#module-xml.sax.xmlreaderis > one example. > > Matei > > On Mar 18, 2014, at 7:49 AM, Diana Carroll wrote: > > Well, if anyone is still following this, I've gotten the following code > working which in theory should

Re: example of non-line oriented input data?

2014-03-19 Thread Diana Carroll
fashion, without every > materializing the whole tree. > http://docs.python.org/2/library/xml.sax.reader.html#module-xml.sax.xmlreaderis > one example. > > Matei > > On Mar 18, 2014, at 7:49 AM, Diana Carroll wrote: > > Well, if anyone is still following this, I'

Re: example of non-line oriented input data?

2014-03-18 Thread Diana Carroll
'Panama'}] BUT I'm a bit concerned about the construction of the string "s". How big can my file be before converting it to a string becomes problematic? On Tue, Mar 18, 2014 at 9:41 AM, Diana Carroll wrote: > Thanks, Matei. > > In the context of this discu

Re: example of non-line oriented input data?

2014-03-18 Thread Diana Carroll
ut > sum(x) inside a list above. > > In practice mapPartitions is most useful if you want to share some data or > work across the elements. For example maybe you want to load a lookup table > once from an external file and then check each element in it, or sum up a > bunch of eleme

Re: example of non-line oriented input data?

2014-03-17 Thread Diana Carroll
]# join back each file into a single > string > > The glom() method groups all the elements of each partition of an RDD into > an array, giving you an RDD of arrays of objects. If your input is small > files, you always have one partition per file. > > There's also m

Re: example of non-line oriented input data?

2014-03-17 Thread Diana Carroll
an also use mapPartitions on it to group together the lines because each > input file will end up being a single dataset partition (or map task). This > will let you concatenate the lines in each file and parse them as one XML > object. > > Matei > > On Mar 17, 2014, at 9:52 AM, D

Re: example of non-line oriented input data?

2014-03-17 Thread Diana Carroll
m = logData.map(lambda s: (json.loads(s)['key'], > len(concatenate_paragraphs(json.loads(s)['paragraphs'] > > > tm = tm.reduceByKey(lambda _, x: _ + x) > > > op = tm.collect() > for key, num_words in op: > > print 'state: %s, num_words:

Re: example of non-line oriented input data?

2014-03-17 Thread Diana Carroll
wouldn't have to parse/deserialize the massive document; > it would just have to track open/closed tags/braces to know when to insert > a newline. > > Then you'd just open the line-delimited result and deserialize the > individual objects/nodes with map(). > > Nick >

example of non-line oriented input data?

2014-03-17 Thread Diana Carroll
Has anyone got a working example of a Spark application that analyzes data in a non-line-oriented format, such as XML or JSON? I'd like to do this without re-inventing the wheel...anyone care to share? Thanks! Diana

sample data for pagerank?

2014-03-13 Thread Diana Carroll
I'd like to play around with the Page Rank example included with Spark but I can't find that any sample data to work with is included. Am I missing it? Anyone got a sample file to share? Thanks, Diana

building Spark docs

2014-03-12 Thread Diana Carroll
Hi all. I needed to build the Spark docs. The basic instructions to do this are in spark/docs/README.md but it took me quite a bit of playing around to actually get it working on my system. In case this is useful to anyone else, thought I'd post. This is what I did to build the docs on a CentOS