I'm just getting started with Spark SQL and DataFrames in 1.3.0.
I notice that the Spark API shows a different syntax for referencing
columns in a dataframe than the Spark SQL Programming Guide.
For instance, the API docs for the select method show this:
df.select($"colA", $"colB")
Whereas the
I'm hoping someone can clear up some confusion for me.
When I view the Spark 1.0 docs online (http://spark.apache.org/docs/1.0.0/)
they are different than the docs which are packaged with the Spark 1.0.0
download (spark-1.0.0.tgz).
In particular, in the online docs, there's a single merged Spark
Hey all, trying to set up a pretty simple streaming app and getting some
weird behavior.
First, a non-streaming job that works fine: I'm trying to pull out lines
of a log file that match a regex, for which I've set up a function:
def getRequestDoc(s: String):
String = { "KBDOC-[0-9]*".r.find
with external storage systems.
>
>
> Do you really want to log something for each element of your RDD?
>
> Nick
>
>
> On Tue, May 6, 2014 at 3:31 PM, Diana Carroll wrote:
>
>> What should I do if I want to log something as part of a task?
>>
>> This is what
What should I do if I want to log something as part of a task?
This is what I tried. To set up a logger, I followed the advice here:
http://py4j.sourceforge.net/faq.html#how-to-turn-logging-on-off
logger = logging.getLogger("py4j")
logger.setLevel(logging.INFO)
logger.addHandler(logging.StreamHa
, Ethan Jewett wrote:
>>>
>>>> I believe Spark caches RDDs it has memory for regardless of whether you
>>>> actually call the 'cache' method on the RDD. The 'cache' method just tips
>>>> off Spark that the RDD should have higher priority. At le
Anyone have any guidance on using a broadcast variable to ship data to
workers vs. an RDD?
Like, say I'm joining web logs in an RDD with user account data. I could
keep the account data in an RDD or if it's "small", a broadcast variable
instead. How small is small? Small enough that I know it c
I'm guessing your shell stopping when it attempts to connect to the RM is
not related to that warning. You'll get that message out of the box from
Spark if you don't have HADOOP_HOME set correctly. I'm using CDH 5.0
installed in default locations, and got rid of the warning by setting
HADOOP_HOME
e generic...
>
>
> On Mon, Apr 28, 2014 at 10:47 AM, Diana Carroll wrote:
>
>> Thanks, Deb. But I'm looking at org.apache.spark.examples.SparkALS,
>> which is not in the mllib examples, and does not take any file parameters.
>>
>> I don't see the clas
osv solveseems like
> that's the bottleneck step...but making double to float is one way to scale
> it even further...
>
> Thanks.
> Deb
>
>
>
> On Mon, Apr 28, 2014 at 10:30 AM, Diana Carroll wrote:
>
>> Hi everyone. I'm trying to run some of
Hi everyone. I'm trying to run some of the Spark example code, and most of
it appears to be undocumented (unless I'm missing something). Can someone
help me out?
I'm particularly interested in running SparkALS, which wants parameters:
M U F iter slices
What are these variables? They appear to
t clears dependencies. You might need checkpoint to cut a
> long lineage in iterative algorithms. -Xiangrui
>
> On Mon, Apr 21, 2014 at 11:34 AM, Diana Carroll
> wrote:
> > I'm trying to understand when I would want to checkpoint an RDD rather
> than
> > just persist to
I'm trying to understand when I would want to checkpoint an RDD rather than
just persist to disk.
Every reference I can find to checkpoint related to Spark Streaming. But
the method is defined in the core Spark library, not Streaming.
Does it exist solely for streaming, or are there circumstance
I loaded a very tiny file into Spark -- 23 lines of text, 2.6kb
Given the size, and that it is a single file, I assumed it would only be in
a single partition. But when I cache it, I can see in the Spark App UI
that it actually splits it into two partitions:
[image: Inline image 1]
Is this cor
I'm looking at the Tuning Guide suggestion to use Kryo instead of default
serialization. My questions:
Does pyspark use Java serialization by default, as Scala spark does? If
so, then...
can I use Kryo with pyspark instead? The instructions say I should
register my classes with the Kryo Seriali
Not sure what data you are sending in. You could try calling
"lines.print()" instead which should just output everything that comes in
on the stream. Just to test that your socket is receiving what you think
you are sending.
On Mon, Mar 31, 2014 at 12:18 PM, eric perler wrote:
> Hello
>
> i ju
If you are learning about Spark Streaming, as I am, you've probably use
netcat "nc" as mentioned in the spark streaming programming guide. I
wanted something a little more useful, so I modified the
ClickStreamGenerator code to make a very simple script that simply reads a
file off disk and passes
Thanks, Tagatha. This and your other reply on awaitTermination are very
helpful.
Diana
On Thu, Mar 27, 2014 at 4:40 PM, Tathagata Das
wrote:
> Very good questions! Responses inline.
>
> TD
>
> On Thu, Mar 27, 2014 at 8:02 AM, Diana Carroll
> wrote:
> > I'm worki
I'm working with spark streaming using spark-shell, and hoping folks could
answer a few questions I have.
I'm doing WordCount on a socket stream:
import org.apache.spark.streaming.StreamingContext
import org.apache.spark.streaming.StreamingContext._
import org.apache.spark.streaming.Seconds
var s
The API docs for ssc.awaitTermination say simply "Wait for the execution to
stop. Any exceptions that occurs during the execution will be thrown in
this thread."
Can someone help me understand what this means? What causes execution to
stop? Why do we need to wait for that to happen?
I tried rem
Thanks, Tagatha, very helpful. A follow-up question below...
On Wed, Mar 26, 2014 at 2:27 PM, Tathagata Das
wrote:
>
>
> *Answer 3:*You can do something like
> wordCounts.foreachRDD((rdd: RDD[...], time: Time) => {
>if (rdd.take(1).size == 1) {
> // There exists at least one element i
I'm trying to understand Spark streaming, hoping someone can help.
I've kinda-sorta got a version of Word Count running, and it looks like
this:
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.streaming.StreamingContext._
object StreamingWordCount {
def m
which definitely refers
to the spark install.
On Monday, March 24, 2014, Nan Zhu wrote:
> I found that I never read the document carefully and I never find that
> Spark document is suggesting you to use Spark-distributed sbt..
>
> Best,
>
> --
> Nan Zhu
>
>
&g
Ongen:
I don't know why your process is hanging, sorry. But I do know that the
way saveAsTextFile works is that you give it a path to a directory, not a
file. The "file" is saved in multiple parts, corresponding to the
partitions. (part-0, part-1 etc.)
(Presumably it does this because i
>
> In either case though, it's not a Spark-specific issue...Hopefully
> some of all this helps.
>
> On Mon, Mar 24, 2014 at 4:30 PM, Diana Carroll
> wrote:
> > Yeah, that's exactly what I did. Unfortunately it doesn't work:
> >
> > $SPARK_HOME/s
h_to_spark/sbt
>
> To make it permanent, you can add it to your ~/.bashrc or ~/.bash_profile
> or ??? depending on the system you are using. If you are on Windows, sorry,
> I can't offer any help there ;)
>
> Ognen
>
>
> On 3/24/14, 3:16 PM, Diana Carroll wrote:
>
ll/ where you can put your multiple
> .scala files which will then have to be a part of a package
> com.github.dianacarroll (you can just put that as your first line in each
> of these scala files). I am new to Java/Scala so this is how I do it. More
> educated Java/Scala programmers may
t manually from http://www.scala-sbt.org/
On Mon, Mar 24, 2014 at 4:00 PM, Nan Zhu wrote:
> Hi, Diana,
>
> See my inlined answer
>
> --
> Nan Zhu
>
>
> On Monday, March 24, 2014 at 3:44 PM, Diana Carroll wrote:
>
> Has anyone successfully followed the instruct
ing with scala 2.10 as the example shows). But for that
> part of the guide, it's not any different than building a scala app.
>
> On Mon, Mar 24, 2014 at 3:44 PM, Diana Carroll
> wrote:
> > Has anyone successfully followed the instructions on the Quick Start
> page of
&
Has anyone successfully followed the instructions on the Quick Start page
of the Spark home page to run a "standalone" Scala application? I can't,
and I figure I must be missing something obvious!
I'm trying to follow the instructions here as close to "word for word" as
possible:
http://spark.apa
; http://docs.python.org/2/library/xml.sax.reader.html#module-xml.sax.xmlreaderis
> one example.
>
> Matei
>
> On Mar 18, 2014, at 7:49 AM, Diana Carroll wrote:
>
> Well, if anyone is still following this, I've gotten the following code
> working which in theory should
fashion, without every
> materializing the whole tree.
> http://docs.python.org/2/library/xml.sax.reader.html#module-xml.sax.xmlreaderis
> one example.
>
> Matei
>
> On Mar 18, 2014, at 7:49 AM, Diana Carroll wrote:
>
> Well, if anyone is still following this, I'
'Panama'}]
BUT I'm a bit concerned about the construction of the string "s". How big
can my file be before converting it to a string becomes problematic?
On Tue, Mar 18, 2014 at 9:41 AM, Diana Carroll wrote:
> Thanks, Matei.
>
> In the context of this discu
ut
> sum(x) inside a list above.
>
> In practice mapPartitions is most useful if you want to share some data or
> work across the elements. For example maybe you want to load a lookup table
> once from an external file and then check each element in it, or sum up a
> bunch of eleme
]# join back each file into a single
> string
>
> The glom() method groups all the elements of each partition of an RDD into
> an array, giving you an RDD of arrays of objects. If your input is small
> files, you always have one partition per file.
>
> There's also m
an also use mapPartitions on it to group together the lines because each
> input file will end up being a single dataset partition (or map task). This
> will let you concatenate the lines in each file and parse them as one XML
> object.
>
> Matei
>
> On Mar 17, 2014, at 9:52 AM, D
m = logData.map(lambda s: (json.loads(s)['key'],
> len(concatenate_paragraphs(json.loads(s)['paragraphs']
>
>
> tm = tm.reduceByKey(lambda _, x: _ + x)
>
>
> op = tm.collect()
> for key, num_words in op:
>
> print 'state: %s, num_words:
wouldn't have to parse/deserialize the massive document;
> it would just have to track open/closed tags/braces to know when to insert
> a newline.
>
> Then you'd just open the line-delimited result and deserialize the
> individual objects/nodes with map().
>
> Nick
>
Has anyone got a working example of a Spark application that analyzes data
in a non-line-oriented format, such as XML or JSON? I'd like to do this
without re-inventing the wheel...anyone care to share? Thanks!
Diana
I'd like to play around with the Page Rank example included with Spark but
I can't find that any sample data to work with is included. Am I missing
it? Anyone got a sample file to share?
Thanks,
Diana
Hi all. I needed to build the Spark docs. The basic instructions to do
this are in spark/docs/README.md but it took me quite a bit of playing
around to actually get it working on my system.
In case this is useful to anyone else, thought I'd post. This is what I
did to build the docs on a CentOS
41 matches
Mail list logo