how to check whether spill over to hard drive happened or not

2017-05-06 Thread Zeming Yu
hi,

I'm running pyspark on my local PC using the stand alone mode.

After a pyspark window function on a dataframe, I did a groupby query on
the dataframe.
The groupby query turns out to be very slow (10+ minutes on a small data
set).
I then cached the dataframe and re-ran the same query. The query remained
very slow.

I could also hear noises from the hard drive - I assume the PC is busy
reading and writing from the hard drive. Is this an indication of the data
frame has spilled over to hard drive?

What's the best method for monitoring what's happening? How can I avoid
this from happening?

Thanks!


Re: take the difference between two columns of a dataframe in pyspark

2017-05-06 Thread Zeming Yu
OK. I've worked it out.

df.withColumn('diff', col('A')-col('B'))

On Sun, May 7, 2017 at 11:49 AM, Zeming Yu  wrote:

> Say I have the following dataframe with two numeric columns A and B,
> what's the best way to add a column showing the difference between the two
> columns?
>
> +-+--+
> |A| B|
> +-+--+
> |786.31999|786.12|
> |   786.12|786.12|
> |   786.42|786.12|
> |   786.72|786.12|
> |   786.92|786.12|
> |   786.92|786.12|
> |   786.72|786.12|
> |   786.72|786.12|
> |   827.72|786.02|
> |   827.72|786.02|
> +-+--+
>
>
> I could probably figure out how to do this vis UDF, but is UDF generally 
> slower?
>
>
> Thanks!
>
>


take the difference between two columns of a dataframe in pyspark

2017-05-06 Thread Zeming Yu
Say I have the following dataframe with two numeric columns A and B, what's
the best way to add a column showing the difference between the two columns?

+-+--+
|A| B|
+-+--+
|786.31999|786.12|
|   786.12|786.12|
|   786.42|786.12|
|   786.72|786.12|
|   786.92|786.12|
|   786.92|786.12|
|   786.72|786.12|
|   786.72|786.12|
|   827.72|786.02|
|   827.72|786.02|
+-+--+


I could probably figure out how to do this vis UDF, but is UDF generally slower?


Thanks!


Re: Join streams Apache Spark

2017-05-06 Thread tencas
There exists an Spark Streaming example of the classic word count, using
apache kafka connector:

https://github.com/apache/spark/blob/master/examples/src/main/java/org/apache/spark/examples/streaming/JavaKafkaWordCount.java

(maybe you already know)

The point is, what are the benefits from using Kafka, instead of a lighter
solution like yours. Maybe anybody could help us. Anyway, when I try it out,
I'll give you feedback.

On the other hand, have you got ,by any chance, the same script written on
Scala, Phyton or Java ?





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Join-streams-Apache-Spark-tp28603p28658.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Join streams Apache Spark

2017-05-06 Thread saulshanabrook
Would love to hear if you try it out. I was also considering that. I
recently changed to using the file based streaming input. I made another Go
script

that let's me connect over TCP and writes each newline it receives to a new
file in a folder. Then Spark can read them from that folder.

On Sat, May 6, 2017 at 2:38 PM tencas [via Apache Spark User List] <
ml+s1001560n2865...@n3.nabble.com> wrote:

> Thanks @saulshanabrook, I'll have a look at it.
>
> I think apache kafka could be an alternative solution, but I haven't
> checked it yet.
>
> --
> If you reply to this email, your message will be added to the discussion
> below:
>
> http://apache-spark-user-list.1001560.n3.nabble.com/Join-streams-Apache-Spark-tp28603p28656.html
> To unsubscribe from Join streams Apache Spark, click here
> 
> .
> NAML
> 
>




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Join-streams-Apache-Spark-tp28603p28657.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Join streams Apache Spark

2017-05-06 Thread tencas
Thanks @saulshanabrook, I'll have a look at it.

I think apache kafka could be an alternative solution, but I haven't checked
it yet.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Join-streams-Apache-Spark-tp28603p28656.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Join streams Apache Spark

2017-05-06 Thread saulshanabrook
I  wrote a server in Go
  
that allows many TCP connections for incoming data on one port, writing each
line to the client listening on another port. The  environmental variable
set's what port client's should connect to, to send data to Spark (the
sensors in your case) and the  sets the port that Spark should connect to,
to listen for data.

If anyone knows a simpler way of doing this, by using some existing
software, I would love to know about it.

If you are interested in this code, I would be happy to clean it up and
release it with some documentation.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Join-streams-Apache-Spark-tp28603p28655.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Structured Streaming + initialState

2017-05-06 Thread Patrick McGloin
The initial state is stored in a Parquet file which is effectively a static
Dataset.  I seen there is a Jira open for full joins on streaming plus
static Datasets for Structured Streaming (SPARK-20002
).  So once that Jira is
completed it would be possible.

For mapGroupsWithState it would be great if you could provide an
initialState Dataset with Key -> State initial values.

On 5 May 2017 at 23:49, Tathagata Das  wrote:

> Can you explain how your initial state is stored? is it a file, or its in
> a database?
> If its in a database, then when initialize the GroupState, you can fetch
> it from the database.
>
> On Fri, May 5, 2017 at 7:35 AM, Patrick McGloin  > wrote:
>
>> Hi all,
>>
>> With Spark Structured Streaming, is there a possibility to set an
>> "initial state" for a query?
>>
>> Using a join between a streaming Dataset and a static Dataset does not
>> support full joins.
>>
>> Using mapGroupsWithState to create a GroupState does not support an
>> initialState (as the Spark Streaming StateSpec did).
>>
>> Are there any plans to add support for initial states?  Or is there
>> already a way to do so?
>>
>> Best regards,
>> Patrick
>>
>
>