Re: [External] Re: [GraphFrames Spark Package]: Why is there not a distribution for Spark 3.3?

2024-03-17 Thread Ofir Manor
Just to add - the latest version is 0.8.3, it seems to support 3.3:
"Support Spark 3.3 / Scala 2.12 , Spark 3.4 / Scala 2.12 and Scala 2.13, Spark 
3.5 / Scala 2.12 and Scala 2.13"
Releases · graphframes/graphframes 
(github.com)
   Ofir

From: Russell Jurney 
Sent: Friday, March 15, 2024 11:43 PM
To: brad.boil...@fcc-fac.ca.invalid 
Cc: user@spark.apache.org 
Subject: [External] Re: [GraphFrames Spark Package]: Why is there not a 
distribution for Spark 3.3?

There is an implementation for Spark 3, but GraphFrames isn't released often 
enough to match every point version. It supports Spark 3.4. Try it - it will 
probably work. https://spark-packages.org/package/graphframes/graphframes

Thanks,
Russell Jurney @rjurney 
russell.jur...@gmail.com 
LI FB 
datasyndrome.com Book a time on 
Calendly


On Fri, Jan 12, 2024 at 7:55 AM Boileau, Brad  
wrote:

Hello,



I was hoping to use a distribution of GraphFrames for AWS Glue 4 which has 
spark 3.3, but there is no found distribution for Spark 3.3 at this location:



https://spark-packages.org/package/graphframes/graphframes



Do you have any advice on the best compatible version to use for Spark 3.3?



Sincerely,



Brad Boileau

Senior Product Architect / Architecte produit sénior
Farm Credit Canada | Financement agricole Canada
1820 Hamilton Street / 1820, rue Hamilton

Regina SK  S4P 2B8

Tel/Tél. : 306-359, C/M: 306-737-8900

fcc.ca / fac.ca

FCC social media / 
Médias sociaux FAC



[2QA=]



This email, including attachments, is confidential. You may not share this 
email with any third party. If you are not the intended recipient, any 
redistribution or copying of this email is prohibited. If you have received 
this email in error or cannot comply with these restrictions, please delete or 
destroy it entirely and immediately without making a copy and notify us by 
return email.

Ce courriel (y compris toutes les pièces jointes qu’il comporte) est 
confidentiel. Vous ne pouvez pas partager ce courriel avec des tiers. Si vous 
n’êtes pas le destinataire prévu, toute divulgation, reproduction, copie ou 
distribution de ce courriel est strictement interdite. Si vous avez reçu ce 
courriel par erreur ou ne pouvez pas respecter ces restrictions, merci de le 
supprimer ou de le détruire complètement et immédiatement, sans le dupliquer, 
et de nous aviser par retour de courriel.


Unsubscribe from FCC marketing-related 
messages.
 (Customers will still receive messages related to business transactions.)

Se désabonner pour ne plus recevoir de messages liés au marketing de la part de 
FAC.
 (Les clients continueront de recevoir des messages concernant leurs 
transactions.)


Re: tuning - Spark data serialization for cache() ?

2017-08-07 Thread Ofir Manor
Thanks a lot for the quick pointer!
So, is the advice I linked to in official Spark 2.2 documentation
misleading? You are saying that Spark 2.2 does not use by Java
serialization? And the tip to switch to Kyro is also outdated?

Ofir Manor

Co-Founder & CTO | Equalum

Mobile: +972-54-7801286 | Email: ofir.ma...@equalum.io

On Mon, Aug 7, 2017 at 8:47 PM, Kazuaki Ishizaki <ishiz...@jp.ibm.com>
wrote:

> For Dataframe (and Dataset), cache() already uses fast
> serialization/deserialization with data compression schemes.
>
> We already identified some performance issues regarding cache(). We are
> working for alleviating these issues in https://issues.apache.org/
> jira/browse/SPARK-14098.
> We expect that these PRs will be integrated into Spark 2.3.
>
> Kazuaki Ishizaki
>
>
>
> From:Ofir Manor <ofir.ma...@equalum.io>
> To:user <user@spark.apache.org>
> Date:2017/08/08 02:04
> Subject:tuning - Spark data serialization for cache() ?
> --
>
>
>
> Hi,
> I'm using Spark 2.2, and have a big batch job, using dataframes (with
> built-in, basic types). It references the same intermediate dataframe
> multiple times, so I wanted to try to cache() that and see if it helps,
> both in memory footprint and performance.
>
> Now, the Spark 2.2 tuning page (
> *http://spark.apache.org/docs/latest/tuning.html*
> <http://spark.apache.org/docs/latest/tuning.html>) clearly says:
> 1. The default Spark serialization is Java serialization.
> 2. It is recommended to switch to Kyro serialization.
> 3. "Since Spark 2.0.0, we internally use Kryo serializer when shuffling
> RDDs with simple types, arrays of simple types, or string type".
>
> Now, I remember that in 2.0 launch, there were discussion of a third
> serialization format that is much more performant and compact. (Encoder?),
> but it is not referenced in the tuning guide and its Scala doc is not very
> clear to me. Specifically, Databricks shared some graphs etc of how much it
> is better than Kyro and Java serialization - see Encoders here:
>
> *https://databricks.com/blog/2016/01/04/introducing-apache-spark-datasets.html*
> <https://databricks.com/blog/2016/01/04/introducing-apache-spark-datasets.html>
>
> So, is that relevant to cache()? If so, how can I enable it - and is it
> for MEMORY_AND_DISK_ONLY or MEMORY_AND_DISK_SER?
>
> I tried to play with some other variations, like enabling Kyro by the
> tuning guide instructions, but didn't see any impact on the cached
> dataframe size (same tens of GBs in the UI). So any tips around that?
>
> Thanks.
>
> Ofir Manor
>
> Co-Founder & CTO | Equalum
>
> Mobile: *+972-54-7801286* <%2B972-54-7801286> | Email:
> *ofir.ma...@equalum.io* <ofir.ma...@equalum.io>
>
>
>


tuning - Spark data serialization for cache() ?

2017-08-07 Thread Ofir Manor
Hi,
I'm using Spark 2.2, and have a big batch job, using dataframes (with
built-in, basic types). It references the same intermediate dataframe
multiple times, so I wanted to try to cache() that and see if it helps,
both in memory footprint and performance.

Now, the Spark 2.2 tuning page (
http://spark.apache.org/docs/latest/tuning.html) clearly says:
1. The default Spark serialization is Java serialization.
2. It is recommended to switch to Kyro serialization.
3. "Since Spark 2.0.0, we internally use Kryo serializer when shuffling
RDDs with simple types, arrays of simple types, or string type".

Now, I remember that in 2.0 launch, there were discussion of a third
serialization format that is much more performant and compact. (Encoder?),
but it is not referenced in the tuning guide and its Scala doc is not very
clear to me. Specifically, Databricks shared some graphs etc of how much it
is better than Kyro and Java serialization - see Encoders here:
https://databricks.com/blog/2016/01/04/introducing-apache-spark-datasets.html

So, is that relevant to cache()? If so, how can I enable it - and is
it for MEMORY_AND_DISK_ONLY
or MEMORY_AND_DISK_SER?

I tried to play with some other variations, like enabling Kyro by the
tuning guide instructions, but didn't see any impact on the cached
dataframe size (same tens of GBs in the UI). So any tips around that?

Thanks.

Ofir Manor

Co-Founder & CTO | Equalum

Mobile: +972-54-7801286 | Email: ofir.ma...@equalum.io


Re: Does spark 2.1.0 structured streaming support jdbc sink?

2017-04-10 Thread Ofir Manor
Also check SPARK-19478 <https://issues.apache.org/jira/browse/SPARK-19478> -
JDBC sink (seems to be waiting for a review)

Ofir Manor

Co-Founder & CTO | Equalum

Mobile: +972-54-7801286 | Email: ofir.ma...@equalum.io

On Mon, Apr 10, 2017 at 10:10 AM, Hemanth Gudela <hemanth.gud...@qvantel.com
> wrote:

> Many thanks Silvio for the link. That’s exactly what I’m looking for. ☺
>
> However there is no mentioning of checkpoint support for custom
> “ForeachWriter” in structured streaming. I’m going to test that now.
>
>
>
> Good question Gary, this is the mentioning in the link
> <https://databricks.com/blog/2017/04/04/real-time-end-to-end-integration-with-apache-kafka-in-apache-sparks-structured-streaming.html>
> .
>
> Often times we want to be able to write output of streams to external
> databases such as MySQL. At the time of writing, the Structured Streaming
> API does not support external databases as sinks; however, when it does,
> the API option will be as simple as .format("jdbc").start("jdbc:mysql/..").
>
>
> In the meantime, we can use the foreach sink to accomplish this. Let’s
> create a custom JDBC Sink that extends *ForeachWriter* and implements its
> methods.
>
>
>
> I’m not sure though if jdbc sink feature will be available in upcoming
> spark (2.2.0?) version or not.
>
> It would good to know if someone has information about it.
>
>
>
> Thanks,
>
> Hemanth
>
>
>
> *From: *"lucas.g...@gmail.com" <lucas.g...@gmail.com>
> *Date: *Monday, 10 April 2017 at 8.24
> *To: *"user@spark.apache.org" <user@spark.apache.org>
> *Subject: *Re: Does spark 2.1.0 structured streaming support jdbc sink?
>
>
>
> Interesting, does anyone know if we'll be seeing the JDBC sinks in
> upcoming releases?
>
>
>
> Thanks!
>
>
>
> Gary Lucas
>
>
>
> On 9 April 2017 at 13:52, Silvio Fiorito <silvio.fior...@granturing.com>
> wrote:
>
> JDBC sink is not in 2.1. You can see here for an example implementation
> using the ForEachWriter sink instead: https://databricks.com/blog/
> 2017/04/04/real-time-end-to-end-integration-with-apache-
> kafka-in-apache-sparks-structured-streaming.html
>
>
>
>
>
> *From: *Hemanth Gudela <hemanth.gud...@qvantel.com>
> *Date: *Sunday, April 9, 2017 at 4:30 PM
> *To: *"user@spark.apache.org" <user@spark.apache.org>
> *Subject: *Does spark 2.1.0 structured streaming support jdbc sink?
>
>
>
> Hello Everyone,
>
> I am new to Spark, especially spark streaming.
>
>
>
> I am trying to read an input stream from Kafka, perform windowed
> aggregations in spark using structured streaming, and finally write
> aggregates to a sink.
>
> -  MySQL as an output sink doesn’t seem to be an option, because
> this block of code throws an error
>
> streamingDF.writeStream.format("jdbc").start("jdbc:mysql…”)
>
> *ava.lang.UnsupportedOperationException*: Data source jdbc does not
> support streamed writing
>
> This is strange because, this
> <http://rxin.github.io/talks/2016-02-18_spark_summit_streaming.pdf>
> document shows that jdbc is supported as an output sink!
>
>
>
> -  Parquet doesn’t seem to be an option, because it doesn’t
> support “complete” output mode, but “append” only. As I’m preforming
> windows aggregations in spark streaming, the output mode has to be
> complete, and cannot be “append”
>
>
>
> -  Memory and console sinks are good for debugging, but are not
> suitable for production jobs.
>
>
>
> So, please correct me if I’m missing something in my code to enable jdbc
> output sink.
>
> If jdbc output sink is not option, please suggest me an alternative output
> sink that suits my needs better.
>
>
>
> Or since structured streaming is still ‘alpha’, should I resort to spark
> dstreams to achieve my use case described above.
>
> Please suggest.
>
>
>
> Thanks in advance,
>
> Hemanth
>
>
>


Re: Structured Streaming - Can I start using it?

2017-03-14 Thread Ofir Manor
To add to what Michael said, my experience was that Structured Streaming in
2.0 was half-baked / alpha, but in 2.1 it is significantly more robust.
Also a lot of its "missing functionality" were not available in Spark
Streaming either way.
HOWEVER, you mentioned that you think about rewriting your existing spark
streaming code... May I ask why do you need a rewrite? Do you have a
specific functional or performance issues? Some specific new use case or a
specific new API you want to leverage?
Changing an existing, working solution has its costs, both in dev time and
ops time (changes to monitoring, troubleshooting etc), so I think you
should know what you want to achieve here and ask / prototype if current
release fits it.

Ofir Manor

Co-Founder & CTO | Equalum

Mobile: +972-54-7801286 | Email: ofir.ma...@equalum.io

On Mon, Mar 13, 2017 at 9:45 PM, Michael Armbrust <mich...@databricks.com>
wrote:

> I think its very very unlikely that it will get withdrawn.  The primary
> reason that the APIs are still marked experimental is that we like to have
> several releases before committing to interface stability (in particular
> the interfaces to write custom sources and sinks are likely to evolve).
> Also, there are currently quite a few limitations in the types of queries
> that we can run (i.e. multiple aggregations are disallowed, we don't
> support stream-stream joins yet).  In these cases though, we explicitly say
> its not supported when you try to start your stream.
>
> For the use cases that are supported in 2.1 though (streaming ETL, event
> time aggregation, etc) I'll say that we have been using it in production
> for several months and we have customers doing the same.
>
> On Mon, Mar 13, 2017 at 11:21 AM, Gaurav1809 <gauravhpan...@gmail.com>
> wrote:
>
>> I read in spark documentation that Structured Streaming is still ALPHA in
>> Spark 2.1 and the APIs are still experimental. Shall I use it to re write
>> my
>> existing spark streaming code? Looks like it is not yet production ready.
>> What happens if Structured Streaming project gets withdrawn?
>>
>>
>>
>> --
>> View this message in context: http://apache-spark-user-list.
>> 1001560.n3.nabble.com/Structured-Streaming-Can-I-start-
>> using-it-tp28488.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>>
>


Re: What do I loose if I run spark without using HDFS or Zookeeper?

2016-08-25 Thread Ofir Manor
Just to add one concrete example regarding HDFS dependency.
Have a look at checkpointing
https://spark.apache.org/docs/1.6.2/streaming-programming-guide.html#checkpointing
For example, for Spark Streaming, you can not do any window operation in a
cluster without checkpointing to HDFS (or S3).

Ofir Manor

Co-Founder & CTO | Equalum

Mobile: +972-54-7801286 | Email: ofir.ma...@equalum.io

On Thu, Aug 25, 2016 at 11:13 PM, Mich Talebzadeh <mich.talebza...@gmail.com
> wrote:

> Hi Kant,
>
> I trust the following would be of use.
>
> Big Data depends on Hadoop Ecosystem from whichever angle one looks at it.
>
> In the heart of it and with reference to points you raised about HDFS, one
> needs to have a working knowledge of Hadoop Core System including HDFS,
> Map-reduce algorithm and Yarn whether one uses them or not. After all Big
> Data is all about horizontal scaling with master and nodes (as opposed to
> vertical scaling like SQL Server running on a Host). and distributed data
> (by default data is replicated three times on different nodes for
> scalability and availability).
>
> Other members including Sean provided the limits on how far one operate
> Spark in its own space. If you are going to deal with data (data in motion
> and data at rest), then you will need to interact with some form of storage
> and HDFS and compatible file systems like S3 are the natural choices.
>
> Zookeeper is not just about high availability. It is used in Spark
> Streaming with Kafka, it is also used with Hive for concurrency. It is also
> a distributed locking system.
>
> HTH
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 25 August 2016 at 20:52, Mark Hamstra <m...@clearstorydata.com> wrote:
>
>> s/playing a role/paying a role/
>>
>> On Thu, Aug 25, 2016 at 12:51 PM, Mark Hamstra <m...@clearstorydata.com>
>> wrote:
>>
>>> One way you can start to make this make more sense, Sean, is if you
>>> exploit the code/data duality so that the non-distributed data that you are
>>> sending out from the driver is actually paying a role more like code (or at
>>> least parameters.)  What is sent from the driver to an Executer is then
>>> used (typically as seeds or parameters) to execute some procedure on the
>>> Worker node that generates the actual data on the Workers.  After that, you
>>> proceed to execute in a more typical fashion with Spark using the
>>> now-instantiated distributed data.
>>>
>>> But I don't get the sense that this meta-programming-ish style is really
>>> what the OP was aiming at.
>>>
>>> On Thu, Aug 25, 2016 at 12:39 PM, Sean Owen <so...@cloudera.com> wrote:
>>>
>>>> Without a distributed storage system, your application can only create
>>>> data on the driver and send it out to the workers, and collect data back
>>>> from the workers. You can't read or write data in a distributed way. There
>>>> are use cases for this, but pretty limited (unless you're running on 1
>>>> machine).
>>>>
>>>> I can't really imagine a serious use of (distributed) Spark without
>>>> (distribute) storage, in a way I don't think many apps exist that don't
>>>> read/write data.
>>>>
>>>> The premise here is not just replication, but partitioning data across
>>>> compute resources. With a distributed file system, your big input exists
>>>> across a bunch of machines and you can send the work to the pieces of data.
>>>>
>>>> On Thu, Aug 25, 2016 at 7:57 PM, kant kodali <kanth...@gmail.com>
>>>> wrote:
>>>>
>>>>> @Mich I understand why I would need Zookeeper. It is there for fault
>>>>> tolerance given that spark is a master-slave architecture and when a mater
>>>>> goes down zookeeper will run a leader election algorithm to elect a new
>>>>> leader however DevOps hate Zookeeper they would be much happier to go with
>>>>> etcd & consul and looks like if we mesos scheduler 

Re: ORC v/s Parquet for Spark 2.0

2016-07-28 Thread Ofir Manor
BTW - this thread has many anecdotes on Apache ORC vs. Apache Parquet (I
personally think both are great at this point).
But the original question was about Spark 2.0. Anyone has some insights
about Parquet-specific optimizations / limitations vs. ORC-specific
optimizations / limitations in pre-2.0 vs. 2.0? I've put one in the
beginning of the thread regarding Structured Streaming, but there was a
general claim that pre-2.0 Spark was missing many ORC optimizations, and
that some (all?) were added in 2.0.
I saw that a lot of related tickets closed in 2.0, but it would great if
someone close to the details can explain.

Ofir Manor

Co-Founder & CTO | Equalum

Mobile: +972-54-7801286 | Email: ofir.ma...@equalum.io

On Thu, Jul 28, 2016 at 6:49 PM, Mich Talebzadeh <mich.talebza...@gmail.com>
wrote:

> Like anything else your mileage varies.
>
> ORC with Vectorised query execution
> <https://cwiki.apache.org/confluence/display/Hive/Vectorized+Query+Execution> 
> is
> the nearest one can get to proper Data Warehouse like SAP IQ or Teradata
> with columnar indexes. To me that is cool. Parquet has been around and has
> its use case as well.
>
> I guess there is no hard and fast rule which one to use all the time. Use
> the one that provides best fit for the condition.
>
> HTH
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 28 July 2016 at 09:18, Jörn Franke <jornfra...@gmail.com> wrote:
>
>> I see it more as a process of innovation and thus competition is good.
>> Companies just should not follow these religious arguments but try
>> themselves what suits them. There is more than software when using software
>> ;)
>>
>> On 28 Jul 2016, at 01:44, Mich Talebzadeh <mich.talebza...@gmail.com>
>> wrote:
>>
>> And frankly this is becoming some sort of religious arguments now
>>
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>> On 28 July 2016 at 00:01, Sudhir Babu Pothineni <sbpothin...@gmail.com>
>> wrote:
>>
>>> It depends on what you are dong, here is the recent comparison of ORC,
>>> Parquet
>>>
>>>
>>> https://www.slideshare.net/mobile/oom65/file-format-benchmarks-avro-json-orc-parquet
>>>
>>> Although from ORC authors, I thought fair comparison, We use ORC as
>>> System of Record on our Cloudera HDFS cluster, our experience is so far
>>> good.
>>>
>>> Perquet is backed by Cloudera, which has more installations of Hadoop.
>>> ORC is by Hortonworks, so battle of file format continues...
>>>
>>> Sent from my iPhone
>>>
>>> On Jul 27, 2016, at 4:54 PM, janardhan shetty <janardhan...@gmail.com>
>>> wrote:
>>>
>>> Seems like parquet format is better comparatively to orc when the
>>> dataset is log data without nested structures? Is this fair understanding ?
>>> On Jul 27, 2016 1:30 PM, "Jörn Franke" <jornfra...@gmail.com> wrote:
>>>
>>>> Kudu has been from my impression be designed to offer somethings
>>>> between hbase and parquet for write intensive loads - it is not faster for
>>>> warehouse type of querying compared to parquet (merely slower, because that
>>>> is not its use case).   I assume this is still the strategy of it.
>>>>
>>>> For some scenarios it could make sense together with parquet and Orc.
>>>> Howev

Re: The Future Of DStream

2016-07-27 Thread Ofir Manor
For the 2.0 release, look for "Unsupported Operations" here:

http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html
Also, there are bigger gaps - like no Kafka support, no way to plug
user-defined sources or sinks etc

Ofir Manor

Co-Founder & CTO | Equalum

Mobile: +972-54-7801286 | Email: ofir.ma...@equalum.io

On Wed, Jul 27, 2016 at 11:24 AM, Chang Chen <baibaic...@gmail.com> wrote:

>
> I don't understand what kind of low level control that DStream can do
> while Structure Streaming can not
>
> Thanks
> Chang
>
> On Wednesday, July 27, 2016, Matei Zaharia <matei.zaha...@gmail.com>
> wrote:
>
>> Yup, they will definitely coexist. Structured Streaming is currently
>> alpha and will probably be complete in the next few releases, but Spark
>> Streaming will continue to exist, because it gives the user more low-level
>> control. It's similar to DataFrames vs RDDs (RDDs are the lower-level API
>> for when you want control, while DataFrames do more optimizations
>> automatically by restricting the computation model).
>>
>> Matei
>>
>> On Jul 27, 2016, at 12:03 AM, Ofir Manor <ofir.ma...@equalum.io> wrote:
>>
>> Structured Streaming in 2.0 is declared as alpha - plenty of bits still
>> missing:
>>
>> http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html
>> I assume that it will be declared stable / GA in a future 2.x release,
>> and then it will co-exist with DStream for quite a while before someone
>> will suggest to start a deprecation process that will eventually lead to
>> its removal...
>> As a user, I guess we will need to apply judgement about when to switch
>> to Structured Streaming - each of us have a different risk/value tradeoff,
>> based on our specific situation...
>>
>> Ofir Manor
>>
>> Co-Founder & CTO | Equalum
>>
>> Mobile: +972-54-7801286 | Email: ofir.ma...@equalum.io
>>
>> On Wed, Jul 27, 2016 at 8:02 AM, Chang Chen <baibaic...@gmail.com> wrote:
>>
>>> Hi guys
>>>
>>> Structure Stream is coming with spark 2.0,  but I noticed that DStream
>>> is still here
>>>
>>> What's the future of the DStream, will it be deprecated and removed
>>> eventually? Or co-existed with  Structure Stream forever?
>>>
>>> Thanks
>>> Chang
>>>
>>>
>>
>>


Re: [ANNOUNCE] Announcing Apache Spark 2.0.0

2016-07-27 Thread Ofir Manor
Hold the release! There is a minor documentation issue :)
But seriously, congrats all on this massive achievement!

Anyway, I think it would be very helpful to add a link to the Structured
Streaming Developer Guide (Alpha) to both the documentation home page and
from the beginning of the "old" Spark Streaming Programming Guide, as I
think many users will look for them. I had a "deep link" to that page so I
haven't noticed that it is very hard to find until now. I'm referring to
this page:

http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html



Ofir Manor

Co-Founder & CTO | Equalum

Mobile: +972-54-7801286 | Email: ofir.ma...@equalum.io

On Wed, Jul 27, 2016 at 9:00 AM, Reynold Xin <r...@databricks.com> wrote:

> Hi all,
>
> Apache Spark 2.0.0 is the first release of Spark 2.x line. It includes
> 2500+ patches from 300+ contributors.
>
> To download Spark 2.0, head over to the download page:
> http://spark.apache.org/downloads.html
>
> To view the release notes:
> http://spark.apache.org/releases/spark-release-2-0-0.html
>
>
> (note: it can take a few hours for everything to be propagated, so you
> might get 404 on some download links.  If you see any issues with the
> release notes or webpage *please contact me directly, off-list*)
>
>


Re: The Future Of DStream

2016-07-27 Thread Ofir Manor
Structured Streaming in 2.0 is declared as alpha - plenty of bits still
missing:

http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html
I assume that it will be declared stable / GA in a future 2.x release, and
then it will co-exist with DStream for quite a while before someone will
suggest to start a deprecation process that will eventually lead to its
removal...
As a user, I guess we will need to apply judgement about when to switch to
Structured Streaming - each of us have a different risk/value tradeoff,
based on our specific situation...

Ofir Manor

Co-Founder & CTO | Equalum

Mobile: +972-54-7801286 | Email: ofir.ma...@equalum.io

On Wed, Jul 27, 2016 at 8:02 AM, Chang Chen <baibaic...@gmail.com> wrote:

> Hi guys
>
> Structure Stream is coming with spark 2.0,  but I noticed that DStream is
> still here
>
> What's the future of the DStream, will it be deprecated and removed
> eventually? Or co-existed with  Structure Stream forever?
>
> Thanks
> Chang
>
>


Re: ORC v/s Parquet for Spark 2.0

2016-07-26 Thread Ofir Manor
One additional point specific to Spark 2.0 - for the alpha Structured
Streaming API (only),  the file sink only supports Parquet format (I'm sure
that limitation will be lifted in a future release before Structured
Streaming is GA):
 "File sink - Stores the output to a directory. As of Spark 2.0, this
only supports Parquet file format, and Append output mode."

http://people.apache.org/~pwendell/spark-releases/spark-2.0.0-rc5-docs/structured-streaming-programming-guide.html#where-to-go-from-here

​


Re: Spark, Scala, and DNA sequencing

2016-07-24 Thread Ofir Manor
Hi James,
BTW - if you are into analyzing DNA with Spark, you may also be interested
in ADAM:
   https://github.com/bigdatagenomics/adam
http://bdgenomics.org/

Ofir Manor

Co-Founder & CTO | Equalum

Mobile: +972-54-7801286 | Email: ofir.ma...@equalum.io

On Fri, Jul 22, 2016 at 10:31 PM, James McCabe <ja...@oranda.com> wrote:

> Hi!
>
> I hope this may be of use/interest to someone:
>
> Spark, a Worked Example: Speeding Up DNA Sequencing
>
>
> http://scala-bility.blogspot.nl/2016/07/spark-worked-example-speeding-up-dna.html
>
> James
>
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: Timeline for supporting basic operations like groupBy, joins etc on Streaming DataFrames

2016-06-07 Thread Ofir Manor
TD - this might not be the best forum, but (1) - batch left outer stream -
is always feasible under reasonable constraints, for example a window
constraint on the stream.

I think it would be super useful to have a central place in the 2.0 docs
that spells out what exactly is included, what is targeted to 2.1 and what
will likely be post 2.1...
I think that so far it is not well-communicated (and we are a couple of
weeks after the preview release) - as a user and potential early adopter I
have to constantly dig into the source code and pull requests trying to
decipher if I could use 2.0 APIs for my use case.

Ofir Manor

Co-Founder & CTO | Equalum

Mobile: +972-54-7801286 | Email: ofir.ma...@equalum.io

On Tue, Jun 7, 2016 at 12:36 PM, Tathagata Das <tathagata.das1...@gmail.com>
wrote:

> 1.  Not all types of joins are supported. Here is the list.
> - Right outer joins - stream-batch not allowed, batch-stream allowed
> - Left outer joins - batch-stream not allowed, stream-batch allowed
>  (reverse of Right outer join)
> - Stream-stream joins are not allowed
>
> In the cases of outer joins, the not-allowed-cases are fundamentally hard
> because to do them correctly, every time there is new data in the stream,
> all the past data in the stream needs to be processed. Since we cannot
> stored ever-increasing amount of data in memory, this is not feasible.
>
> 2. For the update mode, the timeline is Spark 2.1.
>
>
> TD
>
> On Mon, Jun 6, 2016 at 6:54 AM, raaggarw <raagg...@adobe.com> wrote:
>
>> Thanks
>> So,
>>
>> 1) For joins (stream-batch) - are all types of joins supported - i mean
>> inner, leftouter etc or specific ones?
>> Also what is the timeline for complete support - I mean stream-stream
>> joins?
>>
>> 2) So now outputMode is exposed via DataFrameWriter but will work in
>> specific cases as you mentioned? We were looking for delta & append output
>> modes for aggregation/groupBy. What is the timeline for that?
>>
>> Thanks
>> Ravi
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Timeline-for-supporting-basic-operations-like-groupBy-joins-etc-on-Streaming-DataFrames-tp27091p27093.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>


Does decimal(6,-2) exists on purpose?

2016-05-26 Thread Ofir Manor
Hi,
was surprised to notice a negative scale on decimal (Spark 1.6.1). To
reproduce:

scala> z.printSchema
root
 |-- price: decimal(6,2) (nullable = true)

scala> val a = z.selectExpr("round(price,-2)")
a: org.apache.spark.sql.DataFrame = [round(price,-2): decimal(6,-2)]


I expected the function to return decimal(6,0)
It doesn't immediately break anything for me, but I'm not performing
additional numeric manipulation on the results.

BTW - thinking about it, both round(price) and round(price,-2) might better
return decimal(4,0), not (6,0).
The input decimal was .nn and will become just nnnn.

Ofir Manor

Co-Founder & CTO | Equalum

Mobile: +972-54-7801286 | Email: ofir.ma...@equalum.io


Re: Structured Streaming in Spark 2.0 and DStreams

2016-05-16 Thread Ofir Manor
Yuval,
Not sure what in-scope to land in 2.0, but there is another new infra bit
to manage state more efficiently called State Store, whose initial version
is already commited:
   SPARK-13809 - State Store: A new framework for state management for
computing Streaming Aggregates
https://issues.apache.org/jira/browse/SPARK-13809
Eventually the pull request links into the design doc, that discusses the
limits of updateStateByKey and mapWithState and how that will be
handled...

At a quick glance at the code, it seems to be used already in streaming
aggregations.

Just my two cents,

Ofir Manor

Co-Founder & CTO | Equalum

Mobile: +972-54-7801286 | Email: ofir.ma...@equalum.io

On Mon, May 16, 2016 at 11:33 AM, Yuval Itzchakov <yuva...@gmail.com> wrote:

> Also, re-reading the relevant part from the Structured Streaming
> documentation (
> https://docs.google.com/document/d/1NHKdRSNCbCmJbinLmZuqNA1Pt6CGpFnLVRbzuDUcZVM/edit#heading=h.335my4b18x6x
> ):
> Discretized streams (aka dstream)
>
> Unlike Storm, dstream exposes a higher level API similar to RDDs. There
> are two main challenges with dstream:
>
>
>1.
>
>Similar to Storm, it exposes a monotonic system (processing) time
>metric, and makes support for event time difficult.
>2.
>
>Its APIs are tied to the underlying microbatch execution model, and as
>a result lead to inflexibilities such as changing the underlying batch
>interval would require changing the window size.
>
>
> RQ addresses the above:
>
>
>1.
>
>RQ operations support both system time and event time.
>2.
>
>RQ APIs are decoupled from the underlying execution model. As a matter
>of fact, it is possible to implement an alternative engine that is not
>microbatch-based for RQ.
>3. In addition, due to the declarative specification of operations, RQ
>leverages a relational query optimizer and can often generate more
>efficient query plans.
>
>
> This doesn't seem to attack the actual underlying implementation for how
> things like "mapWithState" are going to be translated into RQ, and I think
> thats the hole that's causing my misunderstanding.
>
> On Mon, May 16, 2016 at 1:36 AM Yuval Itzchakov <yuva...@gmail.com> wrote:
>
>> Hi Ofir,
>> Thanks for the elaborated answer. I have read both documents, where they
>> do a light touch on infinite Dataframes/Datasets. However, they do not go
>> in depth as regards to how existing transformations on DStreams, for
>> example, will be transformed into the Dataset APIs. I've been browsing the
>> 2.0 branch and have yet been able to understand how they correlate.
>>
>> Also, placing SparkSession in the sql package seems like a peculiar
>> choice, since this is going to be the global abstraction over
>> SparkContext/StreamingContext from now on.
>>
>> On Sun, May 15, 2016, 23:42 Ofir Manor <ofir.ma...@equalum.io> wrote:
>>
>>> Hi Yuval,
>>> let me share my understanding based on similar questions I had.
>>> First, Spark 2.x aims to replace a whole bunch of its APIs with just two
>>> main ones - SparkSession (replacing Hive/SQL/Spark Context) and Dataset
>>> (merging of Dataset and Dataframe - which is why it inherits all the
>>> SparkSQL goodness), while RDD seems as a low-level API only for special
>>> cases. The new Dataset should also support both batch and streaming -
>>> replacing (eventually) DStream as well. See the design docs in SPARK-13485
>>> (unified API) and SPARK-8360 (StructuredStreaming) for a good intro.
>>> However, as you noted, not all will be fully delivered in 2.0. For
>>> example, it seems that streaming from / to Kafka using StructuredStreaming
>>> didn't make it (so far?) to 2.0 (which is a showstopper for me).
>>> Anyway, as far as I understand, you should be able to apply stateful
>>> operators (non-RDD) on Datasets (for example, the new event-time window
>>> processing SPARK-8360). The gap I see is mostly limited streaming sources /
>>> sinks migrated to the new (richer) API and semantics.
>>> Anyway, I'm pretty sure once 2.0 gets to RC, the documentation and
>>> examples will align with the current offering...
>>>
>>>
>>> Ofir Manor
>>>
>>> Co-Founder & CTO | Equalum
>>>
>>> Mobile: +972-54-7801286 | Email: ofir.ma...@equalum.io
>>>
>>> On Sun, May 15, 2016 at 1:52 PM, Yuval.Itzchakov <yuva...@gmail.com>
>>> wrote:
>>>
>>>> I've been reading/watching videos about the upcoming Spark 2.0 release
>>>> which
>>>> brings us Structured St

Re: Structured Streaming in Spark 2.0 and DStreams

2016-05-15 Thread Ofir Manor
Ben,
I'm just a Spark user - but at least in March Spark Summit, that was the
main term used.
Taking a step back from the details, maybe this new post from Reynold is a
better intro to Spark 2.0 highlights
https://databricks.com/blog/2016/05/11/spark-2-0-technical-preview-easier-faster-and-smarter.html

If you want to drill down, go to SPARK-8360 "Structured Streaming (aka
Streaming DataFrames)". The design doc (written by Reynold in March) is
very readable:
 https://issues.apache.org/jira/browse/SPARK-8360

Regarding directly querying (SQL) the state managed by a streaming process
- I don't know if that will land in 2.0 or only later.

Hope that helps,

Ofir Manor

Co-Founder & CTO | Equalum

Mobile: +972-54-7801286 | Email: ofir.ma...@equalum.io

On Sun, May 15, 2016 at 11:58 PM, Benjamin Kim <bbuil...@gmail.com> wrote:

> Hi Ofir,
>
> I just recently saw the webinar with Reynold Xin. He mentioned the Spark
> Session unification efforts, but I don’t remember the DataSet for
> Structured Streaming aka Continuous Applications as he put it. He did
> mention streaming or unlimited DataFrames for Structured Streaming so one
> can directly query the data from it. Has something changed since then?
>
> Thanks,
> Ben
>
>
> On May 15, 2016, at 1:42 PM, Ofir Manor <ofir.ma...@equalum.io> wrote:
>
> Hi Yuval,
> let me share my understanding based on similar questions I had.
> First, Spark 2.x aims to replace a whole bunch of its APIs with just two
> main ones - SparkSession (replacing Hive/SQL/Spark Context) and Dataset
> (merging of Dataset and Dataframe - which is why it inherits all the
> SparkSQL goodness), while RDD seems as a low-level API only for special
> cases. The new Dataset should also support both batch and streaming -
> replacing (eventually) DStream as well. See the design docs in SPARK-13485
> (unified API) and SPARK-8360 (StructuredStreaming) for a good intro.
> However, as you noted, not all will be fully delivered in 2.0. For
> example, it seems that streaming from / to Kafka using StructuredStreaming
> didn't make it (so far?) to 2.0 (which is a showstopper for me).
> Anyway, as far as I understand, you should be able to apply stateful
> operators (non-RDD) on Datasets (for example, the new event-time window
> processing SPARK-8360). The gap I see is mostly limited streaming sources /
> sinks migrated to the new (richer) API and semantics.
> Anyway, I'm pretty sure once 2.0 gets to RC, the documentation and
> examples will align with the current offering...
>
>
> Ofir Manor
>
> Co-Founder & CTO | Equalum
>
> Mobile: +972-54-7801286 | Email: ofir.ma...@equalum.io
>
> On Sun, May 15, 2016 at 1:52 PM, Yuval.Itzchakov <yuva...@gmail.com>
> wrote:
>
>> I've been reading/watching videos about the upcoming Spark 2.0 release
>> which
>> brings us Structured Streaming. One thing I've yet to understand is how
>> this
>> relates to the current state of working with Streaming in Spark with the
>> DStream abstraction.
>>
>> All examples I can find, in the Spark repository/different videos is
>> someone
>> streaming local JSON files or reading from HDFS/S3/SQL. Also, when
>> browsing
>> the source, SparkSession seems to be defined inside org.apache.spark.sql,
>> so
>> this gives me a hunch that this is somehow all related to SQL and the
>> likes,
>> and not really to DStreams.
>>
>> What I'm failing to understand is: Will this feature impact how we do
>> Streaming today? Will I be able to consume a Kafka source in a streaming
>> fashion (like we do today when we open a stream using KafkaUtils)? Will we
>> be able to do state-full operations on a Dataset[T] like we do today using
>> MapWithStateRDD? Or will there be a subset of operations that the catalyst
>> optimizer can understand such as aggregate and such?
>>
>> I'd be happy anyone could shed some light on this.
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Structured-Streaming-in-Spark-2-0-and-DStreams-tp26959.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com
>> <http://nabble.com>.
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>
>


Re: Structured Streaming in Spark 2.0 and DStreams

2016-05-15 Thread Ofir Manor
Hi Yuval,
let me share my understanding based on similar questions I had.
First, Spark 2.x aims to replace a whole bunch of its APIs with just two
main ones - SparkSession (replacing Hive/SQL/Spark Context) and Dataset
(merging of Dataset and Dataframe - which is why it inherits all the
SparkSQL goodness), while RDD seems as a low-level API only for special
cases. The new Dataset should also support both batch and streaming -
replacing (eventually) DStream as well. See the design docs in SPARK-13485
(unified API) and SPARK-8360 (StructuredStreaming) for a good intro.
However, as you noted, not all will be fully delivered in 2.0. For example,
it seems that streaming from / to Kafka using StructuredStreaming didn't
make it (so far?) to 2.0 (which is a showstopper for me).
Anyway, as far as I understand, you should be able to apply stateful
operators (non-RDD) on Datasets (for example, the new event-time window
processing SPARK-8360). The gap I see is mostly limited streaming sources /
sinks migrated to the new (richer) API and semantics.
Anyway, I'm pretty sure once 2.0 gets to RC, the documentation and examples
will align with the current offering...


Ofir Manor

Co-Founder & CTO | Equalum

Mobile: +972-54-7801286 | Email: ofir.ma...@equalum.io

On Sun, May 15, 2016 at 1:52 PM, Yuval.Itzchakov <yuva...@gmail.com> wrote:

> I've been reading/watching videos about the upcoming Spark 2.0 release
> which
> brings us Structured Streaming. One thing I've yet to understand is how
> this
> relates to the current state of working with Streaming in Spark with the
> DStream abstraction.
>
> All examples I can find, in the Spark repository/different videos is
> someone
> streaming local JSON files or reading from HDFS/S3/SQL. Also, when browsing
> the source, SparkSession seems to be defined inside org.apache.spark.sql,
> so
> this gives me a hunch that this is somehow all related to SQL and the
> likes,
> and not really to DStreams.
>
> What I'm failing to understand is: Will this feature impact how we do
> Streaming today? Will I be able to consume a Kafka source in a streaming
> fashion (like we do today when we open a stream using KafkaUtils)? Will we
> be able to do state-full operations on a Dataset[T] like we do today using
> MapWithStateRDD? Or will there be a subset of operations that the catalyst
> optimizer can understand such as aggregate and such?
>
> I'd be happy anyone could shed some light on this.
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Structured-Streaming-in-Spark-2-0-and-DStreams-tp26959.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>