Re: Spark 4.0 Query Analyzer Bug Report

2024-02-20 Thread Holden Karau
Do you mean Spark 3.4? 4.0 is very much not released yet.

Also it would help if you could share your query & more of the logs leading
up to the error.

On Tue, Feb 20, 2024 at 3:07 PM Sharma, Anup 
wrote:

> Hi Spark team,
>
>
>
> We ran into a dataframe issue after upgrading from spark 3.1 to 4.
>
>
>
> query_result.explain(extended=True)\n  File
> \"…/spark/python/lib/pyspark.zip/pyspark/sql/dataframe.py\"
>
> raise Py4JJavaError(\npy4j.protocol.Py4JJavaError: An error occurred while 
> calling z:org.apache.spark.sql.api.python.PythonSQLUtils.explainString.\n: 
> java.lang.IllegalStateException: You hit a query analyzer bug. Please report 
> your query to Spark user mailing list.\n\tat 
> org.apache.spark.sql.execution.SparkStrategies$Aggregation$.apply(SparkStrategies.scala:516)\n\tat
>  
> org.apache.spark.sql.catalyst.planning.QueryPlanner.$anonfun$plan$1(QueryPlanner.scala:63)\n\tat
>  scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:486)\n\tat 
> scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:492)\n\tat 
> scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:491)\n\tat 
> org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:93)\n\tat
>  
> org.apache.spark.sql.execution.SparkStrategies.plan(SparkStrategies.scala:72)\n\tat
>  
> org.apache.spark.sql.catalyst.planning.QueryPlanner.$anonfun$plan$3(QueryPlanner.scala:78)\n\tat
>  
> scala.collection.TraversableOnce$folder$1.apply(TraversableOnce.scala:196)\n\tat
>  
> scala.collection.TraversableOnce$folder$1.apply(TraversableOnce.scala:194)\n\tat
>  scala.collection.Iterator.foreach(Iterator.scala:943)\n\tat 
> scala.collection.Iterator.foreach$(Iterator.scala:943)\n\tat 
> scala.collection.AbstractIterator.foreach(Iterator.scala:1431)\n\tat 
> scala.collection.TraversableOnce.foldLeft(TraversableOnce.scala:199)\n\tat 
> scala.collect...
>
>
>
>
>
> Could you please let us know if this is already being looked at?
>
>
>
> Thanks,
>
> Anup
>


-- 
Cell : 425-233-8271


Re: Introducing Comet, a plugin to accelerate Spark execution via DataFusion and Arrow

2024-02-13 Thread Holden Karau
This looks really cool :) Out of interest what are the differences in the
approach between this and Glutten?

On Tue, Feb 13, 2024 at 12:42 PM Chao Sun  wrote:

> Hi all,
>
> We are very happy to announce that Project Comet, a plugin to
> accelerate Spark query execution via leveraging DataFusion and Arrow,
> has now been open sourced under the Apache Arrow umbrella. Please
> check the project repo
> https://github.com/apache/arrow-datafusion-comet for more details if
> you are interested. We'd love to collaborate with people from the open
> source community who share similar goals.
>
> Thanks,
> Chao
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: Spark-Connect: Param `--packages` does not take effect for executors.

2023-12-04 Thread Holden Karau
So I think this sounds like a bug to me, in the help options for both
regular spark-submit and ./sbin/start-connect-server.sh we say:
"  --packages  Comma-separated list of maven coordinates of
jars to include
  on the driver and executor classpaths. Will
search the local
  maven repo, then maven central and any
additional remote
  repositories given by --repositories. The
format for the
  coordinates should be
groupId:artifactId:version."

If the behaviour is intentional for spark-connect it would be good to
understand why (and then also update the docs).

On Mon, Dec 4, 2023 at 3:33 PM Aironman DirtDiver 
wrote:

> The issue you're encountering with the iceberg-spark-runtime dependency
> not being properly passed to the executors in your Spark Connect server
> deployment could be due to a couple of factors:
>
>1.
>
>*Spark Submit Packaging:* When you use the --packages parameter in
>spark-submit, it only adds the JARs to the driver classpath. The
>executors still need to download and load the JARs separately. This
>can lead to issues if the JARs are not accessible from the executors, such
>as when running in a distributed environment like Kubernetes.
>2.
>
>*Kubernetes Container Image:* The Spark Connect server container image
>(xxx/spark-py:3.5-prd) might not have the iceberg-spark-runtime dependency
>pre-installed. This means that even if the JARs are available on the
>driver,the executors won't have access to them.
>
> To address this issue, consider the following solutions:
>
>1.
>
>*Package Dependencies into Image:* As you mentioned, packaging the
>required dependencies into your container image is a viable option. This
>ensures that the executors have direct access to the JARs, eliminating
>the need for downloading or copying during job execution.
>2.
>
>*Use Spark Submit with --jars Option:* Instead of relying on --packages
>, you can explicitly specify the JARs using the --jars option in
>spark-submit. This will package the JARs into the Spark application's
>submission directory, ensuring that they are available to both the
>driver and executors.
>3.
>
>*Mount JARs as Shared Volume:* If the iceberg-spark-runtime dependency
>is already installed on the cluster nodes,you can mount the JARs as a
>shared volume accessible to both the driver and executors. This avoids
>the need to package or download the JARs.
>Mounting JARs as a shared volume in your Spark Connect server
>deployment involves creating a shared volume that stores the JARs and then
>mounting that volume to both the driver and executor containers. Here's a
>step-by-step guide:
>
>Create a Shared Volume: Create a shared volume using a persistent
>storage solution like NFS, GlusterFS, or AWS EFS. Ensure that all cluster
>nodes have access to the shared volume.
>
>Copy JARs to Shared Volume: Copy the required JARs, including
>iceberg-spark-runtime, to the shared volume. This will make them accessible
>to both the driver and executor containers.
>
>Mount Shared Volume to Driver Container: In your Spark Connect server
>deployment configuration, specify the shared volume as a mount point for
>the driver container. This will make the JARs available to the driver.
>
>Mount Shared Volume to Executor Containers: In the Spark Connect
>server deployment configuration, specify the shared volume as a mount point
>for the executor containers. This will make the JARs available to the
>executors.
>
>Update Spark Connect Server Configuration: In your Spark Connect
>server configuration, ensure that the spark.sql.catalogImplementation
>property is set to iceberg. This will instruct Spark to use the Iceberg
>catalog implementation.
>
>By following these steps, you can successfully mount JARs as a shared
>volume in your Spark Connect server deployment, eliminating the need to
>package or download the JARs.
>4.
>
>*Use Spark Connect Server with Remote Resources:* Spark Connect Server
>supports accessing remote resources,such as JARs stored in a distributed
>file system or a cloud storage service. By configuring Spark Connect
>Server to use remote resources, you can avoid packaging the
>dependencies into the container image.
>
> By implementing one of these solutions, you should be able to resolve the
> issue of the iceberg-spark-runtime dependency not being properly passed to
> the executors in your Spark Connect server deployment.
>
> Let me know if any of the proposal works for you.
>
> Alonso
>
> El lun, 4 dic 2023 a las 11:44, Xiaolong Wang
> () escribió:
>
>> Hi, Spark community,
>>
>> I encountered a weird bug when using Spark Connect server to integrate
>> with Iceberg. I added the 

Re: Classpath isolation per SparkSession without Spark Connect

2023-11-27 Thread Holden Karau
So I don’t think we make any particular guarantees around class path
isolation there, so even if it does work it’s something you’d need to pay
attention to on upgrades. Class path isolation is tricky to get right.

On Mon, Nov 27, 2023 at 2:58 PM Faiz Halde  wrote:

> Hello,
>
> We are using spark 3.5.0 and were wondering if the following is achievable
> using spark-core
>
> Our use case involves spinning up a spark cluster where the driver
> application loads user jars containing spark transformations at runtime. A
> single spark application can load multiple user jars ( same cluster ) that
> can have class path conflicts if care is not taken
>
> AFAIK, to get this right requires the Executor to be designed in a way
> that allows for class path isolation ( UDF, lambda expressions ). Ideally
> per Spark Session is what we want
>
> I know Spark connect has been designed this way but Spark connect is not
> an option for us at the moment. I had some luck using a private method
> inside spark called JobArtifactSet.withActiveJobArtifactState
>
> Is it sufficient for me to run the user code enclosed
> within JobArtifactSet.withActiveJobArtifactState to achieve my requirement?
>
> Thank you
>
>
> Faiz
>


Re: Write Spark Connection client application in Go

2023-09-12 Thread Holden Karau
That’s so cool! Great work y’all :)

On Tue, Sep 12, 2023 at 8:14 PM bo yang  wrote:

> Hi Spark Friends,
>
> Anyone interested in using Golang to write Spark application? We created a 
> Spark
> Connect Go Client library .
> Would love to hear feedback/thoughts from the community.
>
> Please see the quick start guide
> 
> about how to use it. Following is a very short Spark Connect application in
> Go:
>
> func main() {
>   spark, _ := 
> sql.SparkSession.Builder.Remote("sc://localhost:15002").Build()
>   defer spark.Stop()
>
>   df, _ := spark.Sql("select 'apple' as word, 123 as count union all 
> select 'orange' as word, 456 as count")
>   df.Show(100, false)
>   df.Collect()
>
>   df.Write().Mode("overwrite").
>   Format("parquet").
>   Save("file:///tmp/spark-connect-write-example-output.parquet")
>
>   df = spark.Read().Format("parquet").
>   Load("file:///tmp/spark-connect-write-example-output.parquet")
>   df.Show(100, false)
>
>   df.CreateTempView("view1", true, false)
>   df, _ = spark.Sql("select count, word from view1 order by count")
> }
>
>
> Many thanks to Martin, Hyukjin, Ruifeng and Denny for creating and working
> together on this repo! Welcome more people to contribute :)
>
> Best,
> Bo
>
>


Re: Elasticsearch support for Spark 3.x

2023-08-27 Thread Holden Karau
What’s the version of the ES connector you are using?

On Sat, Aug 26, 2023 at 10:17 AM Dipayan Dev 
wrote:

> Hi All,
>
> We're using Spark 2.4.x to write dataframe into the Elasticsearch index.
> As we're upgrading to Spark 3.3.0, it throwing out error
> Caused by: java.lang.ClassNotFoundException: es.DefaultSource
> at java.base/java.net.URLClassLoader.findClass(URLClassLoader.java:476)
> at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:589)
> at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:522)
>
> Looking at a few responses from Stackoverflow
> . it seems this is not yet
> supported by Elasticsearch-hadoop.
>
> Does anyone have experience with this? Or faced/resolved this issue in
> Spark 3?
>
> Thanks in advance!
>
> Regards
> Dipayan
>
-- 
Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  
YouTube Live Streams: https://www.youtube.com/user/holdenkarau


Re: Dynamic allocation does not deallocate executors

2023-08-08 Thread Holden Karau
So if you disable shuffle tracking but enable shuffle block decommissioning
it should work from memory

On Tue, Aug 8, 2023 at 4:13 AM Mich Talebzadeh 
wrote:

> Hm. I don't think it will work
>
> --conf spark.dynamicAllocation.shuffleTracking.enabled=false
>
> In Spark 3.4.1 running spark in k8s
>
> you get
>
> : org.apache.spark.SparkException: Dynamic allocation of executors
> requires the external shuffle service. You may enable this through
> spark.shuffle.service.enabled.
>
> HTH
>
> Mich Talebzadeh,
> Solutions Architect/Engineering Lead
> London
> United Kingdom
>
>
>view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Mon, 7 Aug 2023 at 21:24, Holden Karau  wrote:
>
>> I think you need to set 
>> "spark.dynamicAllocation.shuffleTracking.enabled=true"
>> to false.
>>
>> On Mon, Aug 7, 2023 at 2:50 AM Mich Talebzadeh 
>> wrote:
>>
>>> Yes I have seen cases where the driver gone but a couple of executors
>>> hanging on. Sounds like a code issue.
>>>
>>> HTH
>>>
>>> Mich Talebzadeh,
>>> Solutions Architect/Engineering Lead
>>> London
>>> United Kingdom
>>>
>>>
>>>view my Linkedin profile
>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>
>>>
>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>>
>>> On Thu, 27 Jul 2023 at 15:01, Sergei Zhgirovski 
>>> wrote:
>>>
>>>> Hi everyone
>>>>
>>>> I'm trying to use pyspark 3.3.2.
>>>> I have these relevant options set:
>>>>
>>>> 
>>>> spark.dynamicAllocation.enabled=true
>>>> spark.dynamicAllocation.shuffleTracking.enabled=true
>>>> spark.dynamicAllocation.shuffleTracking.timeout=20s
>>>> spark.dynamicAllocation.executorIdleTimeout=30s
>>>> spark.dynamicAllocation.cachedExecutorIdleTimeout=40s
>>>> spark.executor.instances=0
>>>> spark.dynamicAllocation.minExecutors=0
>>>> spark.dynamicAllocation.maxExecutors=20
>>>> spark.master=k8s://https://k8s-api.<>:6443
>>>> 
>>>>
>>>> So I'm using kubernetes to deploy up to 20 executors
>>>>
>>>> then I run this piece of code:
>>>> 
>>>> df = spark.read.parquet("s3a://>>> files>")
>>>> print(df.count())
>>>> time.sleep(999)
>>>> 
>>>>
>>>> This works fine and as expected: during the execution ~1600 tasks are
>>>> completed, 20 executors get deployed and are being quickly removed after
>>>> the calculation is complete.
>>>>
>>>> Next, I add these to the config:
>>>> 
>>>> spark.decommission.enabled=true
>>>> spark.storage.decommission.shuffleBlocks.enabled=true
>>>> spark.storage.decommission.enabled=true
>>>> spark.storage.decommission.rddBlocks.enabled=true
>>>> 
>>>>
>>>> I repeat the experiment on an empty kubernetes cluster, so that no
>>>> actual pod evicting is occuring.
>>>>
>>>> This time executors deallocation is not working as expected: depending
>>>> on the run, after the job is complete, 0-3 executors out of 20 remain
>>>> present forever and never seem to get removed.
>>>>
>>>> I tried to debug the code and found out that inside the
>>>> 'ExecutorMonitor.timedOutExecutors' function those executors that never get
>>>> to be removed do not make it to the 'timedOutExecs' variable, because the
>>>> property 'hasActiveShuffle' remains 'true' for them.
>>>>
>>>> I'm a little stuck here trying to understand how all pod management,
>>>> shuffle tracking and decommissioning were supposed to be working together,
>>>> how to debug this and whether this is an expected behavior at all (to me it
>>>> is not).
>>>>
>>>> Thank you for any hints!
>>>>
>>>
>>
>> --
>> Twitter: https://twitter.com/holdenkarau
>> Books (Learning Spark, High Performance Spark, etc.):
>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>
> --
Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
YouTube Live Streams: https://www.youtube.com/user/holdenkarau


Re: Dynamic allocation does not deallocate executors

2023-08-07 Thread Holden Karau
I think you need to set "spark.dynamicAllocation.shuffleTracking.enabled=true"
to false.

On Mon, Aug 7, 2023 at 2:50 AM Mich Talebzadeh 
wrote:

> Yes I have seen cases where the driver gone but a couple of executors
> hanging on. Sounds like a code issue.
>
> HTH
>
> Mich Talebzadeh,
> Solutions Architect/Engineering Lead
> London
> United Kingdom
>
>
>view my Linkedin profile
> 
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Thu, 27 Jul 2023 at 15:01, Sergei Zhgirovski 
> wrote:
>
>> Hi everyone
>>
>> I'm trying to use pyspark 3.3.2.
>> I have these relevant options set:
>>
>> 
>> spark.dynamicAllocation.enabled=true
>> spark.dynamicAllocation.shuffleTracking.enabled=true
>> spark.dynamicAllocation.shuffleTracking.timeout=20s
>> spark.dynamicAllocation.executorIdleTimeout=30s
>> spark.dynamicAllocation.cachedExecutorIdleTimeout=40s
>> spark.executor.instances=0
>> spark.dynamicAllocation.minExecutors=0
>> spark.dynamicAllocation.maxExecutors=20
>> spark.master=k8s://https://k8s-api.<>:6443
>> 
>>
>> So I'm using kubernetes to deploy up to 20 executors
>>
>> then I run this piece of code:
>> 
>> df = spark.read.parquet("s3a://")
>> print(df.count())
>> time.sleep(999)
>> 
>>
>> This works fine and as expected: during the execution ~1600 tasks are
>> completed, 20 executors get deployed and are being quickly removed after
>> the calculation is complete.
>>
>> Next, I add these to the config:
>> 
>> spark.decommission.enabled=true
>> spark.storage.decommission.shuffleBlocks.enabled=true
>> spark.storage.decommission.enabled=true
>> spark.storage.decommission.rddBlocks.enabled=true
>> 
>>
>> I repeat the experiment on an empty kubernetes cluster, so that no actual
>> pod evicting is occuring.
>>
>> This time executors deallocation is not working as expected: depending on
>> the run, after the job is complete, 0-3 executors out of 20 remain present
>> forever and never seem to get removed.
>>
>> I tried to debug the code and found out that inside the
>> 'ExecutorMonitor.timedOutExecutors' function those executors that never get
>> to be removed do not make it to the 'timedOutExecs' variable, because the
>> property 'hasActiveShuffle' remains 'true' for them.
>>
>> I'm a little stuck here trying to understand how all pod management,
>> shuffle tracking and decommissioning were supposed to be working together,
>> how to debug this and whether this is an expected behavior at all (to me it
>> is not).
>>
>> Thank you for any hints!
>>
>

-- 
Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  
YouTube Live Streams: https://www.youtube.com/user/holdenkarau


Re: SPIP: Shutting down spark structured streaming when the streaming process completed current process

2023-02-18 Thread Holden Karau
Is there someone focused on streaming work these days who would want to
shepherd this?

On Sat, Feb 18, 2023 at 5:02 PM Dongjoon Hyun 
wrote:

> Thank you for considering me, but may I ask what makes you think to put me
> there, Mich? I'm curious about your reason.
>
> > I have put dongjoon.hyun as a shepherd.
>
> BTW, unfortunately, I cannot help you with that due to my on-going
> personal stuff. I'll adjust the JIRA first.
>
> Thanks,
> Dongjoon.
>
>
> On Sat, Feb 18, 2023 at 10:51 AM Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>> https://issues.apache.org/jira/browse/SPARK-42485
>>
>>
>> Spark Structured Streaming is a very useful tool in dealing with Event
>> Driven Architecture. In an Event Driven Architecture, there is generally a
>> main loop that listens for events and then triggers a call-back function
>> when one of those events is detected. In a streaming application the
>> application waits to receive the source messages in a set interval or
>> whenever they happen and reacts accordingly.
>>
>> There are occasions that you may want to stop the Spark program
>> gracefully. Gracefully meaning that Spark application handles the last
>> streaming message completely and terminates the application. This is
>> different from invoking interrupts such as CTRL-C.
>>
>> Of course one can terminate the process based on the following
>>
>>1. query.awaitTermination() # Waits for the termination of this
>>query, with stop() or with error
>>
>>
>>1. query.awaitTermination(timeoutMs) # Returns true if this query is
>>terminated within the timeout in milliseconds.
>>
>> So the first one above waits until an interrupt signal is received. The
>> second one will count the timeout and will exit when the timeout in
>> milliseconds is reached.
>>
>> The issue is that one needs to predict how long the streaming job needs
>> to run. Clearly any interrupt at the terminal or OS level (kill process),
>> may end up the processing terminated without a proper completion of the
>> streaming process.
>>
>> I have devised a method that allows one to terminate the spark
>> application internally after processing the last received message. Within
>> say 2 seconds of the confirmation of shutdown, the process will invoke a
>> graceful shutdown.
>>
>> This new feature proposes a solution to handle the topic doing work for
>> the message being processed gracefully, wait for it to complete and
>> shutdown the streaming process for a given topic without loss of data or
>> orphaned transactions
>>
>>
>> I have put dongjoon.hyun as a shepherd. Kindly advise me if that is the
>> correct approach.
>>
>> JIRA ticket https://issues.apache.org/jira/browse/SPARK-42485
>>
>> SPIP doc: TBC
>>
>> Discussion thread: in
>>
>> https://lists.apache.org/list.html?d...@spark.apache.org
>>
>>
>> Thanks.
>>
>>
>>view my Linkedin profile
>> 
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
> --
Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  
YouTube Live Streams: https://www.youtube.com/user/holdenkarau


Re: [PySpark] Reader/Writer for bgzipped data

2022-12-06 Thread Holden Karau
Take a look at https://github.com/nielsbasjes/splittablegzip :D

On Tue, Dec 6, 2022 at 7:46 AM Oliver Ruebenacker <
oliv...@broadinstitute.org> wrote:

>
>  Hello Holden,
>
>   Thank you for the response, but what is "splittable gzip"?
>
>  Best, Oliver
>
> On Tue, Dec 6, 2022 at 9:22 AM Holden Karau  wrote:
>
>> There is the splittable gzip Hadoop input format, maybe someone could
>> extend that to use support bgzip?
>>
>> On Tue, Dec 6, 2022 at 1:43 PM Oliver Ruebenacker <
>> oliv...@broadinstitute.org> wrote:
>>
>>>
>>>  Hello Chris,
>>>
>>>   Yes, you can use gunzip/gzip to uncompress a file created by bgzip,
>>> but to start reading from somewhere other than the beginning of the file,
>>> you would need to use an index to tell you where the blocks start.
>>> Originally, a Tabix index was used and is still the popular choice,
>>> although other types of indices also exist.
>>>
>>>  Best, Oliver
>>>
>>> On Mon, Dec 5, 2022 at 6:17 PM Chris Nauroth 
>>> wrote:
>>>
>>>> Sorry, I misread that in the original email.
>>>>
>>>> This is my first time looking at bgzip. I see from the documentation
>>>> that it is putting some additional framing around gzip and producing a
>>>> series of small blocks, such that you can create an index of the file and
>>>> decompress individual blocks instead of the whole file. That's interesting,
>>>> because it could potentially support a splittable format. (Plain gzip isn't
>>>> splittable.)
>>>>
>>>> I also noticed that it states it is "compatible with" gzip. I tried a
>>>> basic test of running bgzip on a file, which produced a .gz output file,
>>>> and then running the same spark.read.text code sample from earlier. Sure
>>>> enough, I was able to read the data. This implies there is at least some
>>>> basic compatibility, so that you could read files created by bgzip.
>>>> However, that read would not be optimized in any way to take advantage of
>>>> an index file. There also would not be any way to produce bgzip-style
>>>> output like in the df.write.option code sample. To achieve either of those,
>>>> it would require writing a custom Hadoop compression codec to integrate
>>>> more closely with the data format.
>>>>
>>>> Chris Nauroth
>>>>
>>>>
>>>> On Mon, Dec 5, 2022 at 2:08 PM Oliver Ruebenacker <
>>>> oliv...@broadinstitute.org> wrote:
>>>>
>>>>>
>>>>>  Hello,
>>>>>
>>>>>   Thanks for the response, but I mean compressed with bgzip
>>>>> <http://www.htslib.org/doc/bgzip.html>, not bzip2.
>>>>>
>>>>>  Best, Oliver
>>>>>
>>>>> On Fri, Dec 2, 2022 at 4:44 PM Chris Nauroth 
>>>>> wrote:
>>>>>
>>>>>> Hello Oliver,
>>>>>>
>>>>>> Yes, Spark makes this possible using the Hadoop compression codecs
>>>>>> and the Hadoop-compatible FileSystem interface [1]. Here is an example of
>>>>>> reading:
>>>>>>
>>>>>> df = spark.read.text("gs:///data/shakespeare-bz2")
>>>>>> df.show(10)
>>>>>>
>>>>>> This is using a test data set of the complete works of Shakespeare,
>>>>>> stored as text and compressed to a single .bz2 file. This code sample
>>>>>> didn't need to do anything special to declare that it's working with 
>>>>>> bzip2
>>>>>> compression, because the Hadoop compression codecs detect that the file 
>>>>>> has
>>>>>> a .bz2 extension and automatically assume it needs to be decompressed
>>>>>> before presenting it to our code in the DataFrame as text.
>>>>>>
>>>>>> On the write side, if you wanted to declare a particular kind of
>>>>>> output compression, you can do it with a write option like this:
>>>>>>
>>>>>> df.write.option("compression",
>>>>>> "org.apache.hadoop.io.compress.BZip2Codec").text("gs://>>>>> bucket>/data/shakespeare-bz2-copy")
>>>>>>
>>>>>> This writes the contents of the DataFrame, stored as text and
>>>>>> compress

Re: [PySpark] Reader/Writer for bgzipped data

2022-12-06 Thread Holden Karau
There is the splittable gzip Hadoop input format, maybe someone could
extend that to use support bgzip?

On Tue, Dec 6, 2022 at 1:43 PM Oliver Ruebenacker <
oliv...@broadinstitute.org> wrote:

>
>  Hello Chris,
>
>   Yes, you can use gunzip/gzip to uncompress a file created by bgzip, but
> to start reading from somewhere other than the beginning of the file, you
> would need to use an index to tell you where the blocks start. Originally,
> a Tabix index was used and is still the popular choice, although other
> types of indices also exist.
>
>  Best, Oliver
>
> On Mon, Dec 5, 2022 at 6:17 PM Chris Nauroth  wrote:
>
>> Sorry, I misread that in the original email.
>>
>> This is my first time looking at bgzip. I see from the documentation that
>> it is putting some additional framing around gzip and producing a series of
>> small blocks, such that you can create an index of the file and decompress
>> individual blocks instead of the whole file. That's interesting, because it
>> could potentially support a splittable format. (Plain gzip isn't
>> splittable.)
>>
>> I also noticed that it states it is "compatible with" gzip. I tried a
>> basic test of running bgzip on a file, which produced a .gz output file,
>> and then running the same spark.read.text code sample from earlier. Sure
>> enough, I was able to read the data. This implies there is at least some
>> basic compatibility, so that you could read files created by bgzip.
>> However, that read would not be optimized in any way to take advantage of
>> an index file. There also would not be any way to produce bgzip-style
>> output like in the df.write.option code sample. To achieve either of those,
>> it would require writing a custom Hadoop compression codec to integrate
>> more closely with the data format.
>>
>> Chris Nauroth
>>
>>
>> On Mon, Dec 5, 2022 at 2:08 PM Oliver Ruebenacker <
>> oliv...@broadinstitute.org> wrote:
>>
>>>
>>>  Hello,
>>>
>>>   Thanks for the response, but I mean compressed with bgzip
>>> , not bzip2.
>>>
>>>  Best, Oliver
>>>
>>> On Fri, Dec 2, 2022 at 4:44 PM Chris Nauroth 
>>> wrote:
>>>
 Hello Oliver,

 Yes, Spark makes this possible using the Hadoop compression codecs and
 the Hadoop-compatible FileSystem interface [1]. Here is an example of
 reading:

 df = spark.read.text("gs:///data/shakespeare-bz2")
 df.show(10)

 This is using a test data set of the complete works of Shakespeare,
 stored as text and compressed to a single .bz2 file. This code sample
 didn't need to do anything special to declare that it's working with bzip2
 compression, because the Hadoop compression codecs detect that the file has
 a .bz2 extension and automatically assume it needs to be decompressed
 before presenting it to our code in the DataFrame as text.

 On the write side, if you wanted to declare a particular kind of output
 compression, you can do it with a write option like this:

 df.write.option("compression",
 "org.apache.hadoop.io.compress.BZip2Codec").text("gs://>>> bucket>/data/shakespeare-bz2-copy")

 This writes the contents of the DataFrame, stored as text and
 compressed to .bz2 files in the destination path.

 My example is testing with a GCS bucket (scheme "gs:"), but you can
 also switch the Hadoop file system interface to target other file systems
 like S3 (scheme "s3a:"). Hadoop maintains documentation on how to configure
 the S3AFIleSystem, including how to pass credentials for access to the S3
 bucket [2].

 Note that for big data use cases, other compression codecs like Snappy
 are generally preferred for greater efficiency. (Of course, we're not
 always in complete control of the data formats we're given, so the support
 for bz2 is there.)

 [1]
 https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/filesystem/index.html
 [2]
 https://hadoop.apache.org/docs/current/hadoop-aws/tools/hadoop-aws/index.html

 Chris Nauroth


 On Fri, Dec 2, 2022 at 11:32 AM Oliver Ruebenacker <
 oliv...@broadinstitute.org> wrote:

>
>  Hello,
>
>   Is it possible to read/write a DataFrame from/to a set of bgzipped
> files? Can it read from/write to AWS S3? Thanks!
>
>  Best, Oliver
>
> --
> Oliver Ruebenacker, Ph.D. (he)
> Senior Software Engineer, Knowledge Portal Network ,
> Flannick Lab , Broad Institute
> 
>

>>>
>>> --
>>> Oliver Ruebenacker, Ph.D. (he)
>>> Senior Software Engineer, Knowledge Portal Network , 
>>> Flannick
>>> Lab , Broad Institute
>>> 
>>>
>>
>
> --
> Oliver Ruebenacker, Ph.D. (he)
> Senior Software Engineer, Knowledge Portal Network 

Re: Dataproc serverless for Spark

2022-11-28 Thread Holden Karau
This sounds like a great question for the Google DataProc folks (I know
there was some interesting work being done around it but I left before it
was finished so I don't want to provide a possibly incorrect answer).

If your a GCP customer try reaching out to their support for details.

On Mon, Nov 21, 2022 at 1:47 PM Mich Talebzadeh 
wrote:

> I have not used standalone for a good while. The standard dataproc uses
> YARN as the resource manager. The vanilla dataproc is Google's answer to
> Hadoop on the cloud. Move your analytics workload from on-premise to Cloud
> with little effort with the same look and feel. Google then introduced  
> dynamic
> allocation of resources to cater for those apps that could not be easily
> migrated to Kubernetes (GKE). so the  doc states that  without dynamic
> allocation, it only asks for containers at the beginning of the job. With
> dynamic allocation, it will remove containers, or ask for new ones, as
> necessary. This is still using YARN. See here
> 
>
> 
>  This
> approach was as not necessarily very successful as adding executors
> dynamically for larger workloads could freeze the spark application itself.
> Reading the doc it says startup time for serverless is 60 seconds compared
> to dataproc on Compute engine (the one you setup your own spark cluster on
> dataproc tin boxes) of 90 seconds
>
> Dataproc serverless for Spark autoscaling
>  makes
> a reference to  "Dataproc Serverless autoscaling is the default behavior,
> and uses Spark dynamic resource allocation
> 
>  to
> determine whether, how, and when to scale your workload" So the key point
> is Not standalone mode but generally references to "Spark provides a
> mechanism to dynamically adjust the resources your application occupies
> based on the workload. This means that your application may give resources
> back to the cluster if they are no longer used and request them again later
> when there is demand. This feature is particularly useful if multiple
> applications share resources in your Spark cluster."
>
> Is'nt this the standard Spark resource allocation? So why has this
> suddenly been elevated from Spark 3.2?
>
> Someone may give a more qualified answer here :)
>
>
>view my Linkedin profile
> 
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
>
> On Mon, 21 Nov 2022 at 17:32, Stephen Boesch  wrote:
>
>> Out of curiosity : are there functional limitations in Spark Standalone
>> that are of concern?  Yarn is more configurable for running non-spark
>> workloads and how to run multiple spark jobs in parallel. But for a single
>> spark job it seems standalone launches more quickly and does not miss any
>> features. Are there specific limitations you are aware of / run into?
>>
>> stephen b
>>
>> On Mon, 21 Nov 2022 at 09:01, Mich Talebzadeh 
>> wrote:
>>
>>> Hi,
>>>
>>> I have not tested this myself but Google have brought up *Dataproc 
>>> Serverless
>>> for Spar*k. in a nutshell Dataproc Serverless lets you run Spark batch
>>> workloads without requiring you to provision and manage your own cluster.
>>> Specify workload parameters, and then submit the workload to the Dataproc
>>> Serverless service. The service will run the workload on a managed compute
>>> infrastructure, autoscaling resources as needed. Dataproc Serverless
>>> charges apply only to the time when the workload is executing. Google
>>> Dataproc is similar to Amazon EMR
>>>
>>> So in short you don't need to provision your own Dataproc cluster etc.
>>> One thing Inoticed from release doc
>>> is that the
>>> resource management is *spark based a*s opposed to standard Dataproc
>>> which iis YARN based. It is available for Spark 3.2. My assumption is
>>> that by Spark based it means that spark is running in standalone mode. Has
>>> there been much improvement in release 3.2 for standalone mode?
>>>
>>> Thanks
>>>
>>>
>>>
>>>
>>>view my Linkedin profile
>>> 
>>>
>>>
>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or 

Re: Dynamic Scaling without Kubernetes

2022-10-26 Thread Holden Karau
So Spark can dynamically scale on YARN, but standalone mode becomes a bit
complicated — where do you envision Spark gets the extra resources from?

On Wed, Oct 26, 2022 at 12:18 PM Artemis User 
wrote:

> Has anyone tried to make a Spark cluster dynamically scalable, i.e.,
> adding a new worker node automatically to the cluster when no more
> executors are available upon a new job submitted?  We need to make the
> whole cluster on-prem and really lightweight, so standalone mode is
> preferred and no k8s if possible.   Any suggestion?  Thanks in advance!
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
> --
Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  
YouTube Live Streams: https://www.youtube.com/user/holdenkarau


Re: Jupyter notebook on Dataproc versus GKE

2022-09-06 Thread Holden Karau
I’ve used Argo for K8s scheduling, for awhile it’s also what Kubeflow used
underneath for scheduling.

On Tue, Sep 6, 2022 at 10:01 AM Mich Talebzadeh 
wrote:

> Thank you all.
>
> Has anyone used Argo for k8s scheduler by any chance?
>
> On Tue, 6 Sep 2022 at 13:41, Bjørn Jørgensen 
> wrote:
>
>> "*JupyterLab is the next-generation user interface for Project Jupyter
>> offering all the familiar building blocks of the classic Jupyter Notebook
>> (notebook, terminal, text editor, file browser, rich outputs, etc.) in a
>> flexible and powerful user interface.*"
>> https://github.com/jupyterlab/jupyterlab
>>
>> You will find them both at https://jupyter.org
>>
>> man. 5. sep. 2022 kl. 23:40 skrev Mich Talebzadeh <
>> mich.talebza...@gmail.com>:
>>
>>> Thanks Bjorn,
>>>
>>> What are the differences and the functionality Jupyerlab brings in on
>>> top of Jupyter notebook?
>>>
>>>
>>>
>>>view my Linkedin profile
>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>
>>>
>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>>
>>> On Mon, 5 Sept 2022 at 20:58, Bjørn Jørgensen 
>>> wrote:
>>>
>>>> Jupyter notebook is replaced with jupyterlab :)
>>>>
>>>> man. 5. sep. 2022 kl. 21:10 skrev Holden Karau :
>>>>
>>>>>
>>>>>
>>>>> On Mon, Sep 5, 2022 at 9:00 AM Mich Talebzadeh <
>>>>> mich.talebza...@gmail.com> wrote:
>>>>>
>>>>>> Thanks for that.
>>>>>>
>>>>>> How do you rate the performance of Jupyter W/Spark on K8s compared to
>>>>>> the same on  a cluster of VMs (example Dataproc).
>>>>>>
>>>>>> Also somehow a related question (may be naive as well). For example,
>>>>>> Google offers a lot of standard ML libraries for example built into a 
>>>>>> data
>>>>>> warehouse like BigQuery. What does the Jupyter notebook offer that others
>>>>>> don't?
>>>>>>
>>>>> Jupyter notebook doesn’t offer any particular set of libraries,
>>>>> although you can add your own to the container etc.
>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>view my Linkedin profile
>>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>>>
>>>>>>
>>>>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>>>>
>>>>>>
>>>>>>
>>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility
>>>>>> for any loss, damage or destruction of data or any other property which 
>>>>>> may
>>>>>> arise from relying on this email's technical content is explicitly
>>>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>>>> arising from such loss, damage or destruction.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Mon, 5 Sept 2022 at 12:47, Holden Karau 
>>>>>> wrote:
>>>>>>
>>>>>>> I’ve run Jupyter w/Spark on K8s, haven’t tried it with Dataproc
>>>>>>> personally.
>>>>>>>
>>>>>>> The Spark K8s pod scheduler is now more pluggable for Yunikorn and
>>>>>>> Volcano can be used with less effort.
>>>>>>>
>>>>>>> On Mon, Sep 5, 2022 at 7:44 AM Mich Talebzadeh <
>>>>>>> mich.talebza...@gmail.com> wrote:
>>>>>>>
>>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>>
>>>>>>>> Has anyone got experience of running Jupyter on dataproc versus
>>>>>>>> Jupyter notebook on GKE (k8).
>>>>>>

Re: Jupyter notebook on Dataproc versus GKE

2022-09-05 Thread Holden Karau
On Mon, Sep 5, 2022 at 9:00 AM Mich Talebzadeh 
wrote:

> Thanks for that.
>
> How do you rate the performance of Jupyter W/Spark on K8s compared to the
> same on  a cluster of VMs (example Dataproc).
>
> Also somehow a related question (may be naive as well). For example,
> Google offers a lot of standard ML libraries for example built into a data
> warehouse like BigQuery. What does the Jupyter notebook offer that others
> don't?
>
Jupyter notebook doesn’t offer any particular set of libraries, although
you can add your own to the container etc.

>
>
>
>view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Mon, 5 Sept 2022 at 12:47, Holden Karau  wrote:
>
>> I’ve run Jupyter w/Spark on K8s, haven’t tried it with Dataproc
>> personally.
>>
>> The Spark K8s pod scheduler is now more pluggable for Yunikorn and
>> Volcano can be used with less effort.
>>
>> On Mon, Sep 5, 2022 at 7:44 AM Mich Talebzadeh 
>> wrote:
>>
>>>
>>> Hi,
>>>
>>>
>>> Has anyone got experience of running Jupyter on dataproc versus Jupyter
>>> notebook on GKE (k8).
>>>
>>>
>>> I have not looked at this for a while but my understanding is that Spark
>>> on GKE/k8 is not yet performed. This is classic Spark with Python/Pyspark.
>>>
>>>
>>> Also I would like to know the state of spark with Volcano. Has progress
>>> made on that front.
>>>
>>>
>>> Regards,
>>>
>>>
>>> Mich
>>>
>>>
>>>view my Linkedin profile
>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>
>>>
>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>> --
>> Twitter: https://twitter.com/holdenkarau
>> Books (Learning Spark, High Performance Spark, etc.):
>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>
> --
Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
YouTube Live Streams: https://www.youtube.com/user/holdenkarau


Re: Jupyter notebook on Dataproc versus GKE

2022-09-05 Thread Holden Karau
I’ve run Jupyter w/Spark on K8s, haven’t tried it with Dataproc personally.

The Spark K8s pod scheduler is now more pluggable for Yunikorn and Volcano
can be used with less effort.

On Mon, Sep 5, 2022 at 7:44 AM Mich Talebzadeh 
wrote:

>
> Hi,
>
>
> Has anyone got experience of running Jupyter on dataproc versus Jupyter
> notebook on GKE (k8).
>
>
> I have not looked at this for a while but my understanding is that Spark
> on GKE/k8 is not yet performed. This is classic Spark with Python/Pyspark.
>
>
> Also I would like to know the state of spark with Volcano. Has progress
> made on that front.
>
>
> Regards,
>
>
> Mich
>
>
>view my Linkedin profile
> 
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
-- 
Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  
YouTube Live Streams: https://www.youtube.com/user/holdenkarau


Re: Reverse proxy for Spark UI on Kubernetes

2022-05-17 Thread Holden Karau
Could we make it do the same sort of history server fallback approach?

On Tue, May 17, 2022 at 10:41 PM bo yang  wrote:

> It is like Web Application Proxy in YARN (
> https://hadoop.apache.org/docs/stable/hadoop-yarn/hadoop-yarn-site/WebApplicationProxy.html),
> to provide easy access for Spark UI when the Spark application is running.
>
> When running Spark on Kubernetes with S3, there is no YARN. The reverse
> proxy here is to behave like that Web Application Proxy. It will
> simplify settings to access Spark UI on Kubernetes.
>
>
> On Mon, May 16, 2022 at 11:46 PM wilson  wrote:
>
>> what's the advantage of using reverse proxy for spark UI?
>>
>> Thanks
>>
>> On Tue, May 17, 2022 at 1:47 PM bo yang  wrote:
>>
>>> Hi Spark Folks,
>>>
>>> I built a web reverse proxy to access Spark UI on Kubernetes (working
>>> together with
>>> https://github.com/GoogleCloudPlatform/spark-on-k8s-operator). Want to
>>> share here in case other people have similar need.
>>>
>>> The reverse proxy code is here:
>>> https://github.com/datapunchorg/spark-ui-reverse-proxy
>>>
>>> Let me know if anyone wants to use or would like to contribute.
>>>
>>> Thanks,
>>> Bo
>>>
>>> --
Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  
YouTube Live Streams: https://www.youtube.com/user/holdenkarau


Re: Reverse proxy for Spark UI on Kubernetes

2022-05-17 Thread Holden Karau
Oh that’s rad 

On Tue, May 17, 2022 at 7:47 AM bo yang  wrote:

> Hi Spark Folks,
>
> I built a web reverse proxy to access Spark UI on Kubernetes (working
> together with https://github.com/GoogleCloudPlatform/spark-on-k8s-operator).
> Want to share here in case other people have similar need.
>
> The reverse proxy code is here:
> https://github.com/datapunchorg/spark-ui-reverse-proxy
>
> Let me know if anyone wants to use or would like to contribute.
>
> Thanks,
> Bo
>
> --
Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  
YouTube Live Streams: https://www.youtube.com/user/holdenkarau


Re: Unable to access Google buckets using spark-submit

2022-02-12 Thread Holden Karau
You can also put the GS access jar with your Spark jars — that’s what the
class not found exception is pointing you towards.

On Fri, Feb 11, 2022 at 11:58 PM Mich Talebzadeh 
wrote:

> BTW I also answered you in in stackoverflow :
>
>
> https://stackoverflow.com/questions/71088934/unable-to-access-google-buckets-using-spark-submit
>
> HTH
>
>
>view my Linkedin profile
> 
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Sat, 12 Feb 2022 at 08:24, Mich Talebzadeh 
> wrote:
>
>> You are trying to access a Google storage bucket gs:// from your local
>> host.
>>
>> It does not see it because spark-submit assumes that it is a local file
>> system on the host which is not.
>>
>> You need to mount gs:// bucket as a local file system.
>>
>> You can use the tool called gcsfuse
>> https://cloud.google.com/storage/docs/gcs-fuse . Cloud Storage FUSE is
>> an open source FUSE  adapter that allows
>> you to mount Cloud Storage buckets as file systems on Linux or macOS
>> systems. You can download gcsfuse from here
>> 
>>
>>
>> Pretty simple.
>>
>>
>> It will be installed as /usr/bin/gcsfuse and you can mount it by creating
>> a local mount file like /mnt/gs as root and give permission to others to
>> use it.
>>
>>
>> As a normal user that needs to access gs:// bucket (not as root), use
>> gcsfuse to mount it. For example I am mounting a gcs bucket called
>> spark-jars-karan here
>>
>>
>> Just use the bucket name itself
>>
>>
>> gcsfuse spark-jars-karan /mnt/gs
>>
>>
>> Then you can refer to it as /mnt/gs in spark-submit from on-premise host
>>
>> spark-submit --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.2.0 
>> --jars /mnt/gs/spark-bigquery-with-dependencies_2.12-0.23.2.jar
>>
>> HTH
>>
>>view my Linkedin profile
>> 
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Sat, 12 Feb 2022 at 04:31, karan alang  wrote:
>>
>>> Hello All,
>>>
>>> I'm trying to access gcp buckets while running spark-submit from local,
>>> and running into issues.
>>>
>>> I'm getting error :
>>> ```
>>>
>>> 22/02/11 20:06:59 WARN NativeCodeLoader: Unable to load native-hadoop 
>>> library for your platform... using builtin-java classes where applicable
>>> Exception in thread "main" 
>>> org.apache.hadoop.fs.UnsupportedFileSystemException: No FileSystem for 
>>> scheme "gs"
>>>
>>> ```
>>> I tried adding the --conf
>>> spark.hadoop.fs.gs.impl=com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem
>>>
>>> to the spark-submit command, but getting ClassNotFoundException
>>>
>>> Details are in stackoverflow :
>>>
>>> https://stackoverflow.com/questions/71088934/unable-to-access-google-buckets-using-spark-submit
>>>
>>> Any ideas on how to fix this ?
>>> tia !
>>>
>>> --
Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  
YouTube Live Streams: https://www.youtube.com/user/holdenkarau


Re: Spark 3.1.2 full thread dumps

2022-02-04 Thread Holden Karau
We don’t block scaling up after node failure in classic Spark if that’s the
question.

On Fri, Feb 4, 2022 at 6:30 PM Mich Talebzadeh 
wrote:

> From what I can see in auto scaling setup, you will always need a min of
> two worker nodes as primary. It also states and I quote "Scaling primary
> workers is not recommended due to HDFS limitations which result in
> instability while scaling. These limitations do not exist for secondary
> workers". So the scaling comes with the secondary workers specifying the
> min and max instances. It also defaults to 2 minutes for the so-called auto
> scaling cooldown duration hence that delay observed. I presume task
> allocation to the new executors is FIFO for new tasks. This link
> 
> does some explanation on autoscaling.
>
> Handling Spot Node Loss and Spot Blocks in Spark Clusters
> "When the Spark AM receives the spot loss (Spot Node Loss or Spot Blocks)
> notification from the RM, it notifies the Spark driver. The driver then
> performs the following actions:
>
>1. Identifies all the executors affected by the upcoming node loss.
>2. Moves all of the affected executors to a decommissioning state, and
>no new tasks are scheduled on these executors.
>3. Kills all the executors after reaching 50% of the termination time.
>4. *Starts the failed tasks (if any) on other executors.*
>5. For these nodes, it removes all the entries of the shuffle data
>from the map output tracker on driver after reaching 90% of the termination
>time. This helps in preventing the shuffle-fetch failures due to spot loss.
>6. Recomputes the shuffle data from the lost node by stage
>resubmission and at the time shuffles data of spot node if required."
>7.
>8. So basically when a node fails classic spark comes into play and no
>new nodes are added etc (no rescaling) and tasks are redistributed among
>the existing executors as I read it?
>
>view my Linkedin profile
> 
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Fri, 4 Feb 2022 at 13:55, Sean Owen  wrote:
>
>> I have not seen stack traces under autoscaling, so not even sure what the
>> error in question is.
>> There is always delay in acquiring a whole new executor in the cloud as
>> it usually means a new VM is provisioned.
>> Spark treats the new executor like any other, available for executing
>> tasks.
>>
>> On Fri, Feb 4, 2022 at 4:28 AM Mich Talebzadeh 
>> wrote:
>>
>>> Thanks for the info.
>>>
>>> My concern has always been on how Spark handles autoscaling (adding new
>>> executors) when the load pattern changes.I have tried to test this with
>>> setting the following parameters (Spark 3.1.2 on GCP)
>>>
>>> spark-submit --verbose \
>>> ...
>>>   --conf spark.dynamicAllocation.enabled="true" \
>>>--conf spark.shuffle.service.enabled="true" \
>>>--conf spark.dynamicAllocation.minExecutors=2 \
>>>--conf spark.dynamicAllocation.maxExecutors=10 \
>>>--conf spark.dynamicAllocation.initialExecutors=4 \
>>>
>>> It is not very clear to me how Spark distributes tasks on the added
>>> executors and the source of delay. As you have observed there is a delay in
>>> adding new resources and allocating tasks. If that process is efficient?
>>>
>>> Thanks
>>>
>>>view my Linkedin profile
>>> 
>>>
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>>
>>> On Fri, 4 Feb 2022 at 03:04, Maksim Grinman  wrote:
>>>
 It's actually on AWS EMR. The job bootstraps and runs fine -- the
 autoscaling group is to bring up a service that spark will be calling. Some
 code waits for the autoscaling group to come up before continuing
 processing in Spark, since the Spark cluster will need to make requests to
 the service in the autoscaling group. It takes several minutes for the
 service to come up, and during the wait, Spark starts to show these thread
 dumps, as presumably it thinks 

Re: Log4j 1.2.17 spark CVE

2021-12-12 Thread Holden Karau
My understanding is it only applies to log4j 2+ so we don’t need to do
anything.

On Sun, Dec 12, 2021 at 8:46 PM Pralabh Kumar 
wrote:

> Hi developers,  users
>
> Spark is built using log4j 1.2.17 . Is there a plan to upgrade based on
> recent CVE detected ?
>
>
> Regards
> Pralabh kumar
>
-- 
Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  
YouTube Live Streams: https://www.youtube.com/user/holdenkarau


Re: Choice of IDE for Spark

2021-10-01 Thread Holden Karau
Personally I like Jupyter notebooks for my interactive work and then once
I’ve done my exploration I switch back to emacs with either scala-metals or
Python mode.

I think the main takeaway is: do what feels best for you, there is no one
true way to develop in Spark.

On Fri, Oct 1, 2021 at 1:28 AM Mich Talebzadeh 
wrote:

> Thanks guys for your comments.
>
> I agree with you Florian that opening a terminal say in VSC allows you to
> run a shell script (an sh file) to submit your spark code, however, this
> really makes sense if your IDE is running on a Linux host submitting a job
> to a Kubernetes cluster or YARN cluster.
>
> For Python, I will go with PyCharm which is specific to the Python world.
> With Spark, I have used IntelliJ with Spark plug in on MAC for development
> work. Then created a JAR file, gzipped the whole project and scped to an
> IBM sandbox, untarred it and ran it with a pre-prepared shell with
> environment plugin for dev, test, staging etc.
>
> IDE is also useful for looking at csv, tsv type files or creating json
> from one form to another. For json validation,especially if the file is too
> large, you may have restriction loading the file to web json validator
> because of the risk of proprietary data being exposed. There is a tool
> called jq  (a lightweight and flexible
> command-line JSON processor), that comes pretty handy to validate json.
> Download and install it on OS and run it as
>
> zcat .tgz | jq
>
> That will validate the whole tarred and gzipped json file. Otherwise most
> of these IDE tools come with add-on plugins, for various needs. My
> preference would be to use the best available IDE for the job. VSC I would
> consider as a general purpose tool. If all fails, one can always use OS
> stuff like vi, vim, sed, awk etc 樂
>
>
> Cheers
>
>
>view my Linkedin profile
> 
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Fri, 1 Oct 2021 at 06:55, Florian CASTELAIN <
> florian.castel...@redlab.io> wrote:
>
>> Hello.
>>
>> Any "evolved" code editor allows you to create tasks (or builds, or
>> whatever they are called in the IDE you chose). If you do not find anything
>> that packages by default all you need, you could just create your own tasks.
>>
>>
>> *For yarn, one needs to open a terminal and submit from there. *
>>
>> You can create task(s) that launch your yarn commands.
>>
>>
>> *With VSC, you get stuff for working with json files but I am not sure
>> with a plugin for Python *
>>
>> In your json task configuration, you can launch whatever you want:
>> python, shell. I bet you could launch your favorite video game (just make a
>> task called "let's have a break" )
>>
>> Just to say, if you want everything exactly the way you want, I do not
>> think you will find an IDE that does it. You will have to customize it.
>> (correct me if wrong, of course).
>>
>> Have a good day.
>>
>> *[image: signature_299490615]* 
>>
>>
>>
>> [image: Banner] 
>>
>>
>>
>> *Florian CASTELAIN *
>> *Ingénieur Logiciel*
>>
>> 72 Rue de la République, 76140 Le Petit-Quevilly
>> 
>> m: +33 616 530 226
>> e: florian.castel...@redlab.io w: www.redlab.io
>>
>> --
>> *De :* Jeff Zhang 
>> *Envoyé :* jeudi 30 septembre 2021 13:57
>> *À :* Mich Talebzadeh 
>> *Cc :* user @spark 
>> *Objet :* Re: Choice of IDE for Spark
>>
>> IIRC, you want an IDE for pyspark on yarn ?
>>
>> Mich Talebzadeh  于2021年9月30日周四 下午7:00写道:
>>
>> Hi,
>>
>> This may look like a redundant question but it comes about because of the
>> advent of Cloud workstation usage like Amazon workspaces and others.
>>
>> With IntelliJ you are OK with Spark & Scala. With PyCharm you are fine
>> with PySpark and the virtual environment. Mind you as far as I know PyCharm
>> only executes spark-submit in local mode. For yarn, one needs to open a
>> terminal and submit from there.
>>
>> However, in Amazon workstation, you get Visual Studio Code
>>  (VSC, an MS product) and openoffice
>> installed. With VSC, you get stuff for working with json files but I am not
>> sure with a plugin for Python etc, will it be as good as PyCharm? Has
>> anyone used VSC in anger for Spark and if so what is the experience?
>>
>> Thanks
>>
>>
>>
>>view my Linkedin profile
>> 
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or 

Drop-In Virtual Office Hour round 2 :)

2021-09-28 Thread Holden Karau
Hi Folks,

I'm going to do another drop-in virtual office hour and I've made a public
google calendar to track them so hopefully it's easier for folks to add
events
https://calendar.google.com/calendar/?cid=cXBubTY3Z2VzcmNjbnEzOWIzb3RyOWI1am9AZ3JvdXAuY2FsZW5kYXIuZ29vZ2xlLmNvbQ
or ics feed at
https://calendar.google.com/calendar/ical/qpnm67gesrccnq39b3otr9b5jo%40group.calendar.google.com/public/basic.ics


Hope to see some of y'all there :)

Cheers,

Holden :)

-- 
Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  
YouTube Live Streams: https://www.youtube.com/user/holdenkarau


Re: Drop-In Virtual Office Half-Hour

2021-09-20 Thread Holden Karau
Hey folks I'm doing my drop-in half-hour now -
http://meet.google.com/ccd-mkbd-gfv :)

On Mon, Sep 13, 2021 at 4:12 PM Holden Karau  wrote:

> Hi Folks,
>
> I'm going to experiment with a drop-in virtual half-hour office hour type
> thing next Monday, if you've got any burning Spark or general OSS questions
> you haven't had the time to ask anyone else I hope you'll swing by and join
> me. If no one comes with questions I'll tour some of the Spark on K8s code
> I've been working on recently -
> https://calendar.google.com/event?action=TEMPLATE=MG84MWJwaGc2YnNuNnFscHJ1c2ptb3NiMm8gaG9sZGVuLmthcmF1QG0=holden.karau%40gmail.com
>
>
> Cheers,
>
> Holden :)
>
> --
> Twitter: https://twitter.com/holdenkarau
> Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>


-- 
Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
YouTube Live Streams: https://www.youtube.com/user/holdenkarau


Re: Drop-In Virtual Office Half-Hour

2021-09-17 Thread Holden Karau
Some folks have had difficulty adding the event so:

https://calendar.google.com/calendar/embed?src=qpnm67gesrccnq39b3otr9b5jo%40group.calendar.google.com=America%2FLos_Angeles

Holden - OSS Virtual Office Half Hour, Drop-Ins Welcome :)
Monday, Sep 20 · 2:30–3 PM Pacific time
Google Meet joining info
Video call link: https://meet.google.com/ccd-mkbd-gfv


On Mon, Sep 13, 2021 at 4:12 PM Holden Karau  wrote:

> Hi Folks,
>
> I'm going to experiment with a drop-in virtual half-hour office hour type
> thing next Monday, if you've got any burning Spark or general OSS questions
> you haven't had the time to ask anyone else I hope you'll swing by and join
> me. If no one comes with questions I'll tour some of the Spark on K8s code
> I've been working on recently -
> https://calendar.google.com/event?action=TEMPLATE=MG84MWJwaGc2YnNuNnFscHJ1c2ptb3NiMm8gaG9sZGVuLmthcmF1QG0=holden.karau%40gmail.com
>
>
> Cheers,
>
> Holden :)
>
> --
> Twitter: https://twitter.com/holdenkarau
> Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>


-- 
Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
YouTube Live Streams: https://www.youtube.com/user/holdenkarau


Re: Drop-In Virtual Office Half-Hour

2021-09-13 Thread Holden Karau
Hmmm ok the Google calendar share link is still being difficult sorry
y’all. It’s going to be Monday at 2:30pm pacific time:

Holden - OSS Virtual Office Half Hour, Drop Ins Welcome :)
Monday, Sep 20 · 2:30–3 PM
Google Meet joining info
Video call link: https://meet.google.com/ccd-mkbd-gfv

On Mon, Sep 13, 2021 at 5:11 PM Holden Karau  wrote:

> Ah thanks for pointing that out. I changed the visibility on it to public
> so it should work now.
>
> On Mon, Sep 13, 2021 at 4:26 PM Gourav Sengupta 
> wrote:
>
>> Hi Holden,
>>
>> This is such a wonderful opportunity. Sadly when I click on the link it
>> says event not found.
>>
>> Regards,
>> Gourav
>>
>> On Tue, Sep 14, 2021 at 12:13 AM Holden Karau 
>> wrote:
>>
>>> Hi Folks,
>>>
>>> I'm going to experiment with a drop-in virtual half-hour office hour
>>> type thing next Monday, if you've got any burning Spark or general OSS
>>> questions you haven't had the time to ask anyone else I hope you'll swing
>>> by and join me. If no one comes with questions I'll tour some of the Spark
>>> on K8s code I've been working on recently -
>>> https://calendar.google.com/event?action=TEMPLATE=MG84MWJwaGc2YnNuNnFscHJ1c2ptb3NiMm8gaG9sZGVuLmthcmF1QG0=holden.karau%40gmail.com
>>>
>>>
>>> Cheers,
>>>
>>> Holden :)
>>>
>>> --
>>> Twitter: https://twitter.com/holdenkarau
>>> Books (Learning Spark, High Performance Spark, etc.):
>>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>
>>
>
> --
> Twitter: https://twitter.com/holdenkarau
> Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>
-- 
Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
YouTube Live Streams: https://www.youtube.com/user/holdenkarau


Re: Drop-In Virtual Office Half-Hour

2021-09-13 Thread Holden Karau
Ah thanks for pointing that out. I changed the visibility on it to public
so it should work now.

On Mon, Sep 13, 2021 at 4:26 PM Gourav Sengupta 
wrote:

> Hi Holden,
>
> This is such a wonderful opportunity. Sadly when I click on the link it
> says event not found.
>
> Regards,
> Gourav
>
> On Tue, Sep 14, 2021 at 12:13 AM Holden Karau 
> wrote:
>
>> Hi Folks,
>>
>> I'm going to experiment with a drop-in virtual half-hour office hour type
>> thing next Monday, if you've got any burning Spark or general OSS questions
>> you haven't had the time to ask anyone else I hope you'll swing by and join
>> me. If no one comes with questions I'll tour some of the Spark on K8s code
>> I've been working on recently -
>> https://calendar.google.com/event?action=TEMPLATE=MG84MWJwaGc2YnNuNnFscHJ1c2ptb3NiMm8gaG9sZGVuLmthcmF1QG0=holden.karau%40gmail.com
>>
>>
>> Cheers,
>>
>> Holden :)
>>
>> --
>> Twitter: https://twitter.com/holdenkarau
>> Books (Learning Spark, High Performance Spark, etc.):
>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>
>

-- 
Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
YouTube Live Streams: https://www.youtube.com/user/holdenkarau


Drop-In Virtual Office Half-Hour

2021-09-13 Thread Holden Karau
Hi Folks,

I'm going to experiment with a drop-in virtual half-hour office hour type
thing next Monday, if you've got any burning Spark or general OSS questions
you haven't had the time to ask anyone else I hope you'll swing by and join
me. If no one comes with questions I'll tour some of the Spark on K8s code
I've been working on recently -
https://calendar.google.com/event?action=TEMPLATE=MG84MWJwaGc2YnNuNnFscHJ1c2ptb3NiMm8gaG9sZGVuLmthcmF1QG0=holden.karau%40gmail.com


Cheers,

Holden :)

-- 
Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  
YouTube Live Streams: https://www.youtube.com/user/holdenkarau


Re: Spark on Kubernetes scheduler variety

2021-07-08 Thread Holden Karau
Hi Y'all,

We had an initial meeting which went well, got some more context around
Volcano and its near-term roadmap. Talked about the impact around scheduler
deadlocking and some ways that we could potentially improve integration
from the Spark side and Volcano sides respectively. I'm going to start
creating some sub-issues under
https://issues.apache.org/jira/browse/SPARK-36057

If anyone is interested in being on the next meeting please reach out and
I'll send an e-mail around to try and schedule re-occurring sync that works
for folks.

Cheers,

Holden

On Thu, Jun 24, 2021 at 8:56 AM Holden Karau  wrote:

> That's awesome, I'm just starting to get context around Volcano but maybe
> we can schedule an initial meeting for all of us interested in pursuing
> this to get on the same page.
>
> On Wed, Jun 23, 2021 at 6:54 PM Klaus Ma  wrote:
>
>> Hi team,
>>
>> I'm kube-batch/Volcano founder, and I'm excited to hear that the spark
>> community also has such requirements :)
>>
>> Volcano provides several features for batch workload, e.g. fair-share,
>> queue, reservation, preemption/reclaim and so on.
>> It has been used in several product environments with Spark; if
>> necessary, I can give an overall introduction about Volcano's features and
>> those use cases :)
>>
>> -- Klaus
>>
>> On Wed, Jun 23, 2021 at 11:26 PM Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>>
>>>
>>> Please allow me to be diverse and express a different point of view on
>>> this roadmap.
>>>
>>>
>>> I believe from a technical point of view spending time and effort plus
>>> talent on batch scheduling on Kubernetes could be rewarding. However, if I
>>> may say I doubt whether such an approach and the so-called democratization
>>> of Spark on whatever platform is really should be of great focus.
>>>
>>> Having worked on Google Dataproc <https://cloud.google.com/dataproc> (A 
>>> fully
>>> managed and highly scalable service for running Apache Spark, Hadoop and
>>> more recently other artefacts) for that past two years, and Spark on
>>> Kubernetes on-premise, I have come to the conclusion that Spark is not a
>>> beast that that one can fully commoditize it much like one can do with
>>> Zookeeper, Kafka etc. There is always a struggle to make some niche areas
>>> of Spark like Spark Structured Streaming (SSS) work seamlessly and
>>> effortlessly on these commercial platforms with whatever as a Service.
>>>
>>>
>>> Moreover, Spark (and I stand corrected) from the ground up has already a
>>> lot of resiliency and redundancy built in. It is truly an enterprise class
>>> product (requires enterprise class support) that will be difficult to
>>> commoditize with Kubernetes and expect the same performance. After all,
>>> Kubernetes is aimed at efficient resource sharing and potential cost saving
>>> for the mass market. In short I can see commercial enterprises will work on
>>> these platforms ,but may be the great talents on dev team should focus on
>>> stuff like the perceived limitation of SSS in dealing with chain of
>>> aggregation( if I am correct it is not yet supported on streaming datasets)
>>>
>>>
>>> These are my opinions and they are not facts, just opinions so to speak
>>> :)
>>>
>>>
>>>view my Linkedin profile
>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>>
>>> On Fri, 18 Jun 2021 at 23:18, Holden Karau  wrote:
>>>
>>>> I think these approaches are good, but there are limitations (eg
>>>> dynamic scaling) without us making changes inside of the Spark Kube
>>>> scheduler.
>>>>
>>>> Certainly whichever scheduler extensions we add support for we should
>>>> collaborate with the people developing those extensions insofar as they are
>>>> interested. My first place that I checked was #sig-scheduling which is
>>>> fairly quite on the Kubernetes slack but if there are more places to look
>>>> for folks interested in batch scheduling

Re: Spark on Kubernetes scheduler variety

2021-06-24 Thread Holden Karau
That's awesome, I'm just starting to get context around Volcano but maybe
we can schedule an initial meeting for all of us interested in pursuing
this to get on the same page.

On Wed, Jun 23, 2021 at 6:54 PM Klaus Ma  wrote:

> Hi team,
>
> I'm kube-batch/Volcano founder, and I'm excited to hear that the spark
> community also has such requirements :)
>
> Volcano provides several features for batch workload, e.g. fair-share,
> queue, reservation, preemption/reclaim and so on.
> It has been used in several product environments with Spark; if necessary,
> I can give an overall introduction about Volcano's features and those use
> cases :)
>
> -- Klaus
>
> On Wed, Jun 23, 2021 at 11:26 PM Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>>
>>
>> Please allow me to be diverse and express a different point of view on
>> this roadmap.
>>
>>
>> I believe from a technical point of view spending time and effort plus
>> talent on batch scheduling on Kubernetes could be rewarding. However, if I
>> may say I doubt whether such an approach and the so-called democratization
>> of Spark on whatever platform is really should be of great focus.
>>
>> Having worked on Google Dataproc <https://cloud.google.com/dataproc> (A fully
>> managed and highly scalable service for running Apache Spark, Hadoop and
>> more recently other artefacts) for that past two years, and Spark on
>> Kubernetes on-premise, I have come to the conclusion that Spark is not a
>> beast that that one can fully commoditize it much like one can do with
>> Zookeeper, Kafka etc. There is always a struggle to make some niche areas
>> of Spark like Spark Structured Streaming (SSS) work seamlessly and
>> effortlessly on these commercial platforms with whatever as a Service.
>>
>>
>> Moreover, Spark (and I stand corrected) from the ground up has already a
>> lot of resiliency and redundancy built in. It is truly an enterprise class
>> product (requires enterprise class support) that will be difficult to
>> commoditize with Kubernetes and expect the same performance. After all,
>> Kubernetes is aimed at efficient resource sharing and potential cost saving
>> for the mass market. In short I can see commercial enterprises will work on
>> these platforms ,but may be the great talents on dev team should focus on
>> stuff like the perceived limitation of SSS in dealing with chain of
>> aggregation( if I am correct it is not yet supported on streaming datasets)
>>
>>
>> These are my opinions and they are not facts, just opinions so to speak :)
>>
>>
>>view my Linkedin profile
>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Fri, 18 Jun 2021 at 23:18, Holden Karau  wrote:
>>
>>> I think these approaches are good, but there are limitations (eg dynamic
>>> scaling) without us making changes inside of the Spark Kube scheduler.
>>>
>>> Certainly whichever scheduler extensions we add support for we should
>>> collaborate with the people developing those extensions insofar as they are
>>> interested. My first place that I checked was #sig-scheduling which is
>>> fairly quite on the Kubernetes slack but if there are more places to look
>>> for folks interested in batch scheduling on Kubernetes we should definitely
>>> give it a shot :)
>>>
>>> On Fri, Jun 18, 2021 at 1:41 AM Mich Talebzadeh <
>>> mich.talebza...@gmail.com> wrote:
>>>
>>>> Hi,
>>>>
>>>> Regarding your point and I quote
>>>>
>>>> "..  I know that one of the Spark on Kube operators
>>>> supports volcano/kube-batch so I was thinking that might be a place I would
>>>> start exploring..."
>>>>
>>>> There seems to be ongoing work on say Volcano as part of  Cloud Native
>>>> Computing Foundation <https://cncf.io/> (CNCF). For example through
>>>> https://github.com/volcano-sh/volcano
>>>>
>>> <https://github.com/volcano-sh/volcano>
>>>>
>>>> There may be value-add in collaborating with such groups through CNCF
>>>> in order to have a c

Re: CVEs

2021-06-21 Thread Holden Karau
If you get to a point where you find something you think is highly likely a
valid vulnerability the best path forward is likely reaching out to private@
to figure out how to do a security release.

On Mon, Jun 21, 2021 at 4:42 PM Eric Richardson 
wrote:

> Thanks for the quick reply. Yes, since it is included in the jars then it
> is unclear whether it is used internally at least to me.
>
> I can substitute the jar in the distro to avoid the scanner from finding
> it but then it is unclear whether I could be breaking something or not.
> Given that 3.1.2 is the latest release, I guess you might expect that it
> would pass the scanners but I am not sure if that version spans 3.0.x and
> 3.1.x or not either.
>
> I can report findings in an issue where I am pretty darn sure it is a
> valid vulnerability if that is ok? That at least would raise the
> visibility.
>
> Will 3.2.x be Scala 2.13.x only or cross compiled with 2.12?
>
> I realize Spark is a beast so I just want to help if I can but also not
> create extra work if it is not useful for me or the Spark team/contributors.
>
> On Mon, Jun 21, 2021 at 3:43 PM Sean Owen  wrote:
>
>> Whether it matters really depends on whether the CVE affects Spark.
>> Sometimes it clearly could and so we'd try to back-port dependency updates
>> to active branches.
>> Sometimes it clearly doesn't and hey sometimes the dependency is updated
>> anyway for good measure (mostly to keep this off static analyzer reports)
>> but probably wouldn't backport.
>>
>> Jackson has been a persistent one but in this case Spark is already on
>> 2.12.x in master, and it wasn't clear last time I looked at those CVEs that
>> they can affect Spark itself. End user apps perhaps, but those apps can
>> supply their own Jackson.
>>
>> If someone had a legit view that this is potentially more serious I think
>> we could _probably backport that update, but Jackson can be a little bit
>> tricky with compatibility IIRC so would just bear some testing.
>>
>>
>> On Mon, Jun 21, 2021 at 5:27 PM Eric Richardson 
>> wrote:
>>
>>> Hi,
>>>
>>> I am working with Spark 3.1.2 and getting several vulnerabilities
>>> popping up. I am wondering if the Spark distros are scanned etc. and how
>>> people resolve these.
>>>
>>> For example. I am finding -
>>> https://nvd.nist.gov/vuln/detail/CVE-2020-25649
>>>
>>> This looks like it is fixed in 2.11.0 -
>>> https://github.com/FasterXML/jackson-databind/issues/2589 - but Spark
>>> supplies 2.10.0.
>>>
>>> Thanks,
>>> Eric
>>>
>> --
Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  
YouTube Live Streams: https://www.youtube.com/user/holdenkarau


Re: Scala vs Python for ETL with Spark

2020-10-17 Thread Holden Karau
Scala and Python have their advantages and disadvantages with Spark.  In my
experience with performance is super important you’ll end up needing to do
some of your work in the JVM, but in many situations what matters work is
what your team and company are familiar with and the ecosystem of tooling
for your domain.

Since that can change so much between people and projects I think arguing
about the one true language is likely to be unproductive.

We’re all here because we want Spark and more broadly open source data
tooling to succeed — let’s keep that in mind. There is far too much stress
in the world, and I know I’ve sometimes used word choices I regret
especially this year. Let’s all take the weekend to do something we enjoy
away from Spark :)

On Sat, Oct 17, 2020 at 7:58 AM "Yuri Oleynikov (‫יורי אולייניקוב‬‎)" <
yur...@gmail.com> wrote:

> It seems that thread converted to holy war that has nothing to do with
> original question. If it is, it’s super disappointing
>
> Отправлено с iPhone
>
> > 17 окт. 2020 г., в 15:53, Molotch  написал(а):
> >
> > I would say the pros and cons of Python vs Scala is both down to Spark,
> the
> > languages in themselves and what kind of data engineer you will get when
> you
> > try to hire for the different solutions.
> >
> > With Pyspark you get less functionality and increased complexity with the
> > py4j java interop compared to vanilla Spark. Why would you want that?
> Maybe
> > you want the Python ML tools and have a clear use case, then go for it.
> If
> > not, avoid the increased complexity and reduced functionality of Pyspark.
> >
> > Python vs Scala? Idiomatic Python is a lesson in bad programming
> > habits/ideas, there's no other way to put it. Do you really want
> programmers
> > enjoying coding i such a language hacking away at your system?
> >
> > Scala might be far from perfect with the plethora of ways to express
> > yourself. But Python < 3.5 is not fit for anything except simple
> scripting
> > IMO.
> >
> > Doing exploratory data analysis in a Jupiter notebook, Pyspark seems
> like a
> > fine idea. Coding an entire ETL library including state management, the
> > whole kitchen including the sink, Scala everyday of the week.
> >
> >
> >
> > --
> > Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
> >
> > -
> > To unsubscribe e-mail: user-unsubscr...@spark.apache.org
> >
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
> --
Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  
YouTube Live Streams: https://www.youtube.com/user/holdenkarau


Re: REST Structured Steaming Sink

2020-07-01 Thread Holden Karau
On Wed, Jul 1, 2020 at 6:13 PM Burak Yavuz  wrote:

> I'm not sure having a built-in sink that allows you to DDOS servers is the
> best idea either
>
Do you think it would be used accidentally? If so we could have it with
default per server rate limits that people would have to explicitly tune.

> . foreachWriter is typically used for such use cases, not foreachBatch.
> It's also pretty hard to guarantee exactly-once, rate limiting, etc.
>

> Best,
> Burak
>
> On Wed, Jul 1, 2020 at 5:54 PM Holden Karau  wrote:
>
>> I think adding something like this (if it doesn't already exist) could
>> help make structured streaming easier to use, foreachBatch is not the best
>> API.
>>
>> On Wed, Jul 1, 2020 at 2:21 PM Jungtaek Lim 
>> wrote:
>>
>>> I guess the method, query parameter, header, and the payload would
>>> be all different for almost every use case - that makes it hard to
>>> generalize and requires implementation to be pretty much complicated to be
>>> flexible enough.
>>>
>>> I'm not aware of any custom sink implementing REST so your best bet
>>> would be simply implementing your own with foreachBatch, but so someone
>>> might jump in and provide a pointer if there is something in the Spark
>>> ecosystem.
>>>
>>> Thanks,
>>> Jungtaek Lim (HeartSaVioR)
>>>
>>> On Thu, Jul 2, 2020 at 3:21 AM Sam Elamin 
>>> wrote:
>>>
>>>> Hi All,
>>>>
>>>>
>>>> We ingest alot of restful APIs into our lake and I'm wondering if it is
>>>> at all possible to created a rest sink in structured streaming?
>>>>
>>>> For now I'm only focusing on restful services that have an incremental
>>>> ID so my sink can just poll for new data then ingest.
>>>>
>>>> I can't seem to find a connector that does this and my gut instinct
>>>> tells me it's probably because it isn't possible due to something
>>>> completely obvious that I am missing
>>>>
>>>> I know some RESTful API obfuscate the IDs to a hash of strings and that
>>>> could be a problem but since I'm planning on focusing on just numerical IDs
>>>> that just get incremented I think I won't be facing that issue
>>>>
>>>>
>>>> Can anyone let me know if this sounds like a daft idea? Will I need
>>>> something like Kafka or kinesis as a buffer and redundancy or am I
>>>> overthinking this?
>>>>
>>>>
>>>> I would love to bounce ideas with people who runs structured streaming
>>>> jobs in production
>>>>
>>>>
>>>> Kind regards
>>>> San
>>>>
>>>>
>>>>
>>
>> --
>> Twitter: https://twitter.com/holdenkarau
>> Books (Learning Spark, High Performance Spark, etc.):
>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>
> --
Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
YouTube Live Streams: https://www.youtube.com/user/holdenkarau


Re: REST Structured Steaming Sink

2020-07-01 Thread Holden Karau
I think adding something like this (if it doesn't already exist) could help
make structured streaming easier to use, foreachBatch is not the best API.

On Wed, Jul 1, 2020 at 2:21 PM Jungtaek Lim 
wrote:

> I guess the method, query parameter, header, and the payload would be all
> different for almost every use case - that makes it hard to generalize and
> requires implementation to be pretty much complicated to be flexible enough.
>
> I'm not aware of any custom sink implementing REST so your best bet would
> be simply implementing your own with foreachBatch, but so someone might
> jump in and provide a pointer if there is something in the Spark ecosystem.
>
> Thanks,
> Jungtaek Lim (HeartSaVioR)
>
> On Thu, Jul 2, 2020 at 3:21 AM Sam Elamin  wrote:
>
>> Hi All,
>>
>>
>> We ingest alot of restful APIs into our lake and I'm wondering if it is
>> at all possible to created a rest sink in structured streaming?
>>
>> For now I'm only focusing on restful services that have an incremental ID
>> so my sink can just poll for new data then ingest.
>>
>> I can't seem to find a connector that does this and my gut instinct tells
>> me it's probably because it isn't possible due to something completely
>> obvious that I am missing
>>
>> I know some RESTful API obfuscate the IDs to a hash of strings and that
>> could be a problem but since I'm planning on focusing on just numerical IDs
>> that just get incremented I think I won't be facing that issue
>>
>>
>> Can anyone let me know if this sounds like a daft idea? Will I need
>> something like Kafka or kinesis as a buffer and redundancy or am I
>> overthinking this?
>>
>>
>> I would love to bounce ideas with people who runs structured streaming
>> jobs in production
>>
>>
>> Kind regards
>> San
>>
>>
>>

-- 
Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  
YouTube Live Streams: https://www.youtube.com/user/holdenkarau


[ANNOUNCE] Apache Spark 2.4.6 released

2020-06-10 Thread Holden Karau
We are happy to announce the availability of Spark 2.4.6!

Spark 2.4.6 is a maintenance release containing stability, correctness, and
security fixes.
This release is based on the branch-2.4 maintenance branch of Spark. We
strongly recommend all 2.4 users to upgrade to this stable release.

To download Spark 2.4.6, head over to the download page:
http://spark.apache.org/downloads.html
Spark 2.4.6 is also available in Maven Central, PyPI, and CRAN.

Note that you might need to clear your browser cache or
to use `Private`/`Incognito` mode according to your browsers.

To view the release notes:
https://spark.apache.org/releases/spark-release-2.4.6.html

We would like to acknowledge all community members for contributing to this
release. This release would not have been possible without you.


Re: Spark API and immutability

2020-05-25 Thread Holden Karau
So even on RDDs cache/persist mutate the RDD object. The important thing
for Spark is that the data  represented/in the RDD/Dataframe isn’t mutated.

On Mon, May 25, 2020 at 10:56 AM Chris Thomas 
wrote:

>
> The cache() method on the DataFrame API caught me out.
>
> Having learnt that DataFrames are built on RDDs and that RDDs are
> immutable, when I saw the statement df.cache() in our codebase I thought
> ‘This must be a bug, the result is not assigned, the statement will have no
> affect.’
>
> However, I’ve since learnt that the cache method actually mutates the
> DataFrame object*. The statement was valid after all.
>
> I understand that the underlying user data is immutable, but doesn’t
> mutating the DataFrame object make the API a little inconsistent and harder
> to reason about?
>
> Regards
>
> Chris
>
>
> * (as does persist and rdd.setName methods. I expect there are others)
>
-- 
Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  
YouTube Live Streams: https://www.youtube.com/user/holdenkarau


Re: Watch "Airbus makes more of the sky with Spark - Jesse Anderson & Hassene Ben Salem" on YouTube

2020-04-25 Thread Holden Karau
Also it’s ok if Spark and Flink evolve in different directions, were both
part of the same open source foundation. Sometimes being everything to
everyone isn’t as important as being the best at what you need.

I like to think of our relationship with other Apache projects as less
competitive and more cooperative, and focus on how can we make open source
data tools successful.

On Sat, Apr 25, 2020 at 11:40 AM  wrote:

> Zahid,
>
>
>
> Starting with Spark 2.3.0, the Spark team introduced an experimental
> feature called “Continuous Streaming”[1][2] to enter that space, but in
> general, Spark streaming operates using micro-batches while Flink operates
> using the Continuous Flow Operator model.
>
>
>
> There are many resources online comparing the two but I am leaving you
> one[3] (old, but still relevant)  so you can start looking into it.
>
>
>
> Note that while I am not a subject expert, that’s the basic explanation.
> Until recently we were not competing with Flink in that space, so it
> explains why Flink was preferred at the time and why it would still be
> preferred today. We will catch up eventually.
>
>
>
> [1]
> https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#continuous-processing
>
> [2]
> https://databricks.com/blog/2018/03/20/low-latency-continuous-processing-mode-in-structured-streaming-in-apache-spark-2-3-0.html
>
> [3] https://www.youtube.com/watch?v=Dzx-iE6RN4w=emb_title
>
>
>
> *From:* Zahid Rahman 
> *Sent:* Saturday, April 25, 2020 7:36 AM
> *To:* joerg.stre...@posteo.de
> *Cc:* user 
> *Subject:* Re: Watch "Airbus makes more of the sky with Spark - Jesse
> Anderson & Hassene Ben Salem" on YouTube
>
>
>
> My motive is simple . I want you (spark product  experts user)  to
> challenge the reason given by Jesse Anderson for choosing flink over spark.
>
>
>
> You know the saying keep your friends close, keep your enemies even closer.
>
> The video  only has  328 views.
>
> It is a great educational tool to see a recent recent Use Case. Should be
> of compelling interest to anyone in this field. Commercial Companies do not
> often share or discuss their projects openly.
>
>
>
> Incidentally Heathrow is the busiest airport in the world.
>
> 1. Because the emailing facility completed my sentence.
>
>
>
> 2. I think at Heathrow the gap is less than two minutes.
>
>
>
>
>
> On Sat, 25 Apr 2020, 09:42 Jörg Strebel,  wrote:
>
> Hallo!
>
> Well, the title of the video is actually "Airbus makes more of the sky
> with Flink - Jesse Anderson & Hassene Ben Salem"and it talks about Apache
> Flink and specifically not about Apache Spark.They excluded Spark Streaming
> for high latency reasons.
>
> Why are you posting this video on a Spark mailing list?
>
> Regards
>
> J. Strebel
>
> Am 25.04.20 um 05:07 schrieb Zahid Rahman:
>
>
>
> https://youtu.be/sYlbD_OoHhs
>
>
> Backbutton.co.uk
>
> ¯\_(ツ)_/¯
> ♡۶Java♡۶RMI ♡۶
>
> Make Use Method {MUM}
>
> makeuse.org
>
> --
>
> Jörg Strebel
>
> Aachener Straße 2 
> 
>
> 80804 München 
> 
>
> --
Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  
YouTube Live Streams: https://www.youtube.com/user/holdenkarau


Re: Copyright Infringment

2020-04-25 Thread Holden Karau
Glad it makes sense now :)

On Sat, Apr 25, 2020 at 11:25 AM Som Lima  wrote:

> The statement in the book makes sense now. It is based on the premise as
> detailed in paragraph 1).
>
> *unless your are reproducing significant portion of the code. *
>
>
>
>
>
> On Sat, 25 Apr 2020, 17:11 Holden Karau,  wrote:
>
>> I’m one of the authors.
>>
>> I think you’ve misunderstood the licenses here, but I am not a lawyer.
>> This is not legal advice but my understanding is:
>>
>> 1) Spark is Apache licensed so code your make using Spark doesn’t need to
>> be open source (it’s not GPL)
>>
>> 2) If you want to use examples from the book and you aren’t using a
>> substantial portion of the code from the book go for it. If you are using a
>> substantial portion of the code please talk to O’Reilly (the publisher) for
>> permission.
>> If you look at the book’s example repo you can find information about the
>> license the individual examples are available under, most are Apache
>> licensed but some components examples are GPL licensed.
>>
>> I hope this helps and your able to use the examples in the book to get
>> your job done and thanks for reading the book.
>>
>> On Sat, Apr 25, 2020 at 8:48 AM Som Lima  wrote:
>>
>>> The text is very clear on the issue of copyright infringement. Ask
>>> permission or you are committing an unlawful act.
>>>
>>> The words "significant portion" has not been quantified.
>>>
>>> So I have nothing to ask of the authors except may be to quantify.
>>> Quantification is a secondary issue.
>>>
>>> My reading of the text is that it applies to any spark user and not just
>>> me personally.
>>>
>>> The authors need to make clear to all spark users whether copyright
>>> infringement was their intent or not.
>>>
>>> The authors need to make clear to all spark users why should any
>>> development team share their Use Case in order  avoid  falling on the
>>> wrong side
>>> of copyright infringement claims.
>>>
>>> I understand  you are also  a named author of a book on Apache usage.
>>>
>>> Perhaps you can share with us from your expertise  the need or your
>>> motivation  for the addendum to the Apache Spark online usage documents.
>>>
>>> Let me rephrase my question.
>>>
>>> Does any Spark User feel as I do this text is a violation of Apache
>>> foundation's  free licence agreement  ?
>>>
>>>
>>>
>>> On Sat, 25 Apr 2020, 16:18 Sean Owen,  wrote:
>>>
>>>> You'll want to ask the authors directly ; the book is not produced by
>>>> the project itself, so can't answer here.
>>>>
>>>> On Sat, Apr 25, 2020, 8:42 AM Som Lima  wrote:
>>>>
>>>>> At the risk of being removed from the emailing I would like a
>>>>> clarification because I do not want  to commit an unlawful act.
>>>>> Can you please clarify if I would be infringing copyright due to this
>>>>> text.
>>>>> *Book:  High Performance Spark *
>>>>> *authors: holden Karau Rachel Warren.*
>>>>> *page xii:*
>>>>>
>>>>> * This book is here to help you get your job done ... If for example
>>>>> code is offered with this book, you may use it in your programs and
>>>>> documentation. You do not need to contact us for permission unless your 
>>>>> are
>>>>> reproducing significant portion of the code. *
>>>>>
>>>> --
>> Twitter: https://twitter.com/holdenkarau
>> Books (Learning Spark, High Performance Spark, etc.):
>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>
> --
Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
YouTube Live Streams: https://www.youtube.com/user/holdenkarau


Re: Copyright Infringment

2020-04-25 Thread Holden Karau
I’m one of the authors.

I think you’ve misunderstood the licenses here, but I am not a lawyer. This
is not legal advice but my understanding is:

1) Spark is Apache licensed so code your make using Spark doesn’t need to
be open source (it’s not GPL)

2) If you want to use examples from the book and you aren’t using a
substantial portion of the code from the book go for it. If you are using a
substantial portion of the code please talk to O’Reilly (the publisher) for
permission.
If you look at the book’s example repo you can find information about the
license the individual examples are available under, most are Apache
licensed but some components examples are GPL licensed.

I hope this helps and your able to use the examples in the book to get your
job done and thanks for reading the book.

On Sat, Apr 25, 2020 at 8:48 AM Som Lima  wrote:

> The text is very clear on the issue of copyright infringement. Ask
> permission or you are committing an unlawful act.
>
> The words "significant portion" has not been quantified.
>
> So I have nothing to ask of the authors except may be to quantify.
> Quantification is a secondary issue.
>
> My reading of the text is that it applies to any spark user and not just
> me personally.
>
> The authors need to make clear to all spark users whether copyright
> infringement was their intent or not.
>
> The authors need to make clear to all spark users why should any
> development team share their Use Case in order  avoid  falling on the
> wrong side
> of copyright infringement claims.
>
> I understand  you are also  a named author of a book on Apache usage.
>
> Perhaps you can share with us from your expertise  the need or your
> motivation  for the addendum to the Apache Spark online usage documents.
>
> Let me rephrase my question.
>
> Does any Spark User feel as I do this text is a violation of Apache
> foundation's  free licence agreement  ?
>
>
>
> On Sat, 25 Apr 2020, 16:18 Sean Owen,  wrote:
>
>> You'll want to ask the authors directly ; the book is not produced by the
>> project itself, so can't answer here.
>>
>> On Sat, Apr 25, 2020, 8:42 AM Som Lima  wrote:
>>
>>> At the risk of being removed from the emailing I would like a
>>> clarification because I do not want  to commit an unlawful act.
>>> Can you please clarify if I would be infringing copyright due to this
>>> text.
>>> *Book:  High Performance Spark *
>>> *authors: holden Karau Rachel Warren.*
>>> *page xii:*
>>>
>>> * This book is here to help you get your job done ... If for example
>>> code is offered with this book, you may use it in your programs and
>>> documentation. You do not need to contact us for permission unless your are
>>> reproducing significant portion of the code. *
>>>
>> --
Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
YouTube Live Streams: https://www.youtube.com/user/holdenkarau


Re: Going it alone.

2020-04-16 Thread Holden Karau
I want to be clear I believe the language in janethrope1s email is
unacceptable for the mailing list and possibly a violation of the Apache
code of conduct. I’m glad we don’t see messages like this often.

I know this is a stressful time for many of us, but let’s try and do our
best to not take it out on others.

On Wed, Apr 15, 2020 at 11:46 PM Subash Prabakar 
wrote:

> Looks like he had a very bad appraisal this year.. Fun fact : the coming
> year would be too :)
>
> On Thu, 16 Apr 2020 at 12:07, Qi Kang  wrote:
>
>> Well man, check your attitude, you’re way over the line
>>
>>
>> On Apr 16, 2020, at 13:26, jane thorpe 
>> wrote:
>>
>> F*U*C*K O*F*F
>> C*U*N*T*S
>>
>>
>> --
>> On Thursday, 16 April 2020 Kelvin Qin  wrote:
>>
>> No wonder I said why I can't understand what the mail expresses, it feels
>> like a joke……
>>
>>
>>
>>
>>
>> 在 2020-04-16 02:28:49,seemanto.ba...@nomura.com.INVALID 写道:
>>
>> Have we been tricked by a bot ?
>>
>>
>> *From:* Matt Smith 
>> *Sent:* Wednesday, April 15, 2020 2:23 PM
>> *To:* jane thorpe
>> *Cc:* dh.lo...@gmail.com; user@spark.apache.org; janethor...@aol.com;
>> em...@yeikel.com
>> *Subject:* Re: Going it alone.
>>
>>
>> *CAUTION EXTERNAL EMAIL:* DO NOT CLICK ON LINKS OR OPEN ATTACHMENTS THAT
>> ARE UNEXPECTED OR SENT FROM UNKNOWN SENDERS. IF IN DOUBT REPORT TO SPAM
>> SUBMISSIONS.
>>
>> This is so entertaining.
>>
>>
>> 1. Ask for help
>>
>> 2. Compare those you need help from to a lower order primate.
>>
>> 3. Claim you provided information you did not
>>
>> 4. Explain that providing any information would be "too revealing"
>>
>> 5. ???
>>
>>
>> Can't wait to hear what comes next, but please keep it up.  This is a
>> bright spot in my day.
>>
>>
>>
>> On Tue, Apr 14, 2020 at 4:47 PM jane thorpe 
>> wrote:
>>
>> I did write a long email in response to you.
>> But then I deleted it because I felt it would be too revealing.
>>
>>
>>
>>
>> --
>>
>> On Tuesday, 14 April 2020 David Hesson  wrote:
>>
>> I want to know  if Spark is headed in my direction.
>>
>> You are implying  Spark could be.
>>
>>
>>
>> What direction are you headed in, exactly? I don't feel as if anything
>> were implied when you were asked for use cases or what problem you are
>> solving. You were asked to identify some use cases, of which you don't
>> appear to have any.
>>
>>
>> On Tue, Apr 14, 2020 at 4:49 PM jane thorpe 
>> wrote:
>>
>> That's what  I want to know,  Use Cases.
>> I am looking for  direction as I described and I want to know  if Spark
>> is headed in my direction.
>>
>> You are implying  Spark could be.
>>
>> So tell me about the USE CASES and I'll do the rest.
>> --
>>
>> On Tuesday, 14 April 2020 yeikel valdes  wrote:
>>
>> It depends on your use case. What are you trying to solve?
>>
>>
>>
>>  On Tue, 14 Apr 2020 15:36:50 -0400 *janethor...@aol.com.INVALID
>>  *wrote 
>>
>> Hi,
>>
>> I consider myself to be quite good in Software Development especially
>> using frameworks.
>>
>> I like to get my hands  dirty. I have spent the last few months
>> understanding modern frameworks and architectures.
>>
>> I am looking to invest my energy in a product where I don't have to
>> relying on the monkeys which occupy this space  we call software
>> development.
>>
>> I have found one that meets my requirements.
>>
>> Would Apache Spark be a good Tool for me or  do I need to be a member of
>> a team to develop  products  using Apache Spark  ?
>>
>>
>>
>>
>>
>> PLEASE READ: This message is for the named person's use only. It may
>> contain confidential, proprietary or legally privileged information. No
>> confidentiality or privilege is waived or lost by any mistransmission. If
>> you receive this message in error, please delete it and all copies from
>> your system, destroy any hard copies and notify the sender. You must not,
>> directly or indirectly, use, disclose, distribute, print, or copy any part
>> of this message if you are not the intended recipient. Nomura Holding
>> America Inc., Nomura Securities International, Inc, and their respective
>> subsidiaries each reserve the right to monitor all e-mail communications
>> through its networks. Any views expressed in this message are those of the
>> individual sender, except where the message states otherwise and the sender
>> is authorized to state the views of such entity. Unless otherwise stated,
>> any pricing information in this message is indicative only, is subject to
>> change and does not constitute an offer to deal at any price quoted. Any
>> reference to the terms of executed transactions should be treated as
>> preliminary only and subject to our formal written confirmation.
>>
>>
>> --
Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  
YouTube Live Streams: https://www.youtube.com/user/holdenkarau


Re: SPARK Suitable IDE

2020-03-04 Thread Holden Karau
I work in emacs with ensime. I think really any IDE is ok, so go with the
one you feel most at home in.

On Wed, Mar 4, 2020 at 5:49 PM tianlangstudio
 wrote:

> We use IntelliJ IDEA,Whether it's Java, Scala or Python
>
> 
> 
> 
> 
> 
> 
> TianlangStudio 
> Some of the biggest lies: I will start tomorrow/Others are better than
> me/I am not good enough/I don't have time/This is the way I am
> 
>
>
> --
> 发件人:Zahid Rahman 
> 发送时间:2020年3月3日(星期二) 06:43
> 收件人:user 
> 主 题:SPARK Suitable IDE
>
> Hi,
>
> Can you recommend a suitable IDE for Apache sparks from the list below or
> if you know a more suitable one ?
>
> Codeanywhere
> goormIDE
> Koding
> SourceLair
> ShiftEdit
> Browxy
> repl.it
> PaizaCloud IDE
> Eclipse Che
> Visual Studio Online
> Gitpod
> Google Cloud Shell
> Codio
> Codepen
> CodeTasty
> Glitch
> JSitor
> ICEcoder
> Codiad
> Dirigible
> Orion
> Codiva.io
> Collide
> Codenvy
> AWS Cloud9
> JSFiddle
> GitLab
> SLAppForge Sigma
> Jupyter
> CoCalc
>
> Backbutton.co.uk
> ¯\_(ツ)_/¯
> ♡۶Java♡۶RMI ♡۶
> Make Use Method {MUM}
> makeuse.org
> 
>
>
>

-- 
Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  
YouTube Live Streams: https://www.youtube.com/user/holdenkarau


Re: PySpark Pandas UDF

2019-11-12 Thread Holden Karau
Thanks for sharing that. I think we should maybe add some checks around
this so it’s easier to debug. I’m CCing Bryan who might have some thoughts.

On Tue, Nov 12, 2019 at 7:42 AM gal.benshlomo 
wrote:

> SOLVED!
> thanks for the help - I found the issue. it was the version of pyarrow
> (0.15.1) which apparently isn't currently stable. Downgrading it solved the
> issue for me
>
>
>
> --
> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
> --
Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  
YouTube Live Streams: https://www.youtube.com/user/holdenkarau


Re: PySpark Pandas UDF

2019-11-10 Thread Holden Karau
Can you switch the write for a count just so we can isolate if it’s the
write or the count?
Also what’s the output path your using?

On Sun, Nov 10, 2019 at 7:31 AM Gal Benshlomo 
wrote:

>
>
> Hi,
>
>
>
> I’m using pandas_udf and not able to run it from cluster mode, even though
> the same code works on standalone.
>
>
>
> The code is as follows:
>
>
>
>
>
>
>
> schema_test = StructType([
> StructField("cluster", LongType()),
> StructField("name", StringType())
> ])
>
>
> @pandas_udf(schema_test, PandasUDFType.GROUPED_MAP)
> def test_foo(pd_df):
> print('\n\nSid is problematic\n\n')
> pd_df['cluster'] = 1
> return pd_df[['name', 'cluster']]
>
>
>
>
>
> department1 = Row(id='123456', name='Computer Science')
> department2 = Row(id='789012', name='Mechanical Engineering')
> users_data = spark.createDataFrame([department1, department2])
> res = users_data.groupby('id').apply(test_foo)
>
> res.write.parquet(RES_OUTPUT_PATH, mode='overwrite')
>
>
>
>
>
> the errors I’m getting are:
>
> ERROR FileFormatWriter: Aborting job c6eefb8c-c8d5-4236-82d7-298924b03b25.
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 81
> in stage 1.0 failed 4 times, most recent failure: Lost task 81.3 in stage
> 1.0 (TID 192, 10.10.1.17, executor 1): java.lang.IllegalArgumentException
> at java.nio.ByteBuffer.allocate(ByteBuffer.java:334)
> at
> org.apache.arrow.vector.ipc.message.MessageSerializer.readMessage(MessageSerializer.java:543)
> at
> org.apache.arrow.vector.ipc.message.MessageChannelReader.readNext(MessageChannelReader.java:58)
> at
> org.apache.arrow.vector.ipc.ArrowStreamReader.readSchema(ArrowStreamReader.java:132)
> at org.apache.arrow.vector.ipc.ArrowReader.initialize(ArrowReader.java:181)
> at
> org.apache.arrow.vector.ipc.ArrowReader.ensureInitialized(ArrowReader.java:172)
> at
> org.apache.arrow.vector.ipc.ArrowReader.getVectorSchemaRoot(ArrowReader.java:65)
> at
> org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$1.read(ArrowPythonRunner.scala:162)
> at
> org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$1.read(ArrowPythonRunner.scala:122)
> at
> org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:406)
> at
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
> at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
> at
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:232)
> at
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:170)
> at
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:169)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
> at org.apache.spark.scheduler.Task.run(Task.scala:121)
> at
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:403)
> at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:409)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
>
> Driver stacktrace:
> at org.apache.spark.scheduler.DAGScheduler.org
> $apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1889)
> at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1877)
> at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1876)
> at
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
> at
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1876)
> at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:926)
> at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:926)
> at scala.Option.foreach(Option.scala:257)
> at
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:926)
> at
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2110)
> at
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2059)
> at
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2048)
> at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
> at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:737)
> at org.apache.spark.SparkContext.runJob(SparkContext.scala:2061)
> at
> 

Re: Why Spark generates Java code and not Scala?

2019-11-10 Thread Holden Karau
If you look inside of the generation we generate java code and compile it
with Janino. For interested folks the conversation moved over to the dev@
list

On Sat, Nov 9, 2019 at 10:37 AM Marcin Tustin
 wrote:

> What do you mean by this? Spark is written in a combination of Scala and
> Java, and then compiled to Java Byte Code, as is typical for both Scala and
> Java. If there's additional byte code generation happening, it's java byte
> code, because the platform runs on the JVM.
>
> On Sat, Nov 9, 2019 at 12:47 PM Bartosz Konieczny 
> wrote:
>
>> *This Message originated outside your organization.*
>> --
>> Hi there,
>>
>
>> Few days ago I got an intriguing but hard to answer question:
>> "Why Spark generates Java code and not Scala code?"
>> (https://github.com/bartosz25/spark-scala-playground/issues/18)
>>
>> Since I'm not sure about the exact answer, I'd like to ask you to confirm
>> or not my thinking. I was looking for the reasons in the JIRA and the
>> research paper "Spark SQL: Relational Data Processing in Spark" (
>> http://people.csail.mit.edu/matei/papers/2015/sigmod_spark_sql.pdf) but
>> found nothing explaining why Java over Scala. The single task I found was
>> about why Scala and not Java but concerning data types (
>> https://issues.apache.org/jira/browse/SPARK-5193) That's why I'm writing
>> here.
>>
>> My guesses about choosing Java code are:
>> - Java runtime compiler libs are more mature and prod-ready than the
>> Scala's - or at least, they were at the implementation time
>> - Scala compiler tends to be slower than the Java's
>> https://stackoverflow.com/questions/3490383/java-compile-speed-vs-scala-compile-speed
>> - Scala compiler seems to be more complex, so debugging & maintaining it
>> would be harder
>> - it was easier to represent a pure Java OO design than mixed FP/OO in
>> Scala
>> ?
>>
>> Thank you for your help.
>>
>> --
>> Bartosz Konieczny
>> data engineer
>> https://www.waitingforcode.com
>> https://github.com/bartosz25/
>> https://twitter.com/waitingforcode
>>
>> --
Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  
YouTube Live Streams: https://www.youtube.com/user/holdenkarau


Re: pyspark - memory leak leading to OOM after submitting 100 jobs?

2019-11-01 Thread Holden Karau
On Thu, Oct 31, 2019 at 10:04 PM Nicolas Paris 
wrote:

> have you deactivated the spark.ui ?
> I have read several thread explaining the ui can lead to OOM because it
> stores 1000 dags by default
>
>
> On Sun, Oct 20, 2019 at 03:18:20AM -0700, Paul Wais wrote:
> > Dear List,
> >
> > I've observed some sort of memory leak when using pyspark to run ~100
> > jobs in local mode.  Each job is essentially a create RDD -> create DF
> > -> write DF sort of flow.  The RDD and DFs go out of scope after each
> > job completes, hence I call this issue a "memory leak."  Here's
> > pseudocode:
> >
> > ```
> > row_rdds = []
> > for i in range(100):
> >   row_rdd = spark.sparkContext.parallelize([{'a': i} for i in
> range(1000)])
> >   row_rdds.append(row_rdd)
> >
> > for row_rdd in row_rdds:
> >   df = spark.createDataFrame(row_rdd)
> >   df.persist()
> >   print(df.count())
> >   df.write.save(...) # Save parquet
> >   df.unpersist()
> >
> >   # Does not help:
> >   # del df
> >   # del row_rdd
> > ```

The connection between Python GC/del and JVM GC is perhaps a bit weaker
than we might like. There certainly could be a problem here, but it still
shouldn’t be getting to the OOM state.

>
> >
> > In my real application:
> >  * rows are much larger, perhaps 1MB each
> >  * row_rdds are sized to fit available RAM
> >
> > I observe that after 100 or so iterations of the second loop (each of
> > which creates a "job" in the Spark WebUI), the following happens:
> >  * pyspark workers have fairly stable resident and virtual RAM usage
> >  * java process eventually approaches resident RAM cap (8GB standard)
> > but virtual RAM usage keeps ballooning.
> >

Can you share what flags the JVM is launching with? Also which JVM(s) are
ballooning?

>
> > Eventually the machine runs out of RAM and the linux OOM killer kills
> > the java process, resulting in an "IndexError: pop from an empty
> > deque" error from py4j/java_gateway.py .
> >
> >
> > Does anybody have any ideas about what's going on?  Note that this is
> > local mode.  I have personally run standalone masters and submitted a
> > ton of jobs and never seen something like this over time.  Those were
> > very different jobs, but perhaps this issue is bespoke to local mode?
> >
> > Emphasis: I did try to del the pyspark objects and run python GC.
> > That didn't help at all.
> >
> > pyspark 2.4.4 on java 1.8 on ubuntu bionic (tensorflow docker image)
> >
> > 12-core i7 with 16GB of ram and 22GB swap file (swap is *on*).
> >
> > Cheers,
> > -Paul
> >
> > -
> > To unsubscribe e-mail: user-unsubscr...@spark.apache.org
> >
>
> --
> nicolas
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
> --
Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  
YouTube Live Streams: https://www.youtube.com/user/holdenkarau


Re: Loop through Dataframes

2019-10-06 Thread Holden Karau
So if you want to process the contents of a dataframe locally but not pull
all of the data back at once toLocaliterator is probably what you're
looking for, it's still not great though so maybe you can share the root
problem which your trying to solve and folks might have some suggestions
there.

On Sun, Oct 6, 2019 at 2:49 PM KhajaAsmath Mohammed 
wrote:

>
> Hi,
>
> What is the best approach to loop through 3 dataframes in scala based on
> some keys instead of using collect.
>
> Thanks,
> Asmath
>


-- 
Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  
YouTube Live Streams: https://www.youtube.com/user/holdenkarau


Re: Announcing .NET for Apache Spark 0.5.0

2019-09-30 Thread Holden Karau
Congratulations on the release :)

On Mon, Sep 30, 2019 at 9:38 AM Terry Kim  wrote:

> We are thrilled to announce that .NET for Apache Spark 0.5.0 has been just
> released !
>
>
>
> Some of the highlights of this release include:
>
>- Delta Lake 's *DeltaTable *APIs
>- UDF improvements
>- Support for Spark 2.3.4/2.4.4
>
> The release notes
> 
>  include
> the full list of features/improvements of this release.
>
>
>
> We would like to thank all those who contributed to this release.
>
>
>
> Thanks,
>
> Terry
>
-- 
Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  
YouTube Live Streams: https://www.youtube.com/user/holdenkarau


Re: Release Apache Spark 2.4.4

2019-08-14 Thread Holden Karau
That looks like more of a feature than a bug fix unless I’m missing
something?

On Tue, Aug 13, 2019 at 11:58 PM Hyukjin Kwon  wrote:

> Adding Shixiong
>
> WDYT?
>
> 2019년 8월 14일 (수) 오후 2:30, Terry Kim 님이 작성:
>
>> Can the following be included?
>>
>> [SPARK-27234][SS][PYTHON] Use InheritableThreadLocal for current epoch in
>> EpochTracker (to support Python UDFs)
>> <https://github.com/apache/spark/pull/24946>
>>
>> Thanks,
>> Terry
>>
>> On Tue, Aug 13, 2019 at 10:24 PM Wenchen Fan  wrote:
>>
>>> +1
>>>
>>> On Wed, Aug 14, 2019 at 12:52 PM Holden Karau 
>>> wrote:
>>>
>>>> +1
>>>> Does anyone have any critical fixes they’d like to see in 2.4.4?
>>>>
>>>> On Tue, Aug 13, 2019 at 5:22 PM Sean Owen  wrote:
>>>>
>>>>> Seems fine to me if there are enough valuable fixes to justify another
>>>>> release. If there are any other important fixes imminent, it's fine to
>>>>> wait for those.
>>>>>
>>>>>
>>>>> On Tue, Aug 13, 2019 at 6:16 PM Dongjoon Hyun 
>>>>> wrote:
>>>>> >
>>>>> > Hi, All.
>>>>> >
>>>>> > Spark 2.4.3 was released three months ago (8th May).
>>>>> > As of today (13th August), there are 112 commits (75 JIRAs) in
>>>>> `branch-24` since 2.4.3.
>>>>> >
>>>>> > It would be great if we can have Spark 2.4.4.
>>>>> > Shall we start `2.4.4 RC1` next Monday (19th August)?
>>>>> >
>>>>> > Last time, there was a request for K8s issue and now I'm waiting for
>>>>> SPARK-27900.
>>>>> > Please let me know if there is another issue.
>>>>> >
>>>>> > Thanks,
>>>>> > Dongjoon.
>>>>>
>>>>> -
>>>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>>>
>>>>> --
>>>> Twitter: https://twitter.com/holdenkarau
>>>> Books (Learning Spark, High Performance Spark, etc.):
>>>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>>
>>> --
Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
YouTube Live Streams: https://www.youtube.com/user/holdenkarau


Re: Release Apache Spark 2.4.4

2019-08-13 Thread Holden Karau
+1
Does anyone have any critical fixes they’d like to see in 2.4.4?

On Tue, Aug 13, 2019 at 5:22 PM Sean Owen  wrote:

> Seems fine to me if there are enough valuable fixes to justify another
> release. If there are any other important fixes imminent, it's fine to
> wait for those.
>
>
> On Tue, Aug 13, 2019 at 6:16 PM Dongjoon Hyun 
> wrote:
> >
> > Hi, All.
> >
> > Spark 2.4.3 was released three months ago (8th May).
> > As of today (13th August), there are 112 commits (75 JIRAs) in
> `branch-24` since 2.4.3.
> >
> > It would be great if we can have Spark 2.4.4.
> > Shall we start `2.4.4 RC1` next Monday (19th August)?
> >
> > Last time, there was a request for K8s issue and now I'm waiting for
> SPARK-27900.
> > Please let me know if there is another issue.
> >
> > Thanks,
> > Dongjoon.
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
> --
Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  
YouTube Live Streams: https://www.youtube.com/user/holdenkarau


Re: Should python-2 be supported in Spark 3.0?

2019-05-31 Thread Holden Karau
+1

On Fri, May 31, 2019 at 5:41 PM Bryan Cutler  wrote:

> +1 and the draft sounds good
>
> On Thu, May 30, 2019, 11:32 AM Xiangrui Meng  wrote:
>
>> Here is the draft announcement:
>>
>> ===
>> Plan for dropping Python 2 support
>>
>> As many of you already knew, Python core development team and many
>> utilized Python packages like Pandas and NumPy will drop Python 2 support
>> in or before 2020/01/01. Apache Spark has supported both Python 2 and 3
>> since Spark 1.4 release in 2015. However, maintaining Python 2/3
>> compatibility is an increasing burden and it essentially limits the use of
>> Python 3 features in Spark. Given the end of life (EOL) of Python 2 is
>> coming, we plan to eventually drop Python 2 support as well. The current
>> plan is as follows:
>>
>> * In the next major release in 2019, we will deprecate Python 2 support.
>> PySpark users will see a deprecation warning if Python 2 is used. We will
>> publish a migration guide for PySpark users to migrate to Python 3.
>> * We will drop Python 2 support in a future release in 2020, after Python
>> 2 EOL on 2020/01/01. PySpark users will see an error if Python 2 is used.
>> * For releases that support Python 2, e.g., Spark 2.4, their patch
>> releases will continue supporting Python 2. However, after Python 2 EOL, we
>> might not take patches that are specific to Python 2.
>> ===
>>
>> Sean helped make a pass. If it looks good, I'm going to upload it to
>> Spark website and announce it here. Let me know if you think we should do a
>> VOTE instead.
>>
>> On Thu, May 30, 2019 at 9:21 AM Xiangrui Meng  wrote:
>>
>>> I created https://issues.apache.org/jira/browse/SPARK-27884 to track
>>> the work.
>>>
>>> On Thu, May 30, 2019 at 2:18 AM Felix Cheung 
>>> wrote:
>>>
 We don’t usually reference a future release on website

 > Spark website and state that Python 2 is deprecated in Spark 3.0

 I suspect people will then ask when is Spark 3.0 coming out then. Might
 need to provide some clarity on that.

>>>
>>> We can say the "next major release in 2019" instead of Spark 3.0. Spark
>>> 3.0 timeline certainly requires a new thread to discuss.
>>>
>>>


 --
 *From:* Reynold Xin 
 *Sent:* Thursday, May 30, 2019 12:59:14 AM
 *To:* shane knapp
 *Cc:* Erik Erlandson; Mark Hamstra; Matei Zaharia; Sean Owen; Wenchen
 Fen; Xiangrui Meng; dev; user
 *Subject:* Re: Should python-2 be supported in Spark 3.0?

 +1 on Xiangrui’s plan.

 On Thu, May 30, 2019 at 7:55 AM shane knapp 
 wrote:

> I don't have a good sense of the overhead of continuing to support
>> Python 2; is it large enough to consider dropping it in Spark 3.0?
>>
>> from the build/test side, it will actually be pretty easy to continue
> support for python2.7 for spark 2.x as the feature sets won't be 
> expanding.
>

> that being said, i will be cracking a bottle of champagne when i can
> delete all of the ansible and anaconda configs for python2.x.  :)
>

>>> On the development side, in a future release that drops Python 2 support
>>> we can remove code that maintains python 2/3 compatibility and start using
>>> python 3 only features, which is also quite exciting.
>>>
>>>

> shane
> --
> Shane Knapp
> UC Berkeley EECS Research / RISELab Staff Technical Lead
> https://rise.cs.berkeley.edu
>


-- 
Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  
YouTube Live Streams: https://www.youtube.com/user/holdenkarau


Re: How to preserve event order per key in Structured Streaming Repartitioning By Key?

2018-12-11 Thread Holden Karau
So it's been awhile since I poked at the streaming code base, but I don't
think we make an promises about stable sort during repartition, and there's
notes in there about how some of these components should be re-written into
core so even if we did have stable sort I wouldn't depend on it unless it
was in the docs as promise (implementations details can and will change).
It's possible I've just missed something in the docs though.


One possible solution I thought of initial but requires complete output
mode, would be rather than using a hash partitioner use an range
partitioner of the primary key you care about with the second attribute you
want to keep in order (although you could get a split on a the primary key
that way between partions). Then you can apply a "global" sort which if it
matches should not have to do a second shuffle.

A kind of ugly approach that I think would work would be to first add
partion indexes to the elements, then re-partion, then do a groupBy +
custom UDAF which ensure the order within the partion. This is a little
ugly but doesn't depend too much on implementation details. You'd do the
aggregate on a window the same size of your input window and no waiting for
late records.

That being said, while we don't support global sort operations on append or
update updates for fairly clear reasons, it seems like it might be
reasonable to relax this and support sorting within partitions (e.g.
non-global) but that will require a code change and we can take that
discussion to the dev@ list.

On Mon, Dec 3, 2018 at 2:22 PM pmatpadi  wrote:

> I want to write a structured spark streaming Kafka consumer which reads
> data
> from a one partition Kafka topic, repartitions the incoming data by "key"
> to
> 3 spark partitions while keeping the messages ordered per key, and writes
> them to another Kafka topic with 3 partitions.
>
> I used Dataframe.repartition(3, $"key") which I believe uses
> HashPartitioner.
>
> When I executed the query with fixed-batch interval trigger type, I
> visually
> verified the output messages were in the expected order. My assumption is
> that order is not guaranteed on the resulting partition. I am looking to
> receive some affirmation or veto on my assumption in terms of code pointers
> in the spark code repo or documentation.
>
> I also tried using Dataframe.sortWithinPartitions, however this does not
> seem to be supported on streaming data frame without aggregation.
>
> One option I tried was to convert the Dataframe to RDD and apply
> repartitionAndSortWithinPartitions which repartitions the RDD according to
> the given partitioner and, within each resulting partition, sort records by
> their keys. In this case however, I cannot use the resulting RDD in the
> query.writestream operation to write the result in the output Kafka topic.
>
> 1. Is there a data frame repartitioning API that helps sort the
> repartitioned data in the streaming context?
> 2. Are there any other alternatives?
> 3. Does the default trigger type or fixed-interval trigger type for
> micro-batch execution provide any sort of message ordering guarantees?
> 4. Is there any ordering possible in the Continuous trigger type?
>
> Incoming data:
>
> 
>
> Code:
>
> case class KVOutput(key: String, ts: Long, value: String, spark_partition:
> Int)
>
> val df = spark.readStream.format("kafka")
>   .option("kafka.bootstrap.servers", kafkaBrokers.get)
>   .option("subscribe", Array(kafkaInputTopic.get).mkString(","))
>   .option("maxOffsetsPerTrigger",30)
>   .load()
>
> val inputDf = df.selectExpr("CAST(key AS STRING)","CAST(value AS STRING)")
> val resDf = inputDf.repartition(3, $"key")
>   .select(from_json($"value", schema).as("kv"))
>   .selectExpr("kv.key", "kv.ts", "kv.value")
>   .withColumn("spark_partition", spark_partition_id())
>   .select($"key", $"ts", $"value", $"spark_partition").as[KVOutput]
>   .sortWithinPartitions($"ts", $"value")
>   .select($"key".cast(StringType).as("key"),
> to_json(struct($"*")).cast(StringType).as("value"))
>
> val query = resDf.writeStream
>   .format("kafka")
>   .option("kafka.bootstrap.servers", kafkaBrokers.get)
>   .option("topic", kafkaOutputTopic.get)
>   .option("checkpointLocation", checkpointLocation.get)
>   .start()
>
> Error:
>
> When I submit this application, it fails with
> 
>
>
>
>
> --
> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>

-- 
Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  
YouTube Live Streams: https://www.youtube.com/user/holdenkarau


Re: [Spark Shell on AWS K8s Cluster]: Is there more documentation regarding how to run spark-shell on k8s cluster?

2018-11-15 Thread Holden Karau
If folks are interested, while it's not on Amazon, I've got a live stream
of getting client mode with Jupyternotebook to work on GCP/GKE :
https://www.youtube.com/watch?v=eMj0Pv1-Nfo=3=PLRLebp9QyZtZflexn4Yf9xsocrR_aSryx

On Wed, Oct 31, 2018 at 5:55 PM Zhang, Yuqi  wrote:

> Hi Li,
>
>
>
> Thank you very much for your reply!
>
>
>
> > Did you make the headless service that reflects the driver pod name?
>
> I am not sure but I used “app” in the headless service as selector which
> is the same app name for the StatefulSet that will create the spark driver
> pod.
>
> For your reference, I attached the yaml file for making headless service
> and StatefulSet. Could you please help me take a look at it if you have
> time?
>
>
>
> I appreciate for your help & have a good day!
>
>
>
> Best Regards,
>
> --
>
> Yuqi Zhang
>
> Software Engineer
>
> m: 090-6725-6573
>
>
> [image: signature_147554612] 
>
> 2 Chome-2-23-1 Akasaka
>
> Minato, Tokyo 107-0052
> teradata.com 
>
> This e-mail is from Teradata Corporation and may contain information that
> is confidential or proprietary. If you are not the intended recipient, do
> not read, copy or distribute the e-mail or any attachments. Instead, please
> notify the sender and delete the e-mail and any attachments. Thank you.
>
> Please consider the environment before printing.
>
>
>
>
>
>
>
> *From: *Li Gao 
> *Date: *Thursday, November 1, 2018 4:56
> *To: *"Zhang, Yuqi" 
> *Cc: *Gourav Sengupta , "user@spark.apache.org"
> , "Nogami, Masatsugu"
> 
> *Subject: *Re: [Spark Shell on AWS K8s Cluster]: Is there more
> documentation regarding how to run spark-shell on k8s cluster?
>
>
>
> Hi Yuqi,
>
>
>
> Yes we are running Jupyter Gateway and kernels on k8s and using Spark
> 2.4's client mode to launch pyspark. In client mode your driver is running
> on the same pod where your kernel runs.
>
>
>
> I am planning to write some blog post on this on some future date. Did you
> make the headless service that reflects the driver pod name? Thats one of
> critical pieces we automated in our custom code that makes the client mode
> works.
>
>
>
> -Li
>
>
>
>
>
> On Wed, Oct 31, 2018 at 8:13 AM Zhang, Yuqi 
> wrote:
>
> Hi Li,
>
>
>
> Thank you for your reply.
>
> Do you mean running Jupyter client on k8s cluster to use spark 2.4?
> Actually I am also trying to set up JupyterHub on k8s to use spark, that’s
> why I would like to know how to run spark client mode on k8s cluster. If
> there is any related documentation on how to set up the Jupyter on k8s to
> use spark, could you please share with me?
>
>
>
> Thank you for your help!
>
>
>
> Best Regards,
>
> --
>
> Yuqi Zhang
>
> Software Engineer
>
> m: 090-6725-6573
>
>
> [image: signature_147554612] 
>
> 2 Chome-2-23-1 Akasaka
>
> Minato, Tokyo 107-0052
> teradata.com 
>
> This e-mail is from Teradata Corporation and may contain information that
> is confidential or proprietary. If you are not the intended recipient, do
> not read, copy or distribute the e-mail or any attachments. Instead, please
> notify the sender and delete the e-mail and any attachments. Thank you.
>
> Please consider the environment before printing.
>
>
>
>
>
>
>
> *From: *Li Gao 
> *Date: *Thursday, November 1, 2018 0:07
> *To: *"Zhang, Yuqi" 
> *Cc: *"gourav.sengu...@gmail.com" , "
> user@spark.apache.org" , "Nogami, Masatsugu"
> 
> *Subject: *Re: [Spark Shell on AWS K8s Cluster]: Is there more
> documentation regarding how to run spark-shell on k8s cluster?
>
>
>
> Yuqi,
>
>
>
> Your error seems unrelated to headless service config you need to enable.
> For headless service you need to create a headless service that matches to
> your driver pod name exactly in order for spark 2.4 RC to work under client
> mode. We have this running for a while now using Jupyter kernel as the
> driver client.
>
>
>
> -Li
>
>
>
>
>
> On Wed, Oct 31, 2018 at 7:30 AM Zhang, Yuqi 
> wrote:
>
> Hi Gourav,
>
>
>
> Thank you for your reply.
>
>
>
> I haven’t try glue or EMK, but I guess it’s integrating kubernetes on aws
> instances?
>
> I could set up the k8s cluster on AWS, but my problem is don’t know how to
> run spark-shell on kubernetes…
>
> Since spark only support client mode on k8s from 2.4 version which is not
> officially released yet, I would like to ask if there is more detailed
> documentation regarding the way to run spark-shell on k8s cluster?
>
>
>
> Thank you in advance & best regards!
>
>
>
> --
>
> Yuqi Zhang
>
> Software Engineer
>
> m: 090-6725-6573
>
>
> *Error! Filename not specified.* 
>
> 2 Chome-2-23-1 Akasaka
>
> Minato, Tokyo 107-0052
> teradata.com 
>
> This e-mail is from Teradata Corporation and may contain information that
> is confidential or proprietary. If you are not the intended recipient, do
> not read, copy or distribute the e-mail or any attachments. Instead, please
> notify the sender and 

Re: Is there any Spark source in Java

2018-11-03 Thread Holden Karau
Parts of it are indeed written in Java. You probably want to reach out to
the developers list to talk about changing Spark.

On Sat, Nov 3, 2018, 11:42 AM Soheil Pourbafrani  Hi, I want to customize some part of Spark. I was wondering if there any
> Spark source is written in Java language, or all the sources are in Scala
> language?
>


Code review and Coding livestreams today

2018-10-12 Thread Holden Karau
I’ll be doing my regular weekly code review at 10am Pacific today -
https://youtu.be/IlH-EGiWXK8 with a look at the current RC, and in the
afternoon at 3pm Pacific I’ll be doing some live coding around WIP graceful
decommissioning PR -
https://youtu.be/4FKuYk2sbQ8
-- 
Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  
YouTube Live Streams: https://www.youtube.com/user/holdenkarau


Re: Live Streamed Code Review today at 11am Pacific

2018-09-21 Thread Holden Karau
I'm going to be doing this again tomorrow, Friday the 21st, at 9am -
https://www.youtube.com/watch?v=xb2FsHaozVQ / http://twitch.tv/holdenkarau
:) As always if you have anything you want me to look at in particular send
me a message. https://github.com/apache/spark/pull/22275 (Arrow
out-of-order batches) is my current plan to start with :)

On Thu, Jul 19, 2018 at 11:38 PM Holden Karau  wrote:

> Heads up tomorrows Friday review is going to be at 8:30 am instead of 9:30
> am because I had to move some flights around.
>
> On Fri, Jul 13, 2018 at 12:03 PM, Holden Karau 
> wrote:
>
>> This afternoon @ 3pm pacific I'll be looking at review tooling for Spark
>> & Beam https://www.youtube.com/watch?v=ff8_jbzC8JI.
>>
>> Next week's regular Friday code (this time July 20th @ 9:30am pacific)
>> review will once again probably have more of an ML focus for folks
>> interested in watching Spark ML PRs be reviewed -
>>  https://www.youtube.com/watch?v=aG5h99yb6XE
>> <https://www.youtube.com/watch?v=aG5h99yb6XE>
>>
>> Next week I'll have a live coding session with more of a Beam focus if
>> you want to see something a bit different (but still related since Beam
>> runs on Spark) with a focus on Python dependency management (which is a
>> thing we are also exploring in Spark at the same time) -
>> https://www.youtube.com/watch?v=Sv0XhS2pYqA on July 19th at 2pm pacific.
>>
>> P.S.
>>
>> You can follow more generally me holdenkarau on YouTube
>> <https://www.google.com/url?sa=t=j==s=web=1=rja=8=2ahUKEwiGmJfR4JzcAhVqiFQKHdOVBIUQFjAAegQIBBAB=https%3A%2F%2Fwww.youtube.com%2Fuser%2Fholdenkarau=AOvVaw2RJa2rRWsRKM2w2gFFbtE8>
>> and holdenkarau on Twitch <https://www.twitch.tv/holdenkarau> to be
>> notified even when I forget to send out the emails (which is pretty often).
>>
>> This morning I did another live review session I forgot to ping to the
>> list about (
>> https://www.youtube.com/watch?v=M_lRFptcGTI=PLRLebp9QyZtYF46jlSnIu2x1NDBkKa2uw=31
>>  )
>> and yesterday I did some live coding using PySpark and working on Sparkling
>> ML -
>> https://www.youtube.com/watch?v=kCnBDpNce9A=PLRLebp9QyZtYF46jlSnIu2x1NDBkKa2uw=32
>>
>> On Wed, Jun 27, 2018 at 10:44 AM, Holden Karau 
>> wrote:
>>
>>> Today @ 1:30pm pacific I'll be looking at the current Spark 2.1.3 RC and
>>> see how we validate Spark releases -
>>> https://www.twitch.tv/events/VAg-5PKURQeH15UAawhBtw /
>>> https://www.youtube.com/watch?v=1_XLrlKS26o . Tomorrow @ 12:30 live PR
>>> reviews & Monday live coding - https://youtube.com/user/holdenkarau &
>>> https://www.twitch.tv/holdenkarau/events . Hopefully this can encourage
>>> more folks to help with RC validation & PR reviews :)
>>>
>>> On Thu, Jun 14, 2018 at 6:07 AM, Holden Karau 
>>> wrote:
>>>
>>>> Next week is pride in San Francisco but I'm still going to do two quick
>>>> session. One will be live coding with Apache Spark to collect ASF diversity
>>>> information ( https://www.youtube.com/watch?v=OirnFnsU37A /
>>>> https://www.twitch.tv/events/O1edDMkTRBGy0I0RCK-Afg ) on Monday at 9am
>>>> pacific and the other will be the regular Friday code review (
>>>> https://www.youtube.com/watch?v=IAWm4OLRoyY /
>>>> https://www.twitch.tv/events/v0qzXxnNQ_K7a8JYFsIiKQ ) also at 9am.
>>>>
>>>> On Thu, Jun 7, 2018 at 9:10 PM, Holden Karau 
>>>> wrote:
>>>>
>>>>> I'll be doing another one tomorrow morning at 9am pacific focused on
>>>>> Python + K8s support & improved JSON support -
>>>>> https://www.youtube.com/watch?v=Z7ZEkvNwneU &
>>>>> https://www.twitch.tv/events/xU90q9RGRGSOgp2LoNsf6A :)
>>>>>
>>>>> On Fri, Mar 9, 2018 at 3:54 PM, Holden Karau 
>>>>> wrote:
>>>>>
>>>>>> If anyone wants to watch the recording:
>>>>>> https://www.youtube.com/watch?v=lugG_2QU6YU
>>>>>>
>>>>>> I'll do one next week as well - March 16th @ 11am -
>>>>>> https://www.youtube.com/watch?v=pXzVtEUjrLc
>>>>>>
>>>>>> On Fri, Mar 9, 2018 at 9:28 AM, Holden Karau 
>>>>>> wrote:
>>>>>>
>>>>>>> Hi folks,
>>>>>>>
>>>>>>> If your curious in learning more about how Spark is developed, I’m
>>>>>>> going to expirement doing a live code review where folks can watch and 
>>>>>>> see
>>>>&g

Re: Use Arrow instead of Pickle without pandas_udf

2018-07-25 Thread Holden Karau
Not currently. What's the problem with pandas_udf for your use case?

On Wed, Jul 25, 2018 at 1:27 PM, Hichame El Khalfi 
wrote:

> Hi There,
>
>
> Is there a way to use Arrow format instead of Pickle but without using
> pandas_udf ?
>
>
> Thank for your help,
>
>
> Hichame
>



-- 
Twitter: https://twitter.com/holdenkarau


Live Code Reviews, Coding, and Dev Tools

2018-07-24 Thread Holden Karau
Tomorrow afternoon @ 3pm pacific I'll be doing some dev tools poking for
Beam and Spark - https://www.youtube.com/watch?v=6cTmC_fP9B0 for
mention-bot.

On Friday I'll be doing my normal code reviews -
https://www.youtube.com/watch?v=O4rRx-3PTiM

On Monday July 30th @ 9:30am I'll be doing some more coding using PySpark
for data processing - https://www.youtube.com/watch?v=tKFiWKRISdc

On Monday August 6th @ 10am - as some folks have been asking I'll be doing
some live coding on Apache Spark in Scala (exact topic TBD) -
https://www.youtube.com/watch?v=w4IAdU0aVwo

I've also tried to create events for this month on YT and twitch -
https://youtube.com/user/holdenkarau &
https://www.twitch.tv/holdenkarau/events

I'll probably fit some more things in as my conference schedule settles
down a bit. As always if you have any requests (for PRs reviewed or areas)
just give me a shout :)

-- 
Twitter: https://twitter.com/holdenkarau


Re: Live Streamed Code Review today at 11am Pacific

2018-07-20 Thread Holden Karau
Heads up tomorrows Friday review is going to be at 8:30 am instead of 9:30
am because I had to move some flights around.

On Fri, Jul 13, 2018 at 12:03 PM, Holden Karau  wrote:

> This afternoon @ 3pm pacific I'll be looking at review tooling for Spark &
> Beam https://www.youtube.com/watch?v=ff8_jbzC8JI.
>
> Next week's regular Friday code (this time July 20th @ 9:30am pacific)
> review will once again probably have more of an ML focus for folks
> interested in watching Spark ML PRs be reviewed - https://www.youtube.com/
> watch?v=aG5h99yb6XE <https://www.youtube.com/watch?v=aG5h99yb6XE>
>
> Next week I'll have a live coding session with more of a Beam focus if you
> want to see something a bit different (but still related since Beam runs on
> Spark) with a focus on Python dependency management (which is a thing we
> are also exploring in Spark at the same time) - https://www.youtube.com/
> watch?v=Sv0XhS2pYqA on July 19th at 2pm pacific.
>
> P.S.
>
> You can follow more generally me holdenkarau on YouTube
> <https://www.google.com/url?sa=t=j==s=web=1=rja=8=2ahUKEwiGmJfR4JzcAhVqiFQKHdOVBIUQFjAAegQIBBAB=https%3A%2F%2Fwww.youtube.com%2Fuser%2Fholdenkarau=AOvVaw2RJa2rRWsRKM2w2gFFbtE8>
> and holdenkarau on Twitch <https://www.twitch.tv/holdenkarau> to be
> notified even when I forget to send out the emails (which is pretty often).
>
> This morning I did another live review session I forgot to ping to the
> list about ( https://www.youtube.com/watch?v=M_lRFptcGTI=
> PLRLebp9QyZtYF46jlSnIu2x1NDBkKa2uw=31 ) and yesterday I did some
> live coding using PySpark and working on Sparkling ML -
> https://www.youtube.com/watch?v=kCnBDpNce9A=
> PLRLebp9QyZtYF46jlSnIu2x1NDBkKa2uw=32
>
> On Wed, Jun 27, 2018 at 10:44 AM, Holden Karau 
> wrote:
>
>> Today @ 1:30pm pacific I'll be looking at the current Spark 2.1.3 RC and
>> see how we validate Spark releases - https://www.twitch.tv/events/V
>> Ag-5PKURQeH15UAawhBtw / https://www.youtube.com/watch?v=1_XLrlKS26o .
>> Tomorrow @ 12:30 live PR reviews & Monday live coding -
>> https://youtube.com/user/holdenkarau & https://www.twitch.tv/holdenka
>> rau/events . Hopefully this can encourage more folks to help with RC
>> validation & PR reviews :)
>>
>> On Thu, Jun 14, 2018 at 6:07 AM, Holden Karau 
>> wrote:
>>
>>> Next week is pride in San Francisco but I'm still going to do two quick
>>> session. One will be live coding with Apache Spark to collect ASF diversity
>>> information ( https://www.youtube.com/watch?v=OirnFnsU37A /
>>> https://www.twitch.tv/events/O1edDMkTRBGy0I0RCK-Afg ) on Monday at 9am
>>> pacific and the other will be the regular Friday code review (
>>> https://www.youtube.com/watch?v=IAWm4OLRoyY / https://www.tw
>>> itch.tv/events/v0qzXxnNQ_K7a8JYFsIiKQ ) also at 9am.
>>>
>>> On Thu, Jun 7, 2018 at 9:10 PM, Holden Karau 
>>> wrote:
>>>
>>>> I'll be doing another one tomorrow morning at 9am pacific focused on
>>>> Python + K8s support & improved JSON support -
>>>> https://www.youtube.com/watch?v=Z7ZEkvNwneU &
>>>> https://www.twitch.tv/events/xU90q9RGRGSOgp2LoNsf6A :)
>>>>
>>>> On Fri, Mar 9, 2018 at 3:54 PM, Holden Karau 
>>>> wrote:
>>>>
>>>>> If anyone wants to watch the recording: https://www.youtube
>>>>> .com/watch?v=lugG_2QU6YU
>>>>>
>>>>> I'll do one next week as well - March 16th @ 11am -
>>>>> https://www.youtube.com/watch?v=pXzVtEUjrLc
>>>>>
>>>>> On Fri, Mar 9, 2018 at 9:28 AM, Holden Karau 
>>>>> wrote:
>>>>>
>>>>>> Hi folks,
>>>>>>
>>>>>> If your curious in learning more about how Spark is developed, I’m
>>>>>> going to expirement doing a live code review where folks can watch and 
>>>>>> see
>>>>>> how that part of our process works. I have two volunteers already for
>>>>>> having their PRs looked at live, and if you have a Spark PR your working 
>>>>>> on
>>>>>> you’d like me to livestream a review of please ping me.
>>>>>>
>>>>>> The livestream will be at https://www.youtube.com/watch?v=lugG_2QU6YU
>>>>>> .
>>>>>>
>>>>>> Cheers,
>>>>>>
>>>>>> Holden :)
>>>>>> --
>>>>>> Twitter: https://twitter.com/holdenkarau
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Twitter: https://twitter.com/holdenkarau
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Twitter: https://twitter.com/holdenkarau
>>>>
>>>
>>>
>>>
>>> --
>>> Twitter: https://twitter.com/holdenkarau
>>>
>>
>>
>>
>> --
>> Twitter: https://twitter.com/holdenkarau
>>
>
>
>
> --
> Twitter: https://twitter.com/holdenkarau
>



-- 
Twitter: https://twitter.com/holdenkarau


Re: Pyspark access to scala/java libraries

2018-07-15 Thread Holden Karau
If you want to see some examples in a library shows a way to do it -
https://github.com/sparklingpandas/sparklingml and high performance spark
also talks about it.

On Sun, Jul 15, 2018, 11:57 AM <0xf0f...@protonmail.com.invalid> wrote:

> Check
> https://stackoverflow.com/questions/31684842/calling-java-scala-function-from-a-task
>
> ​Sent with ProtonMail Secure Email.​
>
> ‐‐‐ Original Message ‐‐‐
>
> On July 15, 2018 8:01 AM, Mohit Jaggi  wrote:
>
> > Trying again…anyone know how to make this work?
> >
> > > On Jul 9, 2018, at 3:45 PM, Mohit Jaggi mohitja...@gmail.com wrote:
> > >
> > > Folks,
> > >
> > > I am writing some Scala/Java code and want it to be usable from
> pyspark.
> > >
> > > For example:
> > >
> > > class MyStuff(addend: Int) {
> > >
> > > def myMapFunction(x: Int) = x + addend
> > >
> > > }
> > >
> > > I want to call it from pyspark as:
> > >
> > > df = ...
> > >
> > > mystuff = sc._jvm.MyStuff(5)
> > >
> > > df[‘x’].map(lambda x: mystuff.myMapFunction(x))
> > >
> > > How can I do this?
> > >
> > > Mohit.
> >
> > --
> >
> > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: Live Streamed Code Review today at 11am Pacific

2018-07-13 Thread Holden Karau
This afternoon @ 3pm pacific I'll be looking at review tooling for Spark &
Beam https://www.youtube.com/watch?v=ff8_jbzC8JI.

Next week's regular Friday code (this time July 20th @ 9:30am pacific)
review will once again probably have more of an ML focus for folks
interested in watching Spark ML PRs be reviewed -
 https://www.youtube.com/watch?v=aG5h99yb6XE
<https://www.youtube.com/watch?v=aG5h99yb6XE>

Next week I'll have a live coding session with more of a Beam focus if you
want to see something a bit different (but still related since Beam runs on
Spark) with a focus on Python dependency management (which is a thing we
are also exploring in Spark at the same time) -
https://www.youtube.com/watch?v=Sv0XhS2pYqA on July 19th at 2pm pacific.

P.S.

You can follow more generally me holdenkarau on YouTube
<https://www.google.com/url?sa=t=j==s=web=1=rja=8=2ahUKEwiGmJfR4JzcAhVqiFQKHdOVBIUQFjAAegQIBBAB=https%3A%2F%2Fwww.youtube.com%2Fuser%2Fholdenkarau=AOvVaw2RJa2rRWsRKM2w2gFFbtE8>
and holdenkarau on Twitch <https://www.twitch.tv/holdenkarau> to be
notified even when I forget to send out the emails (which is pretty often).

This morning I did another live review session I forgot to ping to the list
about (
https://www.youtube.com/watch?v=M_lRFptcGTI=PLRLebp9QyZtYF46jlSnIu2x1NDBkKa2uw=31
)
and yesterday I did some live coding using PySpark and working on Sparkling
ML -
https://www.youtube.com/watch?v=kCnBDpNce9A=PLRLebp9QyZtYF46jlSnIu2x1NDBkKa2uw=32

On Wed, Jun 27, 2018 at 10:44 AM, Holden Karau  wrote:

> Today @ 1:30pm pacific I'll be looking at the current Spark 2.1.3 RC and
> see how we validate Spark releases - https://www.twitch.tv/events/
> VAg-5PKURQeH15UAawhBtw / https://www.youtube.com/watch?v=1_XLrlKS26o .
> Tomorrow @ 12:30 live PR reviews & Monday live coding -
> https://youtube.com/user/holdenkarau & https://www.twitch.tv/
> holdenkarau/events . Hopefully this can encourage more folks to help with
> RC validation & PR reviews :)
>
> On Thu, Jun 14, 2018 at 6:07 AM, Holden Karau 
> wrote:
>
>> Next week is pride in San Francisco but I'm still going to do two quick
>> session. One will be live coding with Apache Spark to collect ASF diversity
>> information ( https://www.youtube.com/watch?v=OirnFnsU37A /
>> https://www.twitch.tv/events/O1edDMkTRBGy0I0RCK-Afg ) on Monday at 9am
>> pacific and the other will be the regular Friday code review (
>> https://www.youtube.com/watch?v=IAWm4OLRoyY / https://www.tw
>> itch.tv/events/v0qzXxnNQ_K7a8JYFsIiKQ ) also at 9am.
>>
>> On Thu, Jun 7, 2018 at 9:10 PM, Holden Karau 
>> wrote:
>>
>>> I'll be doing another one tomorrow morning at 9am pacific focused on
>>> Python + K8s support & improved JSON support -
>>> https://www.youtube.com/watch?v=Z7ZEkvNwneU &
>>> https://www.twitch.tv/events/xU90q9RGRGSOgp2LoNsf6A :)
>>>
>>> On Fri, Mar 9, 2018 at 3:54 PM, Holden Karau 
>>> wrote:
>>>
>>>> If anyone wants to watch the recording: https://www.youtube
>>>> .com/watch?v=lugG_2QU6YU
>>>>
>>>> I'll do one next week as well - March 16th @ 11am -
>>>> https://www.youtube.com/watch?v=pXzVtEUjrLc
>>>>
>>>> On Fri, Mar 9, 2018 at 9:28 AM, Holden Karau 
>>>> wrote:
>>>>
>>>>> Hi folks,
>>>>>
>>>>> If your curious in learning more about how Spark is developed, I’m
>>>>> going to expirement doing a live code review where folks can watch and see
>>>>> how that part of our process works. I have two volunteers already for
>>>>> having their PRs looked at live, and if you have a Spark PR your working 
>>>>> on
>>>>> you’d like me to livestream a review of please ping me.
>>>>>
>>>>> The livestream will be at https://www.youtube.com/watch?v=lugG_2QU6YU.
>>>>>
>>>>> Cheers,
>>>>>
>>>>> Holden :)
>>>>> --
>>>>> Twitter: https://twitter.com/holdenkarau
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Twitter: https://twitter.com/holdenkarau
>>>>
>>>
>>>
>>>
>>> --
>>> Twitter: https://twitter.com/holdenkarau
>>>
>>
>>
>>
>> --
>> Twitter: https://twitter.com/holdenkarau
>>
>
>
>
> --
> Twitter: https://twitter.com/holdenkarau
>



-- 
Twitter: https://twitter.com/holdenkarau


[ANNOUNCE] Apache Spark 2.1.3

2018-07-01 Thread Holden Karau
We are happy to announce the availability of Spark 2.1.3!

Apache Spark 2.1.3 is a maintenance release, based on the branch-2.1
maintenance branch of Spark. We strongly recommend all 2.1.x users to
upgrade to this stable release. The release notes are available at
http://spark.apache.org/releases/spark-release-2-1-3.html

To download Apache Spark 2.1.3 visit http://spark.apache.org/downloads.html.
This version of Spark is also available on Maven and PyPI.

We would like to acknowledge all community members for contributing patches
to this release.

Special thanks to Marcelo Vanzin for making the release, I'm just handling
the last few details this time.


Re: Live Streamed Code Review today at 11am Pacific

2018-06-27 Thread Holden Karau
Today @ 1:30pm pacific I'll be looking at the current Spark 2.1.3 RC and
see how we validate Spark releases -
https://www.twitch.tv/events/VAg-5PKURQeH15UAawhBtw /
https://www.youtube.com/watch?v=1_XLrlKS26o . Tomorrow @ 12:30 live PR
reviews & Monday live coding - https://youtube.com/user/holdenkarau &
https://www.twitch.tv/holdenkarau/events . Hopefully this can encourage
more folks to help with RC validation & PR reviews :)

On Thu, Jun 14, 2018 at 6:07 AM, Holden Karau  wrote:

> Next week is pride in San Francisco but I'm still going to do two quick
> session. One will be live coding with Apache Spark to collect ASF diversity
> information ( https://www.youtube.com/watch?v=OirnFnsU37A /
> https://www.twitch.tv/events/O1edDMkTRBGy0I0RCK-Afg ) on Monday at 9am
> pacific and the other will be the regular Friday code review (
> https://www.youtube.com/watch?v=IAWm4OLRoyY / https://www.
> twitch.tv/events/v0qzXxnNQ_K7a8JYFsIiKQ ) also at 9am.
>
> On Thu, Jun 7, 2018 at 9:10 PM, Holden Karau  wrote:
>
>> I'll be doing another one tomorrow morning at 9am pacific focused on
>> Python + K8s support & improved JSON support -
>> https://www.youtube.com/watch?v=Z7ZEkvNwneU &
>> https://www.twitch.tv/events/xU90q9RGRGSOgp2LoNsf6A :)
>>
>> On Fri, Mar 9, 2018 at 3:54 PM, Holden Karau 
>> wrote:
>>
>>> If anyone wants to watch the recording: https://www.youtube
>>> .com/watch?v=lugG_2QU6YU
>>>
>>> I'll do one next week as well - March 16th @ 11am -
>>> https://www.youtube.com/watch?v=pXzVtEUjrLc
>>>
>>> On Fri, Mar 9, 2018 at 9:28 AM, Holden Karau 
>>> wrote:
>>>
>>>> Hi folks,
>>>>
>>>> If your curious in learning more about how Spark is developed, I’m
>>>> going to expirement doing a live code review where folks can watch and see
>>>> how that part of our process works. I have two volunteers already for
>>>> having their PRs looked at live, and if you have a Spark PR your working on
>>>> you’d like me to livestream a review of please ping me.
>>>>
>>>> The livestream will be at https://www.youtube.com/watch?v=lugG_2QU6YU.
>>>>
>>>> Cheers,
>>>>
>>>> Holden :)
>>>> --
>>>> Twitter: https://twitter.com/holdenkarau
>>>>
>>>
>>>
>>>
>>> --
>>> Twitter: https://twitter.com/holdenkarau
>>>
>>
>>
>>
>> --
>> Twitter: https://twitter.com/holdenkarau
>>
>
>
>
> --
> Twitter: https://twitter.com/holdenkarau
>



-- 
Twitter: https://twitter.com/holdenkarau


Re: Live Streamed Code Review today at 11am Pacific

2018-06-14 Thread Holden Karau
Next week is pride in San Francisco but I'm still going to do two quick
session. One will be live coding with Apache Spark to collect ASF diversity
information ( https://www.youtube.com/watch?v=OirnFnsU37A /
https://www.twitch.tv/events/O1edDMkTRBGy0I0RCK-Afg ) on Monday at 9am
pacific and the other will be the regular Friday code review (
https://www.youtube.com/watch?v=IAWm4OLRoyY /
https://www.twitch.tv/events/v0qzXxnNQ_K7a8JYFsIiKQ ) also at 9am.

On Thu, Jun 7, 2018 at 9:10 PM, Holden Karau  wrote:

> I'll be doing another one tomorrow morning at 9am pacific focused on
> Python + K8s support & improved JSON support -
> https://www.youtube.com/watch?v=Z7ZEkvNwneU &
> https://www.twitch.tv/events/xU90q9RGRGSOgp2LoNsf6A :)
>
> On Fri, Mar 9, 2018 at 3:54 PM, Holden Karau  wrote:
>
>> If anyone wants to watch the recording: https://www.youtube
>> .com/watch?v=lugG_2QU6YU
>>
>> I'll do one next week as well - March 16th @ 11am -
>> https://www.youtube.com/watch?v=pXzVtEUjrLc
>>
>> On Fri, Mar 9, 2018 at 9:28 AM, Holden Karau 
>> wrote:
>>
>>> Hi folks,
>>>
>>> If your curious in learning more about how Spark is developed, I’m going
>>> to expirement doing a live code review where folks can watch and see how
>>> that part of our process works. I have two volunteers already for having
>>> their PRs looked at live, and if you have a Spark PR your working on you’d
>>> like me to livestream a review of please ping me.
>>>
>>> The livestream will be at https://www.youtube.com/watch?v=lugG_2QU6YU.
>>>
>>> Cheers,
>>>
>>> Holden :)
>>> --
>>> Twitter: https://twitter.com/holdenkarau
>>>
>>
>>
>>
>> --
>> Twitter: https://twitter.com/holdenkarau
>>
>
>
>
> --
> Twitter: https://twitter.com/holdenkarau
>



-- 
Twitter: https://twitter.com/holdenkarau


Re: Live Streamed Code Review today at 11am Pacific

2018-06-07 Thread Holden Karau
I'll be doing another one tomorrow morning at 9am pacific focused on Python
+ K8s support & improved JSON support -
https://www.youtube.com/watch?v=Z7ZEkvNwneU &
https://www.twitch.tv/events/xU90q9RGRGSOgp2LoNsf6A :)

On Fri, Mar 9, 2018 at 3:54 PM, Holden Karau  wrote:

> If anyone wants to watch the recording: https://www.
> youtube.com/watch?v=lugG_2QU6YU
>
> I'll do one next week as well - March 16th @ 11am -
> https://www.youtube.com/watch?v=pXzVtEUjrLc
>
> On Fri, Mar 9, 2018 at 9:28 AM, Holden Karau  wrote:
>
>> Hi folks,
>>
>> If your curious in learning more about how Spark is developed, I’m going
>> to expirement doing a live code review where folks can watch and see how
>> that part of our process works. I have two volunteers already for having
>> their PRs looked at live, and if you have a Spark PR your working on you’d
>> like me to livestream a review of please ping me.
>>
>> The livestream will be at https://www.youtube.com/watch?v=lugG_2QU6YU.
>>
>> Cheers,
>>
>> Holden :)
>> --
>> Twitter: https://twitter.com/holdenkarau
>>
>
>
>
> --
> Twitter: https://twitter.com/holdenkarau
>



-- 
Twitter: https://twitter.com/holdenkarau


Spark ML online serving

2018-06-06 Thread Holden Karau
At Spark Summit some folks were talking about model serving and we wanted
to collect requirements from the community.
-- 
Twitter: https://twitter.com/holdenkarau


Re: Dataframe from 1.5G json (non JSONL)

2018-06-05 Thread Holden Karau
If it’s one 33mb file which decompressed to 1.5g then there is also a
chance you need to split the inputs since gzip is a non-splittable
compression format.

On Tue, Jun 5, 2018 at 11:55 AM Anastasios Zouzias 
wrote:

> Are you sure that your JSON file has the right format?
>
> spark.read.json(...) expects a file where *each line is a json object*.
>
> My wild guess is that
>
> val hdf=spark.read.json("/user/tmp/hugedatafile")
> hdf.show(2) or hdf.take(1) gives OOM
>
> tries to fetch all the data into the driver. Can you reformat your input
> file and try again?
>
> Best,
> Anastasios
>
>
>
> On Tue, Jun 5, 2018 at 8:39 PM, raksja  wrote:
>
>> I have a json file which is a continuous array of objects of similar type
>> [{},{}...] for about 1.5GB uncompressed and 33MB gzip compressed.
>>
>> This is uploaded hugedatafile to hdfs and this is not a JSONL file, its a
>> whole regular json file.
>>
>>
>> [{"id":"1","entityMetadata":{"lastChange":"2018-05-11
>> 01:09:18.0","createdDateTime":"2018-05-11
>> 01:09:18.0","modifiedDateTime":"2018-05-11
>>
>> 01:09:18.0"},"type":"11"},{"id":"2","entityMetadata":{"lastChange":"2018-05-11
>> 01:09:18.0","createdDateTime":"2018-05-11
>> 01:09:18.0","modifiedDateTime":"2018-05-11
>>
>> 01:09:18.0"},"type":"11"},{"id":"3","entityMetadata":{"lastChange":"2018-05-11
>> 01:09:18.0","createdDateTime":"2018-05-11
>> 01:09:18.0","modifiedDateTime":"2018-05-11
>> 01:09:18.0"},"type":"11"}..]
>>
>>
>> I get OOM on executors whenever i try to load this into spark.
>>
>> Try 1
>> val hdf=spark.read.json("/user/tmp/hugedatafile")
>> hdf.show(2) or hdf.take(1) gives OOM
>>
>> Try 2
>> Took a small sampledatafile and got schema to avoid schema infering
>> val sampleSchema=spark.read.json("/user/tmp/sampledatafile").schema
>> val hdf=spark.read.schema(sampleSchema).json("/user/tmp/hugedatafile")
>> hdf.show(2) or hdf.take(1) stuck for 1.5 hrs and gives OOM
>>
>> Try 3
>> Repartition it after before performing action
>> gives OOM
>>
>> Try 4
>> Read about the https://issues.apache.org/jira/browse/SPARK-20980
>> completely
>> val hdf = spark.read.option("multiLine",
>> true)..schema(sampleSchema).json("/user/tmp/hugedatafile")
>> hdf.show(1) or hdf.take(1) gives OOM
>>
>>
>> Can any one help me here?
>>
>>
>>
>> --
>> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>>
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>>
>
>
> --
> -- Anastasios Zouzias
> 
>
-- 
Twitter: https://twitter.com/holdenkarau


Re: testing frameworks

2018-05-30 Thread Holden Karau
So Jessie has an excellent blog post on how to use it with Java
applications -
http://www.jesse-anderson.com/2016/04/unit-testing-spark-with-java/

On Wed, May 30, 2018 at 4:14 AM Spico Florin  wrote:

> Hello!
>   I'm also looking for unit testing spark Java application. I've seen the
> great work done in  spark-testing-base but it seemed to me that I could
> not use for Spark Java applications.
> Only spark scala applications are supported?
> Thanks.
> Regards,
>  Florin
>
> On Wed, May 23, 2018 at 8:07 AM, umargeek 
> wrote:
>
>> Hi Steve,
>>
>> you can try out pytest-spark plugin if your writing programs using pyspark
>> ,please find below link for reference.
>>
>> https://github.com/malexer/pytest-spark
>> 
>>
>> Thanks,
>> Umar
>>
>>
>>
>> --
>> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>>
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>>
> --
Twitter: https://twitter.com/holdenkarau


Re: testing frameworks

2018-05-21 Thread Holden Karau
So I’m biased as the author of spark-testing-base but I think it’s pretty
ok. Are you looking for unit or integration or something else?

On Mon, May 21, 2018 at 5:24 AM Steve Pruitt  wrote:

> Hi,
>
>
>
> Can anyone recommend testing frameworks suitable for Spark jobs.
> Something that can be integrated into a CI tool would be great.
>
>
>
> Thanks.
>
>
>
-- 
Twitter: https://twitter.com/holdenkarau


Re: [Spark on Google Kubernetes Engine] Properties File Error

2018-04-30 Thread Holden Karau
So, while its not perfect, I have a guide focused on running custom Spark
on GKE
https://cloud.google.com/blog/big-data/2018/03/testing-future-apache-spark-releases-and-changes-on-google-kubernetes-engine-and-cloud-dataproc
and
if you want to run pre-built Spark on GKE there is a solutions article -
https://cloud.google.com/solutions/spark-on-kubernetes-engine which could
be relevant.

On Mon, Apr 30, 2018 at 7:51 PM, Eric Wang  wrote:

> Hello all,
>
> I've been trying to spark-submit a job to the Google Kubernetes Engine but
> I keep encountering a "Exception in thread "main"
> java.lang.IllegalArgumentException: Server properties file given at
> /opt/spark/work-dir/driver does not exist or is not a file."
> error. I'm unsure of how to even begin debugging this so any help would be
> greatly appreciated. I've attached the logs and the full spark-submit
> command I'm running here: https://gist.github.com/
> erkkel/c04a0b5ca60ad755cf62e9ad18e5b7ed
>
> For reference, I've been following this guide: https://apache-spark-
> on-k8s.github.io/userdocs/running-on-kubernetes.html
>
> Thanks,
> Eric
>
>


-- 
Twitter: https://twitter.com/holdenkarau


Re: Live Stream Code Reviews :)

2018-04-13 Thread Holden Karau
Thank you :)

Just a reminder this is going to start in under 20 minutes. If anyone has a
PR they'd live reviewed please respond and I'll add it to the list
(otherwise I'll go stick to the normal list of folks who have opted in to
live reviews).

On Thu, Apr 12, 2018 at 2:08 PM, Gourav Sengupta <gourav.sengu...@gmail.com>
wrote:

> Hi,
>
> This is definitely one of the best messages ever in this group. The videos
> are absolutely fantastic in case anyone is trying to learn about
> contributing to SPARK, I had been through one of them. Just trying to
> repeat the steps in the video (without of course doing anything really
> stupid) makes a person learn quite a lot.
>
> Thanks a ton, Holden for the great help.
>
> Also if you click on the link to the video it does show within how many
> hours will the session be active so we do not have to worry about the time
> zone I guess.
>
> Regards,
> Gourav Sengupta
>
> On Thu, Apr 12, 2018 at 8:23 PM, Holden Karau <hol...@pigscanfly.ca>
> wrote:
>
>> Hi Y'all,
>>
>> If your interested in learning more about how the development process in
>> Apache Spark works I've been doing a weekly live streamed code review most
>> Fridays at 11am. This weeks will be on twitch/youtube (
>> https://www.twitch.tv/holdenkarau / https://www.youtube.com/watc
>> h?v=vGVSa9KnD80 ). If you have a PR into Spark (or a related project)
>> you'd like me to review live let me know and I'll add it to my queue.
>>
>> Cheers,
>>
>> Holden :)
>>
>> --
>> Twitter: https://twitter.com/holdenkarau
>>
>
>


-- 
Twitter: https://twitter.com/holdenkarau


Re: Live Stream Code Reviews :)

2018-04-12 Thread Holden Karau
Ah yes really good point 11am pacific :)

On Thu, Apr 12, 2018 at 1:01 PM, Marco Mistroni <mmistr...@gmail.com> wrote:

> PST  I believelike last time
> Works out 9pm bst & 10 pm cet if I m correct
>
> On Thu, Apr 12, 2018, 8:47 PM Matteo Olivi <matteooli...@gmail.com> wrote:
>
>> Hi,
>> 11 am in which timezone?
>>
>> Il gio 12 apr 2018, 21:23 Holden Karau <hol...@pigscanfly.ca> ha scritto:
>>
>>> Hi Y'all,
>>>
>>> If your interested in learning more about how the development process in
>>> Apache Spark works I've been doing a weekly live streamed code review most
>>> Fridays at 11am. This weeks will be on twitch/youtube (
>>> https://www.twitch.tv/holdenkarau / https://www.youtube.com/
>>> watch?v=vGVSa9KnD80 ). If you have a PR into Spark (or a related
>>> project) you'd like me to review live let me know and I'll add it to my
>>> queue.
>>>
>>> Cheers,
>>>
>>> Holden :)
>>>
>>> --
>>> Twitter: https://twitter.com/holdenkarau
>>>
>>


-- 
Twitter: https://twitter.com/holdenkarau


Live Stream Code Reviews :)

2018-04-12 Thread Holden Karau
Hi Y'all,

If your interested in learning more about how the development process in
Apache Spark works I've been doing a weekly live streamed code review most
Fridays at 11am. This weeks will be on twitch/youtube (
https://www.twitch.tv/holdenkarau /
https://www.youtube.com/watch?v=vGVSa9KnD80 ). If you have a PR into Spark
(or a related project) you'd like me to review live let me know and I'll
add it to my queue.

Cheers,

Holden :)

-- 
Twitter: https://twitter.com/holdenkarau


Re: Open sourcing Sparklens: Qubole's Spark Tuning Tool

2018-03-21 Thread Holden Karau
Super exciting! I look forward to digging through it this weekend.

On Wed, Mar 21, 2018 at 9:33 PM ☼ R Nair (रविशंकर नायर) <
ravishankar.n...@gmail.com> wrote:

> Excellent. You filled a missing link.
>
> Best,
> Passion
>
> On Wed, Mar 21, 2018 at 11:36 PM, Rohit Karlupia 
> wrote:
>
>> Hi,
>>
>> Happy to announce the availability of Sparklens as open source project.
>> It helps in understanding the  scalability limits of spark applications and
>> can be a useful guide on the path towards tuning applications for lower
>> runtime or cost.
>>
>> Please clone from here: https://github.com/qubole/sparklens
>> Old blogpost:
>> https://www.qubole.com/blog/introducing-quboles-spark-tuning-tool/
>>
>> thanks,
>> rohitk
>>
>> PS: Thanks for the patience. It took couple of months to get back on
>> this.
>>
>>
>>
>>
>>
> --
Twitter: https://twitter.com/holdenkarau


Re: Live Streamed Code Review today at 11am Pacific

2018-03-09 Thread Holden Karau
If anyone wants to watch the recording:
https://www.youtube.com/watch?v=lugG_2QU6YU

I'll do one next week as well - March 16th @ 11am -
https://www.youtube.com/watch?v=pXzVtEUjrLc

On Fri, Mar 9, 2018 at 9:28 AM, Holden Karau <hol...@pigscanfly.ca> wrote:

> Hi folks,
>
> If your curious in learning more about how Spark is developed, I’m going
> to expirement doing a live code review where folks can watch and see how
> that part of our process works. I have two volunteers already for having
> their PRs looked at live, and if you have a Spark PR your working on you’d
> like me to livestream a review of please ping me.
>
> The livestream will be at https://www.youtube.com/watch?v=lugG_2QU6YU.
>
> Cheers,
>
> Holden :)
> --
> Twitter: https://twitter.com/holdenkarau
>



-- 
Twitter: https://twitter.com/holdenkarau


Live Streamed Code Review today at 11am Pacific

2018-03-09 Thread Holden Karau
Hi folks,

If your curious in learning more about how Spark is developed, I’m going to
expirement doing a live code review where folks can watch and see how that
part of our process works. I have two volunteers already for having their
PRs looked at live, and if you have a Spark PR your working on you’d like
me to livestream a review of please ping me.

The livestream will be at https://www.youtube.com/watch?v=lugG_2QU6YU.

Cheers,

Holden :)
-- 
Twitter: https://twitter.com/holdenkarau


Re: Spark not releasing shuffle files in time (with very large heap)

2018-02-23 Thread Holden Karau
You can also look at the shuffle file cleanup tricks we do inside of the
ALS algorithm in Spark.

On Fri, Feb 23, 2018 at 6:20 PM, vijay.bvp  wrote:

> have you looked at
> http://apache-spark-user-list.1001560.n3.nabble.com/Limit-
> Spark-Shuffle-Disk-Usage-td23279.html
>
> and the post mentioned there
> https://forums.databricks.com/questions/277/how-do-i-avoid-
> the-no-space-left-on-device-error.html
>
> also try compressing the output
> https://spark.apache.org/docs/latest/configuration.html#
> compression-and-serialization
> spark.shuffle.compress
>
> thanks
> Vijay
>
>
>
> --
> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


-- 
Twitter: https://twitter.com/holdenkarau


Re: Can spark handle this scenario?

2018-02-16 Thread Holden Karau
I'm not sure what you mean by it could be hard to serialize complex
operations?

Regardless I think the question is do you want to parallelize this on
multiple machines or just one?

On Feb 17, 2018 4:20 PM, "Lian Jiang"  wrote:

> Thanks Ayan. RDD may support map better than Dataset/DataFrame. However,
> it could be hard to serialize complex operation for Spark to execute in
> parallel. IMHO, spark does not fit this scenario. Hope this makes sense.
>
> On Fri, Feb 16, 2018 at 8:58 PM, ayan guha  wrote:
>
>> ** You do NOT need dataframes, I mean.
>>
>> On Sat, Feb 17, 2018 at 3:58 PM, ayan guha  wrote:
>>
>>> Hi
>>>
>>> Couple of suggestions:
>>>
>>> 1. Do not use Dataset, use Dataframe in this scenario. There is no
>>> benefit of dataset features here. Using Dataframe, you can write an
>>> arbitrary UDF which can do what you want to do.
>>> 2. In fact you do need dataframes here. You would be better off with RDD
>>> here. just create a RDD of symbols and use map to do the processing.
>>>
>>> On Sat, Feb 17, 2018 at 12:40 PM, Irving Duran 
>>> wrote:
>>>
 Do you only want to use Scala? Because otherwise, I think with pyspark
 and pandas read table you should be able to accomplish what you want to
 accomplish.

 Thank you,

 Irving Duran

 On 02/16/2018 06:10 PM, Lian Jiang wrote:

 Hi,

 I have a user case:

 I want to download S stock data from Yahoo API in parallel using
 Spark. I have got all stock symbols as a Dataset. Then I used below code to
 call Yahoo API for each symbol:



 case class Symbol(symbol: String, sector: String)

 case class Tick(symbol: String, sector: String, open: Double, close:
 Double)


 // symbolDS is Dataset[Symbol], pullSymbolFromYahoo returns
 Dataset[Tick]


 symbolDs.map { k =>

   pullSymbolFromYahoo(k.symbol, k.sector)

 }


 This statement cannot compile:


 Unable to find encoder for type stored in a Dataset.  Primitive types
 (Int, String, etc) and Product types (case classes) are supported by
 importing spark.implicits._  Support for serializing other types will
 be added in future releases.


 My questions are:


 1. As you can see, this scenario is not traditional dataset handling
 such as count, sql query... Instead, it is more like a UDF which apply
 random operation on each record. Is Spark good at handling such scenario?


 2. Regarding the compilation error, any fix? I did not find a
 satisfactory solution online.


 Thanks for help!






>>>
>>>
>>> --
>>> Best Regards,
>>> Ayan Guha
>>>
>>
>>
>>
>> --
>> Best Regards,
>> Ayan Guha
>>
>
>


Re: pyspark+spacy throwing pickling exception

2018-02-15 Thread Holden Karau
So you left out the exception. On one hand I’m also not sure how well spacy
serializes, so to debug this I would start off by moving the nlp = inside
of my function and see if it still fails.

On Thu, Feb 15, 2018 at 9:08 PM Selvam Raman  wrote:

> import spacy
>
> nlp = spacy.load('en')
>
>
>
> def getPhrases(content):
> phrases = []
> doc = nlp(str(content))
> for chunks in doc.noun_chunks:
> phrases.append(chunks.text)
> return phrases
>
> the above function will retrieve the noun phrases from the content and
> return list of phrases.
>
>
> def f(x) : print(x)
>
>
> description = 
> xmlData.filter(col("dcterms:description").isNotNull()).select(col("dcterms:description").alias("desc"))
>
> description.rdd.flatMap(lambda row: getPhrases(row.desc)).foreach(f)
>
> when i am trying to access getphrases i am getting below exception
>
>
>
> --
> Selvam Raman
> "லஞ்சம் தவிர்த்து நெஞ்சம் நிமிர்த்து"
>
-- 
Twitter: https://twitter.com/holdenkarau


FOSDEM mini-office hour?

2018-01-31 Thread Holden Karau
Hi Spark Friends,

If any folks are around for FOSDEM this year I was planning on doing a
coffee office hour on the last day after my talks
. Maybe like 6pm?
I'm also going to see if any BEAM folks are around and interested :)

Cheers,

Holden :)

-- 
Twitter: https://twitter.com/holdenkarau


Re: Spark Tuning Tool

2018-01-22 Thread Holden Karau
That's very interesting, and might also get some interest on the dev@ list
if it was open source.

On Tue, Jan 23, 2018 at 4:02 PM, Roger Marin  wrote:

> I'd be very interested.
>
> On 23 Jan. 2018 4:01 pm, "Rohit Karlupia"  wrote:
>
>> Hi,
>>
>> I have been working on making the performance tuning of spark
>> applications bit easier.  We have just released the beta version of the
>> tool on Qubole.
>>
>> https://www.qubole.com/blog/introducing-quboles-spark-tuning-tool/
>>
>> This is not OSS yet but we would like to contribute it to OSS.  Fishing
>> for some interest in the community if people find this work interesting and
>> would like to try to it out.
>>
>> thanks,
>> Rohit Karlupia
>>
>>
>>


-- 
Twitter: https://twitter.com/holdenkarau


Re: Access to Applications metrics

2017-12-05 Thread Holden Karau
I've done a SparkListener to record metrics for validation (it's a bit out
of date). Are you just looking to have graphing/alerting set up on the
Spark metrics?

On Tue, Dec 5, 2017 at 1:53 PM, Thakrar, Jayesh <
jthak...@conversantmedia.com> wrote:

> You can also get the metrics from the Spark application events log file.
>
>
>
> See https://www.slideshare.net/JayeshThakrar/apache-
> bigdata2017sparkprofiling
>
>
>
>
>
> *From: *"Qiao, Richard" 
> *Date: *Monday, December 4, 2017 at 6:09 PM
> *To: *Nick Dimiduk , "user@spark.apache.org" <
> user@spark.apache.org>
> *Subject: *Re: Access to Applications metrics
>
>
>
> It works to collect Job level, through Jolokia java agent.
>
>
>
> Best Regards
>
> Richard
>
>
>
>
>
> *From: *Nick Dimiduk 
> *Date: *Monday, December 4, 2017 at 6:53 PM
> *To: *"user@spark.apache.org" 
> *Subject: *Re: Access to Applications metrics
>
>
>
> Bump.
>
>
>
> On Wed, Nov 15, 2017 at 2:28 PM, Nick Dimiduk  wrote:
>
> Hello,
>
>
>
> I'm wondering if it's possible to get access to the detailed
> job/stage/task level metrics via the metrics system (JMX, Graphite, ).
> I've enabled the wildcard sink and I do not see them. It seems these values
> are only available over http/json and to SparkListener instances, is this
> the case? Has anyone worked on a SparkListener that would bridge data from
> one to the other?
>
>
>
> Thanks,
>
> Nick
>
>
>
>
> --
>
> The information contained in this e-mail is confidential and/or
> proprietary to Capital One and/or its affiliates and may only be used
> solely in performance of work or services for Capital One. The information
> transmitted herewith is intended only for use by the individual or entity
> to which it is addressed. If the reader of this message is not the intended
> recipient, you are hereby notified that any review, retransmission,
> dissemination, distribution, copying or other use of, or taking of any
> action in reliance upon this information is strictly prohibited. If you
> have received this communication in error, please contact the sender and
> delete the material from your computer.
>



-- 
Twitter: https://twitter.com/holdenkarau


Re: Recommended way to serialize Hadoop Writables' in Spark

2017-12-03 Thread Holden Karau
So is there a reason you want to shuffle Hadoop types rather than the Java
types?

As for your specific question, for Kyro you also need to register your
serializers, did you do that?

On Sun, Dec 3, 2017 at 10:02 AM pradeepbaji  wrote:

> Hi,
>
> Is there any recommended way of serializing Hadoop Writables' in Spark?
> Here is my problem.
>
> Question1:
> I have a pair RDD which is created by reading a SEQ[LongWritable,
> BytesWritable]:
> RDD[(LongWritable, BytesWritable)]
>
> I have these two settings set in spark conf.
> spark.serializer=org.apache.spark.serializer.KryoSerializer
> spark.kryo.registrator=MyCustomRegistrator
>
> Inside the MyCustomRegistrator, I registered both LongWritable and
> BytesWritable classes.
> kryo.register(classOf[LongWritable])
> kryo.register(classOf[BytesWritable])
>
> The total size of the SEQ[LongWritable, BytesWritable] that I read to
> create
> the RDD[(LongWritable, BytesWritable)] is *800MB*. I have 10 executors in
> my
> job with 10GB of memory. I am performing reduceByKey operation on this RDD
> and I see very huge Shuffle writes of 10GB on each executor which doesn't
> make any sense. Also the reduceByKey stage is very very slow and sometimes
> executors throw OOM exception.
>
> Can someone explain this shuffle behavior in Spark? Why does Spark show
> 100GB of shuffle writes for 800MB if input data?
> Also when I convert RDD[(LongWritable,BytesWritable)] to RDD[Long,
> CustomObject] , the reduceByKey operation takes only 30 seconds to finish
> and is very fast.
>
> Question2:
> Now for the same job this time I wrote custom serializers for LongWritable
> and BytesWritable. Here is the code.
>
>
> import com.esotericsoftware.kryo.{Kryo, Serializer}
> import com.esotericsoftware.kryo.io.{Input, Output}
> import org.apache.hadoop.io.{BytesWritable, LongWritable}
> /**
>   * Kryo Custom Serializer for serializing LongWritable
>   */
> class LongWritableSerializer extends Serializer[LongWritable] {
>   override def write(kryo: Kryo, output: Output, obj: LongWritable): Unit =
> {
> output.writeLong(obj.get())
>   }
>   override def read(kryo: Kryo,
> input: Input,
> clazz: Class[LongWritable]): LongWritable = {
> val longVal = input.readLong()
> new LongWritable(longVal)
>   }
> }
> /**
>   * Kryo Custom Serializer for serializing BytesWritable
>   */
> class BytesWritableSerializer extends Serializer[BytesWritable] {
>   override def write(kryo: Kryo, output: Output, obj: BytesWritable): Unit
> =
> {
> val bytes = obj.getBytes
> output.writeInt(bytes.size)
> output.writeBytes(bytes)
>   }
>   override def read(kryo: Kryo,
> input: Input,
> clazz: Class[BytesWritable]): BytesWritable = {
> val length = input.readInt()
> val bytes = input.readBytes(length)
> new BytesWritable(bytes)
>   }
> }
>
>
>
> And then I registered these with Kryo inside MyCustomRegistrator.
> kryo.register(classOf[LongWritable], new LongWritableSerializer())
> kryo.register(classOf[BytesWritable], new BytesWritableSerializer())
>
> I still see the same behavior. Can someone also check this?
>
>
> Thanks,
> Pradeep
>
>
>
>
>
>
> --
> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
> --
Twitter: https://twitter.com/holdenkarau


Re: Is Databricks REST API open source ?

2017-12-02 Thread Holden Karau
That API is not open source. There are some other options as separate
projects you can check out (like Livy,spark-jobserver, etc).

On Sat, Dec 2, 2017 at 8:30 PM kant kodali  wrote:

> HI All,
>
> Is REST API (https://docs.databricks.com/api/index.html) open source?
> where I can submit spark jobs over rest?
>
> Thanks!
>
-- 
Twitter: https://twitter.com/holdenkarau


Re: NLTK with Spark Streaming

2017-11-26 Thread Holden Karau
So it’s certainly doable (it’s not super easy mind you), but until the
arrow udf release goes out it will be rather slow.

On Sun, Nov 26, 2017 at 8:01 AM ashish rawat  wrote:

> Hi,
>
> Has someone tried running NLTK (python) with Spark Streaming (scala)? I
> was wondering if this is a good idea and what are the right Spark operators
> to do this? The reason we want to try this combination is that we don't
> want to run our transformations in python (pyspark), but after the
> transformations, we need to run some natural language processing operations
> and we don't want to restrict the functions data scientists' can use to
> Spark natural language library. So, Spark streaming with NLTK looks like
> the right option, from the perspective of fast data processing and data
> science flexibility.
>
> Regards,
> Ashish
>
-- 
Twitter: https://twitter.com/holdenkarau


What do you pay attention to when validating Spark jobs?

2017-11-21 Thread Holden Karau
Hi Folks,

I'm working on updating a talk and I was wondering if any folks in the
community wanted to share their best practices for validating your Spark
jobs? Are there any counters folks have found useful for
monitoring/validating your Spark jobs?

Cheers,

Holden :)

-- 
Twitter: https://twitter.com/holdenkarau


Re: PySpark 2.2.0, Kafka 0.10 DataFrames

2017-11-20 Thread Holden Karau
What command did you use to launch your Spark application? The
https://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html#deploying
documentation suggests using spark-submit with the `--packages` flag to
include the required Kafka package. e.g.

./bin/spark-submit --packages
org.apache.spark:spark-sql-kafka-0-10_2.11:2.2.0 ...



On Mon, Nov 20, 2017 at 3:07 PM, salemi  wrote:

> Hi All,
>
> we are trying to use DataFrames approach with Kafka 0.10 and PySpark 2.2.0.
> We followed the instruction on the wiki
> https://spark.apache.org/docs/latest/structured-streaming-
> kafka-integration.html.
> We coded something similar to the code below using Python:
> df = spark \
>   .read \
>   .format("kafka") \
>   .option("kafka.bootstrap.servers", "host1:port1,host2:port2") \
>   .option("subscribe", "topic1") \
>   .load()
>
> But we are getting the the the exception below. Does PySpark 2.2.0 supports
> DataFrames with Kafka 0.10? If yes, what could be the root cause for the
> exception below?
>
> Thank you,
> Ali
>
> Exception:
> py4j.protocol.Py4JJavaError: An error occurred while calling o31.load.
> : java.lang.ClassNotFoundException: Failed to find data source: kafka.
> Please find packages at http://spark.apache.org/third-party-projects.html
> at
> org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(
> DataSource.scala:549)
> at
> org.apache.spark.sql.execution.datasources.DataSource.providingClass$
> lzycompute(DataSource.scala:86)
> at
> org.apache.spark.sql.execution.datasources.DataSource.providingClass(
> DataSource.scala:86)
> at
> org.apache.spark.sql.execution.datasources.DataSource.sourceSchema(
> DataSource.scala:195)
> at
> org.apache.spark.sql.execution.datasources.DataSource.sourceInfo$
> lzycompute(DataSource.scala:87)
> at
> org.apache.spark.sql.execution.datasources.DataSource.sourceInfo(
> DataSource.scala:87)
> at
> org.apache.spark.sql.execution.streaming.StreamingRelation$.apply(
> StreamingRelation.scala:30)
> at
> org.apache.spark.sql.streaming.DataStreamReader.
> load(DataStreamReader.scala:150)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:
> 62)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(
> DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
> at py4j.reflection.ReflectionEngine.invoke(
> ReflectionEngine.java:357)
> at py4j.Gateway.invoke(Gateway.java:280)
> at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.
> java:132)
> at py4j.commands.CallCommand.execute(CallCommand.java:79)
> at py4j.GatewayConnection.run(GatewayConnection.java:214)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.ClassNotFoundException: kafka.DefaultSource
> at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
> at
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$21$$
> anonfun$apply$12.apply(DataSource.scala:533)
> at
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$21$$
> anonfun$apply$12.apply(DataSource.scala:533)
> at scala.util.Try$.apply(Try.scala:192)
> at
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$21.apply(
> DataSource.scala:533)
> at
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$21.apply(
> DataSource.scala:533)
> at scala.util.Try.orElse(Try.scala:84)
> at
> org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(
> DataSource.scala:533)
> ... 18 more
>
>
>
>
> --
> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


-- 
Twitter: https://twitter.com/holdenkarau


Re: Use of Accumulators

2017-11-14 Thread Holden Karau
And where do you want to read the toggle back from? On the driver?

On Tue, Nov 14, 2017 at 3:52 AM Kedarnath Dixit <
kedarnath_di...@persistent.com> wrote:

> Hi,
>
>
>
> Inside the transformation if there is any change for the Variable’s
> associated data, I want to just toggle it saying there is some change while
> processing the data.
>
>
>
> Please let me know if we can runtime do this.
>
>
>
>
>
> Thanks!
>
> *~Kedar Dixit*
>
> Bigdata Analytics at Persistent Systems Ltd.
>
>
>
> *From:* Holden Karau [via Apache Spark User List] [
> mailto:ml+s1001560n29995...@n3.nabble.com
> <ml+s1001560n29995...@n3.nabble.com>]
> *Sent:* Tuesday, November 14, 2017 1:16 PM
> *To:* Kedarnath Dixit <kedarnath_di...@persistent.com>
>
>
> *Subject:* Re: Use of Accumulators
>
>
>
> So you want to set an accumulator to 1 after a transformation has fully
> completed? Or what exactly do you want to do?
>
>
>
> On Mon, Nov 13, 2017 at 9:47 PM vaquar khan <[hidden email]
> <http:///user/SendEmail.jtp?type=node=29995=0>> wrote:
>
> Confirmed ,you can use Accumulators :)
>
>
>
> Regards,
>
> Vaquar khan
>
> On Mon, Nov 13, 2017 at 10:58 AM, Kedarnath Dixit <[hidden email]
> <http:///user/SendEmail.jtp?type=node=29995=1>> wrote:
>
> Hi,
>
>
>
> We need some way to toggle the flag of  a variable in transformation.
>
>
>
> We are thinking to make use of spark  Accumulators for this purpose.
>
>
>
> Can we use these as below:
>
>
>
> Variables  -> Initial Value
>
>  Variable1 -> 0
>
>  Variable2 -> 0
>
>
>
> In one of the transformations if we need to make Variable2's value to 1.
> Can we achieve this using Accumulators? Please confirm.
>
>
>
> Thanks!
>
>
>
> With Regards,
>
> *~Kedar Dixit*
>
>
>
> [hidden email] <http:///user/SendEmail.jtp?type=node=29995=2> |
> @kedarsdixit | M  value="+919049915588" target="_blank">+91 90499 15588 | T  value="+912067034783" target="_blank">+91 (20) 6703 4783
>
> *Persistent Systems | **Partners In Innovation** | www.persistent.com
> <http://www.persistent.com>*
>
> DISCLAIMER
> ==
> This e-mail may contain privileged and confidential information which is
> the property of Persistent Systems Ltd. It is intended only for the use of
> the individual or entity to which it is addressed. If you are not the
> intended recipient, you are not authorized to read, retain, copy, print,
> distribute or use this message. If you have received this communication in
> error, please notify the sender and delete all copies of this message.
> Persistent Systems Ltd. does not accept any liability for virus infected
> mails.
>
> --
>
> Regards,
>
> Vaquar Khan
>
> +1 -224-436-0783
>
> Greater Chicago
>
> --
>
> Twitter: https://twitter.com/holdenkarau
>
>
> --
>
> *If you reply to this email, your message will be added to the discussion
> below:*
>
>
> http://apache-spark-user-list.1001560.n3.nabble.com/Use-of-Accumulators-tp29975p29995.html
>
> To start a new topic under Apache Spark User List, email
> ml+s1001560n1...@n3.nabble.com
> To unsubscribe from Apache Spark User List, click here
> <http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code=1=a2VkYXJuYXRoX2RpeGl0QHBlcnNpc3RlbnQuY29tfDF8Njk0OTY1MzYz>
> .
> NAML
> <http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer=instant_html%21nabble%3Aemail.naml=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>
-- 
Twitter: https://twitter.com/holdenkarau


Re: Use of Accumulators

2017-11-13 Thread Holden Karau
So you want to set an accumulator to 1 after a transformation has fully
completed? Or what exactly do you want to do?

On Mon, Nov 13, 2017 at 9:47 PM vaquar khan  wrote:

> Confirmed ,you can use Accumulators :)
>
> Regards,
> Vaquar khan
>
> On Mon, Nov 13, 2017 at 10:58 AM, Kedarnath Dixit <
> kedarnath_di...@persistent.com> wrote:
>
>> Hi,
>>
>>
>> We need some way to toggle the flag of  a variable in transformation.
>>
>>
>> We are thinking to make use of spark  Accumulators for this purpose.
>>
>>
>> Can we use these as below:
>>
>>
>> Variables  -> Initial Value
>>
>>  Variable1 -> 0
>>
>>  Variable2 -> 0
>>
>>
>> In one of the transformations if we need to make Variable2's value to 1.
>> Can we achieve this using Accumulators? Please confirm.
>>
>>
>> Thanks!
>>
>>
>> With Regards,
>>
>> *~Kedar Dixit*
>>
>> kedarnath_di...@persistent.com | @kedarsdixit | M +91 90499 15588
>> <+91%2090499%2015588> | T +91 (20) 6703 4783 <+91%2020%206703%204783>
>>
>> *Persistent Systems | **Partners In Innovation** | www.persistent.com
>> *
>> DISCLAIMER
>> ==
>> This e-mail may contain privileged and confidential information which is
>> the property of Persistent Systems Ltd. It is intended only for the use of
>> the individual or entity to which it is addressed. If you are not the
>> intended recipient, you are not authorized to read, retain, copy, print,
>> distribute or use this message. If you have received this communication in
>> error, please notify the sender and delete all copies of this message.
>> Persistent Systems Ltd. does not accept any liability for virus infected
>> mails.
>>
>
>
>
> --
> Regards,
> Vaquar Khan
> +1 -224-436-0783
> Greater Chicago
>
-- 
Twitter: https://twitter.com/holdenkarau


[ANNOUNCE] Apache Spark 2.1.2

2017-10-25 Thread Holden Karau
We are happy to announce the availability of Spark 2.1.2!

Apache Spark 2.1.2 is a maintenance release, based on the branch-2.1
maintenance
branch of Spark. We strongly recommend all 2.1.x users to upgrade to this
stable release.

To download Apache Spark 2.1.2 visit http://spark.apache.org/downloads.html.
This version of Spark is also available on Maven, CRAN archive* & PyPI.

We would like to acknowledge all community members for contributing patches
to this release.

* SparkR can be manually downloaded from CRAN archive, and there are a few
more minor packaging issues to be fixed to have SparkR fully available in
CRAN, see SPARK-22344 for details.

-- 
Twitter: https://twitter.com/holdenkarau


Re: [SS] Why is a streaming aggregation required for complete output mode?

2017-08-18 Thread Holden Karau
My assumption is it would be similar though, in memory sink of all of your
records would quickly overwhelm your cluster, but in aggregation it could
be reasonable. But there might be additional reasons on top of that.

On Fri, Aug 18, 2017 at 11:44 AM Holden Karau <hol...@pigscanfly.ca> wrote:

> Ah yes I'm not sure about the workings of the memory sink.
>
> On Fri, Aug 18, 2017 at 11:36 AM, Jacek Laskowski <ja...@japila.pl> wrote:
>
>> Hi Holden,
>>
>> Thanks a lot for a bit more light on the topic. That however does not
>> explain why memory sink requires Complete for a checkpoint location to
>> work. The only reason I used Complete output mode was to meet the
>> requirements of memory sink and that got me thinking why would the
>> already-memory-hungry memory sink require yet another thing to get the
>> query working.
>>
>> On to exploring the bits...
>>
>> Pozdrawiam,
>> Jacek Laskowski
>> 
>> https://medium.com/@jaceklaskowski/
>> Mastering Apache Spark 2 https://bit.ly/mastering-apache-spark
>> Follow me at https://twitter.com/jaceklaskowski
>>
>>
>> On Fri, Aug 18, 2017 at 6:35 PM, Holden Karau <hol...@pigscanfly.ca>
>> wrote:
>> > So performing complete output without an aggregation would require
>> building
>> > up a table of the entire input to write out at each micro batch. This
>> would
>> > get prohibitively expensive quickly. With an aggregation we just need to
>> > keep track of the aggregates and update them every batch, so the memory
>> > requirement is more reasonable.
>> >
>> > (Note: I don't do a lot of work in streaming so there may be additional
>> > reasons, but these are the ones I remember from when I was working on
>> > looking at integrating ML with SS).
>> >
>> > On Fri, Aug 18, 2017 at 5:25 AM Jacek Laskowski <ja...@japila.pl>
>> wrote:
>> >>
>> >> Hi,
>> >>
>> >> Why is the requirement for a streaming aggregation in a streaming
>> >> query? What would happen if Spark allowed Complete without a single
>> >> aggregation? This is the latest master.
>> >>
>> >> scala> val q = ids.
>> >>  |   writeStream.
>> >>  |   format("memory").
>> >>  |   queryName("dups").
>> >>  |   outputMode(OutputMode.Complete).  // <-- memory sink supports
>> >> checkpointing for Complete output mode only
>> >>  |   trigger(Trigger.ProcessingTime(30.seconds)).
>> >>  |   option("checkpointLocation", "checkpoint-dir"). // <-- use
>> >> checkpointing to save state between restarts
>> >>  |   start
>> >> org.apache.spark.sql.AnalysisException: Complete output mode not
>> >> supported when there are no streaming aggregations on streaming
>> >> DataFrames/Datasets;;
>> >> Project [cast(time#10 as bigint) AS time#15L, id#6]
>> >> +- Deduplicate [id#6], true
>> >>+- Project [cast(time#5 as timestamp) AS time#10, id#6]
>> >>   +- Project [_1#2 AS time#5, _2#3 AS id#6]
>> >>  +- StreamingExecutionRelation MemoryStream[_1#2,_2#3], [_1#2,
>> >> _2#3]
>> >>
>> >>   at
>> >>
>> org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$.org$apache$spark$sql$catalyst$analysis$UnsupportedOperationChecker$$throwError(UnsupportedOperationChecker.scala:297)
>> >>   at
>> >>
>> org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$.checkForStreaming(UnsupportedOperationChecker.scala:115)
>> >>   at
>> >>
>> org.apache.spark.sql.streaming.StreamingQueryManager.createQuery(StreamingQueryManager.scala:232)
>> >>   at
>> >>
>> org.apache.spark.sql.streaming.StreamingQueryManager.startQuery(StreamingQueryManager.scala:278)
>> >>   at
>> >>
>> org.apache.spark.sql.streaming.DataStreamWriter.start(DataStreamWriter.scala:249)
>> >>   ... 57 elided
>> >>
>> >> Pozdrawiam,
>> >> Jacek Laskowski
>> >> 
>> >> https://medium.com/@jaceklaskowski/
>> >> Mastering Apache Spark 2 https://bit.ly/mastering-apache-spark
>> >> Follow me at https://twitter.com/jaceklaskowski
>> >>
>> >> -
>> >> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>> >>
>> > --
>> > Cell : 425-233-8271
>> > Twitter: https://twitter.com/holdenkarau
>>
>
>
>
> --
> Cell : 425-233-8271
> Twitter: https://twitter.com/holdenkarau
>
-- 
Cell : 425-233-8271
Twitter: https://twitter.com/holdenkarau


Re: [SS] Why is a streaming aggregation required for complete output mode?

2017-08-18 Thread Holden Karau
Ah yes I'm not sure about the workings of the memory sink.

On Fri, Aug 18, 2017 at 11:36 AM, Jacek Laskowski <ja...@japila.pl> wrote:

> Hi Holden,
>
> Thanks a lot for a bit more light on the topic. That however does not
> explain why memory sink requires Complete for a checkpoint location to
> work. The only reason I used Complete output mode was to meet the
> requirements of memory sink and that got me thinking why would the
> already-memory-hungry memory sink require yet another thing to get the
> query working.
>
> On to exploring the bits...
>
> Pozdrawiam,
> Jacek Laskowski
> 
> https://medium.com/@jaceklaskowski/
> Mastering Apache Spark 2 https://bit.ly/mastering-apache-spark
> Follow me at https://twitter.com/jaceklaskowski
>
>
> On Fri, Aug 18, 2017 at 6:35 PM, Holden Karau <hol...@pigscanfly.ca>
> wrote:
> > So performing complete output without an aggregation would require
> building
> > up a table of the entire input to write out at each micro batch. This
> would
> > get prohibitively expensive quickly. With an aggregation we just need to
> > keep track of the aggregates and update them every batch, so the memory
> > requirement is more reasonable.
> >
> > (Note: I don't do a lot of work in streaming so there may be additional
> > reasons, but these are the ones I remember from when I was working on
> > looking at integrating ML with SS).
> >
> > On Fri, Aug 18, 2017 at 5:25 AM Jacek Laskowski <ja...@japila.pl> wrote:
> >>
> >> Hi,
> >>
> >> Why is the requirement for a streaming aggregation in a streaming
> >> query? What would happen if Spark allowed Complete without a single
> >> aggregation? This is the latest master.
> >>
> >> scala> val q = ids.
> >>  |   writeStream.
> >>  |   format("memory").
> >>  |   queryName("dups").
> >>  |   outputMode(OutputMode.Complete).  // <-- memory sink supports
> >> checkpointing for Complete output mode only
> >>  |   trigger(Trigger.ProcessingTime(30.seconds)).
> >>  |   option("checkpointLocation", "checkpoint-dir"). // <-- use
> >> checkpointing to save state between restarts
> >>  |   start
> >> org.apache.spark.sql.AnalysisException: Complete output mode not
> >> supported when there are no streaming aggregations on streaming
> >> DataFrames/Datasets;;
> >> Project [cast(time#10 as bigint) AS time#15L, id#6]
> >> +- Deduplicate [id#6], true
> >>+- Project [cast(time#5 as timestamp) AS time#10, id#6]
> >>   +- Project [_1#2 AS time#5, _2#3 AS id#6]
> >>  +- StreamingExecutionRelation MemoryStream[_1#2,_2#3], [_1#2,
> >> _2#3]
> >>
> >>   at
> >> org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$.
> org$apache$spark$sql$catalyst$analysis$UnsupportedOperationChecker$$
> throwError(UnsupportedOperationChecker.scala:297)
> >>   at
> >> org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$.
> checkForStreaming(UnsupportedOperationChecker.scala:115)
> >>   at
> >> org.apache.spark.sql.streaming.StreamingQueryManager.createQuery(
> StreamingQueryManager.scala:232)
> >>   at
> >> org.apache.spark.sql.streaming.StreamingQueryManager.startQuery(
> StreamingQueryManager.scala:278)
> >>   at
> >> org.apache.spark.sql.streaming.DataStreamWriter.
> start(DataStreamWriter.scala:249)
> >>   ... 57 elided
> >>
> >> Pozdrawiam,
> >> Jacek Laskowski
> >> 
> >> https://medium.com/@jaceklaskowski/
> >> Mastering Apache Spark 2 https://bit.ly/mastering-apache-spark
> >> Follow me at https://twitter.com/jaceklaskowski
> >>
> >> -
> >> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
> >>
> > --
> > Cell : 425-233-8271
> > Twitter: https://twitter.com/holdenkarau
>



-- 
Cell : 425-233-8271
Twitter: https://twitter.com/holdenkarau


Re: [SS] Why is a streaming aggregation required for complete output mode?

2017-08-18 Thread Holden Karau
So performing complete output without an aggregation would require building
up a table of the entire input to write out at each micro batch. This would
get prohibitively expensive quickly. With an aggregation we just need to
keep track of the aggregates and update them every batch, so the memory
requirement is more reasonable.

(Note: I don't do a lot of work in streaming so there may be additional
reasons, but these are the ones I remember from when I was working on
looking at integrating ML with SS).

On Fri, Aug 18, 2017 at 5:25 AM Jacek Laskowski  wrote:

> Hi,
>
> Why is the requirement for a streaming aggregation in a streaming
> query? What would happen if Spark allowed Complete without a single
> aggregation? This is the latest master.
>
> scala> val q = ids.
>  |   writeStream.
>  |   format("memory").
>  |   queryName("dups").
>  |   outputMode(OutputMode.Complete).  // <-- memory sink supports
> checkpointing for Complete output mode only
>  |   trigger(Trigger.ProcessingTime(30.seconds)).
>  |   option("checkpointLocation", "checkpoint-dir"). // <-- use
> checkpointing to save state between restarts
>  |   start
> org.apache.spark.sql.AnalysisException: Complete output mode not
> supported when there are no streaming aggregations on streaming
> DataFrames/Datasets;;
> Project [cast(time#10 as bigint) AS time#15L, id#6]
> +- Deduplicate [id#6], true
>+- Project [cast(time#5 as timestamp) AS time#10, id#6]
>   +- Project [_1#2 AS time#5, _2#3 AS id#6]
>  +- StreamingExecutionRelation MemoryStream[_1#2,_2#3], [_1#2,
> _2#3]
>
>   at
> org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$.org$apache$spark$sql$catalyst$analysis$UnsupportedOperationChecker$$throwError(UnsupportedOperationChecker.scala:297)
>   at
> org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$.checkForStreaming(UnsupportedOperationChecker.scala:115)
>   at
> org.apache.spark.sql.streaming.StreamingQueryManager.createQuery(StreamingQueryManager.scala:232)
>   at
> org.apache.spark.sql.streaming.StreamingQueryManager.startQuery(StreamingQueryManager.scala:278)
>   at
> org.apache.spark.sql.streaming.DataStreamWriter.start(DataStreamWriter.scala:249)
>   ... 57 elided
>
> Pozdrawiam,
> Jacek Laskowski
> 
> https://medium.com/@jaceklaskowski/
> Mastering Apache Spark 2 https://bit.ly/mastering-apache-spark
> Follow me at https://twitter.com/jaceklaskowski
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
> --
Cell : 425-233-8271
Twitter: https://twitter.com/holdenkarau


Re: Reparitioning Hive tables - Container killed by YARN for exceeding memory limits

2017-08-02 Thread Holden Karau
The memory overhead is based less on the total amount of data and more on
what you end up doing with the data (e.g. if your doing a lot of off-heap
processing or using Python you need to increase it). Honestly most people
find this number for their job "experimentally" (e.g. they try a few
different things).

On Wed, Aug 2, 2017 at 1:52 PM, Chetan Khatri 
wrote:

> Ryan,
> Thank you for reply.
>
> For 2 TB of Data what should be the value of 
> spark.yarn.executor.memoryOverhead
> = ?
>
> with regards to this - i see issue at spark https://issues.apache.org/
> jira/browse/SPARK-18787 , not sure whether it works or not at Spark 2.0.1
>  !
>
> can you elaborate more for spark.memory.fraction setting.
>
> number of partitions = 674
> Cluster: 455 GB total memory, VCores: 288, Nodes: 17
> Given / tried memory config: executor-mem = 16g, num-executor=10, executor
> cores=6, driver mem=4g
>
> spark.default.parallelism=1000
> spark.sql.shuffle.partitions=1000
> spark.yarn.executor.memoryOverhead=2048
> spark.shuffle.io.preferDirectBufs=false
>
>
>
>
>
>
>
>
>
> On Wed, Aug 2, 2017 at 10:43 PM, Ryan Blue  wrote:
>
>> Chetan,
>>
>> When you're writing to a partitioned table, you want to use a shuffle to
>> avoid the situation where each task has to write to every partition. You
>> can do that either by adding a repartition by your table's partition keys,
>> or by adding an order by with the partition keys and then columns you
>> normally use to filter when reading the table. I generally recommend the
>> second approach because it handles skew and prepares the data for more
>> efficient reads.
>>
>> If that doesn't help, then you should look at your memory settings. When
>> you're getting killed by YARN, you should consider setting `
>> spark.shuffle.io.preferDirectBufs=false` so you use less off-heap memory
>> that the JVM doesn't account for. That is usually an easier fix than
>> increasing the memory overhead. Also, when you set executor memory, always
>> change spark.memory.fraction to ensure the memory you're adding is used
>> where it is needed. If your memory fraction is the default 60%, then 60% of
>> the memory will be used for Spark execution, not reserved whatever is
>> consuming it and causing the OOM. (If Spark's memory is too low, you'll see
>> other problems like spilling too much to disk.)
>>
>> rb
>>
>> On Wed, Aug 2, 2017 at 9:02 AM, Chetan Khatri <
>> chetan.opensou...@gmail.com> wrote:
>>
>>> Can anyone please guide me with above issue.
>>>
>>>
>>> On Wed, Aug 2, 2017 at 6:28 PM, Chetan Khatri <
>>> chetan.opensou...@gmail.com> wrote:
>>>
 Hello Spark Users,

 I have Hbase table reading and writing to Hive managed table where i
 applied partitioning by date column which worked fine but it has generate
 more number of files in almost 700 partitions but i wanted to use
 reparation to reduce File I/O by reducing number of files inside each
 partition.

 *But i ended up with below exception:*

 ExecutorLostFailure (executor 11 exited caused by one of the running
 tasks) Reason: Container killed by YARN for exceeding memory limits. 14.0
 GB of 14 GB physical memory used. Consider boosting spark.yarn.executor.
 memoryOverhead.

 Driver memory=4g, executor mem=12g, num-executors=8, executor core=8

 Do you think below setting can help me to overcome above issue:

 spark.default.parellism=1000
 spark.sql.shuffle.partitions=1000

 Because default max number of partitions are 1000.



>>>
>>
>>
>> --
>> Ryan Blue
>> Software Engineer
>> Netflix
>>
>
>


-- 
Cell : 425-233-8271
Twitter: https://twitter.com/holdenkarau


With 2.2.0 PySpark is now available for pip install from PyPI :)

2017-07-12 Thread Holden Karau
Hi wonderful Python + Spark folks,

I'm excited to announce that with Spark 2.2.0 we finally have PySpark
published on PyPI (see https://pypi.python.org/pypi/pyspark /
https://twitter.com/holdenkarau/status/885207416173756417). This has been a
long time coming (previous releases included pip installable artifacts that
for a variety of reasons couldn't be published to PyPI). So if you (or your
friends) want to be able to work with PySpark locally on your laptop you've
got an easier path getting started (pip install pyspark).

If you are setting up a standalone cluster your cluster will still need the
"full" Spark packaging, but the pip installed PySpark should be able to
work with YARN or an existing standalone cluster installation (of the same
version).

Happy Sparking y'all!

Holden :)


-- 
Cell : 425-233-8271
Twitter: https://twitter.com/holdenkarau


Re: Can we access files on Cluster mode

2017-06-24 Thread Holden Karau
addFile is supposed to not depend on a shared FS unless the semantics have
changed recently.

On Sat, Jun 24, 2017 at 11:55 AM varma dantuluri 
wrote:

> Hi Sudhir,
>
> I believe you have to use a shared file system that is accused by all
> nodes.
>
>
> On Jun 24, 2017, at 1:30 PM, sudhir k  wrote:
>
>
> I am new to Spark and i need some guidance on how to fetch files from
> --files option on Spark-Submit.
>
> I read on some forums that we can fetch the files from
> Spark.getFiles(fileName) and can use it in our code and all nodes should
> read it.
>
> But i am facing some issue
>
> Below is the command i am using
>
> spark-submit --deploy-mode cluster --class com.check.Driver --files
> /home/sql/first.sql test.jar 20170619
>
> so when i use SparkFiles.get(first.sql) , i should be able to read the
> file Path but it is throwing File not Found exception.
>
> I tried SpackContext.addFile(/home/sql/first.sql) and then
> SparkFiles.get(first.sql) but still the same error.
>
> Its working on the stand alone mode but not on cluster mode. Any help is
> appreciated.. Using Spark 2.1.0 and Scala 2.11
>
> Thanks.
>
>
> Regards,
> Sudhir K
>
>
>
> --
> Regards,
> Sudhir K
>
>
> --
Cell : 425-233-8271
Twitter: https://twitter.com/holdenkarau


Re: Spark checkpoint - nonstreaming

2017-05-26 Thread Holden Karau
In non streaming Spark checkpoints aren't for inter-application recovery,
rather you can think of them as doing persist but to a HDFS rather than
each nodes local memory / storage.


On Fri, May 26, 2017 at 3:06 PM Priya  wrote:

> Hi,
>
> With nonstreaming spark application, did checkpoint the RDD and I could see
> the RDD getting checkpointed. I have killed the application after
> checkpointing the RDD and restarted the same application again immediately,
> but it doesn't seem to pick from checkpoint and it again checkpoints the
> RDD. Could anyone please explain why am I seeing this behavior, why it is
> not picking from the checkpoint and proceeding further from there on the
> second run of the same application. Would really help me understand spark
> checkpoint work flow if I can get some clarity on the behavior. Please let
> me know if I am missing something.
>
> [root@checkpointDir]# ls
> 9dd1acf0-bef8-4a4f-bf0e-f7624334abc5  a4f14f43-e7c3-4f64-a980-8483b42bb11d
>
> [root@9dd1acf0-bef8-4a4f-bf0e-f7624334abc5]# ls -la
> total 0
> drwxr-xr-x. 3 root root  20 May 26 16:26 .
> drwxr-xr-x. 4 root root  94 May 26 16:24 ..
> drwxr-xr-x. 2 root root 133 May 26 16:26 rdd-28
>
> [root@priya-vm 9dd1acf0-bef8-4a4f-bf0e-f7624334abc5]# cd rdd-28/
> [root@priya-vm rdd-28]# ls
> part-0  part-1  _partitioner
>
> Thanks
>
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-checkpoint-nonstreaming-tp28712.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
> --
Cell : 425-233-8271
Twitter: https://twitter.com/holdenkarau


Re: [Spark Core]: Python and Scala generate different DAGs for identical code

2017-05-10 Thread Holden Karau
So this Python side pipelining happens in a lot of places which can make
debugging extra challenging. Some people work around this with persist
which breaks the pipelining during debugging, but if your interested in
more general Python debugging I've got a YouTube video on the topic which
could be a good intro (of course I'm pretty biased about that).

On Wed, May 10, 2017 at 9:42 AM Pavel Klemenkov <pklemen...@gmail.com>
wrote:

> Thanks for the quick answer, Holden!
>
> Are there any other tricks with PySpark which are hard to debug using UI
> or toDebugString?
>
> On Wed, May 10, 2017 at 7:18 PM, Holden Karau <hol...@pigscanfly.ca>
> wrote:
>
>> In PySpark the filter and then map steps are combined into a single
>> transformation from the JVM point of view. This allows us to avoid copying
>> the data back to Scala in between the filter and the map steps. The
>> debugging exeperience is certainly much harder in PySpark and I think is an
>> interesting area for those interested in contributing :)
>>
>> On Wed, May 10, 2017 at 7:33 AM pklemenkov <pklemen...@gmail.com> wrote:
>>
>>> This Scala code:
>>> scala> val logs = sc.textFile("big_data_specialization/log.txt").
>>>  | filter(x => !x.contains("INFO")).
>>>  | map(x => (x.split("\t")(1), 1)).
>>>  | reduceByKey((x, y) => x + y)
>>>
>>> generated obvious lineage:
>>>
>>> (2) ShuffledRDD[4] at reduceByKey at :27 []
>>>  +-(2) MapPartitionsRDD[3] at map at :26 []
>>> |  MapPartitionsRDD[2] at filter at :25 []
>>> |  big_data_specialization/log.txt MapPartitionsRDD[1] at textFile at
>>> :24 []
>>> |  big_data_specialization/log.txt HadoopRDD[0] at textFile at
>>> :24 []
>>>
>>> But Python code:
>>>
>>> logs = sc.textFile("../log.txt")\
>>>  .filter(lambda x: 'INFO' not in x)\
>>>  .map(lambda x: (x.split('\t')[1], 1))\
>>>  .reduceByKey(lambda x, y: x + y)
>>>
>>> generated something strange which is hard to follow:
>>>
>>> (2) PythonRDD[13] at RDD at PythonRDD.scala:48 []
>>>  |  MapPartitionsRDD[12] at mapPartitions at PythonRDD.scala:422 []
>>>  |  ShuffledRDD[11] at partitionBy at NativeMethodAccessorImpl.java:0 []
>>>  +-(2) PairwiseRDD[10] at reduceByKey at
>>> :1 []
>>> |  PythonRDD[9] at reduceByKey at :1 []
>>> |  ../log.txt MapPartitionsRDD[8] at textFile at
>>> NativeMethodAccessorImpl.java:0 []
>>> |  ../log.txt HadoopRDD[7] at textFile at
>>> NativeMethodAccessorImpl.java:0 []
>>>
>>> Why is that? Does pyspark do some optimizations under the hood? This
>>> debug
>>> string is really useless for debugging.
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Core-Python-and-Scala-generate-different-DAGs-for-identical-code-tp28674.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>> -
>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>>
>>> --
>> Cell : 425-233-8271 <(425)%20233-8271>
>> Twitter: https://twitter.com/holdenkarau
>>
>
>
>
> --
> Yours faithfully, Pavel Klemenkov.
>
-- 
Cell : 425-233-8271
Twitter: https://twitter.com/holdenkarau


Re: [Spark Core]: Python and Scala generate different DAGs for identical code

2017-05-10 Thread Holden Karau
In PySpark the filter and then map steps are combined into a single
transformation from the JVM point of view. This allows us to avoid copying
the data back to Scala in between the filter and the map steps. The
debugging exeperience is certainly much harder in PySpark and I think is an
interesting area for those interested in contributing :)

On Wed, May 10, 2017 at 7:33 AM pklemenkov  wrote:

> This Scala code:
> scala> val logs = sc.textFile("big_data_specialization/log.txt").
>  | filter(x => !x.contains("INFO")).
>  | map(x => (x.split("\t")(1), 1)).
>  | reduceByKey((x, y) => x + y)
>
> generated obvious lineage:
>
> (2) ShuffledRDD[4] at reduceByKey at :27 []
>  +-(2) MapPartitionsRDD[3] at map at :26 []
> |  MapPartitionsRDD[2] at filter at :25 []
> |  big_data_specialization/log.txt MapPartitionsRDD[1] at textFile at
> :24 []
> |  big_data_specialization/log.txt HadoopRDD[0] at textFile at
> :24 []
>
> But Python code:
>
> logs = sc.textFile("../log.txt")\
>  .filter(lambda x: 'INFO' not in x)\
>  .map(lambda x: (x.split('\t')[1], 1))\
>  .reduceByKey(lambda x, y: x + y)
>
> generated something strange which is hard to follow:
>
> (2) PythonRDD[13] at RDD at PythonRDD.scala:48 []
>  |  MapPartitionsRDD[12] at mapPartitions at PythonRDD.scala:422 []
>  |  ShuffledRDD[11] at partitionBy at NativeMethodAccessorImpl.java:0 []
>  +-(2) PairwiseRDD[10] at reduceByKey at :1
> []
> |  PythonRDD[9] at reduceByKey at :1 []
> |  ../log.txt MapPartitionsRDD[8] at textFile at
> NativeMethodAccessorImpl.java:0 []
> |  ../log.txt HadoopRDD[7] at textFile at
> NativeMethodAccessorImpl.java:0 []
>
> Why is that? Does pyspark do some optimizations under the hood? This debug
> string is really useless for debugging.
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Core-Python-and-Scala-generate-different-DAGs-for-identical-code-tp28674.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
> --
Cell : 425-233-8271
Twitter: https://twitter.com/holdenkarau


Re: Spark Testing Library Discussion

2017-04-26 Thread Holden Karau
Sorry about that, hangouts on air broke in the first one :(

On Wed, Apr 26, 2017 at 8:41 AM, Marco Mistroni <mmistr...@gmail.com> wrote:

> Uh i stayed online in the other link but nobody joinedWill follow
> transcript
> Kr
>
> On 26 Apr 2017 9:35 am, "Holden Karau" <hol...@pigscanfly.ca> wrote:
>
>> And the recording of our discussion is at https://www.youtube.com/wat
>> ch?v=2q0uAldCQ8M
>> A few of us have follow up things and we will try and do another meeting
>> in about a month or two :)
>>
>> On Tue, Apr 25, 2017 at 1:04 PM, Holden Karau <hol...@pigscanfly.ca>
>> wrote:
>>
>>> Urgh hangouts did something frustrating, updated link
>>> https://hangouts.google.com/hangouts/_/ha6kusycp5fvzei2trhay4uhhqe
>>>
>>> On Mon, Apr 24, 2017 at 12:13 AM, Holden Karau <hol...@pigscanfly.ca>
>>> wrote:
>>>
>>>> The (tentative) link for those interested is
>>>> https://hangouts.google.com/hangouts/_/oyjvcnffejcjhi6qazf3lysypue .
>>>>
>>>> On Mon, Apr 24, 2017 at 12:02 AM, Holden Karau <hol...@pigscanfly.ca>
>>>> wrote:
>>>>
>>>>> So 14 people have said they are available on Tuesday the 25th at 1PM
>>>>> pacific so we will do this meeting then ( https://doodle.com/poll/69y6
>>>>> yab4pyf7u8bn ).
>>>>>
>>>>> Since hangouts tends to work ok on the Linux distro I'm running my
>>>>> default is to host this as a "hangouts-on-air" unless there are 
>>>>> alternative
>>>>> ideas.
>>>>>
>>>>> I'll record the hangout and if it isn't terrible I'll post it for
>>>>> those who weren't able to make it (and for next time I'll include more
>>>>> European friendly time options - Doodle wouldn't let me update it once
>>>>> posted).
>>>>>
>>>>> On Fri, Apr 14, 2017 at 11:17 AM, Holden Karau <hol...@pigscanfly.ca>
>>>>> wrote:
>>>>>
>>>>>> Hi Spark Users (+ Some Spark Testing Devs on BCC),
>>>>>>
>>>>>> Awhile back on one of the many threads about testing in Spark there
>>>>>> was some interest in having a chat about the state of Spark testing and
>>>>>> what people want/need.
>>>>>>
>>>>>> So if you are interested in joining an online (with maybe an IRL
>>>>>> component if enough people are SF based) chat about Spark testing please
>>>>>> fill out this doodle - https://doodle.com/poll/69y6yab4pyf7u8bn
>>>>>>
>>>>>> I think reasonable topics of discussion could be:
>>>>>>
>>>>>> 1) What is the state of the different Spark testing libraries in the
>>>>>> different core (Scala, Python, R, Java) and extended languages (C#,
>>>>>> Javascript, etc.)?
>>>>>> 2) How do we make these more easily discovered by users?
>>>>>> 3) What are people looking for in their testing libraries that we are
>>>>>> missing? (can be functionality, documentation, etc.)
>>>>>> 4) Are there any examples of well tested open source Spark projects
>>>>>> and where are they?
>>>>>>
>>>>>> If you have other topics that's awesome.
>>>>>>
>>>>>> To clarify this about libraries and best practices for people testing
>>>>>> their Spark applications, and less about testing Spark's internals
>>>>>> (although as illustrated by some of the libraries there is some strong
>>>>>> overlap in what is required to make that work).
>>>>>>
>>>>>> Cheers,
>>>>>>
>>>>>> Holden :)
>>>>>>
>>>>>> --
>>>>>> Cell : 425-233-8271 <(425)%20233-8271>
>>>>>> Twitter: https://twitter.com/holdenkarau
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Cell : 425-233-8271 <(425)%20233-8271>
>>>>> Twitter: https://twitter.com/holdenkarau
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Cell : 425-233-8271 <(425)%20233-8271>
>>>> Twitter: https://twitter.com/holdenkarau
>>>>
>>>
>>>
>>>
>>> --
>>> Cell : 425-233-8271 <(425)%20233-8271>
>>> Twitter: https://twitter.com/holdenkarau
>>>
>>
>>
>>
>> --
>> Cell : 425-233-8271 <(425)%20233-8271>
>> Twitter: https://twitter.com/holdenkarau
>>
>


-- 
Cell : 425-233-8271
Twitter: https://twitter.com/holdenkarau


Re: Spark Testing Library Discussion

2017-04-26 Thread Holden Karau
And the recording of our discussion is at
https://www.youtube.com/watch?v=2q0uAldCQ8M
A few of us have follow up things and we will try and do another meeting in
about a month or two :)

On Tue, Apr 25, 2017 at 1:04 PM, Holden Karau <hol...@pigscanfly.ca> wrote:

> Urgh hangouts did something frustrating, updated link
> https://hangouts.google.com/hangouts/_/ha6kusycp5fvzei2trhay4uhhqe
>
> On Mon, Apr 24, 2017 at 12:13 AM, Holden Karau <hol...@pigscanfly.ca>
> wrote:
>
>> The (tentative) link for those interested is https://hangouts.google.com
>> /hangouts/_/oyjvcnffejcjhi6qazf3lysypue .
>>
>> On Mon, Apr 24, 2017 at 12:02 AM, Holden Karau <hol...@pigscanfly.ca>
>> wrote:
>>
>>> So 14 people have said they are available on Tuesday the 25th at 1PM
>>> pacific so we will do this meeting then ( https://doodle.com/poll/69y6
>>> yab4pyf7u8bn ).
>>>
>>> Since hangouts tends to work ok on the Linux distro I'm running my
>>> default is to host this as a "hangouts-on-air" unless there are alternative
>>> ideas.
>>>
>>> I'll record the hangout and if it isn't terrible I'll post it for those
>>> who weren't able to make it (and for next time I'll include more European
>>> friendly time options - Doodle wouldn't let me update it once posted).
>>>
>>> On Fri, Apr 14, 2017 at 11:17 AM, Holden Karau <hol...@pigscanfly.ca>
>>> wrote:
>>>
>>>> Hi Spark Users (+ Some Spark Testing Devs on BCC),
>>>>
>>>> Awhile back on one of the many threads about testing in Spark there was
>>>> some interest in having a chat about the state of Spark testing and what
>>>> people want/need.
>>>>
>>>> So if you are interested in joining an online (with maybe an IRL
>>>> component if enough people are SF based) chat about Spark testing please
>>>> fill out this doodle - https://doodle.com/poll/69y6yab4pyf7u8bn
>>>>
>>>> I think reasonable topics of discussion could be:
>>>>
>>>> 1) What is the state of the different Spark testing libraries in the
>>>> different core (Scala, Python, R, Java) and extended languages (C#,
>>>> Javascript, etc.)?
>>>> 2) How do we make these more easily discovered by users?
>>>> 3) What are people looking for in their testing libraries that we are
>>>> missing? (can be functionality, documentation, etc.)
>>>> 4) Are there any examples of well tested open source Spark projects and
>>>> where are they?
>>>>
>>>> If you have other topics that's awesome.
>>>>
>>>> To clarify this about libraries and best practices for people testing
>>>> their Spark applications, and less about testing Spark's internals
>>>> (although as illustrated by some of the libraries there is some strong
>>>> overlap in what is required to make that work).
>>>>
>>>> Cheers,
>>>>
>>>> Holden :)
>>>>
>>>> --
>>>> Cell : 425-233-8271 <(425)%20233-8271>
>>>> Twitter: https://twitter.com/holdenkarau
>>>>
>>>
>>>
>>>
>>> --
>>> Cell : 425-233-8271 <(425)%20233-8271>
>>> Twitter: https://twitter.com/holdenkarau
>>>
>>
>>
>>
>> --
>> Cell : 425-233-8271 <(425)%20233-8271>
>> Twitter: https://twitter.com/holdenkarau
>>
>
>
>
> --
> Cell : 425-233-8271 <(425)%20233-8271>
> Twitter: https://twitter.com/holdenkarau
>



-- 
Cell : 425-233-8271
Twitter: https://twitter.com/holdenkarau


  1   2   3   >