Re: [ANNOUNCE] Apache Spark 3.0.0

2020-06-18 Thread Takeshi Yamamuro
Congrats, all!

Bests,
Takeshi

On Fri, Jun 19, 2020 at 1:16 PM Felix Cheung 
wrote:

> Congrats
>
> --
> *From:* Jungtaek Lim 
> *Sent:* Thursday, June 18, 2020 8:18:54 PM
> *To:* Hyukjin Kwon 
> *Cc:* Mridul Muralidharan ; Reynold Xin <
> r...@databricks.com>; dev ; user <
> user@spark.apache.org>
> *Subject:* Re: [ANNOUNCE] Apache Spark 3.0.0
>
> Great, thanks all for your efforts on the huge step forward!
>
> On Fri, Jun 19, 2020 at 12:13 PM Hyukjin Kwon  wrote:
>
> Yay!
>
> 2020년 6월 19일 (금) 오전 4:46, Mridul Muralidharan 님이 작성:
>
> Great job everyone ! Congratulations :-)
>
> Regards,
> Mridul
>
> On Thu, Jun 18, 2020 at 10:21 AM Reynold Xin  wrote:
>
> Hi all,
>
> Apache Spark 3.0.0 is the first release of the 3.x line. It builds on many
> of the innovations from Spark 2.x, bringing new ideas as well as continuing
> long-term projects that have been in development. This release resolves
> more than 3400 tickets.
>
> We'd like to thank our contributors and users for their contributions and
> early feedback to this release. This release would not have been possible
> without you.
>
> To download Spark 3.0.0, head over to the download page:
> http://spark.apache.org/downloads.html
>
> To view the release notes:
> https://spark.apache.org/releases/spark-release-3-0-0.html
>
>
>
>

-- 
---
Takeshi Yamamuro


Re: [ANNOUNCE] Apache Spark 3.0.0

2020-06-18 Thread Felix Cheung
Congrats


From: Jungtaek Lim 
Sent: Thursday, June 18, 2020 8:18:54 PM
To: Hyukjin Kwon 
Cc: Mridul Muralidharan ; Reynold Xin ; 
dev ; user 
Subject: Re: [ANNOUNCE] Apache Spark 3.0.0

Great, thanks all for your efforts on the huge step forward!

On Fri, Jun 19, 2020 at 12:13 PM Hyukjin Kwon 
mailto:gurwls...@gmail.com>> wrote:
Yay!

2020년 6월 19일 (금) 오전 4:46, Mridul Muralidharan 
mailto:mri...@gmail.com>>님이 작성:
Great job everyone ! Congratulations :-)

Regards,
Mridul

On Thu, Jun 18, 2020 at 10:21 AM Reynold Xin 
mailto:r...@databricks.com>> wrote:

Hi all,

Apache Spark 3.0.0 is the first release of the 3.x line. It builds on many of 
the innovations from Spark 2.x, bringing new ideas as well as continuing 
long-term projects that have been in development. This release resolves more 
than 3400 tickets.

We'd like to thank our contributors and users for their contributions and early 
feedback to this release. This release would not have been possible without you.

To download Spark 3.0.0, head over to the download page: 
http://spark.apache.org/downloads.html

To view the release notes: 
https://spark.apache.org/releases/spark-release-3-0-0.html





Re: [ANNOUNCE] Apache Spark 3.0.0

2020-06-18 Thread Jungtaek Lim
Great, thanks all for your efforts on the huge step forward!

On Fri, Jun 19, 2020 at 12:13 PM Hyukjin Kwon  wrote:

> Yay!
>
> 2020년 6월 19일 (금) 오전 4:46, Mridul Muralidharan 님이 작성:
>
>> Great job everyone ! Congratulations :-)
>>
>> Regards,
>> Mridul
>>
>> On Thu, Jun 18, 2020 at 10:21 AM Reynold Xin  wrote:
>>
>>> Hi all,
>>>
>>> Apache Spark 3.0.0 is the first release of the 3.x line. It builds on
>>> many of the innovations from Spark 2.x, bringing new ideas as well as
>>> continuing long-term projects that have been in development. This release
>>> resolves more than 3400 tickets.
>>>
>>> We'd like to thank our contributors and users for their contributions
>>> and early feedback to this release. This release would not have been
>>> possible without you.
>>>
>>> To download Spark 3.0.0, head over to the download page:
>>> http://spark.apache.org/downloads.html
>>>
>>> To view the release notes:
>>> https://spark.apache.org/releases/spark-release-3-0-0.html
>>>
>>>
>>>
>>>


Re: [ANNOUNCE] Apache Spark 3.0.0

2020-06-18 Thread Hyukjin Kwon
Yay!

2020년 6월 19일 (금) 오전 4:46, Mridul Muralidharan 님이 작성:

> Great job everyone ! Congratulations :-)
>
> Regards,
> Mridul
>
> On Thu, Jun 18, 2020 at 10:21 AM Reynold Xin  wrote:
>
>> Hi all,
>>
>> Apache Spark 3.0.0 is the first release of the 3.x line. It builds on
>> many of the innovations from Spark 2.x, bringing new ideas as well as
>> continuing long-term projects that have been in development. This release
>> resolves more than 3400 tickets.
>>
>> We'd like to thank our contributors and users for their contributions and
>> early feedback to this release. This release would not have been possible
>> without you.
>>
>> To download Spark 3.0.0, head over to the download page:
>> http://spark.apache.org/downloads.html
>>
>> To view the release notes:
>> https://spark.apache.org/releases/spark-release-3-0-0.html
>>
>>
>>
>>


Re: java.lang.ClassNotFoundException for s3a comitter

2020-06-18 Thread Stephen Coy
Hi Murat Migdisoglu,

Unfortunately you need the secret sauce to resolve this.

It is necessary to check out the Apache Spark source code and build it with the 
right command line options. This is what I have been using:

dev/make-distribution.sh --name my-spark --tgz -Pyarn -Phadoop-3.2  -Pyarn 
-Phadoop-cloud -Dhadoop.version=3.2.1

This will add additional jars into the build.

Copy hadoop-aws-3.2.1.jar, hadoop-openstack-3.2.1.jar and 
spark-hadoop-cloud_2.12-3.0.0.jar into the “jars” directory of your Spark 
distribution. If you are paranoid you could copy/replace all the 
hadoop-*-3.2.1.jar files but I have not found that necessary.

You will also need to upgrade the version of guava that appears in the spark 
distro because Hadoop 3.2.1 bumped this from guava-14.0.1.jar to 
guava-27.0-jre.jar. Otherwise you will get runtime ClassNotFound exceptions.

I have been using this combo for many months now with the Spark 3.0 
pre-releases and it has been working great.

Cheers,

Steve C


On 19 Jun 2020, at 10:24 am, murat migdisoglu 
mailto:murat.migdiso...@gmail.com>> wrote:

Hi all
I've upgraded my test cluster to spark 3 and change my comitter to directory 
and I still get this error.. The documentations are somehow obscure on that.
Do I need to add a third party jar to support new comitters?

java.lang.ClassNotFoundException: 
org.apache.spark.internal.io.cloud.PathOutputCommitProtocol


On Thu, Jun 18, 2020 at 1:35 AM murat migdisoglu 
mailto:murat.migdiso...@gmail.com>> wrote:
Hello all,
we have a hadoop cluster (using yarn) using  s3 as filesystem with s3guard is 
enabled.
We are using hadoop 3.2.1 with spark 2.4.5.

When I try to save a dataframe in parquet format, I get the following exception:
java.lang.ClassNotFoundException: 
com.hortonworks.spark.cloud.commit.PathOutputCommitProtocol

My relevant spark configurations are as following:
"hadoop.mapreduce.outputcommitter.factory.scheme.s3a":"org.apache.hadoop.fs.s3a.commit.S3ACommitterFactory",
"fs.s3a.committer.name":
 "magic",
"fs.s3a.committer.magic.enabled": true,
"fs.s3a.impl": "org.apache.hadoop.fs.s3a.S3AFileSystem",

While spark streaming fails with the exception above, apache beam succeeds 
writing parquet files.
What might be the problem?

Thanks in advance


--
"Talkers aren’t good doers. Rest assured that we’re going there to use our 
hands, not our tongues."
W. Shakespeare


--
"Talkers aren’t good doers. Rest assured that we’re going there to use our 
hands, not our tongues."
W. Shakespeare

This email contains confidential information of and is the copyright of 
Infomedia. It must not be forwarded, amended or disclosed without consent of 
the sender. If you received this message by mistake, please advise the sender 
and delete all copies. Security of transmission on the internet cannot be 
guaranteed, could be infected, intercepted, or corrupted and you should ensure 
you have suitable antivirus protection in place. By sending us your or any 
third party personal details, you consent to (or confirm you have obtained 
consent from such third parties) to Infomedia’s privacy policy. 
http://www.infomedia.com.au/privacy-policy/


Re: java.lang.ClassNotFoundException for s3a comitter

2020-06-18 Thread murat migdisoglu
Hi all
I've upgraded my test cluster to spark 3 and change my comitter to
directory and I still get this error.. The documentations are somehow
obscure on that.
Do I need to add a third party jar to support new comitters?

java.lang.ClassNotFoundException:
org.apache.spark.internal.io.cloud.PathOutputCommitProtocol


On Thu, Jun 18, 2020 at 1:35 AM murat migdisoglu 
wrote:

> Hello all,
> we have a hadoop cluster (using yarn) using  s3 as filesystem with s3guard
> is enabled.
> We are using hadoop 3.2.1 with spark 2.4.5.
>
> When I try to save a dataframe in parquet format, I get the following
> exception:
> java.lang.ClassNotFoundException:
> com.hortonworks.spark.cloud.commit.PathOutputCommitProtocol
>
> My relevant spark configurations are as following:
>
> "hadoop.mapreduce.outputcommitter.factory.scheme.s3a":"org.apache.hadoop.fs.s3a.commit.S3ACommitterFactory",
> "fs.s3a.committer.name": "magic",
> "fs.s3a.committer.magic.enabled": true,
> "fs.s3a.impl": "org.apache.hadoop.fs.s3a.S3AFileSystem",
>
> While spark streaming fails with the exception above, apache beam succeeds
> writing parquet files.
> What might be the problem?
>
> Thanks in advance
>
>
> --
> "Talkers aren’t good doers. Rest assured that we’re going there to use
> our hands, not our tongues."
> W. Shakespeare
>


-- 
"Talkers aren’t good doers. Rest assured that we’re going there to use our
hands, not our tongues."
W. Shakespeare


Re: [ANNOUNCE] Apache Spark 3.0.0

2020-06-18 Thread Mridul Muralidharan
Great job everyone ! Congratulations :-)

Regards,
Mridul

On Thu, Jun 18, 2020 at 10:21 AM Reynold Xin  wrote:

> Hi all,
>
> Apache Spark 3.0.0 is the first release of the 3.x line. It builds on many
> of the innovations from Spark 2.x, bringing new ideas as well as continuing
> long-term projects that have been in development. This release resolves
> more than 3400 tickets.
>
> We'd like to thank our contributors and users for their contributions and
> early feedback to this release. This release would not have been possible
> without you.
>
> To download Spark 3.0.0, head over to the download page:
> http://spark.apache.org/downloads.html
>
> To view the release notes:
> https://spark.apache.org/releases/spark-release-3-0-0.html
>
>
>
>


Custom Metrics

2020-06-18 Thread Bryan Jeffrey
Hello.

We're using Spark 2.4.4.  We have a custom metrics sink consuming the
Spark-produced metrics (e.g. heap free, etc.).  I am trying to determine a
good mechanism to pass the Spark application name into the metrics sink.
Current the application ID is included, but not the application name. Is
there a suggested mechanism?

Thank you,

Bryan Jeffrey


Re: [ANNOUNCE] Apache Spark 3.0.0

2020-06-18 Thread Gaetano Fabiano
Congratulations 拾 
Celebrating 拾 

Sent from my iPhone

> On 18 Jun 2020, at 20:38, Gourav Sengupta  wrote:
> 
> 
> CELEBRATIONS!!!
> 
>> On Thu, Jun 18, 2020 at 6:21 PM Reynold Xin  wrote:
>> Hi all,
>> 
>> Apache Spark 3.0.0 is the first release of the 3.x line. It builds on many 
>> of the innovations from Spark 2.x, bringing new ideas as well as continuing 
>> long-term projects that have been in development. This release resolves more 
>> than 3400 tickets.
>> 
>> We'd like to thank our contributors and users for their contributions and 
>> early feedback to this release. This release would not have been possible 
>> without you.
>> 
>> To download Spark 3.0.0, head over to the download page: 
>> http://spark.apache.org/downloads.html
>> 
>> To view the release notes: 
>> https://spark.apache.org/releases/spark-release-3-0-0.html
>> 
>> 
>> 


Re: [ANNOUNCE] Apache Spark 3.0.0

2020-06-18 Thread Gourav Sengupta
CELEBRATIONS!!!

On Thu, Jun 18, 2020 at 6:21 PM Reynold Xin  wrote:

> Hi all,
>
> Apache Spark 3.0.0 is the first release of the 3.x line. It builds on many
> of the innovations from Spark 2.x, bringing new ideas as well as continuing
> long-term projects that have been in development. This release resolves
> more than 3400 tickets.
>
> We'd like to thank our contributors and users for their contributions and
> early feedback to this release. This release would not have been possible
> without you.
>
> To download Spark 3.0.0, head over to the download page:
> http://spark.apache.org/downloads.html
>
> To view the release notes:
> https://spark.apache.org/releases/spark-release-3-0-0.html
>
>
>
>


[ANNOUNCE] Apache Spark 3.0.0

2020-06-18 Thread Reynold Xin
Hi all,

Apache Spark 3.0.0 is the first release of the 3.x line. It builds on many of 
the innovations from Spark 2.x, bringing new ideas as well as continuing 
long-term projects that have been in development. This release resolves more 
than 3400 tickets.

We'd like to thank our contributors and users for their contributions and early 
feedback to this release. This release would not have been possible without you.

To download Spark 3.0.0, head over to the download page: 
http://spark.apache.org/downloads.html

To view the release notes: 
https://spark.apache.org/releases/spark-release-3-0-0.html

smime.p7s
Description: S/MIME Cryptographic Signature


Re: Reading TB of JSON file

2020-06-18 Thread Stephan Wehner
It's an interesting problem. What is the structure of the file? One big
array? On hash with many key-value pairs?

Stephan

On Thu, Jun 18, 2020 at 6:12 AM Chetan Khatri 
wrote:

> Hi Spark Users,
>
> I have a 50GB of JSON file, I would like to read and persist at HDFS so it
> can be taken into next transformation. I am trying to read as
> spark.read.json(path) but this is giving Out of memory error on driver.
> Obviously, I can't afford having 50 GB on driver memory. In general, what
> is the best practice to read large JSON file like 50 GB?
>
> Thanks
>


-- 
Stephan Wehner, Ph.D.
The Buckmaster Institute, Inc.
2150 Adanac Street
Vancouver BC V5L 2E7
Canada
Cell (604) 767-7415
Fax (888) 808-4655

Sign up for our free email course
http://buckmaster.ca/small_business_website_mistakes.html

http://www.buckmaster.ca
http://answer4img.com
http://loggingit.com
http://clocklist.com
http://stephansmap.org
http://benchology.com
http://www.trafficlife.com
http://stephan.sugarmotor.org (Personal Blog)
@stephanwehner (Personal Account)
VA7WSK (Personal call sign)


Re: Reading TB of JSON file

2020-06-18 Thread Gourav Sengupta
Hi,
So you have a single JSON record in multiple lines?
And all the 50 GB is in one file?

Regards,
Gourav

On Thu, 18 Jun 2020, 14:34 Chetan Khatri, 
wrote:

> It is dynamically generated and written at s3 bucket not historical data
> so I guess it doesn't have jsonlines format
>
> On Thu, Jun 18, 2020 at 9:16 AM Jörn Franke  wrote:
>
>> Depends on the data types you use.
>>
>> Do you have in jsonlines format? Then the amount of memory plays much
>> less a role.
>>
>> Otherwise if it is one large object or array I would not recommend it.
>>
>> > Am 18.06.2020 um 15:12 schrieb Chetan Khatri <
>> chetan.opensou...@gmail.com>:
>> >
>> > 
>> > Hi Spark Users,
>> >
>> > I have a 50GB of JSON file, I would like to read and persist at HDFS so
>> it can be taken into next transformation. I am trying to read as
>> spark.read.json(path) but this is giving Out of memory error on driver.
>> Obviously, I can't afford having 50 GB on driver memory. In general, what
>> is the best practice to read large JSON file like 50 GB?
>> >
>> > Thanks
>>
>


Re: GPU Acceleration for spark-3.0.0

2020-06-18 Thread Bobby Evans
"So if I am
going to use GPU in my job running on the spark , I still need to code the
map and reduce function in cuda or in c++ and then invoke them throught jni
or something like GPUEnabler , is that right ?"

Sort of.  You could go through all of that work yourself, or you could use
the plugin that we are going to open source in the next few days.  Go to
https://nvidia.com/spark and click on the contact us link.  You should be
able to get the information you want that way. I know from the list of
spark summit talks that others are working on similar things too.  Intel
has a talk about some of their efforts for  columnar processing on FPGAs
and I think SIMD instructions too, at least going off of their talk last
year.

It should be an exciting time for accelerated SQL in spark.



On Wed, Jun 17, 2020 at 11:17 PM charles_cai <1620075...@qq.com> wrote:

> Bobby
>
> Thanks for your answer, it seems that I have misunderstood this paragraph
> in
> the website : *"GPU-accelerate your Apache Spark 3.0 data science
> pipelines—without code changes—and speed up data processing and model
> training while substantially lowering infrastructure costs."* . So if I am
> going to use GPU in my job running on the spark , I still need to code the
> map and reduce function in cuda or in c++ and then invoke them throught jni
> or something like GPUEnabler , is that right ?
>
> thanks
> Charles
>
>
>
> --
> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: Reading TB of JSON file

2020-06-18 Thread Chetan Khatri
It is dynamically generated and written at s3 bucket not historical data so
I guess it doesn't have jsonlines format

On Thu, Jun 18, 2020 at 9:16 AM Jörn Franke  wrote:

> Depends on the data types you use.
>
> Do you have in jsonlines format? Then the amount of memory plays much less
> a role.
>
> Otherwise if it is one large object or array I would not recommend it.
>
> > Am 18.06.2020 um 15:12 schrieb Chetan Khatri <
> chetan.opensou...@gmail.com>:
> >
> > 
> > Hi Spark Users,
> >
> > I have a 50GB of JSON file, I would like to read and persist at HDFS so
> it can be taken into next transformation. I am trying to read as
> spark.read.json(path) but this is giving Out of memory error on driver.
> Obviously, I can't afford having 50 GB on driver memory. In general, what
> is the best practice to read large JSON file like 50 GB?
> >
> > Thanks
>


Re: Reading TB of JSON file

2020-06-18 Thread Chetan Khatri
File is available at S3 Bucket.


On Thu, Jun 18, 2020 at 9:15 AM Patrick McCarthy 
wrote:

> Assuming that the file can be easily split, I would divide it into a
> number of pieces and move those pieces to HDFS before using spark at all,
> using `hdfs dfs` or similar. At that point you can use your executors to
> perform the reading instead of the driver.
>
> On Thu, Jun 18, 2020 at 9:12 AM Chetan Khatri 
> wrote:
>
>> Hi Spark Users,
>>
>> I have a 50GB of JSON file, I would like to read and persist at HDFS so
>> it can be taken into next transformation. I am trying to read as
>> spark.read.json(path) but this is giving Out of memory error on driver.
>> Obviously, I can't afford having 50 GB on driver memory. In general, what
>> is the best practice to read large JSON file like 50 GB?
>>
>> Thanks
>>
>
>
> --
>
>
> *Patrick McCarthy  *
>
> Senior Data Scientist, Machine Learning Engineering
>
> Dstillery
>
> 470 Park Ave South, 17th Floor, NYC 10016
>


Re: Reading TB of JSON file

2020-06-18 Thread nihed mbarek
Hi,

What is the size of one json document ?

There is also the scan of your json to define the schema, the overhead can
be huge.
2 solution:
define a schema and use directly during the load or ask spark to analyse a
small part of the json file (I don't remember how to do it)

Regards,


On Thu, Jun 18, 2020 at 3:12 PM Chetan Khatri 
wrote:

> Hi Spark Users,
>
> I have a 50GB of JSON file, I would like to read and persist at HDFS so it
> can be taken into next transformation. I am trying to read as
> spark.read.json(path) but this is giving Out of memory error on driver.
> Obviously, I can't afford having 50 GB on driver memory. In general, what
> is the best practice to read large JSON file like 50 GB?
>
> Thanks
>


-- 

M'BAREK Med Nihed,
Fedora Ambassador, TUNISIA, Northern Africa
http://www.nihed.com




Re: Reading TB of JSON file

2020-06-18 Thread Jörn Franke
Depends on the data types you use.

Do you have in jsonlines format? Then the amount of memory plays much less a 
role.

Otherwise if it is one large object or array I would not recommend it.

> Am 18.06.2020 um 15:12 schrieb Chetan Khatri :
> 
> 
> Hi Spark Users,
> 
> I have a 50GB of JSON file, I would like to read and persist at HDFS so it 
> can be taken into next transformation. I am trying to read as 
> spark.read.json(path) but this is giving Out of memory error on driver. 
> Obviously, I can't afford having 50 GB on driver memory. In general, what is 
> the best practice to read large JSON file like 50 GB?
> 
> Thanks

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Reading TB of JSON file

2020-06-18 Thread Patrick McCarthy
Assuming that the file can be easily split, I would divide it into a number
of pieces and move those pieces to HDFS before using spark at all, using
`hdfs dfs` or similar. At that point you can use your executors to perform
the reading instead of the driver.

On Thu, Jun 18, 2020 at 9:12 AM Chetan Khatri 
wrote:

> Hi Spark Users,
>
> I have a 50GB of JSON file, I would like to read and persist at HDFS so it
> can be taken into next transformation. I am trying to read as
> spark.read.json(path) but this is giving Out of memory error on driver.
> Obviously, I can't afford having 50 GB on driver memory. In general, what
> is the best practice to read large JSON file like 50 GB?
>
> Thanks
>


-- 


*Patrick McCarthy  *

Senior Data Scientist, Machine Learning Engineering

Dstillery

470 Park Ave South, 17th Floor, NYC 10016


Reading TB of JSON file

2020-06-18 Thread Chetan Khatri
Hi Spark Users,

I have a 50GB of JSON file, I would like to read and persist at HDFS so it
can be taken into next transformation. I am trying to read as
spark.read.json(path) but this is giving Out of memory error on driver.
Obviously, I can't afford having 50 GB on driver memory. In general, what
is the best practice to read large JSON file like 50 GB?

Thanks


Re: Is Spark Structured Streaming TOTALLY BROKEN (Spark Metadata Issues)

2020-06-18 Thread Jacek Laskowski
Hi Rachana,

> Should I go backward and use Spark Streaming DStream based.

No. Never. It's no longer supported (and should really be removed from the
codebase once and for all - dreaming...).

Spark focuses on Spark SQL and Spark Structured Streaming as user-facing
modules for batch and streaming queries, respectively.

Please note that I'm not a PMC member or even a committer so I'm speaking
for myself only (not representing the project in an official way).

Pozdrawiam,
Jacek Laskowski

https://about.me/JacekLaskowski
"The Internals Of" Online Books 
Follow me on https://twitter.com/jaceklaskowski




On Thu, Jun 18, 2020 at 12:03 AM Rachana Srivastava
 wrote:

> *Structured Stream Vs Spark Steaming (DStream)?*
>
> Which is recommended for system stability.  Exactly once is NOT first
> priority.  First priority is STABLE system.
>
> I am I need to make a decision soon.  I need help.  Here is the question
> again.  Should I go backward and use Spark Streaming DStream based.  Write
> our own checkpoint and go from there.  At least we never encounter these
> metadata issues there.
>
> Thanks,
>
> Rachana
>
> On Wednesday, June 17, 2020, 02:02:20 PM PDT, Jungtaek Lim <
> kabhwan.opensou...@gmail.com> wrote:
>
>
> Just in case if anyone prefers ASF projects then there are other
> alternative projects in ASF as well, alphabetically, Apache Hudi [1] and
> Apache Iceberg [2]. Both are recently graduated as top level projects.
> (DISCLAIMER: I'm not involved in both.)
>
> BTW it would be nice if we make the metadata implementation on file stream
> source/sink be pluggable - from what I've seen, plugin approach has been
> selected as the way to go whenever some part is going to be complicated and
> it becomes arguable whether the part should be handled in Spark project vs
> should be outside. e.g. checkpoint manager, state store provider, etc. It
> would open up chances for the ecosystem to play with the challenge "without
> completely re-writing the file stream source and sink", focusing on
> scalability for metadata in a long run query. Alternative projects
> described above will still provide more higher-level features and
> look attractive, but sometimes it may be just "using a sledgehammer to
> crack a nut".
>
> 1. https://hudi.apache.org/
> 2. https://iceberg.apache.org/
>
>
> On Thu, Jun 18, 2020 at 2:34 AM Tathagata Das 
> wrote:
>
> Hello Rachana,
>
> Getting exactly-once semantics on files and making it scale to a very
> large number of files are very hard problems to solve. While Structured
> Streaming + built-in file sink solves the exactly-once guarantee that
> DStreams could not, it is definitely limited in other ways (scaling in
> terms of files, combining batch and streaming writes in the same place,
> etc). And solving this problem requires a holistic solution that is
> arguably beyond the scope of the Spark project.
>
> There are other projects that are trying to solve this file management
> issue. For example, Delta Lake (full disclosure, I am
> involved in it) was built to exactly solve this problem - get exactly-once
> and ACID guarantees on files, but also scale to handling millions of files.
> Please consider it as part of your solution.
>
>
>
>
> On Wed, Jun 17, 2020 at 9:50 AM Rachana Srivastava
>  wrote:
>
> I have written a simple spark structured steaming app to move data from
> Kafka to S3. Found that in order to support exactly-once guarantee spark
> creates _spark_metadata folder, which ends up growing too large as the
> streaming app is SUPPOSE TO run FOREVER. But when the streaming app runs
> for a long time the metadata folder grows so big that we start getting OOM
> errors. Only way to resolve OOM is delete Checkpoint and Metadata folder
> and loose VALUABLE customer data.
>
> Spark open JIRAs SPARK-24295 and SPARK-29995, SPARK-30462, and SPARK-24295)
> Since Spark Streaming was NOT broken like this. Is Spark Streaming a
> BETTER choice?
>
>