[jira] [Updated] (SPARK-26647) Spark Job marked as success when data is still being written to GCS

Grid (JIRA) Thu, 17 Jan 2019 08:52:19 -0800


     [ 
https://issues.apache.org/jira/browse/SPARK-26647?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Grid updated SPARK-26647:
-------------------------
    Environment: 
Spark Kubernetes {{2.4.0}} on gke {{1.11.5-gke.5}}

 

  was:
Using spark kubernetes {{2.4.0}} on gke {{1.11.5-gke.5}}

 

    Description: 
When using Spark on Kubernetes and the latest jar

{{[https://storage.googleapis.com/hadoop-lib/gcs/gcs-connector-latest-hadoop2.jar]}}
 (dont' know what version this corresponds to)

I have a spark job that writes about 10GB of data to GCS using DataFrame write

df .write.json(path_to_gcs_bucket) 

This job and stage completes reports as complete but I can still see part files 
being written in the background:

{{gs://mybucket/output/ZGM0YTg3Nzk2NDEwY2ViY2FhNTYwZTZi/part-00124-e86f3a48-72f7-4bf7-bdc4-328e97cdc7b1-c000.json}}

The job is marked as success but there are still gcs writes going on in the 
background. This should update/report to the the job stage correctly and not be 
marked as {{success}}.

Once the writes have completed the spark context stop() is encountered and the 
job terminated.

using spark kubernetes {{2.4.0}} on gke {{1.11.5-gke.5}}

  was:
When using Spark on Kubernetes and the latest jar

{{[https://storage.googleapis.com/hadoop-lib/gcs/gcs-connector-latest-hadoop2.jar]}}
 (dont' know what version this corresponds to)

I have a spark job that writes about 10GB of data to GCS using DataFrame write

df .write.json(path_to_gcs_bucket) 

This job and stage completes reports as complete but I can still see part files 
being written in the background:

{{gs://mybucket/output/ZGM0YTg3Nzk2NDEwY2ViY2FhNTYwZTZi/part-00124-e86f3a48-72f7-4bf7-bdc4-328e97cdc7b1-c000.json}}

The job is marked as success but there are still gcs writes going on in the 
background. This should update/report to the the job stage correctly and not be 
marked as {{success}}.

Once the writes have completed the spark context\{{ stop()}} is encountered and 
the job terminated.

using spark kubernetes {{2.4.0}} on gke {{1.11.5-gke.5}}


> Spark Job marked as success when data is still being written to GCS
> -------------------------------------------------------------------
>
>                 Key: SPARK-26647
>                 URL: https://issues.apache.org/jira/browse/SPARK-26647
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 2.4.0
>         Environment: Spark Kubernetes {{2.4.0}} on gke {{1.11.5-gke.5}}
>  
>            Reporter: Grid
>            Priority: Major
>         Attachments: 51244468-1971b700-197d-11e9-9682-f021f1bc64e7.png
>
>
> When using Spark on Kubernetes and the latest jar
> {{[https://storage.googleapis.com/hadoop-lib/gcs/gcs-connector-latest-hadoop2.jar]}}
>  (dont' know what version this corresponds to)
> I have a spark job that writes about 10GB of data to GCS using DataFrame write
> df .write.json(path_to_gcs_bucket) 
> This job and stage completes reports as complete but I can still see part 
> files being written in the background:
> {{gs://mybucket/output/ZGM0YTg3Nzk2NDEwY2ViY2FhNTYwZTZi/part-00124-e86f3a48-72f7-4bf7-bdc4-328e97cdc7b1-c000.json}}
> The job is marked as success but there are still gcs writes going on in the 
> background. This should update/report to the the job stage correctly and not 
> be marked as {{success}}.
> Once the writes have completed the spark context stop() is encountered and 
> the job terminated.
> using spark kubernetes {{2.4.0}} on gke {{1.11.5-gke.5}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26647) Spark Job marked as success when data is still being written to GCS

Reply via email to