[GitHub] spark pull request: [SPARK-4835] Disable validateOutputSpecs for S...

JoshRosen Fri, 02 Jan 2015 18:54:59 -0800

Github user JoshRosen commented on the pull request:

    https://github.com/apache/spark/pull/3832#issuecomment-68580794
  
    @tdas I've updated this PR and added a test case.  My test case uses calls 
inside of a `transform()` call to emulate what Streaming's `saveAsHadoopFiles` 
operation does.  Is this a valid use of `transform()` or am I breaking rules by 
having actions in my transform function?  My gut says that we shouldn't endorse 
/ recommend this for the same reason that we advise against using accumulators 
inside of map() tasks: the transform call might get evaluated multiple times if 
caching isn't use, which makes it possible to write programs whose behavior 
changes depending on whether caching is enabled.
    
    I wasn't able to get the existing "recovery with saveAsNewAPIHadoopFiles 
operation" test to fail, though, even though I discovered this bug while 
refactoring that test in my other PR.  I think that the issue is that the 
failed `saveAsNewAPIHadoopFiles` jobs failed but did not trigger a failure of 
the other actions / transformations in that batch, so we still got the correct 
output even though the batch completion event wasn't posted to the listener 
bus.  The current tests rely on wall-clock time to detect when batches have 
been processed and hence didn't detect that the batch completion event was 
missing.  I noticed that the StreamingListener API doesn't really have any 
events for job / batch failures, but that's a topic for a separate PR.
    
    I was about to write that this bug might not actually affect users who 
don't use `transform` but it still impacts users in the partial-failure case 
where they've used PairDStreamFunctions.saveAsHadoopFiles() but a batch fails 
with partially-written output: an individual output _partition_ might be 
atomically committed to the output directory (e.g. if the file exists, then it 
has the right contents), but I think we can still wind up in a scenario where 
only a subset of the partitions are written and the non-empty output directory 
prevents the recovery from recomputing the missing partitions.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request: [SPARK-4835] Disable validateOutputSpecs for S...

Reply via email to