[GitHub] spark pull request: [CORE] [SPARK-6593] Provide option for HadoopR...

srowen Mon, 30 Mar 2015 05:16:26 -0700

Github user srowen commented on the pull request:

    https://github.com/apache/spark/pull/5250#issuecomment-87653690
  
    I don't know if corrupted gzip files are such a common problem, but I'm not 
sure that would change the logic about where to fix things. It is a problem 
with the preceding ETL process, yes. Something else needs to explicitly check 
and/or fix the input first if this is a problem.
    
    I suppose my point too is that this change does not just address the 
proposed problem with gzip files. It treats any error as recoverable.
    
    It's nothing to do with inconsistent state. It's the presenting a 
successful result that is actually silently missing input, which might not even 
be deterministic. This seems way more problematic than reliably failing-fast 
and, yes, making you fix your upstream process.
    
    Hiding behind a flag only goes so far. It's documented (or else how many 
people does it help?). It becomes a code path that has to be supported for a 
long time. It is presented to users as a fine thing to do when I don't believe 
it is. It's not the good being the enemy of the perfect, but the dangerous 
being the enemy of the good.
    
    This is nothing to do with telling people they can't use Spark, or have to 
fix an unfixable upstream process. This is about appropriately dealing with bad 
upstream data in the right place, and this is not how to do it.
    
    Specifically: why not write a process that just opens a stream on each 
input file in turn and tries to read a handful of bytes? if it fails, delete 
the file or do what you like with it. This is maybe 10 lines of code in your 
driver.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request: [CORE] [SPARK-6593] Provide option for HadoopR...

Reply via email to