[GitHub] spark pull request: [CORE] [SPARK-6593] Provide option for HadoopR...

tigerquoll Mon, 30 Mar 2015 05:00:08 -0700

Github user tigerquoll commented on the pull request:

    https://github.com/apache/spark/pull/5250#issuecomment-87650287
  
    Hi Sean, 
    Thanks for your input - your views have helped me refine my thinking on the 
matter.  I believe that If you take a purist's point of view then yes you can 
say the source of the problem (likely) is with the data producer and should be 
fixed at the data producer's end. 
    
    The point being is that this is a problem that is affecting many spark 
users right now, and many users are not in control of the source system of the 
data they are analysing and are forced to 'make do' with what they have.  You 
call this solution a band-aid - but many ETL solutions are a bandaid - but 
providing this functionality is useful and serves a purpose for the end-user.
    
    Are you concerned that swallowing an exception could leave the hadoop input 
libraries in an inconsistent state, causing more data corruption?  This will 
not happen because swallowing the exception triggers the immediate finish of 
the file reading task and no more data will be read by the task.
    
    Are you concerned that swallowing an exception indicates that something has 
potentially gone wrong earlier in the hadoop input read, and that previous data 
could have been corrupted?  The user already knows this is potentially the case 
because running the application without this option enabled has caused the 
application to terminate in the first place.
    
    The fact that we are being more permissive of potentially corrupt data is a 
show stopper for this being default behaviour - but I'm not proposing this be 
default behaviour, I'm proposing this be a last-ditch option that an advanced 
user can knowingly enable when attempting to deal with corrupted data, with the 
understanding that their data could be made worse, but most likely corrupt data 
will be omitted. 
    
    The alternative is to tell them that their data is not suitable for being 
loaded into spark and perhaps they should use another tool or tell the data 
system owner to fix their data feeds and get back to them with another data set 
some time in the future.  I know which option I would prefer if given the 
choice - don't let perfect be the enemy of good.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request: [CORE] [SPARK-6593] Provide option for HadoopR...

Reply via email to