Yea - I'd just add a bunch of columns. Doesn't seem like that big of a deal.
On Wed, Jul 15, 2015 at 10:53 AM, RJ Nowling <rnowl...@gmail.com> wrote: > I'm considering a few approaches -- one of which is to provide new > functions like mapLeft, mapRight, filterLeft, etc. > > But this all falls shorts with DataFrames. RDDs can easily be extended > from RDD[T] to RDD[Record[T]]. I guess with DataFrames, I could add > special columns? > > On Wed, Jul 15, 2015 at 12:36 PM, Reynold Xin <r...@databricks.com> wrote: > >> How about just using two fields, one boolean field to mark good/bad, and >> another to get the source file? >> >> >> On Wed, Jul 15, 2015 at 10:31 AM, RJ Nowling <rnowl...@gmail.com> wrote: >> >>> Hi all, >>> >>> I'm working on an ETL task with Spark. As part of this work, I'd like >>> to mark records with some info such as: >>> >>> 1. Whether the record is good or bad (e.g, Either) >>> 2. Originating file and lines >>> >>> Part of my motivation is to prevent errors with individual records from >>> stopping the entire pipeline. I'd also like to filter out and log bad >>> records at various stages. >>> >>> I could use RDD[Either[T]] for everything but that won't work for >>> DataFrames. I was wondering if anyone has had a similar situation and if >>> they found elegant ways to handle this? >>> >>> Thanks, >>> RJ >>> >> >> >