Re: Record metadata with RDDs and DataFrames

Reynold Xin Wed, 15 Jul 2015 11:02:12 -0700

Yea - I'd just add a bunch of columns. Doesn't seem like that big of a deal.



On Wed, Jul 15, 2015 at 10:53 AM, RJ Nowling <rnowl...@gmail.com> wrote:

> I'm considering a few approaches -- one of which is to provide new
> functions like mapLeft, mapRight, filterLeft, etc.
>
> But this all falls shorts with DataFrames.  RDDs can easily be extended
> from RDD[T] to RDD[Record[T]].  I guess with DataFrames, I could add
> special columns?
>
> On Wed, Jul 15, 2015 at 12:36 PM, Reynold Xin <r...@databricks.com> wrote:
>
>> How about just using two fields, one boolean field to mark good/bad, and
>> another to get the source file?
>>
>>
>> On Wed, Jul 15, 2015 at 10:31 AM, RJ Nowling <rnowl...@gmail.com> wrote:
>>
>>> Hi all,
>>>
>>> I'm working on an ETL task with Spark.  As part of this work, I'd like
>>> to mark records with some info such as:
>>>
>>> 1. Whether the record is good or bad (e.g, Either)
>>> 2. Originating file and lines
>>>
>>> Part of my motivation is to prevent errors with individual records from
>>> stopping the entire pipeline.  I'd also like to filter out and log bad
>>> records at various stages.
>>>
>>> I could use RDD[Either[T]] for everything but that won't work for
>>> DataFrames.  I was wondering if anyone has had a similar situation and if
>>> they found elegant ways to handle this?
>>>
>>> Thanks,
>>> RJ
>>>
>>
>>
>

Re: Record metadata with RDDs and DataFrames

Reply via email to