[ 
https://issues.apache.org/jira/browse/BIGTOP-1535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14218964#comment-14218964
 ] 

RJ Nowling commented on BIGTOP-1535:
------------------------------------

[~jayunit100], what do you mean by "in place processing of data?"  To use any 
of the data in Spark, we have to read it into memory and structure it into more 
useful data structures.  For example, customer and store details are repeated 
in every line.  There is also a little of parsing logic needed to parse 
date/times back into the appropriate objects to do things like sort 
transactions by dates and times.

I think it's a good thing that the Spark driver produces the "raw" data -- it 
is a realistic problem that data scientists face.

Ok, so then the question of whether we should use MapReduce for ETL or write a 
Spark version.  I see upsides and downsides here.

Pros of a Spark ETL script:
* Good example for users
* Not all Spark users will have access to a MapReduce installation.  In fact, 
many users are either leaving MR for Spark or just starting on Spark alone.  I 
think forcing users to setup MR to use BPS Spark would cause headaches for 
users.
* Provide comparisons between Spark and MapReduce solutions (Pig, etc.)

Cons of a Spark ETL script:
* Greater risk of divergence of BPS MapReduce and BPS Spark.

If the primary concern is divergence, then I wonder if this can be address in 
other ways?  For example, can you add a MapReduce driver for the new data 
generator, output data in the same format as the Spark driver, and modify the 
Pig script to convert it to the same or a similar normalized representation so 
that we the components are interchangeable?

If you're not comfortable with that solution, then let's find one that we're 
both happy with.  :)  I want to make sure we get on the same page.




> Add Spark ETL script to BigPetStore
> -----------------------------------
>
>                 Key: BIGTOP-1535
>                 URL: https://issues.apache.org/jira/browse/BIGTOP-1535
>             Project: Bigtop
>          Issue Type: Improvement
>          Components: blueprints
>            Reporter: RJ Nowling
>            Assignee: RJ Nowling
>
> We should add script that reads the results from the data generator and 
> normalizes the data and splits it into separate tables (ETL).  It would be 
> nice to use Spark SQL but it is not required.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to