Re: Alternatives for dataframe collectAsList()

2017-04-04 Thread lucas.g...@gmail.com
As Keith said, it depends on what you want to do with your data. >From a pipelining perspective the general flow (YMMV) is: Load dataset(s) -> Transform and / or Join --> Aggregate --> Write dataset Each step in the pipeline does something distinct with the data. The end step is usually

Re: What is the best way to run a scheduled spark batch job on AWS EC2 ?

2017-04-12 Thread lucas.g...@gmail.com
"Building data products is a very different discipline from that of building software." That is a fundamentally incorrect assumption. There will always be a need for figuring out how to apply said principles, but saying 'we're different' has always turned out to be incorrect and I have seen no

Re: Does spark 2.1.0 structured streaming support jdbc sink?

2017-04-09 Thread lucas.g...@gmail.com
Interesting, does anyone know if we'll be seeing the JDBC sinks in upcoming releases? Thanks! Gary Lucas On 9 April 2017 at 13:52, Silvio Fiorito wrote: > JDBC sink is not in 2.1. You can see here for an example implementation > using the ForEachWriter sink

Re: SPIP: Spark on Kubernetes

2017-08-15 Thread lucas.g...@gmail.com
>From our perspective, we have invested heavily in Kubernetes as our cluster manager of choice. We also make quite heavy use of spark. We've been experimenting with using these builds (2.1 with pyspark enabled) quite heavily. Given that we've already 'paid the price' to operate Kubernetes in

Re: With 2.2.0 PySpark is now available for pip install from PyPI :)

2017-07-12 Thread lucas.g...@gmail.com
That's great! On 12 July 2017 at 12:41, Felix Cheung wrote: > Awesome! Congrats!! > > -- > *From:* holden.ka...@gmail.com on behalf of > Holden Karau > *Sent:* Wednesday, July 12, 2017

Re: Flatten JSON to multiple columns in Spark

2017-07-18 Thread lucas.g...@gmail.com
I've been wondering about this for awhile. We wanted to do something similar for generically saving thousands of individual homogenous events into well formed parquet. Ultimately I couldn't find something I wanted to own and pushed back on the requirements. It seems the canonical answer is that

Re: Flatten JSON to multiple columns in Spark

2017-07-18 Thread lucas.g...@gmail.com
i <ferra...@gmail.com> > wrote: > >> What's against: >> >> df.rdd.map(...) >> >> or >> >> dataset.foreach() >> >> https://spark.apache.org/docs/2.0.1/api/scala/index.html#org >> .apache.spark.sql.Dataset@foreach(f:T=>Unit):Unit &

Re: Spark Testing Library Discussion

2017-04-25 Thread lucas.g...@gmail.com
Hi all, whoever (Sam I think) was going to do some work on doing a template testing pipeline. I'd love to be involved, I have a current task in my day job (data engineer) to flesh out our testing how-to / best practices for Spark jobs and I think I'll be doing something very similar for the next

Re: Spark Testing Library Discussion

2017-04-28 Thread lucas.g...@gmail.com
> > As promised here is the first blog post in a series of posts I hope to write > on how we build data pipelines > > Please feel free to retweet my original tweet and share because the more > ideas we have the better! > > Feedback is always welcome! > > Regards &

Re: Spark Testing Library Discussion

2017-04-29 Thread lucas.g...@gmail.com
gt; > Thank you very much for the feedback and I will be sure to add it once I > have more feedback > > > Maybe we can create a gist of all this or even a tiny book on best > practices if people find it useful > > Looking forward to the PR! > > Regards > Sam > &

Re: How can i remove the need for calling cache

2017-08-01 Thread lucas.g...@gmail.com
Hi Jeff, that looks sane to me. Do you have additional details? On 1 August 2017 at 11:05, jeff saremi wrote: > Calling cache/persist fails all our jobs (i have posted 2 threads on > this). > > And we're giving up hope in finding a solution. > So I'd like to find a

Spark <--> S3 flakiness

2017-05-10 Thread lucas.g...@gmail.com
Hi users, we have a bunch of pyspark jobs that are using S3 for loading / intermediate steps and final output of parquet files. We're running into the following issues on a semi regular basis: * These are intermittent errors, IE we have about 300 jobs that run nightly... And a fairly random but

Re: Spark <--> S3 flakiness

2017-05-11 Thread lucas.g...@gmail.com
://issues.apache.org/jira/browse/HADOOP-9565 look relevant too. On 10 May 2017 at 22:24, Miguel Morales <therevolti...@gmail.com> wrote: > Try using the DirectParquetOutputCommiter: > http://dev.sortable.com/spark-directparquetoutputcommitter/ > > On Wed, May 10, 2017 at 10:07 PM, lu

Re: Spark <--> S3 flakiness

2017-05-11 Thread lucas.g...@gmail.com
, lucas.g...@gmail.com <lucas.g...@gmail.com> wrote: > Looks like this isn't viable in spark 2.0.0 (and greater I presume). I'm > pretty sure I came across this blog and ignored it due to that. > > Any other thoughts? The linked tickets in: https://issues.apache.org/ > jira/bro

Multiple CSV libs causes issues spark 2.1

2017-05-09 Thread lucas.g...@gmail.com
> > df = spark.sqlContext.read.csv('out/df_in.csv') > > 17/05/09 15:51:29 WARN ObjectStore: Version information not found in > metastore. hive.metastore.schema.verification is not enabled so recording > the schema version 1.2.0 > 17/05/09 15:51:29 WARN ObjectStore: Failed to get database

Re: Multiple CSV libs causes issues spark 2.1

2017-05-09 Thread lucas.g...@gmail.com
I'm a bit confused by that answer, I'm assuming it's spark deciding which lib to use. On 9 May 2017 at 14:30, Mark Hamstra <m...@clearstorydata.com> wrote: > This looks more like a matter for Databricks support than spark-user. > > On Tue, May 9, 2017 at 2:02 PM, lucas.g...@gma

Re: Multiple CSV libs causes issues spark 2.1

2017-05-09 Thread lucas.g...@gmail.com
nality . > > > Thank you, > *Pushkar Gujar* > > > On Tue, May 9, 2017 at 6:09 PM, Mark Hamstra <m...@clearstorydata.com> > wrote: > >> Looks to me like it is a conflict between a Databricks library and Spark >> 2.1. That's an issue for Databricks to resolve or pr

Re: [Spark Core]: Python and Scala generate different DAGs for identical code

2017-05-10 Thread lucas.g...@gmail.com
Any chance of a link to that video :) Thanks! G On 10 May 2017 at 09:49, Holden Karau wrote: > So this Python side pipelining happens in a lot of places which can make > debugging extra challenging. Some people work around this with persist > which breaks the pipelining

Re: Spark <--> S3 flakiness

2017-05-17 Thread lucas.g...@gmail.com
Gary On 17 May 2017 at 03:19, Steve Loughran <ste...@hortonworks.com> wrote: > > On 17 May 2017, at 06:00, lucas.g...@gmail.com wrote: > > Steve, thanks for the reply. Digging through all the documentation now. > > Much appreciated! > > > > FWIW, if you

Re: Spark <--> S3 flakiness

2017-05-16 Thread lucas.g...@gmail.com
Steve, thanks for the reply. Digging through all the documentation now. Much appreciated! On 16 May 2017 at 10:10, Steve Loughran <ste...@hortonworks.com> wrote: > > On 11 May 2017, at 06:07, lucas.g...@gmail.com wrote: > > Hi users, we have a bunch of pyspark jobs

Re: Deciphering spark warning "Truncated the string representation of a plan since it was too large."

2017-06-12 Thread lucas.g...@gmail.com
AFAIK the process a spark program follows is: 1. A set of transformations are defined on a given input dataset. 2. At some point an action is called 1. In your case this is writing to your parquet file. 3. When that happens spark creates a logical plan and then a physical plan

Re: Using SparkContext in Executors

2017-05-31 Thread lucas.g...@gmail.com
+1 to Ayan's answer, I think this is a common distributed anti pattern that trips us all up at some point or another. You definitely want to (in most cases) yield and create a new RDD/Dataframe/Dataset and then perform your save operation on that. On 28 May 2017 at 21:09, ayan guha

Re: RDD order preservation through transformations

2017-09-13 Thread lucas.g...@gmail.com
I'm wondering why you need order preserved, we've had situations where keeping the source as an artificial field in the dataset was important and I had to run contortions to inject that (In this case the datasource had no unique key). Is this similar? On 13 September 2017 at 10:46, Suzen, Mehmet

Re: How to pass sparkSession from driver to executor

2017-09-21 Thread lucas.g...@gmail.com
I'm not sure what you're doing. But I have in the past used spark to consume a manifest file and then execute a .mapPartition on the result like this: def map_key_to_event(s3_events_data_lake): def _map_key_to_event(event_key_list, s3_client=test_stub): print("Events in list")

Re: Question on partitionColumn for a JDBC read using a timestamp from MySql

2017-09-19 Thread lucas.g...@gmail.com
01' and '2017-11-01 00:00:01'"] Gets quite nice performance (better than I expected). Thanks! On 18 September 2017 at 13:21, lucas.g...@gmail.com <lucas.g...@gmail.com> wrote: > I'm pretty sure you can use a timestamp as a partitionColumn, It's > Timestamp type in MySQL. I

Question on partitionColumn for a JDBC read using a timestamp from MySql

2017-09-18 Thread lucas.g...@gmail.com
I'm pretty sure you can use a timestamp as a partitionColumn, It's Timestamp type in MySQL. It's at base a numeric type and Spark requires a numeric type passed in. This doesn't work as the where parameter in MySQL becomes raw numerics which won't query against the mysql Timestamp.

Re: [Spark-Submit] Where to store data files while running job in cluster mode?

2017-09-29 Thread lucas.g...@gmail.com
We use S3, there are caveats and issues with that but it can be made to work. If interested let me know and I'll show you our workarounds. I wouldn't do it naively though, there's lots of potential problems. If you already have HDFS use that, otherwise all things told it's probably less effort

Re: Spark based Data Warehouse

2017-11-17 Thread lucas.g...@gmail.com
On 17 November 2017 at 22:20, ashish rawat <dceash...@gmail.com> wrote: > Thanks everyone for their suggestions. Does any of you take care of auto > scale up and down of your underlying spark clusters on AWS? > > On Nov 14, 2017 10:46 AM, "lucas.g...@gmail.com" <lucas.

Re: Spark based Data Warehouse

2017-11-13 Thread lucas.g...@gmail.com
Hi Ashish, bear in mind that EMR has some additional tooling available that smoothes out some S3 problems that you may / almost certainly will encounter. We are using Spark / S3 not on EMR and have encountered issues with file consistency, you can deal with it but be aware it's additional

Re: Writing files to s3 with out temporary directory

2017-11-20 Thread lucas.g...@gmail.com
That sounds like allot of work and if I understand you correctly it sounds like a piece of middleware that already exists (I could be wrong). Alluxio? Good luck and let us know how it goes! Gary On 20 November 2017 at 14:10, Jim Carroll wrote: > Thanks. In the meantime

Re: Writing files to s3 with out temporary directory

2017-11-20 Thread lucas.g...@gmail.com
You can expect to see some fixes for this sort of issue in the medium term future (multiple months, probably not years). As Taylor notes, it's a Hadoop problem, not a spark problem. So whichever version of hadoop includes the fix will then wait for a spark release to get built against it. Last

Re: What do you pay attention to when validating Spark jobs?

2017-11-21 Thread lucas.g...@gmail.com
I don't think these will blow anyones minds but: 1) Row counts. Most of our jobs 'recompute the world' nightly so we can expect to see fairly predictable row variances. 2) Rolling snapshots. We can also expect that for some critical datasets we can compute a rolling average for important

Re: spark session jdbc performance

2017-10-25 Thread lucas.g...@gmail.com
Gourav, I'm assuming you misread the code. It's 30 partitions, which isn't a ridiculous value. Maybe you misread the upperBound for the partitions? (That would be ridiculous) Why not use the PK as the partition column? Obviously it depends on the downstream queries. If you're going to be

Re: spark session jdbc performance

2017-10-25 Thread lucas.g...@gmail.com
ing things, can you please explain why the UI is showing > only one partition to run the query? > > > Regards, > Gourav Sengupta > > On Wed, Oct 25, 2017 at 6:03 PM, lucas.g...@gmail.com < > lucas.g...@gmail.com> wrote: > >> Gourav, I'm assuming you misread the

Re: spark session jdbc performance

2017-10-24 Thread lucas.g...@gmail.com
Did you check the query plan / check the UI? That code looks same to me. Maybe you've only configured for one executor? Gary On Oct 24, 2017 2:55 PM, "Naveen Madhire" wrote: > > Hi, > > > > I am trying to fetch data from Oracle DB using a subquery and experiencing >

Re: Spark streaming for CEP

2017-10-24 Thread lucas.g...@gmail.com
This looks really interesting, thanks for linking! Gary Lucas On 24 October 2017 at 15:06, Mich Talebzadeh wrote: > Great thanks Steve > > Dr Mich Talebzadeh > > > > LinkedIn * > https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >

Re: spark session jdbc performance

2017-10-24 Thread lucas.g...@gmail.com
Sorry, I meant to say: "That code looks SANE to me" Assuming that you're seeing the query running partitioned as expected then you're likely configured with one executor. Very easy to check in the UI. Gary Lucas On 24 October 2017 at 16:09, lucas.g...@gmail.com <lucas.g...@gma

Does Apache Spark take into account JDBC indexes / statistics when optimizing queries?

2017-10-19 Thread lucas.g...@gmail.com
IE: If my JDBC table has an index on it, will the optimizer consider that when pushing predicates down? I noticed in a query like this: df = spark.hiveContext.read.jdbc( url=jdbc_url, table="schema.table", column="id", lowerBound=lower_bound_id, upperBound=upper_bound_id,

Re: Does Apache Spark take into account JDBC indexes / statistics when optimizing queries?

2017-10-19 Thread lucas.g...@gmail.com
will in no case be liable for any monetary damages arising from > such loss, damage or destruction. > > > > On 19 October 2017 at 23:29, lucas.g...@gmail.com <lucas.g...@gmail.com> > wrote: > >> If the underlying table(s) have indexes on them. Does spark us

Re: Does Apache Spark take into account JDBC indexes / statistics when optimizing queries?

2017-10-19 Thread lucas.g...@gmail.com
any other property which may arise > from relying on this email's technical content is explicitly disclaimed. > The author will in no case be liable for any monetary damages arising from > such loss, damage or destruction. > > > > On 19 October 2017 at 23:10, lucas.g...@gmai

Re: Suggestions on using scala/python for Spark Streaming

2017-10-26 Thread lucas.g...@gmail.com
I don't have any specific wisdom for you on that front. But I've always been served well by the 'Try both' approach. Set up your benchmarks, configure both setups... You don't have to go the whole hog, but just enough to get a mostly realistic implementation functional. Run them both with some

Re: Controlling number of spark partitions in dataframes

2017-10-26 Thread lucas.g...@gmail.com
I think we'd need to see the code that loads the df. Parallelism and partition count are related but they're not the same. I've found the documentation fuzzy on this, but it looks like default.parrallelism is what spark uses for partitioning when it has no other guidance. I'm also under the

Re: Does Apache Spark take into account JDBC indexes / statistics when optimizing queries?

2017-10-20 Thread lucas.g...@gmail.com
technical content is explicitly disclaimed. > The author will in no case be liable for any monetary damages arising from > such loss, damage or destruction. > > > > On 20 October 2017 at 00:04, lucas.g...@gmail.com <lucas.g...@gmail.com> > wrote: > >> Ok, so when S

Re: Controlling number of spark partitions in dataframes

2017-10-26 Thread lucas.g...@gmail.com
Thanks Daniel! I've been wondering that for ages! IE where my JDBC sourced datasets are coming up with 200 partitions on write to S3. What do you mean for (except for the initial read)? Can you explain that a bit further? Gary Lucas On 26 October 2017 at 11:28, Daniel Siegmann

Re: Controlling number of spark partitions in dataframes

2017-10-26 Thread lucas.g...@gmail.com
ttps://maps.google.com/?q=214+W+29th+Street,+5th+FloorNew+York,+NY+10001=gmail=g> > > > On Thu, Oct 26, 2017 at 11:31 AM, lucas.g...@gmail.com < > lucas.g...@gmail.com> wrote: > >> Thanks Daniel! >> >> I've been wondering that for ages! >> >> IE

Re: Spark Tuning Tool

2018-01-22 Thread lucas.g...@gmail.com
I'd be very interested in anything I can send to my analysts to assist them with their troubleshooting / optimization... Of course our engineers would appreciate it as well. However we'd be way more interested if it was OSS. Thanks! Gary Lucas On 22 January 2018 at 21:16, Holden Karau

Re: spark.sql call takes far too long

2018-01-24 Thread lucas.g...@gmail.com
Hi Michael. I haven't had this particular issue previously, but I have had other performance issues. Some questions which may help: 1. Have you checked the Spark Console? 2. Have you isolated the query in question, are you sure it's actually where the slowdown occurs? 3. How much data are you

Re: Rest API for Spark2.3 submit on kubernetes(version 1.8.*) cluster

2018-03-21 Thread lucas.g...@gmail.com
Speaking from experience, if you're already operating a kubernetes cluster. Getting a spark workload operating there is nearly an order of magnitude simpler than working with / around EMR. That's not say EMR is excessively hard, just that Kubernetes is easier, all the steps to getting your

Re: Question on Spark-kubernetes integration

2018-03-02 Thread lucas.g...@gmail.com
Oh interesting, given that pyspark was working in spark on kub 2.2 I assumed it would be part of what got merged. Is there a roadmap in terms of when that may get merged up? Thanks! On 2 March 2018 at 21:32, Felix Cheung wrote: > That’s in the plan. We should be

Re: Schema store for Parquet

2020-03-04 Thread lucas.g...@gmail.com
Or AWS glue catalog if you're in AWS On Wed, 4 Mar 2020 at 10:35, Magnus Nilsson wrote: > Google hive metastore. > > On Wed, Mar 4, 2020 at 7:29 PM Ruijing Li wrote: > >> Hi all, >> >> Has anyone explored efforts to have a centralized storage of schemas of >> different parquet files? I know