As Keith said, it depends on what you want to do with your data.
>From a pipelining perspective the general flow (YMMV) is:
Load dataset(s) -> Transform and / or Join --> Aggregate --> Write dataset
Each step in the pipeline does something distinct with the data.
The end step is usually
"Building data products is a very different discipline from that of
building software."
That is a fundamentally incorrect assumption.
There will always be a need for figuring out how to apply said principles,
but saying 'we're different' has always turned out to be incorrect and I
have seen no
Interesting, does anyone know if we'll be seeing the JDBC sinks in upcoming
releases?
Thanks!
Gary Lucas
On 9 April 2017 at 13:52, Silvio Fiorito
wrote:
> JDBC sink is not in 2.1. You can see here for an example implementation
> using the ForEachWriter sink
>From our perspective, we have invested heavily in Kubernetes as our cluster
manager of choice.
We also make quite heavy use of spark. We've been experimenting with using
these builds (2.1 with pyspark enabled) quite heavily. Given that we've
already 'paid the price' to operate Kubernetes in
That's great!
On 12 July 2017 at 12:41, Felix Cheung wrote:
> Awesome! Congrats!!
>
> --
> *From:* holden.ka...@gmail.com on behalf of
> Holden Karau
> *Sent:* Wednesday, July 12, 2017
I've been wondering about this for awhile.
We wanted to do something similar for generically saving thousands of
individual homogenous events into well formed parquet.
Ultimately I couldn't find something I wanted to own and pushed back on the
requirements.
It seems the canonical answer is that
i <ferra...@gmail.com>
> wrote:
>
>> What's against:
>>
>> df.rdd.map(...)
>>
>> or
>>
>> dataset.foreach()
>>
>> https://spark.apache.org/docs/2.0.1/api/scala/index.html#org
>> .apache.spark.sql.Dataset@foreach(f:T=>Unit):Unit
&
Hi all, whoever (Sam I think) was going to do some work on doing a template
testing pipeline. I'd love to be involved, I have a current task in my day
job (data engineer) to flesh out our testing how-to / best practices for
Spark jobs and I think I'll be doing something very similar for the next
>
> As promised here is the first blog post in a series of posts I hope to write
> on how we build data pipelines
>
> Please feel free to retweet my original tweet and share because the more
> ideas we have the better!
>
> Feedback is always welcome!
>
> Regards
&
gt;
> Thank you very much for the feedback and I will be sure to add it once I
> have more feedback
>
>
> Maybe we can create a gist of all this or even a tiny book on best
> practices if people find it useful
>
> Looking forward to the PR!
>
> Regards
> Sam
>
&
Hi Jeff, that looks sane to me. Do you have additional details?
On 1 August 2017 at 11:05, jeff saremi wrote:
> Calling cache/persist fails all our jobs (i have posted 2 threads on
> this).
>
> And we're giving up hope in finding a solution.
> So I'd like to find a
Hi users, we have a bunch of pyspark jobs that are using S3 for loading /
intermediate steps and final output of parquet files.
We're running into the following issues on a semi regular basis:
* These are intermittent errors, IE we have about 300 jobs that run
nightly... And a fairly random but
://issues.apache.org/jira/browse/HADOOP-9565 look relevant too.
On 10 May 2017 at 22:24, Miguel Morales <therevolti...@gmail.com> wrote:
> Try using the DirectParquetOutputCommiter:
> http://dev.sortable.com/spark-directparquetoutputcommitter/
>
> On Wed, May 10, 2017 at 10:07 PM, lu
, lucas.g...@gmail.com <lucas.g...@gmail.com> wrote:
> Looks like this isn't viable in spark 2.0.0 (and greater I presume). I'm
> pretty sure I came across this blog and ignored it due to that.
>
> Any other thoughts? The linked tickets in: https://issues.apache.org/
> jira/bro
>
> df = spark.sqlContext.read.csv('out/df_in.csv')
>
> 17/05/09 15:51:29 WARN ObjectStore: Version information not found in
> metastore. hive.metastore.schema.verification is not enabled so recording
> the schema version 1.2.0
> 17/05/09 15:51:29 WARN ObjectStore: Failed to get database
I'm a bit confused by that answer, I'm assuming it's spark deciding which
lib to use.
On 9 May 2017 at 14:30, Mark Hamstra <m...@clearstorydata.com> wrote:
> This looks more like a matter for Databricks support than spark-user.
>
> On Tue, May 9, 2017 at 2:02 PM, lucas.g...@gma
nality .
>
>
> Thank you,
> *Pushkar Gujar*
>
>
> On Tue, May 9, 2017 at 6:09 PM, Mark Hamstra <m...@clearstorydata.com>
> wrote:
>
>> Looks to me like it is a conflict between a Databricks library and Spark
>> 2.1. That's an issue for Databricks to resolve or pr
Any chance of a link to that video :)
Thanks!
G
On 10 May 2017 at 09:49, Holden Karau wrote:
> So this Python side pipelining happens in a lot of places which can make
> debugging extra challenging. Some people work around this with persist
> which breaks the pipelining
Gary
On 17 May 2017 at 03:19, Steve Loughran <ste...@hortonworks.com> wrote:
>
> On 17 May 2017, at 06:00, lucas.g...@gmail.com wrote:
>
> Steve, thanks for the reply. Digging through all the documentation now.
>
> Much appreciated!
>
>
>
> FWIW, if you
Steve, thanks for the reply. Digging through all the documentation now.
Much appreciated!
On 16 May 2017 at 10:10, Steve Loughran <ste...@hortonworks.com> wrote:
>
> On 11 May 2017, at 06:07, lucas.g...@gmail.com wrote:
>
> Hi users, we have a bunch of pyspark jobs
AFAIK the process a spark program follows is:
1. A set of transformations are defined on a given input dataset.
2. At some point an action is called
1. In your case this is writing to your parquet file.
3. When that happens spark creates a logical plan and then a physical
plan
+1 to Ayan's answer, I think this is a common distributed anti pattern that
trips us all up at some point or another.
You definitely want to (in most cases) yield and create a new
RDD/Dataframe/Dataset and then perform your save operation on that.
On 28 May 2017 at 21:09, ayan guha
I'm wondering why you need order preserved, we've had situations where
keeping the source as an artificial field in the dataset was important and
I had to run contortions to inject that (In this case the datasource had no
unique key).
Is this similar?
On 13 September 2017 at 10:46, Suzen, Mehmet
I'm not sure what you're doing. But I have in the past used spark to
consume a manifest file and then execute a .mapPartition on the result like
this:
def map_key_to_event(s3_events_data_lake):
def _map_key_to_event(event_key_list, s3_client=test_stub):
print("Events in list")
01' and '2017-11-01 00:00:01'"]
Gets quite nice performance (better than I expected).
Thanks!
On 18 September 2017 at 13:21, lucas.g...@gmail.com <lucas.g...@gmail.com>
wrote:
> I'm pretty sure you can use a timestamp as a partitionColumn, It's
> Timestamp type in MySQL. I
I'm pretty sure you can use a timestamp as a partitionColumn, It's
Timestamp type in MySQL. It's at base a numeric type and Spark requires a
numeric type passed in.
This doesn't work as the where parameter in MySQL becomes raw numerics
which won't query against the mysql Timestamp.
We use S3, there are caveats and issues with that but it can be made to
work.
If interested let me know and I'll show you our workarounds. I wouldn't do
it naively though, there's lots of potential problems. If you already have
HDFS use that, otherwise all things told it's probably less effort
On 17 November 2017 at 22:20, ashish rawat <dceash...@gmail.com> wrote:
> Thanks everyone for their suggestions. Does any of you take care of auto
> scale up and down of your underlying spark clusters on AWS?
>
> On Nov 14, 2017 10:46 AM, "lucas.g...@gmail.com" <lucas.
Hi Ashish, bear in mind that EMR has some additional tooling available that
smoothes out some S3 problems that you may / almost certainly will
encounter.
We are using Spark / S3 not on EMR and have encountered issues with file
consistency, you can deal with it but be aware it's additional
That sounds like allot of work and if I understand you correctly it sounds
like a piece of middleware that already exists (I could be wrong). Alluxio?
Good luck and let us know how it goes!
Gary
On 20 November 2017 at 14:10, Jim Carroll wrote:
> Thanks. In the meantime
You can expect to see some fixes for this sort of issue in the medium term
future (multiple months, probably not years).
As Taylor notes, it's a Hadoop problem, not a spark problem. So whichever
version of hadoop includes the fix will then wait for a spark release to
get built against it. Last
I don't think these will blow anyones minds but:
1) Row counts. Most of our jobs 'recompute the world' nightly so we can
expect to see fairly predictable row variances.
2) Rolling snapshots. We can also expect that for some critical datasets
we can compute a rolling average for important
Gourav, I'm assuming you misread the code. It's 30 partitions, which isn't
a ridiculous value. Maybe you misread the upperBound for the partitions?
(That would be ridiculous)
Why not use the PK as the partition column? Obviously it depends on the
downstream queries. If you're going to be
ing things, can you please explain why the UI is showing
> only one partition to run the query?
>
>
> Regards,
> Gourav Sengupta
>
> On Wed, Oct 25, 2017 at 6:03 PM, lucas.g...@gmail.com <
> lucas.g...@gmail.com> wrote:
>
>> Gourav, I'm assuming you misread the
Did you check the query plan / check the UI?
That code looks same to me. Maybe you've only configured for one executor?
Gary
On Oct 24, 2017 2:55 PM, "Naveen Madhire" wrote:
>
> Hi,
>
>
>
> I am trying to fetch data from Oracle DB using a subquery and experiencing
>
This looks really interesting, thanks for linking!
Gary Lucas
On 24 October 2017 at 15:06, Mich Talebzadeh
wrote:
> Great thanks Steve
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn *
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>
Sorry, I meant to say: "That code looks SANE to me"
Assuming that you're seeing the query running partitioned as expected then
you're likely configured with one executor. Very easy to check in the UI.
Gary Lucas
On 24 October 2017 at 16:09, lucas.g...@gmail.com <lucas.g...@gma
IE: If my JDBC table has an index on it, will the optimizer consider that
when pushing predicates down?
I noticed in a query like this:
df = spark.hiveContext.read.jdbc(
url=jdbc_url,
table="schema.table",
column="id",
lowerBound=lower_bound_id,
upperBound=upper_bound_id,
will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 19 October 2017 at 23:29, lucas.g...@gmail.com <lucas.g...@gmail.com>
> wrote:
>
>> If the underlying table(s) have indexes on them. Does spark us
any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 19 October 2017 at 23:10, lucas.g...@gmai
I don't have any specific wisdom for you on that front. But I've always
been served well by the 'Try both' approach.
Set up your benchmarks, configure both setups... You don't have to go the
whole hog, but just enough to get a mostly realistic implementation
functional. Run them both with some
I think we'd need to see the code that loads the df.
Parallelism and partition count are related but they're not the same. I've
found the documentation fuzzy on this, but it looks like
default.parrallelism is what spark uses for partitioning when it has no
other guidance. I'm also under the
technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 20 October 2017 at 00:04, lucas.g...@gmail.com <lucas.g...@gmail.com>
> wrote:
>
>> Ok, so when S
Thanks Daniel!
I've been wondering that for ages!
IE where my JDBC sourced datasets are coming up with 200 partitions on
write to S3.
What do you mean for (except for the initial read)?
Can you explain that a bit further?
Gary Lucas
On 26 October 2017 at 11:28, Daniel Siegmann
ttps://maps.google.com/?q=214+W+29th+Street,+5th+FloorNew+York,+NY+10001=gmail=g>
>
>
> On Thu, Oct 26, 2017 at 11:31 AM, lucas.g...@gmail.com <
> lucas.g...@gmail.com> wrote:
>
>> Thanks Daniel!
>>
>> I've been wondering that for ages!
>>
>> IE
I'd be very interested in anything I can send to my analysts to assist them
with their troubleshooting / optimization... Of course our engineers would
appreciate it as well.
However we'd be way more interested if it was OSS.
Thanks!
Gary Lucas
On 22 January 2018 at 21:16, Holden Karau
Hi Michael.
I haven't had this particular issue previously, but I have had other
performance issues.
Some questions which may help:
1. Have you checked the Spark Console?
2. Have you isolated the query in question, are you sure it's actually
where the slowdown occurs?
3. How much data are you
Speaking from experience, if you're already operating a kubernetes
cluster. Getting a spark workload operating there is nearly an order of
magnitude simpler than working with / around EMR.
That's not say EMR is excessively hard, just that Kubernetes is easier, all
the steps to getting your
Oh interesting, given that pyspark was working in spark on kub 2.2 I
assumed it would be part of what got merged.
Is there a roadmap in terms of when that may get merged up?
Thanks!
On 2 March 2018 at 21:32, Felix Cheung wrote:
> That’s in the plan. We should be
Or AWS glue catalog if you're in AWS
On Wed, 4 Mar 2020 at 10:35, Magnus Nilsson wrote:
> Google hive metastore.
>
> On Wed, Mar 4, 2020 at 7:29 PM Ruijing Li wrote:
>
>> Hi all,
>>
>> Has anyone explored efforts to have a centralized storage of schemas of
>> different parquet files? I know
50 matches
Mail list logo