Re: High level explanation of dropDuplicates

2020-01-11 Thread Miguel Morales
I would just map to pair using the id. Then do a reduceByKey where you compare the scores and keep the highest. Then do .values and that should do it. Sent from my iPhone > On Jan 11, 2020, at 11:14 AM, Rishi Shah wrote: > >  > Thanks everyone for your contribution on this topic, I wanted to

Re: HDFS or NFS as a cache?

2017-10-02 Thread Miguel Morales
See: https://github.com/rdblue/s3committer and https://www.youtube.com/watch?v=8F2Jqw5_OnI&feature=youtu.be On Mon, Oct 2, 2017 at 11:31 AM, Marcelo Vanzin wrote: > You don't need to collect data in the driver to save it. The code in > the original question doesn't use "collect()", so it's actu

Re: Spark <--> S3 flakiness

2017-05-13 Thread Miguel Morales
> > You mentioned that it required a lot of effort to get working. May I ask > what you ran into, and how you got it to work? > > Thanks, > Gene > > On Thu, May 11, 2017 at 11:55 AM, Miguel Morales > wrote: >> >> Might want to try to use gzip as opposed to parque

Re: Spark <--> S3 flakiness

2017-05-11 Thread Miguel Morales
he.org/jira/browse/SPARK-10063 >> https://issues.apache.org/jira/browse/HADOOP-13786 >> https://issues.apache.org/jira/browse/HADOOP-9565 look relevant too. >> >> On 10 May 2017 at 22:24, Miguel Morales wrote: >>> >>> Try using the DirectParquetOutputCommiter: &g

Re: Spark <--> S3 flakiness

2017-05-10 Thread Miguel Morales
Try using the DirectParquetOutputCommiter: http://dev.sortable.com/spark-directparquetoutputcommitter/ On Wed, May 10, 2017 at 10:07 PM, lucas.g...@gmail.com wrote: > Hi users, we have a bunch of pyspark jobs that are using S3 for loading / > intermediate steps and final output of parquet files.

Re: Etl with spark

2017-02-12 Thread Miguel Morales
You can parallelize the collection of s3 keys and then pass that to your map function so that files are read in parallel. Sent from my iPhone > On Feb 12, 2017, at 9:41 AM, Sam Elamin wrote: > > thanks Ayan but i was hoping to remove the dependency on a file and just use > in memory list or d

Re: TDD in Spark

2017-01-15 Thread Miguel Morales
I've also written a small blog post that may help you out: https://medium.com/@therevoltingx/test-driven-development-w-apache-spark-746082b44941#.ia6stbl6n On Sun, Jan 15, 2017 at 12:13 PM, Silvio Fiorito wrote: > You should check out Holden’s excellent spark-testing-base package: > https://githu

Re: Error when loading json to spark

2016-12-31 Thread Miguel Morales
Looks like it's trying to treat that path as a folder, try omitting the file name and just use the folder path. On Sat, Dec 31, 2016 at 7:58 PM, Raymond Xie wrote: > Happy new year!!! > > I am trying to load a json file into spark, the json file is attached here. > > I received the following erro

Re: Dependency Injection and Microservice development with Spark

2016-12-28 Thread Miguel Morales
Hi Not sure about Spring boot but trying to use DI libraries you'll run into serialization issues.I've had luck using an old version of Scaldi. Recently though I've been passing the class types as arguments with default values. Then in the spark code it gets instantiated. So you're basic

Re: [Spark 2.0.2 HDFS]: no data locality

2016-12-28 Thread Miguel Morales
> > But unfortunately that's not possible. All containers are connected to > an overlay network. > > Is there any other possiblity to say spark that it is on the same *NODE* > as an hdfs data node? > > > On 28.12.2016 12:00, Miguel Morales wrote: >> It m

Re: [Spark 2.0.2 HDFS]: no data locality

2016-12-28 Thread Miguel Morales
It might have to do with your container ips, it depends on your networking setup. You might want to try host networking so that the containers share the ip with the host. On Wed, Dec 28, 2016 at 1:46 AM, Karamba wrote: > > Hi Sun Rui, > > thanks for answering! > > >> Although the Spark task sche

Re: unit testing in spark

2016-12-08 Thread Miguel Morales
make > sense for those of us that all care about testing to try and do a hangout at > some point so that we can exchange ideas? > >> On Thu, Dec 8, 2016 at 4:15 PM, Miguel Morales >> wrote: >> I would be interested in contributing. Ive created my own library for

Re: unit testing in spark

2016-12-08 Thread Miguel Morales
I would be interested in contributing. Ive created my own library for this as well. In my blog post I talk about testing with Spark in RSpec style: https://medium.com/@therevoltingx/test-driven-development-w-apache-spark-746082b44941 Sent from my iPhone > On Dec 8, 2016, at 4:09 PM, Holden Ka

Re: Spark app write too many small parquet files

2016-12-08 Thread Miguel Morales
Try to coalesce with a value of 2 or so. You could dynamically calculate how many partitions to have to obtain an optimal file size. Sent from my iPhone > On Dec 8, 2016, at 1:03 PM, Kevin Tran wrote: > > How many partition should it be when streaming? - As in streaming process the > data wi

Re: Monitoring the User Metrics for a long running Spark Job

2016-12-05 Thread Miguel Morales
One thing I've done before is to install datadogs statsd agent on the nodes. Then you can emit metrics and stats to it and build dashboards on datadog. Sent from my iPhone > On Dec 5, 2016, at 8:17 PM, Chawla,Sumit wrote: > > Hi Manish > > I am specifically looking for something similar to f

Re: Spark Standalone Cluster - Running applications in JSON format

2016-11-30 Thread Miguel Morales
history server indicates there was a > problem. > > I will keep digging around. Thanks for your help so far Miguel. > > On 1/12/2016 3:33 PM, Miguel Morales wrote: > > Try hitting: http://:18080/api/v1 > > Then hit /applications. > > That should give you a list of run

Re: Spark Standalone Cluster - Running applications in JSON format

2016-11-30 Thread Miguel Morales
I don't have a running driver Spark instance since I am submitting jobs to > Spark using the SparkLauncher class. Or maybe I am missing something obvious. > Apologies if so. > > > > > On 1/12/2016 3:21 PM, Miguel Morales wrote: > > Check the Monitoring and Instr

Re: Spark Standalone Cluster - Running applications in JSON format

2016-11-30 Thread Miguel Morales
Check the Monitoring and Instrumentation API: http://spark.apache.org/docs/latest/monitoring.html On Wed, Nov 30, 2016 at 9:20 PM, Carl Ballantyne wrote: > Hi All, > > I want to get the running applications for my Spark Standalone cluster in > JSON format. The same information displayed on the w

Re: updateStateByKey -- when the key is multi-column (like a composite key )

2016-11-30 Thread Miguel Morales
I *think* you can return a map to updateStateByKey which would include your fields. Another approach would be to create a hash (like create a json version of the hash and return that.) On Wed, Nov 30, 2016 at 12:30 PM, shyla deshpande wrote: > updateStateByKey - Can this be used when the key is