I would just map to pair using the id. Then do a reduceByKey where you compare
the scores and keep the highest. Then do .values and that should do it.
Sent from my iPhone
> On Jan 11, 2020, at 11:14 AM, Rishi Shah wrote:
>
>
> Thanks everyone for your contribution on this topic, I wanted to
See: https://github.com/rdblue/s3committer and
https://www.youtube.com/watch?v=8F2Jqw5_OnI&feature=youtu.be
On Mon, Oct 2, 2017 at 11:31 AM, Marcelo Vanzin wrote:
> You don't need to collect data in the driver to save it. The code in
> the original question doesn't use "collect()", so it's actu
>
> You mentioned that it required a lot of effort to get working. May I ask
> what you ran into, and how you got it to work?
>
> Thanks,
> Gene
>
> On Thu, May 11, 2017 at 11:55 AM, Miguel Morales
> wrote:
>>
>> Might want to try to use gzip as opposed to parque
he.org/jira/browse/SPARK-10063
>> https://issues.apache.org/jira/browse/HADOOP-13786
>> https://issues.apache.org/jira/browse/HADOOP-9565 look relevant too.
>>
>> On 10 May 2017 at 22:24, Miguel Morales wrote:
>>>
>>> Try using the DirectParquetOutputCommiter:
&g
Try using the DirectParquetOutputCommiter:
http://dev.sortable.com/spark-directparquetoutputcommitter/
On Wed, May 10, 2017 at 10:07 PM, lucas.g...@gmail.com
wrote:
> Hi users, we have a bunch of pyspark jobs that are using S3 for loading /
> intermediate steps and final output of parquet files.
You can parallelize the collection of s3 keys and then pass that to your map
function so that files are read in parallel.
Sent from my iPhone
> On Feb 12, 2017, at 9:41 AM, Sam Elamin wrote:
>
> thanks Ayan but i was hoping to remove the dependency on a file and just use
> in memory list or d
I've also written a small blog post that may help you out:
https://medium.com/@therevoltingx/test-driven-development-w-apache-spark-746082b44941#.ia6stbl6n
On Sun, Jan 15, 2017 at 12:13 PM, Silvio Fiorito
wrote:
> You should check out Holden’s excellent spark-testing-base package:
> https://githu
Looks like it's trying to treat that path as a folder, try omitting
the file name and just use the folder path.
On Sat, Dec 31, 2016 at 7:58 PM, Raymond Xie wrote:
> Happy new year!!!
>
> I am trying to load a json file into spark, the json file is attached here.
>
> I received the following erro
Hi
Not sure about Spring boot but trying to use DI libraries you'll run into
serialization issues.I've had luck using an old version of Scaldi.
Recently though I've been passing the class types as arguments with default
values. Then in the spark code it gets instantiated. So you're basic
>
> But unfortunately that's not possible. All containers are connected to
> an overlay network.
>
> Is there any other possiblity to say spark that it is on the same *NODE*
> as an hdfs data node?
>
>
> On 28.12.2016 12:00, Miguel Morales wrote:
>> It m
It might have to do with your container ips, it depends on your
networking setup. You might want to try host networking so that the
containers share the ip with the host.
On Wed, Dec 28, 2016 at 1:46 AM, Karamba wrote:
>
> Hi Sun Rui,
>
> thanks for answering!
>
>
>> Although the Spark task sche
make
> sense for those of us that all care about testing to try and do a hangout at
> some point so that we can exchange ideas?
>
>> On Thu, Dec 8, 2016 at 4:15 PM, Miguel Morales
>> wrote:
>> I would be interested in contributing. Ive created my own library for
I would be interested in contributing. Ive created my own library for this as
well. In my blog post I talk about testing with Spark in RSpec style:
https://medium.com/@therevoltingx/test-driven-development-w-apache-spark-746082b44941
Sent from my iPhone
> On Dec 8, 2016, at 4:09 PM, Holden Ka
Try to coalesce with a value of 2 or so. You could dynamically calculate how
many partitions to have to obtain an optimal file size.
Sent from my iPhone
> On Dec 8, 2016, at 1:03 PM, Kevin Tran wrote:
>
> How many partition should it be when streaming? - As in streaming process the
> data wi
One thing I've done before is to install datadogs statsd agent on the nodes.
Then you can emit metrics and stats to it and build dashboards on datadog.
Sent from my iPhone
> On Dec 5, 2016, at 8:17 PM, Chawla,Sumit wrote:
>
> Hi Manish
>
> I am specifically looking for something similar to f
history server indicates there was a
> problem.
>
> I will keep digging around. Thanks for your help so far Miguel.
>
> On 1/12/2016 3:33 PM, Miguel Morales wrote:
>
> Try hitting: http://:18080/api/v1
>
> Then hit /applications.
>
> That should give you a list of run
I don't have a running driver Spark instance since I am submitting jobs to
> Spark using the SparkLauncher class. Or maybe I am missing something obvious.
> Apologies if so.
>
>
>
>
> On 1/12/2016 3:21 PM, Miguel Morales wrote:
>
> Check the Monitoring and Instr
Check the Monitoring and Instrumentation API:
http://spark.apache.org/docs/latest/monitoring.html
On Wed, Nov 30, 2016 at 9:20 PM, Carl Ballantyne wrote:
> Hi All,
>
> I want to get the running applications for my Spark Standalone cluster in
> JSON format. The same information displayed on the w
I *think* you can return a map to updateStateByKey which would include your
fields. Another approach would be to create a hash (like create a json
version of the hash and return that.)
On Wed, Nov 30, 2016 at 12:30 PM, shyla deshpande
wrote:
> updateStateByKey - Can this be used when the key is
19 matches
Mail list logo