Hi,
I have two RDD's with csv data as below :
RDD-1
101970_5854301840,fbcf5485-e696-4100-9468-a17ec7c5bb43,19229261643
101970_5854301839,fbaf5485-e696-4100-9468-a17ec7c5bb39,9229261645
101970_5854301839,fbbf5485-e696-4100-9468-a17ec7c5bb39,9229261647
101970_17038953,546853f9-cf07-4700-b202-00f21
do a join (or a variant like cogroup,
> leftOuterJoin, subtractByKey etc. found in PairRDDFunctions)
>
> the downside is this requires a shuffle of both your RDDs
>
> On Thu, Feb 19, 2015 at 3:36 PM, Himanish Kushary
> wrote:
>
>> Hi,
>>
>> I have two RDD&
d the settings for the parameters *spark.akka.frameSize (=
500), **spark.akka.timeout,**spark.akka.askTimeout and
**spark.core.connection.ack.wait.timeout
*to get rid of any insufficient frame size and timeout errors
Thanks
Himanish
On Thu, Feb 26, 2015 at 5:00 PM, Himanish Kushary
wrote:
> Hi,
We are running our Spark jobs on Amazon AWS and are using AWS Datapipeline
for orchestration of the different spark jobs. AWS datapipeline provides
automatic EMR cluster provisioning, retry on failure,SNS notification etc.
out of the box and works well for us.
On Sun, Mar 1, 2015 at 7:02 PM, F
Hi,
I have a RDD of pairs of strings like below :
(A,B)
(B,C)
(C,D)
(A,D)
(E,F)
(B,F)
I need to transform/filter this into a RDD of pairs that does not repeat a
string once it has been used once. So something like ,
(A,B)
(C,D)
(E,F)
(B,C) is out because B has already ben used in (A,B), (A,D)
PM, Nathan Kronenfeld <
nkronenfeld@uncharted.software> wrote:
> What would it do with the following dataset?
>
> (A, B)
> (A, C)
> (B, D)
>
>
> On Wed, Mar 25, 2015 at 1:02 PM, Himanish Kushary
> wrote:
>
>> Hi,
>>
>> I have a RDD of pair
calable solution.
Thanks
On Wed, Mar 25, 2015 at 3:13 PM, Nathan Kronenfeld <
nkronenfeld@uncharted.software> wrote:
> You're generating all possible pairs?
>
> In that case, why not just generate the sequential pairs you want from the
> start?
>
> On Wed, Mar 25, 2015 a