Hi,
I am trying to use the MultipleTextOutputFormat to rename the output files
of my Spark job something different from the default part-N.
I have implemented a custom MultipleTextOutputFormat class as follows:
*class DriveOutputRenameMultipleTextOutputFormat extends
Hi,
How can I implement a custom MultipleOutputFormat and specify it as the
output of my Spark job so that I can ensure that there is a unique output
file per key (instead of a a unique output file per reducer)?
Thanks
Arpan
I am running a spark job on ~ 124 GB of data in a S3 bucket. The Job runs
fine but occasionally returns the following exception during the first map
stage which involves reading and transforming the data from S3. Is there a
config parameter I can set to increase this timeout limit?
*14/08/23
Is there any way to control the ordering of values for each key during a
groupByKey() operation? Is there some sort of implicit ordering in place
already?
Thanks
Arpan
or 1.1 branch with the feature of spilling in Python.
Davies
On Wed, Aug 13, 2014 at 4:08 PM, Arpan Ghosh ar...@automatic.com wrote:
Here are the biggest keys:
[ (17634, 87874097),
(8407, 38395833),
(20092, 14403311),
(9295, 4142636),
(14359, 3129206
The errors are occurring in the exact same time in the job as
well..right at the end of the groupByKey() when 5 tasks are left.
On Thu, Aug 14, 2014 at 12:59 PM, Arpan Ghosh ar...@automatic.com wrote:
Hi Davies,
I tried the second option and launched my ec2 cluster with master on all
Hi,
I have launched an AWS Spark cluster using the spark-ec2 script
(--hadoop-major-version=1). The ephemeral-HDFS is setup correctly and I can
see the name node at master hostname:50070. When I try to copy files from
S3 into ephemeral-HDFS using distcp using the following command:
Hi,
Let me begin by describing my Spark setup on EC2 (launched using the
provided spark-ec2.py script):
- 100 c3.2xlarge workers (8 cores 15GB memory each)
- 1 c3.2xlarge Master (only running master daemon)
- Spark 1.0.2
- 8GB mounted at */* 80 GB mounted at */mnt*
], False).take(10)
Davies
On Wed, Aug 13, 2014 at 12:21 PM, Arpan Ghosh ar...@automatic.com wrote:
Hi,
Let me begin by describing my Spark setup on EC2 (launched using the
provided spark-ec2.py script):
100 c3.2xlarge workers (8 cores 15GB memory each)
1 c3.2xlarge Master (only
appreciate that
if you could test it with you real case.
Davies
On Wed, Aug 13, 2014 at 1:57 PM, Arpan Ghosh ar...@automatic.com wrote:
Thanks Davies. I am running Spark 1.0.2 (which seems to be the latest
release)
I'll try changing it to a reduceByKey() and check the size of the largest
().sortBy(lambda x:x[1], False).take(10)
Davies
On Wed, Aug 13, 2014 at 12:21 PM, Arpan Ghosh ar...@automatic.com wrote:
Hi,
Let me begin by describing my Spark setup on EC2 (launched using the
provided spark-ec2.py script):
100 c3.2xlarge workers (8 cores 15GB memory each)
1
11 matches
Mail list logo