[ https://issues.apache.org/jira/browse/SPARK-22240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16200367#comment-16200367 ]
Steve Loughran commented on SPARK-22240: ---------------------------------------- Amazon EMR is amazon's own fork of Spark & Hadoop, with their own s3 connectors: They explicitly say don't use s3a|http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan-file-systems.html]. I was about to deny there was any problem with Spark & S3a, but after looking into things more, I think that HADOOP-14943 means that S3A is returning the filesize as its single partition for a file, when really it should be splitting it up. As a result, the partitioning is going to be limited by that set in {{SparkContext.defaultMinPartitions}} unless you pass in a partition number to your {{SparkContext.hadoopRDD}} calls. I'd do that until I can fix things in S3A. > S3 CSV number of partitions incorrectly computed > ------------------------------------------------ > > Key: SPARK-22240 > URL: https://issues.apache.org/jira/browse/SPARK-22240 > Project: Spark > Issue Type: Bug > Components: Spark Core > Affects Versions: 2.2.0 > Environment: Running on EMR 5.8.0 with Hadoop 2.7.3 and Spark 2.2.0 > Reporter: Arthur Baudry > > Reading CSV out of S3 using S3A protocol does not compute the number of > partitions correctly in Spark 2.2.0. > With Spark 2.2.0 I get only partition when loading a 14GB file > {code:java} > scala> val input = spark.read.format("csv").option("header", > "true").option("delimiter", "|").option("multiLine", > "true").load("s3a://<s3_path>") > input: org.apache.spark.sql.DataFrame = [PARTY_KEY: string, ROW_START_DATE: > string ... 36 more fields] > scala> input.rdd.getNumPartitions > res2: Int = 1 > {code} > While in Spark 2.0.2 I had: > {code:java} > scala> val input = spark.read.format("csv").option("header", > "true").option("delimiter", "|").option("multiLine", > "true").load("s3a://<s3_path>") > input: org.apache.spark.sql.DataFrame = [PARTY_KEY: string, ROW_START_DATE: > string ... 36 more fields] > scala> input.rdd.getNumPartitions > res2: Int = 115 > {code} > This introduces obvious performance issues in Spark 2.2.0. Maybe there is a > property that should be set to have the number of partitions computed > correctly. > I'm aware that the .option("multiline","true") is not supported in Spark > 2.0.2, it's not relevant here. -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org