[ https://issues.apache.org/jira/browse/SPARK-16575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Reynold Xin updated SPARK-16575: -------------------------------- Target Version/s: 2.1.0 > partition calculation mismatch with sc.binaryFiles > -------------------------------------------------- > > Key: SPARK-16575 > URL: https://issues.apache.org/jira/browse/SPARK-16575 > Project: Spark > Issue Type: Bug > Components: Input/Output, Java API, Shuffle, Spark Core, Spark Shell > Affects Versions: 1.6.1, 1.6.2 > Reporter: Suhas > Priority: Critical > > sc.binaryFiles is always creating an RDD with number of partitions as 2. > Steps to reproduce: (Tested this bug on databricks community edition) > 1. Try to create an RDD using sc.binaryFiles. In this example, airlines > folder has 1922 files. > Ex: {noformat}val binaryRDD = > sc.binaryFiles("/databricks-datasets/airlines/*"){noformat} > 2. check the number of partitions of the above RDD > - binaryRDD.partitions.size = 2. (expected value is more than 2) > 3. If the RDD is created using sc.textFile, then the number of partitions are > 1921. > 4. Using the same sc.binaryFiles will create 1921 partitions in Spark 1.5.1 > version. > For explanation with screenshot, please look at the link below, > http://apache-spark-developers-list.1001551.n3.nabble.com/Partition-calculation-issue-with-sc-binaryFiles-on-Spark-1-6-2-tt18314.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org