[
https://issues.apache.org/jira/browse/SPARK-11441?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15009690#comment-15009690
]
koert kuipers commented on SPARK-11441:
---------------------------------------
one more place where the cache of FileStatus objects for all part files is used
is in:
def sizeInBytes: Long
it seems that providing this information (the total size in bytes) forces the
driver to visit all the part files.
> HadoopFsRelation is not scalable in number of files read/written
> ----------------------------------------------------------------
>
> Key: SPARK-11441
> URL: https://issues.apache.org/jira/browse/SPARK-11441
> Project: Spark
> Issue Type: Bug
> Components: SQL
> Affects Versions: 1.5.1
> Reporter: koert kuipers
>
> HadoopFsRelation includes a fileStatusCache which holds information on all
> the datafiles (part files) for the data source in the driver program.
> It is not unusual to be reading from 100k+ or even 1mm part files, in which
> case filling up this cache will take a very long time (days?) and require a
> lot of memory. See for example:
> https://mail-archives.apache.org/mod_mbox/incubator-spark-user/201510.mbox/%3CCAG+ckK-FvWK=1b2jqc4s+zaz2zvkqehvos9myo0ssmgpc5-...@mail.gmail.com%3E
> This is not the kind of behavior you would expect of a driver program. Also
> HadoopFsRelation passes this large list of part files into:
> def buildScan(inputFiles: Array[FileStatus]): RDD[Row]
> Almost all implementations of HadoopFsRelation do the following inside
> buildScan:
> FileInputFormat.setInputPaths(job, inputFiles.map(_.getPath): _*)
> This means an array of potentially millions of items now gets stored in the
> JobConf which will be broadcast. I have not found any errors about this on
> mailing list but i believe this is simply because nobody with a large number
> of inputFiles has gotten this far.
> Generally when using Hadoop InputFormats there should never be a need to list
> all the part files driver side. It seems the reason it is done here is to
> facilitate a process in ParquetRelation driver side that creates a merged
> data schema. I wonder if its really necessary to look at all the part files
> for this, or if some assumption can be made that at least all the part files
> in a directory have the same schema (which would reduce the size of the
> problem by a factor 100 or so).
> At the very least it seems to be that the caching of files is parquet
> specific and does not belong in HadoopFsRelation. And buildScan should just
> use the data paths (so directories if one wants to read all part files in a
> directory) as it did before SPARK-7673 / PR #6225
> I ran into this issue myself with spark-avro, which also does not handle the
> input of part files in buildScan well. Spark-avro actually tries to create an
> RDD (and jobConf broadcast) per part file, which is not scalable even for 1k
> part files. Note that it is difficult for spark-avro to create an RDD per
> data directory (as it probably should) since the dataPaths have been lost now
> that the inputFiles is passed into buildScan instead. This to me again
> confirms the change in buildScan is troubling.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]