[
https://issues.apache.org/jira/browse/SPARK-13290?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sean Owen closed SPARK-13290.
-----------------------------
> wholeTextFile and binaryFiles are really slow
> ---------------------------------------------
>
> Key: SPARK-13290
> URL: https://issues.apache.org/jira/browse/SPARK-13290
> Project: Spark
> Issue Type: Bug
> Components: PySpark, Spark Core
> Affects Versions: 1.6.0
> Environment: Linux stand-alone
> Reporter: mathieu longtin
>
> Reading biggish files (175MB) with wholeTextFile or binaryFiles is extremely
> slow. It takes 3 minutes in Java versus 2.5 seconds in Python.
> The java process balloons to 4.3GB of memory and uses 100% CPU the whole
> time. I suspects Spark reads it in small chunks and assembles it at the end,
> hence the large amount of CPU.
> {code}
> In [49]: rdd = sc.binaryFiles(pathToOneFile)
> In [50]: %time path, text = rdd.first()
> CPU times: user 1.91 s, sys: 1.13 s, total: 3.04 s
> Wall time: 3min 32s
> In [51]: len(text)
> Out[51]: 191376122
> In [52]: %time text = open(pathToOneFile).read()
> CPU times: user 8 ms, sys: 691 ms, total: 699 ms
> Wall time: 2.43 s
> In [53]: len(text)
> Out[53]: 191376122
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]