[jira] [Commented] (SPARK-3972) PySpark Error on Windows with sc.wholeTextFiles

Masayoshi TSUZUKI (JIRA) Mon, 07 Mar 2016 01:21:10 -0800

    [ 
https://issues.apache.org/jira/browse/SPARK-3972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15182787#comment-15182787
 ]


Masayoshi TSUZUKI commented on SPARK-3972:
------------------------------------------

In my environment, I can't reproduce this problem.
I tested sc.textFile and sc.wholeTextFiles in both spark 1.1.0 pre-build binary 
and spark 1.6.0 pre-build binary, and all of them works fine.
I'm using Windows 8.1 and python 2.7 .

h3. Spark-1.1.0 localmode
{noformat}
C:\Users\tsudukim\Downloads\spark-1.1.0-bin-hadoop2.4>bin\pyspark.cmd --master 
local
Running C:\Python27\python.exe with 
PYTHONPATH=C:\Users\tsudukim\Downloads\spark-1.1.0-bin-hadoop2.4\bin\..\python\lib\py4j-0.8.2.1-src.zip;C:\Users\tsudukim\Downloads\spark-1.1.0-bin-hadoop2.4\bin\..\python;
Python 2.7.8 (default, Jun 30 2014, 16:03:49) [MSC v.1500 32 bit (Intel)] on 
win32
Type "help", "copyright", "credits" or "license" for more information.
Java HotSpot(TM) 64-Bit Server VM warning: ignoring option MaxPermSize=128m; 
support was removed in 8.0
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties

...(snip)...

Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 1.1.0
      /_/

Using Python version 2.7.8 (default, Jun 30 2014 16:03:49)
SparkContext available as sc.
>>> t1 = sc.textFile("README.md")
16/03/07 17:49:29 INFO MemoryStore: ensureFreeSpace(159118) called with 
curMem=0, maxMem=278019440
16/03/07 17:49:29 INFO MemoryStore: Block broadcast_0 stored as values in 
memory (estimated size 155.4 KB, free 265.0 MB)
>>> t1.count()

...(snip)...

141
>>> t1.first()

...(snip)...

u'# Apache Spark'
>>> t2 = sc.wholeTextFiles("README.md")
16/03/07 17:49:52 INFO MemoryStore: ensureFreeSpace(214194) called with 
curMem=168958, maxMem=278019440
16/03/07 17:49:52 INFO MemoryStore: Block broadcast_3 stored as values in 
memory (estimated size 209.2 KB, free 264.8 MB)
>>> t2.count()

...(snip)...

1
>>> t2.first()

...(snip)...

(u'file:/C:/Users/tsudukim/Downloads/spark-1.1.0-bin-hadoop2.4/README.md', u'# 
Apache Spark\n\nSpark is ...')
>>>
{noformat}

h3. Spark-1.1.0 yarn-client mode
This pattern doesn't work properly because of [SPARK-1825].

h3. Spark-1.6.0 localmode
{noformat}
C:\Users\tsudukim\Downloads\spark-1.6.0-bin-hadoop2.6>bin\pyspark --master local
Python 2.7.8 (default, Jun 30 2014, 16:03:49) [MSC v.1500 32 bit (Intel)] on 
win32
Type "help", "copyright", "credits" or "license" for more information.
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties

...(snip)...

Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 1.6.0
      /_/

Using Python version 2.7.8 (default, Jun 30 2014 16:03:49)
SparkContext available as sc, HiveContext available as sqlContext.
>>> t1 = sc.textFile("README.md")
16/03/07 18:02:06 INFO MemoryStore: Block broadcast_0 stored as values in 
memory (estimated size 127.4 KB, free 127.4 KB)
16/03/07 18:02:06 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in 
memory (estimated size 13.9 KB, free 141.3 KB)
16/03/07 18:02:06 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on 
localhost:51335 (size: 13.9 KB, free: 511.1 MB)
16/03/07 18:02:06 INFO SparkContext: Created broadcast 0 from textFile at 
NativeMethodAccessorImpl.java:-2
>>> t1.count()

...(snip)...

95
>>> t1.first()

...(snip)...

u'# Apache Spark'
>>> t2 = sc.wholeTextFiles("README.md")
16/03/07 18:02:37 INFO MemoryStore: Block broadcast_3 stored as values in 
memory (estimated size 209.1 KB, free 367.4 KB)
16/03/07 18:02:37 INFO MemoryStore: Block broadcast_3_piece0 stored as bytes in 
memory (estimated size 19.5 KB, free 386.8 KB)
16/03/07 18:02:37 INFO BlockManagerInfo: Added broadcast_3_piece0 in memory on 
localhost:51335 (size: 19.5 KB, free: 511.1 MB)
16/03/07 18:02:37 INFO SparkContext: Created broadcast 3 from wholeTextFiles at 
NativeMethodAccessorImpl.java:-2
>>> t2.count()

...(snip)...

1
>>> t2.first()

...(snip)...

(u'file:/C:/Users/tsudukim/Downloads/spark-1.6.0-bin-hadoop2.6/README.md', u'# 
Apache Spark\n\nSpark is ...')
>>>
{noformat}

h3. Spark-1.6.0 yarn-client mode
{noformat}
C:\Users\tsudukim\Downloads\spark-1.6.0-bin-hadoop2.6>set PYSPARK_PYTHON=python

C:\Users\tsudukim\Downloads\spark-1.6.0-bin-hadoop2.6>set 
HADOOP_CONF_DIR=C:\Users\tsudukim\Documents\hadoop\conf

C:\Users\tsudukim\Downloads\spark-1.6.0-bin-hadoop2.6>bin\pyspark --master 
yarn-client
Python 2.7.8 (default, Jun 30 2014, 16:03:49) [MSC v.1500 32 bit (Intel)] on 
win32
Type "help", "copyright", "credits" or "license" for more information.
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties

...(snip)...

Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 1.6.0
      /_/

Using Python version 2.7.8 (default, Jun 30 2014 16:03:49)
SparkContext available as sc, HiveContext available as sqlContext.
>>> t1 = sc.textFile('/tmp/README.md')
16/03/07 18:08:50 INFO MemoryStore: Block broadcast_1 stored as values in 
memory (estimated size 212.5 KB, free 444.6 KB)
16/03/07 18:08:50 INFO MemoryStore: Block broadcast_1_piece0 stored as bytes in 
memory (estimated size 19.7 KB, free 464.3 KB)
16/03/07 18:08:50 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on 
172.16.0.19:51575 (size: 19.7 KB, free: 511.1 MB)
16/03/07 18:08:50 INFO SparkContext: Created broadcast 1 from textFile at 
NativeMethodAccessorImpl.java:-2
>>> t1.count()

...(snip)...

97
>>> t1.first()

...(snip)...

u'# Apache Spark'
>>> t2 = sc.wholeTextFiles('/tmp/README.md')
16/03/07 18:09:19 INFO MemoryStore: Block broadcast_4 stored as values in 
memory (estimated size 212.9 KB, free 694.1 KB)
16/03/07 18:09:19 INFO MemoryStore: Block broadcast_4_piece0 stored as bytes in 
memory (estimated size 19.8 KB, free 713.9 KB)
16/03/07 18:09:19 INFO BlockManagerInfo: Added broadcast_4_piece0 in memory on 
172.16.0.19:51575 (size: 19.8 KB, free: 511.1 MB)
16/03/07 18:09:19 INFO SparkContext: Created broadcast 4 from wholeTextFiles at 
NativeMethodAccessorImpl.java:-2
>>> t2.count()

...(snip)...

1
>>> t2.first()

...(snip)...

(u'hdfs://ff01:8020/tmp/README.md', u'# Apache Spark\n\nSpark is ...')
>>>
{noformat}


> PySpark Error on Windows with sc.wholeTextFiles
> -----------------------------------------------
>
>                 Key: SPARK-3972
>                 URL: https://issues.apache.org/jira/browse/SPARK-3972
>             Project: Spark
>          Issue Type: Bug
>          Components: Input/Output, PySpark, Windows
>    Affects Versions: 1.1.0
>         Environment: Windows 8.1 x64
> Java SE Version 8 Update 20 (build 1.8.0_20-b26); 
> Python 2.7.7
>            Reporter: Michael Griffiths
>            Priority: Minor
>
> When running sc.wholeTextFiles() on a directory, I can run the command but 
> not do anything with the resulting RDD – specifically, I get an error in 
> py4j.protocol.Py4JJavaError; the error is unspecified. This occurs even if I 
> can read the text file(s) individually with sc.textFile()
> Steps followed:
> 1) Download Spark 1.1.0 (pre-builet for Hadoop 2.4: 
> [spark-1.1.0-bin-hadoop2.4.tgz|http://d3kbcqa49mib13.cloudfront.net/spark-1.1.0-bin-hadoop2.4.tgz])
> 2) Extract into folder at root of drive: **D:\spark**
> 3) Create test folder at **D:\testdata** with one (HTML) file contained 
> within it.
> 4) Launch PySpark at **bin\PySpark**
> 5) Try to use sc.wholeTextFiles('d:/testdata'); fail.
> Note: I followed instructions from the upcoming O'Reilly book [Learning 
> Spark|http://shop.oreilly.com/product/0636920028512.do] for this. I do not 
> have any related tools installed (e.g. Hadoop) on the Windows machine.
> See session (below)with tracebacks from errors.
> {noformat}
> Welcome to
>       ____              __
>      / __/__  ___ _____/ /__
>     _\ \/ _ \/ _ `/ __/  '_/
>    /__ / .__/\_,_/_/ /_/\_\   version 1.1.0
>       /_/
> Using Python version 2.7.7 (default, Jun 11 2014 10:40:02)
> SparkContext available as sc.
> >>> file = sc.textFile("d:/testdata/0000cbcc5b470ec06f212990c68c8f76e887b884")
> >>> file.count()
> 732
> >>> file.first()
> u'<!DOCTYPE html>'
> >>> data = sc.wholeTextFiles('d:/testdata')
> >>> data.first()
> Traceback (most recent call last):
>   File "<stdin>", line 1, in <module>
>   File "D:\spark\python\pyspark\rdd.py", line 1167, in first
>     return self.take(1)[0]
>   File "D:\spark\python\pyspark\rdd.py", line 1126, in take
>     totalParts = self._jrdd.partitions().size()
>   File "D:\spark\python\lib\py4j-0.8.2.1-src.zip\py4j\java_gateway.py", line 
> 538, in __call__
>   File "D:\spark\python\lib\py4j-0.8.2.1-src.zip\py4j\protocol.py", line 300, 
> in get_return_value 
> py4j.protocol.Py4JJavaError: An error occurred while calling o21.partitions.
> : java.lang.NullPointerException
>         at java.lang.ProcessBuilder.start(Unknown Source)
>         at org.apache.hadoop.util.Shell.runCommand(Shell.java:445)
>         at org.apache.hadoop.util.Shell.run(Shell.java:418)
>         at 
> org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:650)
>         at org.apache.hadoop.util.Shell.execCommand(Shell.java:739)
>         at org.apache.hadoop.util.Shell.execCommand(Shell.java:722)
>         at org.apache.hadoop.fs.FileUtil.execCommand(FileUtil.java:1097)
>         at 
> org.apache.hadoop.fs.RawLocalFileSystem$DeprecatedRawLocalFileStatus.loadPermissionInfo(RawLocalFileSystem.java:559)
>         at 
> org.apache.hadoop.fs.RawLocalFileSystem$DeprecatedRawLocalFileStatus.getPermission(RawLocalFileSystem.java:534)
>         at 
> org.apache.hadoop.fs.LocatedFileStatus.<init>(LocatedFileStatus.java:42)
>        at org.apache.hadoop.fs.FileSystem$4.next(FileSystem.java:1697)
>         at org.apache.hadoop.fs.FileSystem$4.next(FileSystem.java:1679)
>         at 
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:302)
>         at 
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:263)
>         at 
> org.apache.spark.input.WholeTextFileInputFormat.setMaxSplitSize(WholeTextFileInputFormat.scala:54)
>         at 
> org.apache.spark.rdd.WholeTextFileRDD.getPartitions(NewHadoopRDD.scala:219)
>         at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
>         at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)
>         at scala.Option.getOrElse(Option.scala:120)
>         at org.apache.spark.rdd.RDD.partitions(RDD.scala:202)
>         at 
> org.apache.spark.api.java.JavaRDDLike$class.partitions(JavaRDDLike.scala:50)
>         at 
> org.apache.spark.api.java.JavaPairRDD.partitions(JavaPairRDD.scala:44)
>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>         at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
>         at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
>         at java.lang.reflect.Method.invoke(Unknown Source)
>         at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
>         at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
>         at py4j.Gateway.invoke(Gateway.java:259)
>         at 
> py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
>         at py4j.commands.CallCommand.execute(CallCommand.java:79)
>         at py4j.GatewayConnection.run(GatewayConnection.java:207)
>         at java.lang.Thread.run(Unknown Source)
> >>> data.count()
> Traceback (most recent call last):
>   File "<stdin>", line 1, in <module>
>   File "D:\spark\python\pyspark\rdd.py", line 847, in count
>     return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum()
>   File "D:\spark\python\pyspark\rdd.py", line 838, in sum
>     return self.mapPartitions(lambda x: [sum(x)]).reduce(operator.add)
>   File "D:\spark\python\pyspark\rdd.py", line 759, in reduce
>     vals = self.mapPartitions(func).collect()
>   File "D:\spark\python\pyspark\rdd.py", line 723, in collect
>     bytesInJava = self._jrdd.collect().iterator()
>   File "D:\spark\python\lib\py4j-0.8.2.1-src.zip\py4j\java_gateway.py", line 
> 538, in __call__
>   File "D:\spark\python\lib\py4j-0.8.2.1-src.zip\py4j\protocol.py", line 300, 
> in get_return_value
> py4j.protocol.Py4JJavaError: An error occurred while calling o28.collect.
> : java.lang.NullPointerException
>         at java.lang.ProcessBuilder.start(Unknown Source)
>         at org.apache.hadoop.util.Shell.runCommand(Shell.java:445)
>         at org.apache.hadoop.util.Shell.run(Shell.java:418)
>         at 
> org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:650)
>         at org.apache.hadoop.util.Shell.execCommand(Shell.java:739)
>         at org.apache.hadoop.util.Shell.execCommand(Shell.java:722)
>         at org.apache.hadoop.fs.FileUtil.execCommand(FileUtil.java:1097)
>         at 
> org.apache.hadoop.fs.RawLocalFileSystem$DeprecatedRawLocalFileStatus.loadPermissionInfo(RawLocalFileSystem.java:559)
>         at 
> org.apache.hadoop.fs.RawLocalFileSystem$DeprecatedRawLocalFileStatus.getPermission(RawLocalFileSystem.java:534)
>         at 
> org.apache.hadoop.fs.LocatedFileStatus.<init>(LocatedFileStatus.java:42)
>         at org.apache.hadoop.fs.FileSystem$4.next(FileSystem.java:1697)
>         at org.apache.hadoop.fs.FileSystem$4.next(FileSystem.java:1679)
>         at 
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:302)
>         at 
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:263)
>         at 
> org.apache.spark.input.WholeTextFileInputFormat.setMaxSplitSize(WholeTextFileInputFormat.scala:54)
>         at 
> org.apache.spark.rdd.WholeTextFileRDD.getPartitions(NewHadoopRDD.scala:219)
>         at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
>         at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)
>         at scala.Option.getOrElse(Option.scala:120)
>         at org.apache.spark.rdd.RDD.partitions(RDD.scala:202)
>         at 
> org.apache.spark.api.python.PythonRDD.getPartitions(PythonRDD.scala:56)
>         at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
>         at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)
>         at scala.Option.getOrElse(Option.scala:120)
>         at org.apache.spark.rdd.RDD.partitions(RDD.scala:202)
>         at org.apache.spark.SparkContext.runJob(SparkContext.scala:1135)
>         at org.apache.spark.rdd.RDD.collect(RDD.scala:774)
>         at 
> org.apache.spark.api.java.JavaRDDLike$class.collect(JavaRDDLike.scala:305)
>         at org.apache.spark.api.java.JavaRDD.collect(JavaRDD.scala:32)
>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>         at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
>         at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
>         at java.lang.reflect.Method.invoke(Unknown Source)
>         at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
>         at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
>         at py4j.Gateway.invoke(Gateway.java:259)
>         at 
> py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
>         at py4j.commands.CallCommand.execute(CallCommand.java:79)
>         at py4j.GatewayConnection.run(GatewayConnection.java:207)
>         at java.lang.Thread.run(Unknown Source)
> >>> data.map(lambda x: len(x)).take(1)
> Traceback (most recent call last):
>   File "<stdin>", line 1, in <module>
>   File "D:\spark\python\pyspark\rdd.py", line 1126, in take
>     totalParts = self._jrdd.partitions().size()
>   File "D:\spark\python\lib\py4j-0.8.2.1-src.zip\py4j\java_gateway.py", line 
> 538, in __call__
>   File "D:\spark\python\lib\py4j-0.8.2.1-src.zip\py4j\protocol.py", line 300, 
> in get_return_value
> py4j.protocol.Py4JJavaError: An error occurred while calling o61.partitions.
> : java.lang.NullPointerException
>         at java.lang.ProcessBuilder.start(Unknown Source)
>         at org.apache.hadoop.util.Shell.runCommand(Shell.java:445)
>         at org.apache.hadoop.util.Shell.run(Shell.java:418)
>         at 
> org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:650)
>         at org.apache.hadoop.util.Shell.execCommand(Shell.java:739)
>         at org.apache.hadoop.util.Shell.execCommand(Shell.java:722)
>         at org.apache.hadoop.fs.FileUtil.execCommand(FileUtil.java:1097)
>         at 
> org.apache.hadoop.fs.RawLocalFileSystem$DeprecatedRawLocalFileStatus.loadPermissionInfo(RawLocalFileSystem.java:559)
>         at 
> org.apache.hadoop.fs.RawLocalFileSystem$DeprecatedRawLocalFileStatus.getPermission(RawLocalFileSystem.java:534)
>         at 
> org.apache.hadoop.fs.LocatedFileStatus.<init>(LocatedFileStatus.java:42)
>         at org.apache.hadoop.fs.FileSystem$4.next(FileSystem.java:1697)
>         at org.apache.hadoop.fs.FileSystem$4.next(FileSystem.java:1679)
>         at 
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:302)
>         at 
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:263)
>         at 
> org.apache.spark.input.WholeTextFileInputFormat.setMaxSplitSize(WholeTextFileInputFormat.scala:54)
>         at 
> org.apache.spark.rdd.WholeTextFileRDD.getPartitions(NewHadoopRDD.scala:219)
>         at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
>         at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)
>         at scala.Option.getOrElse(Option.scala:120)
>         at org.apache.spark.rdd.RDD.partitions(RDD.scala:202)
>         at 
> org.apache.spark.api.python.PythonRDD.getPartitions(PythonRDD.scala:56)
>         at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
>         at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)
>         at scala.Option.getOrElse(Option.scala:120)
>         at org.apache.spark.rdd.RDD.partitions(RDD.scala:202)
>         at 
> org.apache.spark.api.java.JavaRDDLike$class.partitions(JavaRDDLike.scala:50)
>         at org.apache.spark.api.java.JavaRDD.partitions(JavaRDD.scala:32)
>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>         at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
>         at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
>         at java.lang.reflect.Method.invoke(Unknown Source)
>         at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
>         at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
>         at py4j.Gateway.invoke(Gateway.java:259)
>         at 
> py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
>         at py4j.commands.CallCommand.execute(CallCommand.java:79)
>         at py4j.GatewayConnection.run(GatewayConnection.java:207)
>         at java.lang.Thread.run(Unknown Source)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SPARK-3972) PySpark Error on Windows with sc.wholeTextFiles

Reply via email to