[jira] [Comment Edited] (SPARK-14103) Python DataFrame CSV load on large file is writing to console in Ipython

Shubhanshu Mishra (JIRA) Sat, 02 Apr 2016 11:52:20 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-14103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15223006#comment-15223006
 ]


Shubhanshu Mishra edited comment on SPARK-14103 at 4/2/16 6:51 PM:
-------------------------------------------------------------------

[~hyukjin.kwon] thanks for pointing this out. I used the {code}quote=""{code} 
as a value and the dataframe reader was able to correctly parse the file. 

{code}
df = sqlContext.read.load("temp.txt", format="csv", header="false", quote="", 
inferSchema="true", delimiter="\t") # WORKS
{code}

After your comment, I looked at the 
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVOptions.scala
 file which sets the default quote character to {code}"{code}, however, in the 
{code}getChar{code} function, it is mentioned if the length of the option is 0 
then the value will be set to the null unicode char {code}\u000{code}. 

I think this fixes up this issue. However, the long error message should be 
taken care of. 




was (Author: [email protected]):
[~hyukjin.kwon] thanks for pointing this out. I used the `quote=""` as a value 
and the dataframe reader was able to correctly parse the file. 

{code}
df = sqlContext.read.load("temp.txt", format="csv", header="false", quote="", 
inferSchema="true", delimiter="\t") # WORKS
{code}

After your comment, I looked at the 
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVOptions.scala
 file which sets the default quote character to `"`, however, in the `getChar` 
function, it is mentioned if the length of the option is 0 then the value will 
be set to the null unicode char `\u000`. 

I think this fixes up this issue. However, the long error message should be 
taken care of. 



> Python DataFrame CSV load on large file is writing to console in Ipython
> ------------------------------------------------------------------------
>
>                 Key: SPARK-14103
>                 URL: https://issues.apache.org/jira/browse/SPARK-14103
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark
>         Environment: Ubuntu, Python 2.7.11, Anaconda 2.5.0, Spark from Master 
> branch
>            Reporter: Shubhanshu Mishra
>              Labels: csv, csvparser, dataframe, pyspark
>
> I am using the spark from the master branch and when I run the following 
> command on a large tab separated file then I get the contents of the file 
> being written to the stderr
> {code}
> df = sqlContext.read.load("temp.txt", format="csv", header="false", 
> inferSchema="true", delimiter="\t")
> {code}
> Here is a sample of output:
> {code}
> ^M[Stage 1:>                                                          (0 + 2) 
> / 2]16/03/23 14:01:02 ERROR Executor: Exception in task 1.0 in stage 1.0 (TID 
> 2)
> com.univocity.parsers.common.TextParsingException: Error processing input: 
> Length of parsed input (1000001) exceeds the maximum number of characters 
> defined in your parser settings (1000000). Identified line separator 
> characters in the parsed content. This may be the cause of the error. The 
> line separator in your parser settings is set to '\n'. Parsed content:
>         Privacy-shake",: a haptic interface for managing privacy settings in 
> mobile location sharing applications       privacy shake a haptic interface 
> for managing privacy settings in mobile location sharing applications  2010   
>  2010/09/07              international conference on human computer 
> interaction  interact                43331058        19371[\n]        
> 3D4F6CA1        Between the Profiles: Another such Bias. Technology 
> Acceptance Studies on Social Network Services       between the profiles 
> another such bias technology acceptance studies on social network services 
> 2015    2015/08/02      10.1007/978-3-319-21383-5_12    international 
> conference on human-computer interaction  interact                43331058    
>     19502[\n]
> .......
> .........
> web snippets    2008    2008/05/04      10.1007/978-3-642-01344-7_13    
> international conference on web information systems and technologies    
> webist          44F29802        19489
> 06FA3FFA        Interactive 3D User Interfaces for Neuroanatomy Exploration   
>   interactive 3d user interfaces for neuroanatomy exploration     2009        
>             internationa]
>         at 
> com.univocity.parsers.common.AbstractParser.handleException(AbstractParser.java:241)
>         at 
> com.univocity.parsers.common.AbstractParser.parseNext(AbstractParser.java:356)
>         at 
> org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.next(CSVParser.scala:137)
>         at 
> org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.next(CSVParser.scala:120)
>         at scala.collection.Iterator$class.foreach(Iterator.scala:742)
>         at 
> org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.foreach(CSVParser.scala:120)
>         at 
> scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:155)
>         at 
> org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.foldLeft(CSVParser.scala:120)
>         at 
> scala.collection.TraversableOnce$class.aggregate(TraversableOnce.scala:212)
>         at 
> org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.aggregate(CSVParser.scala:120)
>         at 
> org.apache.spark.rdd.RDD$$anonfun$aggregate$1$$anonfun$22.apply(RDD.scala:1058)
>         at 
> org.apache.spark.rdd.RDD$$anonfun$aggregate$1$$anonfun$22.apply(RDD.scala:1058)
>         at 
> org.apache.spark.SparkContext$$anonfun$35.apply(SparkContext.scala:1827)
>         at 
> org.apache.spark.SparkContext$$anonfun$35.apply(SparkContext.scala:1827)
>         at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:69)
>         at org.apache.spark.scheduler.Task.run(Task.scala:82)
>         at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:231)
>         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>         at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.ArrayIndexOutOfBoundsException
> 16/03/23 14:01:03 ERROR TaskSetManager: Task 0 in stage 1.0 failed 1 times; 
> aborting job
> ^M[Stage 1:>                                                          (0 + 1) 
> / 2]
> {code}
> For a small sample (<10,000 lines) of the data, I am not getting any error. 
> But as soon as I go above more than 100,000 samples, I start getting the 
> error. 
> I don't think the spark platform should output the actual data to stderr ever 
> as it decreases the readability. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (SPARK-14103) Python DataFrame CSV load on large file is writing to console in Ipython

Reply via email to