[ 
https://issues.apache.org/jira/browse/SPARK-14103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15216541#comment-15216541
 ] 

Shubhanshu Mishra commented on SPARK-14103:
-------------------------------------------

Ok I tried your suggestion of increasing maxCharsPerColumn to an insanely high 
value and that has made the code load my file into the dataframe

{code}

wc -l temp.txt 
# Output is 100000  3181726 25693963 temp.txt

# Any number larger than 3181726 works
df = sqlContext.read.load("temp.txt", format="csv", header="false", 
inferSchema="true",  delimiter="\t", maxCharsPerColumn=3181726) # Works
df = sqlContext.read.load("temp.txt", format="csv", header="false", 
inferSchema="true",  delimiter="\t", maxCharsPerColumn=10000000) # Works
df = sqlContext.read.load("temp.txt", format="csv", header="false", 
inferSchema="true",  delimiter="\t", maxCharsPerColumn=2679360) # Works

# However, this one fails
df = sqlContext.read.load("temp.txt", format="csv", header="false", 
inferSchema="true",  delimiter="\t", maxCharsPerColumn=2679350) # Gives error
{code}

I check the file at that point using the following bash commands:

{code}
$ head -c2679360 temp1.txt | tail -n3

5F59257A        Performance of multicarrier CDMA technique combined with 
space-time block coding over Rayleigh channel  performance of multicarrier cdma 
technique combined with space time block coding over rayleigh channel 2002    
2002    10.1109/ISSSTA.2002.1048562     international symposium on information 
theory and its applications      isita           44B587D1        17005
6C9A7181        Compressive receiver sidelobes suppression based on mismatching 
algorithms      compressive receiver sidelobes suppression based on mismatching 
algorithms      1998    1998  10.1109/ISSSTA.1998.722528       international 
symposium on information theory and its applications      isita           
44B587D1        17166
777FD068        UE Counting Mechanism for MBMS Considering PtM Macro Diversity 
Combining Support in UMTS Networks       ue counting mechanism for mbms 
considering ptm macro diversity combining support in umts networks      2006    
2006/08 10.1109/ISSSTA.2006.311795      internation
{code}


I don't see any issues with the data in these lines. Especially between the 
following characters:

{code}
$ head -c2679360 temp1.txt | tail -c20

6.311795\tinternation
{code}

> Python DataFrame CSV load on large file is writing to console in Ipython
> ------------------------------------------------------------------------
>
>                 Key: SPARK-14103
>                 URL: https://issues.apache.org/jira/browse/SPARK-14103
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark
>         Environment: Ubuntu, Python 2.7.11, Anaconda 2.5.0, Spark from Master 
> branch
>            Reporter: Shubhanshu Mishra
>              Labels: csv, csvparser, dataframe, pyspark
>
> I am using the spark from the master branch and when I run the following 
> command on a large tab separated file then I get the contents of the file 
> being written to the stderr
> {code}
> df = sqlContext.read.load("temp.txt", format="csv", header="false", 
> inferSchema="true", delimiter="\t")
> {code}
> Here is a sample of output:
> {code}
> ^M[Stage 1:>                                                          (0 + 2) 
> / 2]16/03/23 14:01:02 ERROR Executor: Exception in task 1.0 in stage 1.0 (TID 
> 2)
> com.univocity.parsers.common.TextParsingException: Error processing input: 
> Length of parsed input (1000001) exceeds the maximum number of characters 
> defined in your parser settings (1000000). Identified line separator 
> characters in the parsed content. This may be the cause of the error. The 
> line separator in your parser settings is set to '\n'. Parsed content:
>         Privacy-shake",: a haptic interface for managing privacy settings in 
> mobile location sharing applications       privacy shake a haptic interface 
> for managing privacy settings in mobile location sharing applications  2010   
>  2010/09/07              international conference on human computer 
> interaction  interact                43331058        19371[\n]        
> 3D4F6CA1        Between the Profiles: Another such Bias. Technology 
> Acceptance Studies on Social Network Services       between the profiles 
> another such bias technology acceptance studies on social network services 
> 2015    2015/08/02      10.1007/978-3-319-21383-5_12    international 
> conference on human-computer interaction  interact                43331058    
>     19502[\n]
> .......
> .........
> web snippets    2008    2008/05/04      10.1007/978-3-642-01344-7_13    
> international conference on web information systems and technologies    
> webist          44F29802        19489
> 06FA3FFA        Interactive 3D User Interfaces for Neuroanatomy Exploration   
>   interactive 3d user interfaces for neuroanatomy exploration     2009        
>             internationa]
>         at 
> com.univocity.parsers.common.AbstractParser.handleException(AbstractParser.java:241)
>         at 
> com.univocity.parsers.common.AbstractParser.parseNext(AbstractParser.java:356)
>         at 
> org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.next(CSVParser.scala:137)
>         at 
> org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.next(CSVParser.scala:120)
>         at scala.collection.Iterator$class.foreach(Iterator.scala:742)
>         at 
> org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.foreach(CSVParser.scala:120)
>         at 
> scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:155)
>         at 
> org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.foldLeft(CSVParser.scala:120)
>         at 
> scala.collection.TraversableOnce$class.aggregate(TraversableOnce.scala:212)
>         at 
> org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.aggregate(CSVParser.scala:120)
>         at 
> org.apache.spark.rdd.RDD$$anonfun$aggregate$1$$anonfun$22.apply(RDD.scala:1058)
>         at 
> org.apache.spark.rdd.RDD$$anonfun$aggregate$1$$anonfun$22.apply(RDD.scala:1058)
>         at 
> org.apache.spark.SparkContext$$anonfun$35.apply(SparkContext.scala:1827)
>         at 
> org.apache.spark.SparkContext$$anonfun$35.apply(SparkContext.scala:1827)
>         at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:69)
>         at org.apache.spark.scheduler.Task.run(Task.scala:82)
>         at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:231)
>         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>         at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.ArrayIndexOutOfBoundsException
> 16/03/23 14:01:03 ERROR TaskSetManager: Task 0 in stage 1.0 failed 1 times; 
> aborting job
> ^M[Stage 1:>                                                          (0 + 1) 
> / 2]
> {code}
> For a small sample (<10,000 lines) of the data, I am not getting any error. 
> But as soon as I go above more than 100,000 samples, I start getting the 
> error. 
> I don't think the spark platform should output the actual data to stderr ever 
> as it decreases the readability. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to