[
https://issues.apache.org/jira/browse/SPARK-14103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15216541#comment-15216541
]
Shubhanshu Mishra commented on SPARK-14103:
-------------------------------------------
Ok I tried your suggestion of increasing maxCharsPerColumn to an insanely high
value and that has made the code load my file into the dataframe
{code}
wc -l temp.txt
# Output is 100000 3181726 25693963 temp.txt
# Any number larger than 3181726 works
df = sqlContext.read.load("temp.txt", format="csv", header="false",
inferSchema="true", delimiter="\t", maxCharsPerColumn=3181726) # Works
df = sqlContext.read.load("temp.txt", format="csv", header="false",
inferSchema="true", delimiter="\t", maxCharsPerColumn=10000000) # Works
df = sqlContext.read.load("temp.txt", format="csv", header="false",
inferSchema="true", delimiter="\t", maxCharsPerColumn=2679360) # Works
# However, this one fails
df = sqlContext.read.load("temp.txt", format="csv", header="false",
inferSchema="true", delimiter="\t", maxCharsPerColumn=2679350) # Gives error
{code}
I check the file at that point using the following bash commands:
{code}
$ head -c2679360 temp1.txt | tail -n3
5F59257A Performance of multicarrier CDMA technique combined with
space-time block coding over Rayleigh channel performance of multicarrier cdma
technique combined with space time block coding over rayleigh channel 2002
2002 10.1109/ISSSTA.2002.1048562 international symposium on information
theory and its applications isita 44B587D1 17005
6C9A7181 Compressive receiver sidelobes suppression based on mismatching
algorithms compressive receiver sidelobes suppression based on mismatching
algorithms 1998 1998 10.1109/ISSSTA.1998.722528 international
symposium on information theory and its applications isita
44B587D1 17166
777FD068 UE Counting Mechanism for MBMS Considering PtM Macro Diversity
Combining Support in UMTS Networks ue counting mechanism for mbms
considering ptm macro diversity combining support in umts networks 2006
2006/08 10.1109/ISSSTA.2006.311795 internation
{code}
I don't see any issues with the data in these lines. Especially between the
following characters:
{code}
$ head -c2679360 temp1.txt | tail -c20
6.311795\tinternation
{code}
> Python DataFrame CSV load on large file is writing to console in Ipython
> ------------------------------------------------------------------------
>
> Key: SPARK-14103
> URL: https://issues.apache.org/jira/browse/SPARK-14103
> Project: Spark
> Issue Type: Bug
> Components: PySpark
> Environment: Ubuntu, Python 2.7.11, Anaconda 2.5.0, Spark from Master
> branch
> Reporter: Shubhanshu Mishra
> Labels: csv, csvparser, dataframe, pyspark
>
> I am using the spark from the master branch and when I run the following
> command on a large tab separated file then I get the contents of the file
> being written to the stderr
> {code}
> df = sqlContext.read.load("temp.txt", format="csv", header="false",
> inferSchema="true", delimiter="\t")
> {code}
> Here is a sample of output:
> {code}
> ^M[Stage 1:> (0 + 2)
> / 2]16/03/23 14:01:02 ERROR Executor: Exception in task 1.0 in stage 1.0 (TID
> 2)
> com.univocity.parsers.common.TextParsingException: Error processing input:
> Length of parsed input (1000001) exceeds the maximum number of characters
> defined in your parser settings (1000000). Identified line separator
> characters in the parsed content. This may be the cause of the error. The
> line separator in your parser settings is set to '\n'. Parsed content:
> Privacy-shake",: a haptic interface for managing privacy settings in
> mobile location sharing applications privacy shake a haptic interface
> for managing privacy settings in mobile location sharing applications 2010
> 2010/09/07 international conference on human computer
> interaction interact 43331058 19371[\n]
> 3D4F6CA1 Between the Profiles: Another such Bias. Technology
> Acceptance Studies on Social Network Services between the profiles
> another such bias technology acceptance studies on social network services
> 2015 2015/08/02 10.1007/978-3-319-21383-5_12 international
> conference on human-computer interaction interact 43331058
> 19502[\n]
> .......
> .........
> web snippets 2008 2008/05/04 10.1007/978-3-642-01344-7_13
> international conference on web information systems and technologies
> webist 44F29802 19489
> 06FA3FFA Interactive 3D User Interfaces for Neuroanatomy Exploration
> interactive 3d user interfaces for neuroanatomy exploration 2009
> internationa]
> at
> com.univocity.parsers.common.AbstractParser.handleException(AbstractParser.java:241)
> at
> com.univocity.parsers.common.AbstractParser.parseNext(AbstractParser.java:356)
> at
> org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.next(CSVParser.scala:137)
> at
> org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.next(CSVParser.scala:120)
> at scala.collection.Iterator$class.foreach(Iterator.scala:742)
> at
> org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.foreach(CSVParser.scala:120)
> at
> scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:155)
> at
> org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.foldLeft(CSVParser.scala:120)
> at
> scala.collection.TraversableOnce$class.aggregate(TraversableOnce.scala:212)
> at
> org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.aggregate(CSVParser.scala:120)
> at
> org.apache.spark.rdd.RDD$$anonfun$aggregate$1$$anonfun$22.apply(RDD.scala:1058)
> at
> org.apache.spark.rdd.RDD$$anonfun$aggregate$1$$anonfun$22.apply(RDD.scala:1058)
> at
> org.apache.spark.SparkContext$$anonfun$35.apply(SparkContext.scala:1827)
> at
> org.apache.spark.SparkContext$$anonfun$35.apply(SparkContext.scala:1827)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:69)
> at org.apache.spark.scheduler.Task.run(Task.scala:82)
> at
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:231)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.ArrayIndexOutOfBoundsException
> 16/03/23 14:01:03 ERROR TaskSetManager: Task 0 in stage 1.0 failed 1 times;
> aborting job
> ^M[Stage 1:> (0 + 1)
> / 2]
> {code}
> For a small sample (<10,000 lines) of the data, I am not getting any error.
> But as soon as I go above more than 100,000 samples, I start getting the
> error.
> I don't think the spark platform should output the actual data to stderr ever
> as it decreases the readability.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]