[
https://issues.apache.org/jira/browse/SPARK-14480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Wenchen Fan resolved SPARK-14480.
---------------------------------
Resolution: Fixed
> Remove meaningless StringIteratorReader for CSV data source for better
> performance
> ----------------------------------------------------------------------------------
>
> Key: SPARK-14480
> URL: https://issues.apache.org/jira/browse/SPARK-14480
> Project: Spark
> Issue Type: Sub-task
> Components: SQL
> Reporter: Hyukjin Kwon
> Assignee: Hyukjin Kwon
> Fix For: 2.1.0
>
>
> Currently, CSV data source reads and parses CSV data bytes by bytes (not line
> by line).
> In {{CSVParser.scala}}, there is an {{Reader}} wrapping {{Iterator}}. I think
> is made like this for better performance. However, it looks there are two
> problems.
> Firstly, it was actually not faster than processing line by line with
> {{Iterator}} due to additional logics to wrap {{Iterator}} to {{Reader}}.
> Secondly, this brought a bit of complexity because it needs additional logics
> to allow every line to be read bytes by bytes. So, it was pretty difficult to
> figure out issues about parsing, (eg. SPARK-14103). Actually almost all codes
> in {{CSVParser}} might not be needed.
> I made a rough patch and tested this. The test results for the first problem
> are below:
> h4. Results
> - Original codes with {{Reader}} wrapping {{Iterator}}
> ||End-to-end (ns)||Parse Time (ns)||
> | 14116265034 | 2008277960 |
> - New codes with {{Iterator}}
> ||End-to-end (ns)||Parse Time (ns)||
> | 13451699644 | 1549050564 |
> In more details,
> h4. Method
> - TCP-H lineitem table is being tested.
> - The results are collected only by 1000000.
> - End-to-end tests and parsing time tests are performed 10 times and averages
> are calculated for each.
> h4. Environment
> - Machine: MacBook Pro Retina
> - CPU: 4
> - Memory: 8GB
> h4. Dataset
> - [TPC-H|http://www.tpc.org/tpch/] Lineitem Table created with factor 1
> ([generate data|https://github.com/rxin/TPC-H-Hive/tree/master/dbgen)])
> - Size : 724.66 MB
> h4. Test Codes
> - Function to measure time
> {code}
> def time[A](f: => A) = {
> val s = System.nanoTime
> val ret = f
> println("time: "+(System.nanoTime-s)/1e6+"ms")
> ret
> }
> {code}
> - End-to-end test
> {code}
> val path = "lineitem.tbl"
> val df = sqlContext
> .read
> .format("csv")
> .option("header", "false")
> .option("delimiter", "|")
> .load(path)
> time(df.take(1000000))
> {code}
> - Parsing time test for original (in {{BulkCsvParser}})
> {code}
> ...
> // `reader` is a wrapper for an Iterator.
> private val reader = new StringIteratorReader(iter)
> parser.beginParsing(reader)
> ...
> time(parser.parseNext())
> ...
> {code}
> - Parsing time test for new (in {{BulkCsvParser}})
> {code}
> ...
> time(parser.parseLine(iter.next()))
> ...
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]