Re: [csv] Performance comparison

2012-03-14 Thread Emmanuel Bourg
After more experiments I'm less enthusiastic about providing an optimized BufferedReader. The result of the performance test is significantly different if the test is run alone or after all the other unit tests (about 30% slower). When all the tests are executed, the removal of the

Re: [csv] Performance comparison

2012-03-14 Thread Christian Grobmeier
On Tue, Mar 13, 2012 at 4:33 AM, Ralph Goers ralph.go...@dslextreme.com wrote: I don't think we should be trying to recode JDK classes. If the implementations suck, why not? +1 -- http://www.grobmeier.de https://www.timeandbill.de

Re: [csv] Performance comparison

2012-03-13 Thread Emmanuel Bourg
Le 13/03/2012 01:44, sebb a écrit : I don't think we should be trying to recode JDK classes. I'd rather not, but we have done that in the past. FastDateFormat and StrBuilder come to mind. Emmanuel Bourg smime.p7s Description: S/MIME Cryptographic Signature

Re: [csv] Performance comparison

2012-03-13 Thread sebb
On 13 March 2012 09:01, Emmanuel Bourg ebo...@apache.org wrote: Le 13/03/2012 01:44, sebb a écrit : I don't think we should be trying to recode JDK classes. I'd rather not, but we have done that in the past. FastDateFormat and StrBuilder come to mind. And now Java has StringBuilder, which

Re: [csv] Performance comparison

2012-03-13 Thread Emmanuel Bourg
Le 13/03/2012 02:47, Niall Pemberton a écrit : IMO performance should be taken out of the equation by using the Readable interface[1]. That way the users can use whatever implementation suits them (for example using an underlying buffered InputStream) to change/improve performance. I you mean

Re: [csv] Performance comparison

2012-03-12 Thread Emmanuel Bourg
I have identified the performance killer, it's the ExtendedBufferedReader. It implements a complex logic to fetch one character ahead, but this extra character is rarely used. I have implemented a simpler look ahead using mark/reset as suggested by Bob Smith in CSV-42 and the performance

Re: [csv] Performance comparison

2012-03-12 Thread sebb
On 12 March 2012 10:31, Emmanuel Bourg ebo...@apache.org wrote: I have identified the performance killer, it's the ExtendedBufferedReader. It implements a complex logic to fetch one character ahead, but this extra character is rarely used. I have implemented a simpler look ahead using

Re: [csv] Performance comparison

2012-03-12 Thread Emmanuel Bourg
Le 12/03/2012 16:44, sebb a écrit : Java has a PushbackReader class - could that not be used? I considered it, but it doesn't mix well with line reading. The mark/reset solution is really simple and efficient. Emmanuel Bourg smime.p7s Description: S/MIME Cryptographic Signature

Re: [csv] Performance comparison

2012-03-12 Thread Benedikt Ritter
Am 12. März 2012 11:31 schrieb Emmanuel Bourg ebo...@apache.org: I have identified the performance killer, it's the ExtendedBufferedReader. It implements a complex logic to fetch one character ahead, but this extra character is rarely used. I have implemented a simpler look ahead using

Re: [csv] Performance comparison

2012-03-12 Thread Emmanuel Bourg
Le 12/03/2012 17:03, Benedikt Ritter a écrit : The hole logic behind CSVLexer.nextToken() is very hard to read (IMHO). Maybe a some refactoring would help to make it easier to identify bottle necks? Yes I started investigating in this direction. I filed a few bugs regarding the behavior of

Re: [csv] Performance comparison

2012-03-12 Thread James Carman
Would one of the parser libraries not work here? On Mar 12, 2012 12:22 PM, Emmanuel Bourg ebo...@apache.org wrote: Le 12/03/2012 17:03, Benedikt Ritter a écrit : The hole logic behind CSVLexer.nextToken() is very hard to read (IMHO). Maybe a some refactoring would help to make it easier to

Re: [csv] Performance comparison

2012-03-12 Thread Benedikt Ritter
Am 12. März 2012 17:22 schrieb Emmanuel Bourg ebo...@apache.org: Le 12/03/2012 17:03, Benedikt Ritter a écrit : The hole logic behind CSVLexer.nextToken() is very hard to read (IMHO). Maybe a some refactoring would help to make it easier to identify bottle necks? Yes I started

Re: [csv] Performance comparison

2012-03-12 Thread Emmanuel Bourg
Le 12/03/2012 17:28, James Carman a écrit : Would one of the parser libraries not work here? You think at something like JavaCC or AntLR? Not sure it'll be more efficient than a handcrafted parser. The CSV format is simple enough to do it manually. Emmanuel Bourg smime.p7s Description:

Re: [csv] Performance comparison

2012-03-12 Thread Christian Grobmeier
On Mon, Mar 12, 2012 at 5:41 PM, Emmanuel Bourg ebo...@apache.org wrote: Le 12/03/2012 17:28, James Carman a écrit : Would one of the parser libraries not work here? You think at something like JavaCC or AntLR? Not sure it'll be more efficient than a handcrafted parser. The CSV format is

Re: [csv] Performance comparison

2012-03-12 Thread James Carman
Yes this is what I mean. It might be worth a shot. Folks who specialize in parsing have spent much time on these libraries. It would make sense that they are quite fast. It gets us out of the parsing business. On Mar 12, 2012 12:41 PM, Emmanuel Bourg ebo...@apache.org wrote: Le 12/03/2012

Re: [csv] Performance comparison

2012-03-12 Thread Emmanuel Bourg
I kept tickling ExtendedBufferedReader and I have some interesting results. First I tried to simplify it by extending java.io.LineNumberReader instead of BufferedReader. The performance decreased by 20%, probably because the class is synchronized internally. But wait, isn't BufferedReader

Re: [csv] Performance comparison

2012-03-12 Thread sebb
On 13 March 2012 00:12, Emmanuel Bourg ebo...@apache.org wrote: I kept tickling ExtendedBufferedReader and I have some interesting results. First I tried to simplify it by extending java.io.LineNumberReader instead of BufferedReader. The performance decreased by 20%, probably because the

Re: [csv] Performance comparison

2012-03-12 Thread Emmanuel Bourg
Le 13/03/2012 01:25, sebb a écrit : I'm concerned that the CSV code may grow and grow with private versions of code that could be provided by the JDK. By all means make sure the code is efficient in the way it uses the JDK classes, but I don't think we should be recoding standard classes. I

Re: [csv] Performance comparison

2012-03-12 Thread Gary Gregory
On Mar 12, 2012, at 20:25, sebb seb...@gmail.com wrote: On 13 March 2012 00:12, Emmanuel Bourg ebo...@apache.org wrote: I kept tickling ExtendedBufferedReader and I have some interesting results. First I tried to simplify it by extending java.io.LineNumberReader instead of BufferedReader.

Re: [csv] Performance comparison

2012-03-12 Thread Gary Gregory
On Mar 12, 2012, at 20:30, Emmanuel Bourg ebo...@apache.org wrote: Le 13/03/2012 01:25, sebb a écrit : I'm concerned that the CSV code may grow and grow with private versions of code that could be provided by the JDK. By all means make sure the code is efficient in the way it uses the JDK

Re: [csv] Performance comparison

2012-03-12 Thread sebb
On 13 March 2012 00:29, Emmanuel Bourg ebo...@apache.org wrote: Le 13/03/2012 01:25, sebb a écrit : I'm concerned that the CSV code may grow and grow with private versions of code that could be provided by the JDK. By all means make sure the code is efficient in the way it uses the JDK

Re: [csv] Performance comparison

2012-03-12 Thread Niall Pemberton
On Tue, Mar 13, 2012 at 12:29 AM, Emmanuel Bourg ebo...@apache.org wrote: Le 13/03/2012 01:25, sebb a écrit : I'm concerned that the CSV code may grow and grow with private versions of code that could be provided by the JDK. By all means make sure the code is efficient in the way it uses

Re: [csv] Performance comparison

2012-03-12 Thread sebb
On 13 March 2012 01:47, Niall Pemberton niall.pember...@gmail.com wrote: On Tue, Mar 13, 2012 at 12:29 AM, Emmanuel Bourg ebo...@apache.org wrote: Le 13/03/2012 01:25, sebb a écrit : I'm concerned that the CSV code may grow and grow with private versions of code that could be provided by the

Re: [csv] Performance comparison

2012-03-12 Thread Ralph Goers
On Mar 12, 2012, at 5:44 PM, sebb wrote: On 13 March 2012 00:29, Emmanuel Bourg ebo...@apache.org wrote: Le 13/03/2012 01:25, sebb a écrit : I'm concerned that the CSV code may grow and grow with private versions of code that could be provided by the JDK. By all means make sure the

[csv] Performance comparison

2012-03-11 Thread Emmanuel Bourg
Hi, I compared the performance of Commons CSV with the other CSV parsers available. I took the world cities file from Maxmind as a test file [1], it's a big file of 130M with 2.8 million records. Here are the results obtained on a Core 2 Duo E8400 after several iterations to let the JIT

Re: [csv] Performance comparison

2012-03-11 Thread Benedikt Ritter
Am 11. März 2012 15:05 schrieb Emmanuel Bourg ebo...@apache.org: Hi, I compared the performance of Commons CSV with the other CSV parsers available. I took the world cities file from Maxmind as a test file [1], it's a big file of 130M with 2.8 million records. Here are the results obtained

Re: [csv] Performance comparison

2012-03-11 Thread Emmanuel Bourg
Le 11/03/2012 16:53, Benedikt Ritter a écrit : I have some spare time to help you with this. I'll check out the latest source tonight. Any suggestion where to start? Hi Benedikt, thank you for helping. You can start looking at the source of CSVParser if anything catch your eyes, and then run

Re: [csv] Performance comparison

2012-03-11 Thread Benedikt Ritter
Am 11. März 2012 21:21 schrieb Emmanuel Bourg ebo...@apache.org: Le 11/03/2012 16:53, Benedikt Ritter a écrit : I have some spare time to help you with this. I'll check out the latest source tonight. Any suggestion where to start? Hi Benedikt, thank you for helping. You can start looking

Re: [csv] Performance comparison

2012-03-11 Thread Emmanuel Bourg
Le 12/03/2012 00:02, Benedikt Ritter a écrit : I've started to dig my way through the source. I've not done too much performance measuring in my career yet. I would use VisualVM for profiling, if you don't know anything better. Usually I work with JProfiler, it identifies the hotspots pretty