Re: [Haskell-cafe] Sneaking haskell in the workplace -- cleaning csv files

2007-06-16 Thread Tomasz Zielonka
On Fri, Jun 15, 2007 at 11:31:36PM +0100, Jim Burton wrote:
 I think that would only work if there was one column per line...I didn't 
 make it clear that as well as being comma separated, the delimiter is 
 around each column, of which there are several on a line so if the 
 delimiter is ~ a file might look like:
 
 ~sdlkfj~, ~dsdkjf~ #eo row1
 ~sdf
 dfkj~, ~dfsd~  #eo row 2

It would be easier to experiment if you could provide us with an
example input file. If you are worried about revealing sensitive
information, you can change all characters other then newline,
~ and , to As, for example. An accompanying output file, for checking
correctness, would be even nicer.

Best regards
Tomek
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Sneaking haskell in the workplace -- cleaning csv files

2007-06-16 Thread Donn Cave
Quoth Tomasz Zielonka [EMAIL PROTECTED]:
| On Fri, Jun 15, 2007 at 11:31:36PM +0100, Jim Burton wrote:
|  I think that would only work if there was one column per line...I didn't 
|  make it clear that as well as being comma separated, the delimiter is 
|  around each column, of which there are several on a line so if the 
|  delimiter is ~ a file might look like:
|  
|  ~sdlkfj~, ~dsdkjf~ #eo row1
|  ~sdf
|  dfkj~, ~dfsd~  #eo row 2
|
| It would be easier to experiment if you could provide us with an
| example input file. If you are worried about revealing sensitive
| information, you can change all characters other then newline,
| ~ and , to As, for example. An accompanying output file, for checking
| correctness, would be even nicer.

Yes, especially if there's anyone else as little acquainted with CSV
files as I am!

I have never bothered to learn to work with multiple lines in sed, but
from what I gather so far, the following awk would do it --

   awk '{ if (/~$/) print; else printf %s, $0 }'

(literal separator for legibility.)  I know we're not exactly looking
for an awk or sed solution here, but thought it might add some context
to the exercise anyway.

Donn Cave, [EMAIL PROTECTED]
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Sneaking haskell in the workplace -- cleaning csv files

2007-06-16 Thread Jim Burton

Tomasz Zielonka wrote:

On Fri, Jun 15, 2007 at 11:31:36PM +0100, Jim Burton wrote:
I think that would only work if there was one column per line...I didn't 
make it clear that as well as being comma separated, the delimiter is 
around each column, of which there are several on a line so if the 
delimiter is ~ a file might look like:


~sdlkfj~, ~dsdkjf~ #eo row1
~sdf
dfkj~, ~dfsd~  #eo row 2


It would be easier to experiment if you could provide us with an
example input file. If you are worried about revealing sensitive
information, you can change all characters other then newline,
~ and , to As, for example. An accompanying output file, for checking
correctness, would be even nicer.

Hi Tomasz, I can do that but they do essentially look like the example 
above, except with 10 - 30 columns, more data in each column, and more 
rows, maybe this side of a million. They are produced by an Oracle 
export which escapes the delimiter (often a tilde) from within the cols. 
The output file should have exactly one row per line, with extra 
newlines replaced by a string given as a param (it might be a space or a 
html tag -- I only just remembered this and my initial effort doesn't do 
it).


Thanks,

Jim


Best regards
Tomek



___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Sneaking haskell in the workplace -- cleaning csv files

2007-06-16 Thread Tomasz Zielonka
On Sat, Jun 16, 2007 at 12:08:22PM +0100, Jim Burton wrote:
 Tomasz Zielonka wrote:
 It would be easier to experiment if you could provide us with an
 example input file. If you are worried about revealing sensitive
 information, you can change all characters other then newline,
 ~ and , to As, for example. An accompanying output file, for checking
 correctness, would be even nicer.

 Hi Tomasz, I can do that but they do essentially look like the example 
 above, except with 10 - 30 columns, more data in each column, and more 
 rows, maybe this side of a million. They are produced by an Oracle 
 export which escapes the delimiter (often a tilde) from within the cols. 
 The output file should have exactly one row per line, with extra 
 newlines replaced by a string given as a param (it might be a space or a 
 html tag -- I only just remembered this and my initial effort doesn't do 
 it).

I guess you've tried to convince Oracle to produce the right format in
the first place, so there would be no need for post-processing...?

I wonder what would you get if you set the delimiter to be a newline ;-)

Best regards
Tomek
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Sneaking haskell in the workplace -- cleaning csv files

2007-06-16 Thread Jim Burton

Tomasz Zielonka wrote:


I guess you've tried to convince Oracle to produce the right format in
the first place, so there would be no need for post-processing...?


We don't control that job or the first db.



I wonder what would you get if you set the delimiter to be a newline ;-)


eek! ;-)


Best regards
Tomek



___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Sneaking haskell in the workplace -- cleaning csv files

2007-06-15 Thread Thomas Schilling


On 15 jun 2007, at 18.13, Jim Burton wrote:



import qualified Data.ByteString.Char8 as B



Have you tried

import qualified Data.ByteString.Lazy.Char8 as B

?



___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Sneaking haskell in the workplace -- cleaning csv files

2007-06-15 Thread Jim Burton

Thomas Schilling wrote:


On 15 jun 2007, at 18.13, Jim Burton wrote:



import qualified Data.ByteString.Char8 as B



Have you tried

import qualified Data.ByteString.Lazy.Char8 as B

?



No -- I'll give it a try and compare them. Is laziness preferable here?

Thanks,






___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Sneaking haskell in the workplace -- cleaning csv files

2007-06-15 Thread Jason Dagit

On 6/15/07, Jim Burton [EMAIL PROTECTED] wrote:


No -- I'll give it a try and compare them. Is laziness preferable here?


Laziness might give you constant space usage (if you are sufficiently
lazy).  Which would help with the thrashing.

Jason
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Sneaking haskell in the workplace -- cleaning csv files

2007-06-15 Thread Thomas Schilling


On 15 jun 2007, at 21.14, Jim Burton wrote:


Thomas Schilling wrote:

On 15 jun 2007, at 18.13, Jim Burton wrote:

import qualified Data.ByteString.Char8 as B


Have you tried
import qualified Data.ByteString.Lazy.Char8 as B
?



No -- I'll give it a try and compare them. Is laziness preferable  
here?


yes, since you were talking of big files.  if you don't have to keep  
the data around lazy bytestrings will keep the memory footprint low.


___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Sneaking haskell in the workplace -- cleaning csv files

2007-06-15 Thread Sebastian Sylvan

On 15/06/07, Jim Burton [EMAIL PROTECTED] wrote:



I need to remove newlines from csv files (within columns, not at the end
of
entire lines). This is prior to importing into a database and was being
done
at my workplace by a java class for quite a while until the files
processed
got bigger and it proved to be too slow. (The files are up to ~250MB at
the
moment) It was rewritten in PL/SQL, to run after the import, which was an
improvement, but it still has our creaky db server thrashing away. (You
may
have lots of helpful suggestions in mind, but we can't clean the data at
source and AFAIK we can't do it incrementally because there is no
timestamp
or anything on the last change to a row from the legacy db.)

We don't need a general solution - if a line ends with a delimiter we can
be
sure it's the end of the entire line because that's the way the csv files
are generated.

I had a quick go with ByteString (with no attempt at robustness etc) and
although I haven't compared it properly it seems faster than what we have
now. But you can easily make it faster, surely! Hints for improvement
please
(e.g. can I unbox anything, make anything strict, or is that handled by
ByteString, is there a more efficient library function to replace the
fold...?).

module Main
where
import System.Environment (getArgs)
import qualified Data.ByteString.Char8 as B

--remove newlines in the middle of 'columns'
clean :: Char - [B.ByteString] - [B.ByteString]
clean d = foldr (\x ys - if B.null x || B.last x == d then x:ys else
(B.append x $ head ys):(tail ys)) []

main = do args - getArgs
  if length args  2
   then putStrLn Usage: crunchFile INFILE OUTFILE [DELIM]
   else do bs - B.readFile (args!!0)
   let d = if length args == 3 then head (args!!2) else
''
   B.writeFile (args!!1) $ (B.unlines . clean d . B.lines)
bs



Hi,
I haven't compiled this, but you get the general idea:

import qualified Data.ByteString.Lazy.Char8 as B
-- takes a bytestring representing the file, concats the lines
-- then splits it up into real lines using the delimiter
clean :: Char - B.ByteString - [B.ByteString]
clean' d = B.split d . B.concat . B.lines



--
Sebastian Sylvan
+44(0)7857-300802
UIN: 44640862
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Sneaking haskell in the workplace -- cleaning csv files

2007-06-15 Thread Jim Burton

Sebastian Sylvan wrote:

On 15/06/07, Jim Burton [EMAIL PROTECTED] wrote:

[snip]

Hi,

Hi Sebastian,

I haven't compiled this, but you get the general idea:

import qualified Data.ByteString.Lazy.Char8 as B
-- takes a bytestring representing the file, concats the lines
-- then splits it up into real lines using the delimiter
clean :: Char - B.ByteString - [B.ByteString]
clean' d = B.split d . B.concat . B.lines


I think that would only work if there was one column per line...I didn't 
make it clear that as well as being comma separated, the delimiter is 
around each column, of which there are several on a line so if the 
delimiter is ~ a file might look like:


~sdlkfj~, ~dsdkjf~ #eo row1
~sdf
dfkj~, ~dfsd~  #eo row 2






___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Sneaking haskell in the workplace -- cleaning csv files

2007-06-15 Thread Jason Dagit

On 6/15/07, Jim Burton [EMAIL PROTECTED] wrote:

Sebastian Sylvan wrote:
 On 15/06/07, Jim Burton [EMAIL PROTECTED] wrote:
[snip]
 Hi,
Hi Sebastian,
 I haven't compiled this, but you get the general idea:

 import qualified Data.ByteString.Lazy.Char8 as B
 -- takes a bytestring representing the file, concats the lines
 -- then splits it up into real lines using the delimiter
 clean :: Char - B.ByteString - [B.ByteString]
 clean' d = B.split d . B.concat . B.lines

I think that would only work if there was one column per line...I didn't
make it clear that as well as being comma separated, the delimiter is
around each column, of which there are several on a line so if the
delimiter is ~ a file might look like:

~sdlkfj~, ~dsdkjf~ #eo row1
~sdf
dfkj~, ~dfsd~  #eo row 2


I love to see people using Haskell, especially professionally, but I
have to wonder if the real tool for this job is sed? :-)

Jason
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Sneaking haskell in the workplace -- cleaning csv files

2007-06-15 Thread Jim Burton

Jason Dagit wrote:
[snip]


I love to see people using Haskell, especially professionally, but I
have to wonder if the real tool for this job is sed? :-)

Jason


Maybe it is -- I've never used sed. (cue oohs and ahhs from the 
gallery?) But from the (unquantified) gains so far haskell may certainly 
be enough of an improvement to fit the bill even though I'd be 
interested in anything that improved on it further still.

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Sneaking haskell in the workplace -- cleaning csv files

2007-06-15 Thread Brandon S. Allbery KF8NH


On Jun 15, 2007, at 18:37 , Jason Dagit wrote:


I love to see people using Haskell, especially professionally, but I
have to wonder if the real tool for this job is sed? :-)


Actually, while sed could do that, it'd be a nightmare.  You really  
want a parser to deal with general CSV like this, and while you can  
write parsers in sed, you *really* don't want to.  :)


--
brandon s. allbery [solaris,freebsd,perl,pugs,haskell] [EMAIL PROTECTED]
system administrator [openafs,heimdal,too many hats] [EMAIL PROTECTED]
electrical and computer engineering, carnegie mellon universityKF8NH


___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Sneaking haskell in the workplace -- cleaning csv files

2007-06-15 Thread Jim Burton

Sebastian Sylvan wrote:


A sorry, I thought the delimiter was a line delimiter. I'm trying to get to
that fusion goodness by using built-in functions as much as possible...

How about this one:

clean del = B.map ( B.filter (/='\n') ) . B.groupBy (\x y - (x,y) /=
(del,'\n'))

That groupBy will group it into groups which don't have the delimiter
followed by a newline in them (which is the sequence your rows end with),
then it filters out newlines in each row. You might want to filter out
spaces first (if there are any) so that you don't get a space between the
delimiter and newline at the end...


I think you still need unlines after that so is the time complexity 
different to the


unlines . foldr (function including `last') . lines

in my first post? Or is it better for another reason, such as fusion 
goodness?

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Sneaking haskell in the workplace -- cleaning csv files

2007-06-15 Thread Sebastian Sylvan

On 16/06/07, Jim Burton [EMAIL PROTECTED] wrote:


Sebastian Sylvan wrote:

 A sorry, I thought the delimiter was a line delimiter. I'm trying to get
to
 that fusion goodness by using built-in functions as much as possible...

 How about this one:

 clean del = B.map ( B.filter (/='\n') ) . B.groupBy (\x y - (x,y) /=
 (del,'\n'))

 That groupBy will group it into groups which don't have the delimiter
 followed by a newline in them (which is the sequence your rows end
with),
 then it filters out newlines in each row. You might want to filter out
 spaces first (if there are any) so that you don't get a space between
the
 delimiter and newline at the end...


I think you still need unlines after that so is the time complexity
different to the

unlines . foldr (function including `last') . lines

in my first post? Or is it better for another reason, such as fusion
goodness?




Benchmark it I guess :-)
Both versions use a non-bytestring recursive functions (the outer
B.mapshould just be a straight map, and yours use a foldr), which may
mess fusion
up... Not sure what would happe here...
I don't have a Haskell compiler at this computer so I can't try anything
out...


--
Sebastian Sylvan
+44(0)7857-300802
UIN: 44640862
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Sneaking haskell in the workplace -- cleaning csv files

2007-06-15 Thread Jason Dagit

On 6/15/07, Sebastian Sylvan [EMAIL PROTECTED] wrote:


Benchmark it I guess :-)
Both versions use a non-bytestring recursive functions (the outer B.map
should just be a straight map, and yours use a foldr), which may mess fusion
up... Not sure what would happe here...
I don't have a Haskell compiler at this computer so I can't try anything
out...


I just remembered this recent thread about fast bytestring parsing:
http://www.nabble.com/Fast-number-parsing-with-strict-bytestrings--Was%3A-Re%3A-Seemingly-subtle-change-causes-large-performance-variation--tf3887303.html

Perhaps there is an idea or two that can be applied here?

Jason
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe