Re: [Haskell-cafe] Sneaking haskell in the workplace -- cleaning csv files
On Fri, Jun 15, 2007 at 11:31:36PM +0100, Jim Burton wrote: I think that would only work if there was one column per line...I didn't make it clear that as well as being comma separated, the delimiter is around each column, of which there are several on a line so if the delimiter is ~ a file might look like: ~sdlkfj~, ~dsdkjf~ #eo row1 ~sdf dfkj~, ~dfsd~ #eo row 2 It would be easier to experiment if you could provide us with an example input file. If you are worried about revealing sensitive information, you can change all characters other then newline, ~ and , to As, for example. An accompanying output file, for checking correctness, would be even nicer. Best regards Tomek ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Sneaking haskell in the workplace -- cleaning csv files
Quoth Tomasz Zielonka [EMAIL PROTECTED]: | On Fri, Jun 15, 2007 at 11:31:36PM +0100, Jim Burton wrote: | I think that would only work if there was one column per line...I didn't | make it clear that as well as being comma separated, the delimiter is | around each column, of which there are several on a line so if the | delimiter is ~ a file might look like: | | ~sdlkfj~, ~dsdkjf~ #eo row1 | ~sdf | dfkj~, ~dfsd~ #eo row 2 | | It would be easier to experiment if you could provide us with an | example input file. If you are worried about revealing sensitive | information, you can change all characters other then newline, | ~ and , to As, for example. An accompanying output file, for checking | correctness, would be even nicer. Yes, especially if there's anyone else as little acquainted with CSV files as I am! I have never bothered to learn to work with multiple lines in sed, but from what I gather so far, the following awk would do it -- awk '{ if (/~$/) print; else printf %s, $0 }' (literal separator for legibility.) I know we're not exactly looking for an awk or sed solution here, but thought it might add some context to the exercise anyway. Donn Cave, [EMAIL PROTECTED] ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Sneaking haskell in the workplace -- cleaning csv files
Tomasz Zielonka wrote: On Fri, Jun 15, 2007 at 11:31:36PM +0100, Jim Burton wrote: I think that would only work if there was one column per line...I didn't make it clear that as well as being comma separated, the delimiter is around each column, of which there are several on a line so if the delimiter is ~ a file might look like: ~sdlkfj~, ~dsdkjf~ #eo row1 ~sdf dfkj~, ~dfsd~ #eo row 2 It would be easier to experiment if you could provide us with an example input file. If you are worried about revealing sensitive information, you can change all characters other then newline, ~ and , to As, for example. An accompanying output file, for checking correctness, would be even nicer. Hi Tomasz, I can do that but they do essentially look like the example above, except with 10 - 30 columns, more data in each column, and more rows, maybe this side of a million. They are produced by an Oracle export which escapes the delimiter (often a tilde) from within the cols. The output file should have exactly one row per line, with extra newlines replaced by a string given as a param (it might be a space or a html tag -- I only just remembered this and my initial effort doesn't do it). Thanks, Jim Best regards Tomek ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Sneaking haskell in the workplace -- cleaning csv files
On Sat, Jun 16, 2007 at 12:08:22PM +0100, Jim Burton wrote: Tomasz Zielonka wrote: It would be easier to experiment if you could provide us with an example input file. If you are worried about revealing sensitive information, you can change all characters other then newline, ~ and , to As, for example. An accompanying output file, for checking correctness, would be even nicer. Hi Tomasz, I can do that but they do essentially look like the example above, except with 10 - 30 columns, more data in each column, and more rows, maybe this side of a million. They are produced by an Oracle export which escapes the delimiter (often a tilde) from within the cols. The output file should have exactly one row per line, with extra newlines replaced by a string given as a param (it might be a space or a html tag -- I only just remembered this and my initial effort doesn't do it). I guess you've tried to convince Oracle to produce the right format in the first place, so there would be no need for post-processing...? I wonder what would you get if you set the delimiter to be a newline ;-) Best regards Tomek ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Sneaking haskell in the workplace -- cleaning csv files
Tomasz Zielonka wrote: I guess you've tried to convince Oracle to produce the right format in the first place, so there would be no need for post-processing...? We don't control that job or the first db. I wonder what would you get if you set the delimiter to be a newline ;-) eek! ;-) Best regards Tomek ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Sneaking haskell in the workplace -- cleaning csv files
On 15 jun 2007, at 18.13, Jim Burton wrote: import qualified Data.ByteString.Char8 as B Have you tried import qualified Data.ByteString.Lazy.Char8 as B ? ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Sneaking haskell in the workplace -- cleaning csv files
Thomas Schilling wrote: On 15 jun 2007, at 18.13, Jim Burton wrote: import qualified Data.ByteString.Char8 as B Have you tried import qualified Data.ByteString.Lazy.Char8 as B ? No -- I'll give it a try and compare them. Is laziness preferable here? Thanks, ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Sneaking haskell in the workplace -- cleaning csv files
On 6/15/07, Jim Burton [EMAIL PROTECTED] wrote: No -- I'll give it a try and compare them. Is laziness preferable here? Laziness might give you constant space usage (if you are sufficiently lazy). Which would help with the thrashing. Jason ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Sneaking haskell in the workplace -- cleaning csv files
On 15 jun 2007, at 21.14, Jim Burton wrote: Thomas Schilling wrote: On 15 jun 2007, at 18.13, Jim Burton wrote: import qualified Data.ByteString.Char8 as B Have you tried import qualified Data.ByteString.Lazy.Char8 as B ? No -- I'll give it a try and compare them. Is laziness preferable here? yes, since you were talking of big files. if you don't have to keep the data around lazy bytestrings will keep the memory footprint low. ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Sneaking haskell in the workplace -- cleaning csv files
On 15/06/07, Jim Burton [EMAIL PROTECTED] wrote: I need to remove newlines from csv files (within columns, not at the end of entire lines). This is prior to importing into a database and was being done at my workplace by a java class for quite a while until the files processed got bigger and it proved to be too slow. (The files are up to ~250MB at the moment) It was rewritten in PL/SQL, to run after the import, which was an improvement, but it still has our creaky db server thrashing away. (You may have lots of helpful suggestions in mind, but we can't clean the data at source and AFAIK we can't do it incrementally because there is no timestamp or anything on the last change to a row from the legacy db.) We don't need a general solution - if a line ends with a delimiter we can be sure it's the end of the entire line because that's the way the csv files are generated. I had a quick go with ByteString (with no attempt at robustness etc) and although I haven't compared it properly it seems faster than what we have now. But you can easily make it faster, surely! Hints for improvement please (e.g. can I unbox anything, make anything strict, or is that handled by ByteString, is there a more efficient library function to replace the fold...?). module Main where import System.Environment (getArgs) import qualified Data.ByteString.Char8 as B --remove newlines in the middle of 'columns' clean :: Char - [B.ByteString] - [B.ByteString] clean d = foldr (\x ys - if B.null x || B.last x == d then x:ys else (B.append x $ head ys):(tail ys)) [] main = do args - getArgs if length args 2 then putStrLn Usage: crunchFile INFILE OUTFILE [DELIM] else do bs - B.readFile (args!!0) let d = if length args == 3 then head (args!!2) else '' B.writeFile (args!!1) $ (B.unlines . clean d . B.lines) bs Hi, I haven't compiled this, but you get the general idea: import qualified Data.ByteString.Lazy.Char8 as B -- takes a bytestring representing the file, concats the lines -- then splits it up into real lines using the delimiter clean :: Char - B.ByteString - [B.ByteString] clean' d = B.split d . B.concat . B.lines -- Sebastian Sylvan +44(0)7857-300802 UIN: 44640862 ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Sneaking haskell in the workplace -- cleaning csv files
Sebastian Sylvan wrote: On 15/06/07, Jim Burton [EMAIL PROTECTED] wrote: [snip] Hi, Hi Sebastian, I haven't compiled this, but you get the general idea: import qualified Data.ByteString.Lazy.Char8 as B -- takes a bytestring representing the file, concats the lines -- then splits it up into real lines using the delimiter clean :: Char - B.ByteString - [B.ByteString] clean' d = B.split d . B.concat . B.lines I think that would only work if there was one column per line...I didn't make it clear that as well as being comma separated, the delimiter is around each column, of which there are several on a line so if the delimiter is ~ a file might look like: ~sdlkfj~, ~dsdkjf~ #eo row1 ~sdf dfkj~, ~dfsd~ #eo row 2 ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Sneaking haskell in the workplace -- cleaning csv files
On 6/15/07, Jim Burton [EMAIL PROTECTED] wrote: Sebastian Sylvan wrote: On 15/06/07, Jim Burton [EMAIL PROTECTED] wrote: [snip] Hi, Hi Sebastian, I haven't compiled this, but you get the general idea: import qualified Data.ByteString.Lazy.Char8 as B -- takes a bytestring representing the file, concats the lines -- then splits it up into real lines using the delimiter clean :: Char - B.ByteString - [B.ByteString] clean' d = B.split d . B.concat . B.lines I think that would only work if there was one column per line...I didn't make it clear that as well as being comma separated, the delimiter is around each column, of which there are several on a line so if the delimiter is ~ a file might look like: ~sdlkfj~, ~dsdkjf~ #eo row1 ~sdf dfkj~, ~dfsd~ #eo row 2 I love to see people using Haskell, especially professionally, but I have to wonder if the real tool for this job is sed? :-) Jason ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Sneaking haskell in the workplace -- cleaning csv files
Jason Dagit wrote: [snip] I love to see people using Haskell, especially professionally, but I have to wonder if the real tool for this job is sed? :-) Jason Maybe it is -- I've never used sed. (cue oohs and ahhs from the gallery?) But from the (unquantified) gains so far haskell may certainly be enough of an improvement to fit the bill even though I'd be interested in anything that improved on it further still. ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Sneaking haskell in the workplace -- cleaning csv files
On Jun 15, 2007, at 18:37 , Jason Dagit wrote: I love to see people using Haskell, especially professionally, but I have to wonder if the real tool for this job is sed? :-) Actually, while sed could do that, it'd be a nightmare. You really want a parser to deal with general CSV like this, and while you can write parsers in sed, you *really* don't want to. :) -- brandon s. allbery [solaris,freebsd,perl,pugs,haskell] [EMAIL PROTECTED] system administrator [openafs,heimdal,too many hats] [EMAIL PROTECTED] electrical and computer engineering, carnegie mellon universityKF8NH ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Sneaking haskell in the workplace -- cleaning csv files
Sebastian Sylvan wrote: A sorry, I thought the delimiter was a line delimiter. I'm trying to get to that fusion goodness by using built-in functions as much as possible... How about this one: clean del = B.map ( B.filter (/='\n') ) . B.groupBy (\x y - (x,y) /= (del,'\n')) That groupBy will group it into groups which don't have the delimiter followed by a newline in them (which is the sequence your rows end with), then it filters out newlines in each row. You might want to filter out spaces first (if there are any) so that you don't get a space between the delimiter and newline at the end... I think you still need unlines after that so is the time complexity different to the unlines . foldr (function including `last') . lines in my first post? Or is it better for another reason, such as fusion goodness? ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Sneaking haskell in the workplace -- cleaning csv files
On 16/06/07, Jim Burton [EMAIL PROTECTED] wrote: Sebastian Sylvan wrote: A sorry, I thought the delimiter was a line delimiter. I'm trying to get to that fusion goodness by using built-in functions as much as possible... How about this one: clean del = B.map ( B.filter (/='\n') ) . B.groupBy (\x y - (x,y) /= (del,'\n')) That groupBy will group it into groups which don't have the delimiter followed by a newline in them (which is the sequence your rows end with), then it filters out newlines in each row. You might want to filter out spaces first (if there are any) so that you don't get a space between the delimiter and newline at the end... I think you still need unlines after that so is the time complexity different to the unlines . foldr (function including `last') . lines in my first post? Or is it better for another reason, such as fusion goodness? Benchmark it I guess :-) Both versions use a non-bytestring recursive functions (the outer B.mapshould just be a straight map, and yours use a foldr), which may mess fusion up... Not sure what would happe here... I don't have a Haskell compiler at this computer so I can't try anything out... -- Sebastian Sylvan +44(0)7857-300802 UIN: 44640862 ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Sneaking haskell in the workplace -- cleaning csv files
On 6/15/07, Sebastian Sylvan [EMAIL PROTECTED] wrote: Benchmark it I guess :-) Both versions use a non-bytestring recursive functions (the outer B.map should just be a straight map, and yours use a foldr), which may mess fusion up... Not sure what would happe here... I don't have a Haskell compiler at this computer so I can't try anything out... I just remembered this recent thread about fast bytestring parsing: http://www.nabble.com/Fast-number-parsing-with-strict-bytestrings--Was%3A-Re%3A-Seemingly-subtle-change-causes-large-performance-variation--tf3887303.html Perhaps there is an idea or two that can be applied here? Jason ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe