Hi Stuart, Just a clarification upfront : Currently csv has 2 codebases : the on in the writer package is what I have written (don't remember if someone else has worked on that though) and the main o.a.c.csv package is what the codebase started with. Because of like of time and the fact that people seemed to be more interested in what was at the main package, I just continued doing the writer package for private use. My focus is configurability and simple patterns to be able to easily integrate csv in web applications and front end applications. Afaik there is no interaction between the 2 packages :)
Stuart Robertson wrote: > I just looked over the codebase and have a few questions. > > First, I'm wondering if some simple invalid format detection might be > added as a configuration option. Something to detect whether a given > input might even be theoretically parseable. I'd like to be able to > detect, for instance, that this is a binary file, or maybe if it > doesn't seem to contain a consistent separator pattern (line 1 has 10 > columns, line 2 only 6). Basically anything to detect upfront an > invalid file condition rather than have garbage be passed into the > file using CSVParser. The ConfigGuesser could be reused to achieve this. The main goal for ConfigGuesser is to limit user configuration. > > Second, any thoughts on how guessFieldSparator can infer if it's TDF > or CSV? Or maybe what flavor of CSV format the file might be using > (Excel or otherwise). I see the CSVConfigGuesser attempts to > determine whether the file is fixed width. And the method > guessFieldSeperator() seems to have a placeholder for guessing the > file separator, but currently that portion is an empty for loop. It's far from finished and very buggy :) It's the concept I wanted to draw attention too. > > Thinking about how that might be implemented, what if a regex counted > the occurrances of common separators in each of the "guess input" > lines. A reasonable hueristic might be that the separator guess is > that separator that has a common occurrance count in each line, and we > could go with that. Does this sound reasonable? Or maybe there's a > better way to do it? There are a lot of problems with guessing the format :) > > In general, I think it'd be a valuable feature for the guesser to be > as robust as possible for a range of input types. Even if it weren't > possible to make it perfect, for uses where the application can't > completely control the format comming in, being fairly robust in the > face of a variety of types would be outstanding. Robust would be nice, but pretty hard to achieve. Maybe some way of setting the configguesser strategy can make the thing more robust for the scenario you are using it for. My usage scenario is that I don't have a clue what people want to use as their text format (and I don't care). So guessing should be most flexible. By usage a of eg a wizard, people are able to change the behavior of the configgueser like stating that this csv file has 10 fields, you can make your system more robust. So eg 1010101 probably means that the separator will be 0 (1 is the start and the end, so is most likely a value). If the user specifies that the csv file only has one field, we know 0 is not the separator. So I prefer no to limit the options out of the box, but have some kind of strategy to be able to limit the options (in my case that is users who specify that the csv has 10 fields), but we could make standard strategies, like the default excel export format. > > One last observation. CSVConfigGuesser looks intended to uses the > first 10 lines of input if available for inferring the right config. > But looking at the code, it looks to me like it will actually read in > the entire file. Here's the code (from SVN) I'm writing about: Yeah the code is bad, very bad :) I just committed the guesser as a concept. Almost every line is bad to be honest :). Fixed the while loop in subversion.. Mvgr, Martin --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
