Hi Stuart,

Just a clarification upfront : Currently csv has 2 codebases : the on in the 
writer package is what
I have written (don't remember if someone else has worked on that though) and 
the main o.a.c.csv
package is what the codebase started with. Because of like of time and the fact 
that people seemed
to be more interested in what was at the main package, I just continued doing 
the writer package for
private use. My focus is configurability and simple patterns to be able to 
easily integrate csv in
web applications and front end applications.
Afaik there is no interaction between the 2 packages :)

Stuart Robertson wrote:
> I just looked over the codebase and have a few questions.
> 
> First, I'm wondering if some simple invalid format detection might be
> added as a configuration option.  Something to detect whether a given
> input might even be theoretically parseable.  I'd like to be able to
> detect, for instance, that this is a binary file, or maybe if it
> doesn't seem to contain a consistent separator pattern (line 1 has 10
> columns, line 2 only 6).  Basically anything to detect upfront an
> invalid file condition rather than have garbage be passed into the
> file using CSVParser.

The ConfigGuesser could be reused to achieve this. The main goal for 
ConfigGuesser is to limit user
configuration.

> 
> Second, any thoughts on how guessFieldSparator can infer if it's TDF
> or CSV?  Or maybe what flavor of CSV format the file might be using
> (Excel or otherwise).  I see the CSVConfigGuesser attempts to
> determine whether the file is fixed width.  And the method
> guessFieldSeperator() seems to have a placeholder for guessing the
> file separator, but currently that portion is an empty for loop.

It's far from finished and very buggy :) It's the concept I wanted to draw 
attention too.

> 
> Thinking about how that might be implemented, what if a regex counted
> the occurrances of common separators in each of the "guess input"
> lines.  A reasonable hueristic might be that the separator guess is
> that separator that has a common occurrance count in each line, and we
> could go with that.  Does this sound reasonable?  Or maybe there's a
> better way to do it?

There are a lot of problems with guessing the format :)

> 
> In general, I think it'd be a valuable feature for the guesser to be
> as robust as possible for a range of input types.  Even if it weren't
> possible to make it perfect, for uses where the application can't
> completely control the format comming in, being fairly robust in the
> face of a variety of types would be outstanding.

Robust would be nice, but pretty hard to achieve. Maybe some way of setting the 
configguesser
strategy can make the thing more robust for the scenario you are using it for. 
My usage scenario is
that I don't have a clue what people want to use as their text format (and I 
don't care). So
guessing should be most flexible. By usage a of eg a wizard, people are able to 
change the behavior
of the configgueser like stating that this csv file has 10 fields, you can make 
your system more
robust. So eg 1010101 probably means that the separator will be 0 (1 is the 
start and the end, so is
most likely a value). If the user specifies that the csv file only has one 
field, we know 0 is not
the separator.

So I prefer no to limit the options out of the box, but have some kind of 
strategy to be able to
limit the options (in my case that is users who specify that the csv has 10 
fields), but we could
make standard strategies, like the default excel export format.

> 
> One last observation.  CSVConfigGuesser looks intended to uses the
> first 10 lines of input if available for inferring the right config.
> But looking at the code, it looks to me like it will actually read in
> the entire file.  Here's the code (from SVN) I'm writing about:

Yeah the code is bad, very bad :) I just committed the guesser as a concept. 
Almost every line is
bad to be honest :).

Fixed the while loop in subversion..

Mvgr,
Martin

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to