It sounds like you may need to give up a little to make things work - Suppose, for example, that you placed a limit on the length of a quoted string, say 1024 characters - the reader can then either start at the beginning or read back by, say 1024 characters to see if the start is in a quote and proceed accordingly - it quoted strings can be of arbitrary length there may be no good solution
On Wed, Feb 22, 2012 at 11:01 AM, Keith Wiley <kwi...@keithwiley.com> wrote: > It seems nearly impossible to use CSV files as Hadoop input. I see that > there is a CsvRecordInput class, but have found virtually no examples > online of how to use it...and the one example I did find blatantly assumed > that the CSV records were delimited by endlines...which is not CSV spec. > Based on my analysis below, I don't see how CSV input is possible, so I > don't understand how CsvRecordInput can work (and I am having trouble > understanding the completely undocumented CsvRecordInput.java; It isn't > clear how that class is intended to be used). If CsvRecordInput solves all > my problems, then great, but how do I use it? > > I need to process CSV files which will almost certainly contain quoted > endlines. I have attempted to derive my own record reader for this task > and conclude that it is virtually impossible without reading from the > beginning of the file. I explain below. > > Consider this: Assuming a split starts at some arbitrary point in the > file, the standard record reader approach would be to initialize the record > reader by reading to the end of the current mid-record and beginning the > record reader at the start of the next full record...but there is no way to > positively identify the end of CSV record if you start at an arbitrary > location without potentially reading to the end of the file! > > For example, we must consider the possibility that the split begins in the > middle of a quoted string (therefore, endlines do not delimit records > because they may be within a string). We must therefore scan for a > possible end-quote to close the string, but if we *didn't* begin within a > string there may *be no end-quote at all* (the entire CSV file might not > contain a single quoted string). The only way to identify that we did not > begin within a quoted string is to scan to the end of the CSV file (not the > end of the *split* mind you). > > So, initializing a CSV record reader with absolute error-free confidence > potentially requires reading not only the entire split at the time of > initialization (grossly inefficient in itself), but potentially requires > reading the entire file, which may not even reside on the current node! > > I'm at a loss. How can Hadoop take CSV files as input? It must be > possible. CSV is a very plain and common way to arrange textual data, > which is Hadoop's forte; I'm sure people are processing CSV data with > Hadoop, it seems like a natural fit...but I can't imagine how to enable > Hadoop to read it under the conditions of Hadoop file splits. > > Blech. Help! > > > ________________________________________________________________________________ > Keith Wiley kwi...@keithwiley.com keithwiley.com > music.keithwiley.com > > "Luminous beings are we, not this crude matter." > -- Yoda > > ________________________________________________________________________________ > > -- Steven M. Lewis PhD 4221 105th Ave NE Kirkland, WA 98033 206-384-1340 (cell) Skype lordjoe_com